git.karo-electronics.de Git - karo-tx-linux.git/log

mm: compaction: make __compact_pgdat() and compact_pgdat() return void

These functions always return 0. Formalise this.

Cc: Jason Liu <r64343@freescale.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rik van Riel <riel@redhat.com>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mm: fix BUG on madvise early failure

Commit "mm: make madvise(MADV_WILLNEED) support swap file prefetch" has
allowed a situation where blk_finish_plug() would be called without
blk_start_plug() being called before that, which would lead to a BUG:

[   57.320031] kernel BUG at block/blk-core.c:2981!
[   57.320031] invalid opcode: 0000 [#3] PREEMPT SMP DEBUG_PAGEALLOC
[   57.320031] Modules linked in:
[   57.320031] CPU 4
[   57.320031] Pid: 7013, comm: trinity Tainted: G      D W    3.8.0-rc3-next-20130114-sasha-00016-ga107525-dirty #261
[   57.320031] RIP: 0010:[<ffffffff819e3bfa>]  [<ffffffff819e3bfa>] blk_flush_plug_list+0x2a/0x270
[   57.320031] RSP: 0018:ffff880014cc5e68  EFLAGS: 00010297
[   57.320031] RAX: 0000000091827364 RBX: ffff880014cc5e78 RCX: 0000000000000001
[   57.320031] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff880014cc5f18
[   57.340707] RBP: ffff880014cc5ec8 R08: 0000000000000002 R09: 0000000000000000
[   57.340707] R10: 0000000000000000 R11: 0000000000000246 R12: 000000000000000a
[   57.340707] R13: 000000000000001c R14: ffff880014cc5f18 R15: ffff880014cc5f18
[   57.340707] FS:  00007f2ba9ee4700(0000) GS:ffff880013c00000(0000) knlGS:0000000000000000
[   57.340707] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   57.340707] CR2: 00000000008db558 CR3: 0000000014d0a000 CR4: 00000000000406e0
[   57.340707] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[   57.340707] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[   57.340707] Process trinity (pid: 7013, threadinfo ffff880014cc4000, task ffff880014ed8000)
[   57.340707] Stack:
[   57.340707]  ffffffff8123b2e4 00007f2ba9ee46a8 ffff880014cc5e78 ffff880014cc5e78
[   57.340707]  ffff880014c240b0 ffff880014c24110 ffffffff00000001 ffff880014cc5f18
[   57.340707]  000000000000000a 000000000000001c 0000000000000aec ffff880014cc5f18
[   57.340707] Call Trace:
[   57.340707]  [<ffffffff8123b2e4>] ? sys_madvise+0x2a4/0x2e0
[   57.340707]  [<ffffffff819e3e53>] blk_finish_plug+0x13/0x40
[   57.340707]  [<ffffffff8123b268>] sys_madvise+0x228/0x2e0
[   57.340707]  [<ffffffff8107df14>] ? syscall_trace_enter+0x24/0x2e0
[   57.340707]  [<ffffffff83d33d58>] tracesys+0xe1/0xe6
[   57.340707] Code: 00 55 b8 64 73 82 91 48 89 e5 41 57 41 56 49 89 fe 41 55 41 54 53 48 8d 5d b0 48 83 ec 38 48 89 5d b0 48 39 07 48 89 5d b8 74 06 <0f> 0b 0f 1f 40 00 48 8d 45 c0 44 0f b6 e6 48 8b 57 18 48 89 45
[   57.340707] RIP  [<ffffffff819e3bfa>] blk_flush_plug_list+0x2a/0x270
[   57.340707]  RSP <ffff880014cc5e68>

Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
Cc: Shaohua Li <shli@fusionio.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mm-make-madvisemadv_willneed-support-swap-file-prefetch-fix

fix CONFIG_SWAP=n build

Cc: Shaohua Li <shli@fusionio.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mm: make madvise(MADV_WILLNEED) support swap file prefetch

Make madvise(MADV_WILLNEED) support swap file prefetch. If memory is
swapout, this syscall can do swapin prefetch. It has no impact if the
memory isn't swapout.

Signed-off-by: Shaohua Li <shli@fusionio.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mmotm-memcgvmscan-do-not-break-out-targeted-reclaim-without-reclaimed-pagespatch-fix-fix

use conventional comparison order

Cc: Michal Hocko <mhocko@suse.cz>
Cc: Ying Han <yinghan@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mmotm: memcgvmscan-do-not-break-out-targeted-reclaim-without-reclaimed-pages.patch fix

We should break out of the hierarchy loop only if nr_reclaimed exceeded
nr_to_reclaim and not vice-versa. This patch fixes the condition.

Signed-off-by: Ying Han <yinghan@google.com>
Cc: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

memcg,vmscan: do not break out targeted reclaim without reclaimed pages

Targeted (hard resp.  soft) reclaim has traditionally tried to scan one
group with decreasing priority until nr_to_reclaim (SWAP_CLUSTER_MAX
pages) is reclaimed or all priorities are exhausted.  The reclaim is then
retried until the limit is met.

This approach, however, doesn't work well with deeper hierarchies where
groups higher in the hierarchy do not have any or only very few pages
(this usually happens if those groups do not have any tasks and they have
only re-parented pages after some of their children is removed).  Those
groups are reclaimed with decreasing priority pointlessly as there is
nothing to reclaim from them.

An easiest fix is to break out of the memcg iteration loop in shrink_zone
only if the whole hierarchy has been visited or sufficient pages have been
reclaimed.  This is also more natural because the reclaimer expects that
the hierarchy under the given root is reclaimed.  As a result we can
simplify the soft limit reclaim which does its own iteration.

Signed-off-by: Michal Hocko <mhocko@suse.cz>
Reported-by: Ying Han <yinghan@google.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Tejun Heo <htejun@gmail.com>
Cc: Glauber Costa <glommer@parallels.com>
Cc: Li Zefan <lizefan@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mm/ksm.c: use new hashtable implementation

Switch ksm to use the new hashtable implementation. This reduces the
amount of generic unrelated code in the ksm module.

Signed-off-by: Sasha Levin <levinsasha928@gmail.com>
Acked-by: Hugh Dickins <hughd@google.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Konstantin Khlebnikov <khlebnikov@openvz.org>
Cc: Mel Gorman <mgorman@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mm/huge_memory.c: use new hashtable implementation

Switch hugemem to use the new hashtable implementation. This reduces the
amount of generic unrelated code in the hugemem.

This also removes the dymanic allocation of the hash table. The upside is
that we save a pointer dereference when accessing the hashtable, but we
lose 8KB if CONFIG_TRANSPARENT_HUGEPAGE is enabled but the processor
doesn't support hugepages.

Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
Cc: David Rientjes <rientjes@google.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mm: compaction: do not accidentally skip pageblocks in the migrate scanner

Compaction uses the ALIGN macro incorrectly with the migrate scanner by
adding pageblock_nr_pages to a PFN.  It happened to work when initially
implemented as the starting PFN was also aligned but with caching restarts
and isolating in smaller chunks this is no longer always true.

The impact is that the migrate scanner scans outside its current
pageblock.  As pfn_valid() is still checked properly it does not cause any
failure and the impact of the bug is that in some cases it will scan more
than necessary when it crosses a page boundary but by no more than
COMPACT_CLUSTER_MAX.  It is highly unlikely this is even measurable but
it's still wrong so this patch addresses the problem.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mm/vmscan.c:__zone_reclaim(): replace max_t() with max()

"mm: vmscan: save work scanning (almost) empty LRU lists" made
SWAP_CLUSTER_MAX an unsigned long.

Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Hugh Dickins <hughd@google.com>
Cc: Satoru Moriya <satoru.moriya@hds.com>
Cc: Simon Jeons <simon.jeons@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mm/page_alloc.c:__setup_per_zone_wmarks: make min_pages unsigned long

`int' is an inappropriate type for a number-of-pages counter.

While we're there, use the clamp() macro.

Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Cc: Hugh Dickins <hughd@google.com>
Cc: Satoru Moriya <satoru.moriya@hds.com>
Cc: Simon Jeons <simon.jeons@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mm: reduce rmap overhead for ex-KSM page copies created on swap faults

When ex-KSM pages are faulted from swap cache, the fault handler is not
capable of re-establishing anon_vma-spanning KSM pages. In this case, a
copy of the page is created instead, just like during a COW break.

These freshly made copies are known to be exclusive to the faulting VMA
and there is no reason to go look for this page in parent and sibling
processes during rmap operations.

Use page_add_new_anon_rmap() for these copies. This also puts them on the
proper LRU lists and marks them SwapBacked, so we can get rid of doing
this ad-hoc in the KSM copy code.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Rik van Riel <riel@redhat.com>
Acked-by: Hugh Dickins <hughd@google.com>
Cc: Simon Jeons <simon.jeons@gmail.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Satoru Moriya <satoru.moriya@hds.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mm-vmscan-compaction-works-against-zones-not-lruvecs-fix

no need for min_t

Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Rik van Riel <riel@redhat.com>
Cc: Satoru Moriya <satoru.moriya@hds.com>
Cc: Simon Jeons <simon.jeons@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mm: vmscan: compaction works against zones, not lruvecs

The restart logic for when reclaim operates back to back with compaction
is currently applied on the lruvec level.  But this does not make sense,
because the container of interest for compaction is a zone as a whole, not
the zone pages that are part of a certain memory cgroup.

Negative impact is bounded.  For one, the code checks that the lruvec has
enough reclaim candidates, so it does not risk getting stuck on a
condition that can not be fulfilled.  And the unfairness of hammering on
one particular memory cgroup to make progress in a zone will be amortized
by the round robin manner in which reclaim goes through the memory
cgroups.  Still, this can lead to unnecessary allocation latencies when
the code elects to restart on a hard to reclaim or small group when there
are other, more reclaimable groups in the zone.

Move this logic to the zone level and restart reclaim for all memory
cgroups in a zone when compaction requires more free pages from it.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Rik van Riel <riel@redhat.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Cc: Hugh Dickins <hughd@google.com>
Cc: Satoru Moriya <satoru.moriya@hds.com>
Cc: Simon Jeons <simon.jeons@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mm-vmscan-clean-up-get_scan_count-fix

avoid using unintialized_var()

Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Rik van Riel <riel@redhat.com>
Cc: Satoru Moriya <satoru.moriya@hds.com>
Cc: Simon Jeons <simon.jeons@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mm: vmscan: clean up get_scan_count()

Reclaim pressure balance between anon and file pages is calculated through
a tuple of numerators and a shared denominator.

Exceptional cases that want to force-scan anon or file pages configure the
numerators and denominator such that one list is preferred, which is not
necessarily the most obvious way:

    fraction[0] = 1;
    fraction[1] = 0;
    denominator = 1;
    goto out;

Make this easier by making the force-scan cases explicit and use the
fractionals only in case they are calculated from reclaim history.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Rik van Riel <riel@redhat.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Cc: Hugh Dickins <hughd@google.com>
Cc: Satoru Moriya <satoru.moriya@hds.com>
Cc: Simon Jeons <simon.jeons@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mm: vmscan: improve comment on low-page cache handling

Fix comment style and elaborate on why anonymous memory is force-scanned
when file cache runs low.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Rik van Riel <riel@redhat.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Cc: Hugh Dickins <hughd@google.com>
Cc: Satoru Moriya <satoru.moriya@hds.com>
Cc: Simon Jeons <simon.jeons@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mm: vmscan: clarify how swappiness, highest priority, memcg interact

A swappiness of 0 has a slightly different meaning for global reclaim (may
swap if file cache really low) and memory cgroup reclaim (never swap,
ever).

In addition, global reclaim at highest priority will scan all LRU lists
equal to their size and ignore other balancing heuristics.  UNLESS
swappiness forbids swapping, then the lists are balanced based on recent
reclaim effectiveness.  UNLESS file cache is running low, then anonymous
pages are force-scanned.

This (total mess of a) behaviour is implicit and not obvious from the way
the code is organized.  At least make it apparent in the code flow and
document the conditions.  It will be it easier to come up with sane
semantics later.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Satoru Moriya <satoru.moriya@hds.com>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: Hugh Dickins <hughd@google.com>
Cc: Simon Jeons <simon.jeons@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mm: vmscan: save work scanning (almost) empty LRU lists

In certain cases (kswapd reclaim, memcg target reclaim), a fixed minimum
amount of pages is scanned from the LRU lists on each iteration, to make
progress.

Do not make this minimum bigger than the respective LRU list size,
however, and save some busy work trying to isolate and reclaim pages that
are not there.

Empty LRU lists are quite common with memory cgroups in NUMA environments
because there exists a set of LRU lists for each zone for each memory
cgroup, while the memory of a single cgroup is expected to stay on just
one node.  The number of expected empty LRU lists is thus

  memcgs * (nodes - 1) * lru types

Each attempt to reclaim from an empty LRU list does expensive size
comparisons between lists, acquires the zone's lru lock etc.  Avoid that.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Rik van Riel <riel@redhat.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Cc: Hugh Dickins <hughd@google.com>
Cc: Satoru Moriya <satoru.moriya@hds.com>
Cc: Simon Jeons <simon.jeons@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mm: memcg: only evict file pages when we have plenty

e986850 ("mm, vmscan: only evict file pages when we have plenty") makes a
point of not going for anonymous memory while there is still enough
inactive cache around.

The check was added only for global reclaim, but it is just as useful to
reduce swapping in memory cgroup reclaim:

200M-memcg-defconfig-j2

                                 vanilla                   patched
Real time              454.06 (  +0.00%)         453.71 (  -0.08%)
User time              668.57 (  +0.00%)         668.73 (  +0.02%)
System time            128.92 (  +0.00%)         129.53 (  +0.46%)
Swap in               1246.80 (  +0.00%)         814.40 ( -34.65%)
Swap out              1198.90 (  +0.00%)         827.00 ( -30.99%)
Pages allocated   16431288.10 (  +0.00%)    16434035.30 (  +0.02%)
Major faults           681.50 (  +0.00%)         593.70 ( -12.86%)
THP faults             237.20 (  +0.00%)         242.40 (  +2.18%)
THP collapse           241.20 (  +0.00%)         248.50 (  +3.01%)
THP splits             157.30 (  +0.00%)         161.40 (  +2.59%)

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: Hugh Dickins <hughd@google.com>
Cc: Satoru Moriya <satoru.moriya@hds.com>
Cc: Simon Jeons <simon.jeons@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mm/page_alloc.c:__alloc_contig_migrate_range(): cleanup

remove a test-n-branch in the wrapup code

Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Michal Nazarewicz <mina86@mina86.com>
Cc: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

CMA: make putback_lru_pages() call conditional

As per documentation and other places calling putback_lru_pages(),
putback_lru_pages() is called on error only. Make the CMA code behave
consistently.

Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Acked-by: Michal Nazarewicz <mina86@mina86.com>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mm/hugetlb.c: convert to pr_foo()

Cc: Michal Hocko <mhocko@suse.cz>
Cc: Hillf Danton <dhillf@gmail.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mm/memcontrol.c: convert printk(KERN_FOO) to pr_foo()

Acked-by: Sha Zhengju <handai.szj@taobao.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

memcg, oom: provide more precise dump info while memcg oom happening

Currentlt when a memcg oom is happening the oom dump messages is still
global state and provides few useful info for users.  This patch prints
more pointed memcg page statistics for memcg-oom and take hierarchy into
consideration:

Based on Michal's advice, we take hierarchy into consideration:
supppose we trigger an OOM on A's limit
        root_memcg
            |
            A (use_hierachy=1)
           / \
          B   C
          |
          D
then the printed info will be:
Memory cgroup stats for /A:...
Memory cgroup stats for /A/B:...
Memory cgroup stats for /A/C:...
Memory cgroup stats for /A/B/D:...

Following are samples of oom output:
(1)Before change:
[  609.917309] mal-80 invoked oom-killer:gfp_mask=0xd0, order=0, oom_score_adj=0
[  609.917313] mal-80 cpuset=/ mems_allowed=0
[  609.917315] Pid: 2976, comm: mal-80 Not tainted 3.7.0+ #10
[  609.917316] Call Trace:
[  609.917327]  [<ffffffff8167fbfb>] dump_header+0x83/0x1ca
                ..... (call trace)
[  609.917389]  [<ffffffff8168a818>] page_fault+0x28/0x30
<<<<<<<<<<<<<<<<<<<<< memcg specific information
[  609.917391] Task in /A/B/D killed as a result of limit of /A
[  609.917393] memory: usage 101376kB, limit 101376kB, failcnt 57
[  609.917394] memory+swap: usage 101376kB, limit 101376kB, failcnt 0
[  609.917395] kmem: usage 0kB, limit 9007199254740991kB, failcnt 0
<<<<<<<<<<<<<<<<<<<<< print per cpu pageset stat
[  609.917396] Mem-Info:
[  609.917397] Node 0 DMA per-cpu:
[  609.917399] CPU    0: hi:    0, btch:   1 usd:   0
               ......
[  609.917402] CPU    3: hi:    0, btch:   1 usd:   0
[  609.917403] Node 0 DMA32 per-cpu:
[  609.917404] CPU    0: hi:  186, btch:  31 usd: 173
               ......
[  609.917407] CPU    3: hi:  186, btch:  31 usd: 130
<<<<<<<<<<<<<<<<<<<<< print global page state
[  609.917415] active_anon:92963 inactive_anon:40777 isolated_anon:0
[  609.917415]  active_file:33027 inactive_file:51718 isolated_file:0
[  609.917415]  unevictable:0 dirty:3 writeback:0 unstable:0
[  609.917415]  free:729995 slab_reclaimable:6897 slab_unreclaimable:6263
[  609.917415]  mapped:20278 shmem:35971 pagetables:5885 bounce:0
[  609.917415]  free_cma:0
<<<<<<<<<<<<<<<<<<<<< print per zone page state
[  609.917418] Node 0 DMA free:15836kB ... all_unreclaimable? no
[  609.917423] lowmem_reserve[]: 0 3175 3899 3899
[  609.917426] Node 0 DMA32 free:2888564kB ... all_unrelaimable? no
[  609.917430] lowmem_reserve[]: 0 0 724 724
[  609.917436] lowmem_reserve[]: 0 0 0 0
[  609.917438] Node 0 DMA: 1*4kB (U) ... 3*4096kB (M) = 15836kB
[  609.917447] Node 0 DMA32: 41*4kB (UM) ... 702*4096kB (MR) = 2888316kB
[  609.917466] 120710 total pagecache pages
[  609.917467] 0 pages in swap cache
<<<<<<<<<<<<<<<<<<<<< print global swap cache stat
[  609.917468] Swap cache stats: add 0, delete 0, find 0/0
[  609.917469] Free swap  = 499708kB
[  609.917470] Total swap = 499708kB
[  609.929057] 1040368 pages RAM
[  609.929059] 58678 pages reserved
[  609.929060] 169065 pages shared
[  609.929061] 173632 pages non-shared
[  609.929062] [ pid ]   uid  tgid total_vm      rss nr_ptes swapents oom_score_adj name
[  609.929101] [ 2693]     0  2693     6005     1324      17        0             0 god
[  609.929103] [ 2754]     0  2754     6003     1320      16        0             0 god
[  609.929105] [ 2811]     0  2811     5992     1304      18        0             0 god
[  609.929107] [ 2874]     0  2874     6005     1323      18        0             0 god
[  609.929109] [ 2935]     0  2935     8720     7742      21        0             0 mal-30
[  609.929111] [ 2976]     0  2976    21520    17577      42        0             0 mal-80
[  609.929112] Memory cgroup out of memory: Kill process 2976 (mal-80) score 665 or sacrifice child
[  609.929114] Killed process 2976 (mal-80) total-vm:86080kB, anon-rss:69964kB, file-rss:344kB

We can see that messages dumped by show_free_areas() are longsome and can
provide so limited info for memcg that just happen oom.

(2) After change
[  293.235042] mal-80 invoked oom-killer: gfp_mask=0xd0, order=0, oom_score_adj=0
[  293.235046] mal-80 cpuset=/ mems_allowed=0
[  293.235048] Pid: 2704, comm: mal-80 Not tainted 3.7.0+ #10
[  293.235049] Call Trace:
[  293.235058]  [<ffffffff8167fd0b>] dump_header+0x83/0x1d1
.......(call trace)
[  293.235108]  [<ffffffff8168a918>] page_fault+0x28/0x30
[  293.235110] Task in /A/B/D killed as a result of limit of /A
<<<<<<<<<<<<<<<<<<<<< memcg specific information
[  293.235111] memory: usage 102400kB, limit 102400kB, failcnt 140
[  293.235112] memory+swap: usage 102400kB, limit 102400kB, failcnt 0
[  293.235114] kmem: usage 0kB, limit 9007199254740991kB, failcnt 0
[  293.235114] Memory cgroup stats for /A: cache:32KB rss:30984KB mapped_file:0KB swap:0KB inactive_anon:6912KB active_anon:24072KB inactive_file:32KB active_file:0KB unevictable:0KB
[  293.235122] Memory cgroup stats for /A/B: cache:0KB rss:0KB mapped_file:0KB swap:0KB inactive_anon:0KB active_anon:0KB inactive_file:0KB active_file:0KB unevictable:0KB
[  293.235127] Memory cgroup stats for /A/C: cache:0KB rss:0KB mapped_file:0KB swap:0KB inactive_anon:0KB active_anon:0KB inactive_file:0KB active_file:0KB unevictable:0KB
[  293.235132] Memory cgroup stats for /A/B/D: cache:32KB rss:71352KB mapped_file:0KB swap:0KB inactive_anon:6656KB active_anon:64696KB inactive_file:16KB active_file:16KB unevictable:0KB
[  293.235137] [ pid ]   uid  tgid total_vm      rss nr_ptes swapents oom_score_adj name
[  293.235153] [ 2260]     0  2260     6006     1325      18        0             0 god
[  293.235155] [ 2383]     0  2383     6003     1319      17        0             0 god
[  293.235156] [ 2503]     0  2503     6004     1321      18        0             0 god
[  293.235158] [ 2622]     0  2622     6004     1321      16        0             0 god
[  293.235159] [ 2695]     0  2695     8720     7741      22        0             0 mal-30
[  293.235160] [ 2704]     0  2704    21520    17839      43        0             0 mal-80
[  293.235161] Memory cgroup out of memory: Kill process 2704 (mal-80) score 669 or sacrifice child
[  293.235163] Killed process 2704 (mal-80) total-vm:86080kB, anon-rss:71016kB, file-rss:340kB

This version provides more pointed info for memcg in "Memory cgroup stats
for XXX" section.

Signed-off-by: Sha Zhengju <handai.szj@taobao.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

watchdog: trigger all-cpu backtrace when locked up and going to panic

Send an NMI to all CPUs when a lockup is detected and the lockup watchdog
code is configured to panic. This gives us a fairly uptodate snapshot of
all CPUs in the system.

This lets us get stack trace of all CPUs which makes life easier trying to
debug a deadlock, and the NMI doesn't change anything since the next step
is a kernel panic.

Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

fs/block_dev.c: no need to check inode->i_bdev in bd_forget()

Its only caller evict() has promised a non-NULL inode->i_bdev.

Signed-off-by: Yan Hong <clouds.yan@gmail.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

fs: change return values from -EACCES to -EPERM

According to SUSv3:

[EACCES] Permission denied. An attempt was made to access a file in a way
forbidden by its file access permissions.

[EPERM] Operation not permitted. An attempt was made to perform an operation
limited to processes with appropriate privileges or to the owner of a file
or other resource.

So -EPERM should be returned if capability checks fails.

Strictly speaking this is an API change since the error code user sees is
altered.

Signed-off-by: Zhao Hongjiang <zhaohongjiang@huawei.com>
Acked-by: Jan Kara <jack@suse.cz>
Acked-by: Steven Whitehouse <swhiteho@redhat.com>
Acked-by: Ian Kent <raven@themaw.net>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

block: restore /proc/partitions to not display non-partitionable removable devices

We found with newer kernels we started seeing the cdrom device showing
up in /proc/partitions, but it was not there before.

Looking into this I found that commit d27769ec ("block: add
GENHD_FL_NO_PART_SCAN") introduces this change in behavior. It's not
clear to me from the commit's changelog if this change was intentional or
not. This comment still remains: /* Don't show non-partitionable
removeable devices or empty devices */ so I've decided to send a patch to
restore the behavior of not printing unpartitionable removable devices.

Signed-off-by: Josh Hunt <johunt@akamai.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Kay Sievers <kay.sievers@vrfy.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

loopdev: ignore negative offset when calculate loop device size

Negative offset may cause loop device size larger than backing file
size.

$ fallocate -l 1M a
$ losetup --offset 0xffffffffffff0000 /dev/loop0 a
$ blockdev --getsize64 /dev/loop0
1114112
$ ls -l a
-rw-r--r-- 1 root root 1048576 Jan 23 12:46 a
$ cat /dev/loop0
cat: /dev/loop0: Input/output error

It makes no sense to do that. Only apply offset when it's positive.

Fix a typo in the comment by the way.

Signed-off-by: Guo Chao <yan@linux.vnet.ibm.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Guo Chao <yan@linux.vnet.ibm.com>
Cc: M. Hindess <hindessm@uk.ibm.com>
Cc: Nikanth Karthikesan <knikanth@suse.de>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

loopdev: remove an user triggerable oops

When loopdev is built as module and we pass an invalid parameter,
loop_init() will return directly without deregister misc device, which
will cause an oops when insert loop module next time because we left some
garbage in the misc device list.

Test case:
sudo modprobe loop max_part=1024
(failed due to invalid parameter)
sudo modprobe loop
(oops)

Clean up nicely to avoid such oops.

Signed-off-by: Guo Chao <yan@linux.vnet.ibm.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Guo Chao <yan@linux.vnet.ibm.com>
Cc: M. Hindess <hindessm@uk.ibm.com>
Cc: Nikanth Karthikesan <knikanth@suse.de>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

loopdev: move common code into loop_figure_size()

Update block device size in accord with gendisk size and let userspace
know the change in loop_figure_size(). This is a clean up to remove
common code of loop_figure_size()'s two callers.

Signed-off-by: Guo Chao <yan@linux.vnet.ibm.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Guo Chao <yan@linux.vnet.ibm.com>
Cc: M. Hindess <hindessm@uk.ibm.com>
Cc: Nikanth Karthikesan <knikanth@suse.de>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

loopdev: update block device size in loop_set_status()

Loop device driver sometimes fails to impose the size limit on the
device. Keep issuing following two commands:

losetup --offset 7517244416 --sizelimit 3224971264 /dev/loop0 backed_file
blockdev --getsize64 /dev/loop0

blockdev reports file size instead of sizelimit several out of 100 times.

The problems are:

- losetup set up the device in two ioctl:
LOOP_SET_FD and LOOP_SET_STATUS64.

- LOOP_SET_STATUS64 only update size of gendisk.

Block device size will be updated lazily when device comes to use. If udev
rushes in between the two ioctl, it will bring in a block device whose
size is backing file size. If the device is not released after
LOOP_SET_STATUS64 ioctl, blockdev will not see the updated size.

Update block size in LOOP_SET_STATUS64 ioctl.

Signed-off-by: Guo Chao <yan@linux.vnet.ibm.com>
Reported-by: M. Hindess <hindessm@uk.ibm.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Guo Chao <yan@linux.vnet.ibm.com>
Cc: Nikanth Karthikesan <knikanth@suse.de>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

loopdev: fix a deadlock

bd_mutex and lo_ctl_mutex can be held in different order.

Path #1:

blkdev_open
blkdev_get
  __blkdev_get (hold bd_mutex)
   lo_open (hold lo_ctl_mutex)

Path #2:

blkdev_ioctl
lo_ioctl (hold lo_ctl_mutex)
  lo_set_capacity (hold bd_mutex)

Lockdep does not report it, because path #2 actually holds a subclass of
lo_ctl_mutex.  This subclass seems creep into the code by mistake.  The
patch author actually just mentioned it in the changelog, see commit
f028f3b2 ("loop: fix circular locking in loop_clr_fd()"), also see:

http://marc.info/?l=linux-kernel&m=123806169129727&w=2

Path #2 hold bd_mutex to call bd_set_size(), I've protected it
with i_mutex in a previous patch, so drop bd_mutex at this site.

Signed-off-by: Guo Chao <yan@linux.vnet.ibm.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Guo Chao <yan@linux.vnet.ibm.com>
Cc: M. Hindess <hindessm@uk.ibm.com>
Cc: Nikanth Karthikesan <knikanth@suse.de>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

block: remove redundant check to bd_openers()

bd_openers is stable under bd_mutex, no need to check it twice.

Signed-off-by: Guo Chao <yan@linux.vnet.ibm.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Guo Chao <yan@linux.vnet.ibm.com>
Cc: M. Hindess <hindessm@uk.ibm.com>
Cc: Nikanth Karthikesan <knikanth@suse.de>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

block: use i_size_write() in bd_set_size()

blkdev_ioctl(GETBLKSIZE) uses i_size_read() to read size of block device.
If we update block size directly, reader may see intermediate result in
some machines and configurations. Use i_size_write() instead.

Signed-off-by: Guo Chao <yan@linux.vnet.ibm.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Guo Chao <yan@linux.vnet.ibm.com>
Cc: M. Hindess <hindessm@uk.ibm.com>
Cc: Nikanth Karthikesan <knikanth@suse.de>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

cfq: fix lock imbalance with failed allocations

While stress-running very-small container scenarios with the Kernel Memory
Controller, I've run into a lockdep-detected lock imbalance in
cfq-iosched.c.

I'll apologize beforehand for not posting a backlog: I didn't anticipate
it would be so hard to reproduce, so I didn't save my serial output and
went directly on debugging.  Turns out that it did not happen again in
more than 20 runs, making it a quite rare pattern.

But here is my analysis:

When we are in very low-memory situations, we will arrive at
cfq_find_alloc_queue and may not find a queue, having to resort to the oom
queue, in an rcu-locked condition:

  if (!cfqq || cfqq == &cfqd->oom_cfqq)
      [ ... ]

Next, we will release the rcu lock, and try to allocate a queue, retrying
if we succeed:

  rcu_read_unlock();
  spin_unlock_irq(cfqd->queue->queue_lock);
  new_cfqq = kmem_cache_alloc_node(cfq_pool,
                  gfp_mask | __GFP_ZERO,
                  cfqd->queue->node);
   spin_lock_irq(cfqd->queue->queue_lock);
   if (new_cfqq)
       goto retry;

We are unlocked at this point, but it should be fine, since we will
reacquire the rcu_read_lock when we retry.

Except of course, that we may not retry: the allocation may very well fail
and we'll keep on going through the flow:

The next branch is:

    if (cfqq) {
[ ... ]
    } else
        cfqq = &cfqd->oom_cfqq;

And right before exiting, we'll issue rcu_read_unlock().

Being already unlocked, this is the likely source of our imbalance.  Since
cfqq is either already NULL or made NULL in the first statement of the
outter branch, the only viable alternative here seems to be to return the
oom queue right away in case of allocation failure.

Please review the following patch and apply if you agree with my analysis.

Signed-off-by: Glauber Costa <glommer@parallels.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

drivers/block/swim3.c: fix null pointer dereference

The use of pointer fs should be after the null check.

Signed-off-by: Cong Ding <dinggnu@gmail.com>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

block: don't select PERCPU_RWSEM

The block device doesn't use percpu rw-semaphore anymore, so don't select
it for compilation.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

lockdep: make lockdep_assert_held() not have a return value

I recently made the mistake of writing:

foo = lockdep_dereference_protected(..., lockdep_assert_held(...));

which is clearly bogus. If lockdep is disabled in the config this would
cause a compile failure, if it is enabled then it compiles and causes a
puzzling warning about dereferencing without the correct protection.

Wrap the macro in "do { ... } while (0)" to also fail compile for this
when lockdep is enabled.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

sched-proc-sched_debug-fails-on-very-very-large-machines-v2-fix

fix spello in comment

Cc: Dave Jones <davej@redhat.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Nathan Zimmer <nzimmer@sgi.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

sched-proc-sched_debug-fails-on-very-very-large-machines-v2

v2: Took suggetion for adding comments

Signed-off-by: Nathan Zimmer <nzimmer@sgi.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Dave Jones <davej@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

sched-proc-sched_debug-fails-on-very-very-large-machines-fix

whitespace fixlet

Cc: Dave Jones <davej@redhat.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Nathan Zimmer <nzimmer@sgi.com>
Cc: Peter Zijlstra <peterz@infradead.org>)
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

sched: /proc/sched_debug fails on very very large machines

On systems with 4096 cores attemping to read /proc/sched_debug fails. We
are trying to push all the data into a single kmalloc buffer. The issue
is on these very large machines all the data will not fit in 4mb.

A better solution is to not us the single_open mechanism but to provide
our own seq_operations and treat each cpu as an individual record.

The output should be identical to previous version.

Signed-off-by: Nathan Zimmer <nzimmer@sgi.com>
Reported-by: Dave Jones <davej@redhat.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>)
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

sched-proc-sched_stat-fails-on-very-very-large-machines-v2-fix-fix

fix warnings

Reported-by: Wu Fengguang <fengguang.wu@intel.com>
Cc: Dave Jones <davej@redhat.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Nathan Zimmer <nzimmer@sgi.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

sched-proc-sched_stat-fails-on-very-very-large-machines-v2-fix

fix spello in comment

Cc: Dave Jones <davej@redhat.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Nathan Zimmer <nzimmer@sgi.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

sched-proc-sched_stat-fails-on-very-very-large-machines-v2

v2: Took Andrew's suggestion to add comments, fix memleak

Signed-off-by: Nathan Zimmer <nzimmer@sgi.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Dave Jones <davej@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

sched-proc-sched_stat-fails-on-very-very-large-machines-fix

fix memleak

Cc: Dave Jones <davej@redhat.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Nathan Zimmer <nzimmer@sgi.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

sched: /proc/sched_stat fails on very very large machines

On systems with 4096 cores doing a cat /proc/sched_stat fails. We are
trying to push all the data into a single kmalloc buffer. The issue is on
these very large machines all the data will not fit in 4mb.

A better solution is to not use the single_open() mechanism but to provide
our own seq_operations.

The output should be identical to previous version and thus not need the
version number.

Signed-off-by: Nathan Zimmer <nzimmer@sgi.com>
Reported-by: Dave Jones <davej@redhat.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mm: use vm_unmapped_area() on parisc architecture

Update the parisc arch_get_unmapped_area function to make use of
vm_unmapped_area() instead of implementing a brute force search.

Signed-off-by: Michel Lespinasse <walken@google.com>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: "James E.J. Bottomley" <jejb@parisc-linux.org>
Cc: Helge Deller <deller@gmx.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

scripts/tags.sh: add ctags magic for declarations of popular kernel type

- Add magic for declarations of variables of popular kernel type like
spinlock_t, list_head, wait_queue_head_t and other.

- Add a set of specially handled declaration extentions like __attribute,
__aligned and other.

- Simplify pci_bus_* magic

Signed-off-by: Kirill V Tkhai <tkhai@yandex.ru>
Cc: Michal Marek <mmarek@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mm: use vm_unmapped_area() in hugetlbfs on ia64 architecture

Update the ia64 hugetlb_get_unmapped_area function to make use of
vm_unmapped_area() instead of implementing a brute force search.

Signed-off-by: Michel Lespinasse <walken@google.com>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: "Luck, Tony" <tony.luck@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mm: use vm_unmapped_area() on ia64 architecture

Update the ia64 arch_get_unmapped_area function to make use of
vm_unmapped_area() instead of implementing a brute force search.

Signed-off-by: Michel Lespinasse <walken@google.com>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: "Luck, Tony" <tony.luck@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

ia64/mm: fix a bad_page bug when crash kernel booting

On ia64 platform, I set "crashkernel=1024M-:600M", and dmesg shows
128M-728M memory is reserved for crash kernel.  Then "echo c >
/proc/sysrq-trigger" to test kdump.

When crash kernel booting, efi_init() will aligns the memory address in
IA64_GRANULE_SIZE(16M), so 720M-728M memory will be dropped, It means
crash kernel only manage 128M-720M memory.

But initrd start and end are fixed in boot loader, it is before
efi_init(), so initrd size maybe overflow when free_initrd_mem().

Here is the dmesg when crash kernel booting:

...
Ignoring memory below 128MB
Ignoring memory above 720MB
    // aligns the address in IA64_GRANULE_SIZE(16M)
...
Initial ramdisk at: 0xe00000002c3f0000 (20176579 bytes)
    // initrd uses 707M-726M memory
...
Kernel command line:  root=/dev/disk/by-id/ata-STEC_MACH16_M16ISD2-100UCT_STM000142A2D-part3
console=ttyS0,115200n8 console=tty0 initcall_debug elevator=deadline sysrq=1 reset_devices
irqpoll maxcpus=1 initcall_debug linuxrc=trace elfcorehdr=745216K max_addr=728M min_addr=128M
    // show crash kernel parameters
...
Unpacking initramfs...
    // called by populate_rootfs()
Freeing initrd memory: 19648kB freed
    // called by free_initrd()->free_initrd_mem()
BUG: Bad page state in process swapper  pfn:02d00
    // it is a mistake to free over 720M memory to OS (ia64's page size is 64KB)
page:e0000000102dd800 flags:(null) count:0 mapcount:1 mapping:(null) index:0

Call Trace:
[<a000000100018dc0>] show_stack+0x80/0xa0
                                sp=e000000021e8fbd0 bsp=e000000021e81360
[<a00000010090fcc0>] dump_stack+0x30/0x50
                                sp=e000000021e8fda0 bsp=e000000021e81348
[<a0000001001a3180>] bad_page+0x280/0x380
                                sp=e000000021e8fda0 bsp=e000000021e81308
[<a0000001001a8740>] free_hot_cold_page+0x3a0/0x5c0
                                sp=e000000021e8fda0 bsp=e000000021e812a0
[<a0000001001a8a50>] free_hot_page+0x30/0x60
                                sp=e000000021e8fda0 bsp=e000000021e81280
[<a0000001001a8b30>] __free_pages+0xb0/0xe0
                                sp=e000000021e8fda0 bsp=e000000021e81258
[<a0000001001a8c00>] free_pages+0xa0/0xc0
                                sp=e000000021e8fda0 bsp=e000000021e81230
[<a000000100bb40c0>] free_initrd_mem+0x230/0x290
                                sp=e000000021e8fda0 bsp=e000000021e811d8
[<a000000100ba6620>] populate_rootfs+0x1c0/0x280
                                sp=e000000021e8fdb0 bsp=e000000021e811a0
[<a00000010000ac30>] do_one_initcall+0x3b0/0x3e0
                                sp=e000000021e8fdb0 bsp=e000000021e81158
[<a000000100ba0a90>] kernel_init+0x3f0/0x4b0
                                sp=e000000021e8fdb0 bsp=e000000021e81108
[<a000000100016890>] kernel_thread_helper+0xd0/0x100
                                sp=e000000021e8fe30 bsp=e000000021e810e0
[<a00000010000a4c0>] start_kernel_thread+0x20/0x40
                                sp=e000000021e8fe30 bsp=e000000021e810e0
Disabling lock debugging due to kernel taint
BUG: Bad page state in process swapper  pfn:02d01
page:e0000000102dd838 flags:(null) count:0 mapcount:1 mapping:(null) index:0

Call Trace:
[<a000000100018dc0>] show_stack+0x80/0xa0
                                sp=e000000021e8fbd0 bsp=e000000021e81360
[<a00000010090fcc0>] dump_stack+0x30/0x50
                                sp=e000000021e8fda0 bsp=e000000021e81348
[<a0000001001a3180>] bad_page+0x280/0x380
                                sp=e000000021e8fda0 bsp=e000000021e81308
[<a0000001001a8740>] free_hot_cold_page+0x3a0/0x5c0
                                sp=e000000021e8fda0 bsp=e000000021e812a0
[<a0000001001a8a50>] free_hot_page+0x30/0x60
                                sp=e000000021e8fda0 bsp=e000000021e81280
[<a0000001001a8b30>] __free_pages+0xb0/0xe0
                                sp=e000000021e8fda0 bsp=e000000021e81258
[<a0000001001a8c00>] free_pages+0xa0/0xc0
                                sp=e000000021e8fda0 bsp=e000000021e81230
[<a000000100bb40c0>] free_initrd_mem+0x230/0x290
                                sp=e000000021e8fda0 bsp=e000000021e811d8
[<a000000100ba6620>] populate_rootfs+0x1c0/0x280
                                sp=e000000021e8fdb0 bsp=e000000021e811a0
[<a00000010000ac30>] do_one_initcall+0x3b0/0x3e0
                                sp=e000000021e8fdb0 bsp=e000000021e81158
[<a000000100ba0a90>] kernel_init+0x3f0/0x4b0
                                sp=e000000021e8fdb0 bsp=e000000021e81108
[<a000000100016890>] kernel_thread_helper+0xd0/0x100
                                sp=e000000021e8fe30 bsp=e000000021e810e0
[<a00000010000a4c0>] start_kernel_thread+0x20/0x40
                                sp=e000000021e8fe30 bsp=e000000021e810e0
...

Signed-off-by: Xishi Qiu <qiuxishi@huawei.com>
Cc: "Luck, Tony" <tony.luck@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

timer_list-convert-timer-list-to-be-a-proper-seq_file-fix-fix

kernel/time/timer_list.c: In function 'timer_list_show':
kernel/time/timer_list.c:262: warning: cast to pointer from integer of different size
kernel/time/timer_list.c:267: warning: cast to pointer from integer of different size
kernel/time/timer_list.c: In function 'timer_list_start':
kernel/time/timer_list.c:297: warning: cast to pointer from integer of different size

This code is revolting :(

Cc: Dave Jones <davej@redhat.com>
Cc: John Stultz <johnstul@us.ibm.com>
Cc: Nathan Zimmer <nzimmer@sgi.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

timer_list-convert-timer-list-to-be-a-proper-seq_file-v2-fix

fix up comment

Cc: Dave Jones <davej@redhat.com>
Cc: John Stultz <johnstul@us.ibm.com>
Cc: Nathan Zimmer <nzimmer@sgi.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

timer_list-convert-timer-list-to-be-a-proper-seq_file-v2

v2: Added comments on the iteration.

Signed-off-by: Nathan Zimmer <nzimmer@sgi.com>
Cc: John Stultz <johnstul@us.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Dave Jones <davej@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

timer_list-convert-timer-list-to-be-a-proper-seq_file-fix

whitespace fixlet

Cc: Nathan Zimmer <nzimmer@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

timer_list: convert timer list to be a proper seq_file

When running with 4096 cores attemping to read /proc/timer_list will fail
with an ENOMEM condition.  On a sufficantly large systems the total amount
of data is more then 4mb, so it won't fit into a single buffer.  The
failure can also occur on smaller systems when memory fragmentation is
high as reported by Dave Jones.

Convert /proc/timer_list to a proper seq_file with its own iterator.  This
is a little more complex given that we have to make two passes with two
separate headers.

Signed-off-by: Nathan Zimmer <nzimmer@sgi.com>
Reported-by: Dave Jones <davej@redhat.com>
Cc: John Stultz <johnstul@us.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

timer_list: split timer_list_show_tickdevices()

Split timer_list_show_tickdevices() out the header and just pull the rest up
to timer_list_show. Also tweak the location of the whitespace. This is all
to prep for the fix.

Signed-off-by: Nathan Zimmer <nzimmer@sgi.com>
Reported-by: Dave Jones <davej@redhat.com>
Cc: John Stultz <johnstul@us.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

time: don't inline EXPORT_SYMBOL functions

How is the compiler even handling exported functions that are marked
inline? Anyway, these shouldn't be inline because of that, so remove that
marking.

Based on a larger patch by Mark Charlebois to get LLVM to build the
kernel.

Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Mark Charlebois <mcharleb@qualcomm.com>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: hank <pyu@redhat.com>
Cc: John Stultz <john.stultz@linaro.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

cyber2000fb: avoid palette corruption at higher clocks

When 1280x1024@75Hz mode is set, console palette is not set properly -
sometimes the background is white, sometimes yellow and text colors are
also messed up.  This does not happen at 1280x1024@60Hz and below.

It seems that the HW needs some time before setting the palette - maybe
the PLL needs more time to lock at higher speeds.  This patch fixes the
problem but without knowing what register to check for PLL lock(?), the
delay might be excessive.

On Fri, 28 Jan 2011 18:15:37 +0000
Russell King <rmk@arm.linux.org.uk> wrote:

> On Tue, Jan 18, 2011 at 01:14:24PM -0800, Andrew Morton wrote:
> > Russell, I have an (old) note here that this is awaiting an ack from
> > yourself?
>
> Well, I can reproduce this problem on the Netwinders here.  I'm not sure
> that we should delay all mode switches by one second - and any attempt
> to reduce this value does result in the palette not being set correctly.
>
> For 1280x1024-75, the dotclock is 135MHz, which gives a PLL values of
> 0x41 and 0x06.  That's: M=0x41+1, N=0x06+1, P=0x00 (top 2 bits of 0x06)
> -> Q=1
>
> Fpll = 14.31818MHz * M / N
> Fout = Fpll / Q
>
> The PLL itself is formed by dividing the 14-ish MHz frequency by N and
> phase comparing the output of the VCO, divided by M, and adjusting the
> VCO until the two correlate.  As VCOs typically tend to have a limited
> range, it's normal to divide the output frequency to produce a greater
> range - and in this case that's done by Q.
>
> For the 800x600-100 copied from /etc/fb.modes, this has a dotclock of
> 67.5MHz, which is exactly half this rate.  The PLL values for this are:
> M=0x41+1, N=0x06+1, P=0x01, giving PLL values of 0x41 and 0x46.
>
> Booting with 800x600-100 does not suffer the problem.  So it's not
> related to PLL lock time.  There's something else going on.
>
> Another experiment I tried was forcing the PLL values to produce 108MHz
> instead of 135MHz.  108MHz is the dotclock for 1280x1024-60.  This too
> doesn't suffer the problem.
>
> I've also tried chosing other delay values.  100ms is too short and
> produces the problem, but 1s works.  1s for a PLL to lock is a hell of
> a time, especially for a PLL operating in the MHz range.
>
> I've tried setting the PLL to a known good freqency, and then switching
> to 135MHz - the problem persists.  It's not like 135MHz is reaching the
> limits - it'll go up to 206MHz.
>
> So, I don't think this has anything to do with PLL locking.  I think
> there's something else going on which isn't immediately obvious - maybe
> bandwidth starvation preventing us from writing properly to the palette?
> As it's a horrible VGA, where you write the same register multiple times
> I wouldn't be surprised if some writes were going missing.
>
> I'll see if I can play around with it some more this evening, but I've
> spent an awful long time on just this issue already this afternoon...
>
> I think further investigation needs to happen on this patch before it's
> acceptable.  Or maybe we should prevent the cyberpro coming up in

Signed-off-by: Ondrej Zary <linux@rainbow-software.org>
Cc: Russell King <rmk@arm.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

drivers/video/Kconfig: specify the SoCs that make use of FB_IMX

FB_IMX is the framebuffer driver used by MX1, MX21, MX25 and MX27 processors.

Pass this information to the Kconfig text to make it clear.

Signed-off-by: Fabio Estevam <fabio.estevam@freescale.com>
Acked-by: Sascha Hauer <s.hauer@pengutronix.de>
Cc: Florian Tobias Schandinat <FlorianSchandinat@gmx.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

ARM: mmp: add display and fb support in pxa910 defconfig

Add display and fb support in pxa910 defconfig.
Add tpohvga panel, spi support.
Add logo support.

Signed-off-by: Zhou Zhu <zzhu3@marvell.com>
Acked-by: Haojian Zhuang <haojian.zhuang@gmail.com>
Cc: Lisa Du <cldu@marvell.com>
Cc: Guoqing Li <ligq@marvell.com>
Cc: Florian Tobias Schandinat <FlorianSchandinat@gmx.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

ARM: mmp: enable display in ttc_dkb

Enable display in ttc_dkb.

Signed-off-by: Zhou Zhu <zzhu3@marvell.com>
Acked-by: Haojian Zhuang <haojian.zhuang@gmail.com>
Cc: Lisa Du <cldu@marvell.com>
Cc: Guoqing Li <ligq@marvell.com>
Cc: Florian Tobias Schandinat <FlorianSchandinat@gmx.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

ARM: mmp: added device for display controller

Add device for display controller and fb support

Signed-off-by: Zhou Zhu <zzhu3@marvell.com>
Acked-by: Haojian Zhuang <haojian.zhuang@gmail.com>
Cc: Lisa Du <cldu@marvell.com>
Cc: Guoqing Li <ligq@marvell.com>
Cc: Florian Tobias Schandinat <FlorianSchandinat@gmx.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

video: mmpdisp: add spi port in display controller

Add spi port support in mmp display controller. This port is from display
controller and for panel usage. This driver implemented and registered as
a spi master.

Signed-off-by: Zhou Zhu <zzhu3@marvell.com>
Acked-by: Haojian Zhuang <haojian.zhuang@gmail.com>
Cc: Lisa Du <cldu@marvell.com>
Cc: Guoqing Li <ligq@marvell.com>
Cc: Florian Tobias Schandinat <FlorianSchandinat@gmx.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

video: mmp: add tpo hvga panel supported

Add tpo hvga panel support in marvell display framework. This panel
driver implements modes query and power on/off.

This panel driver gets panel config/ plat power on/off/ connected path
name from machine-info and registered as a spi device. This panel driver
uses mmp_disp supplied register_panel function to register panel to path
as machine-info defined.

Signed-off-by: Lisa Du <cldu@marvell.com>
Signed-off-by: Zhou Zhu <zzhu3@marvell.com>
Acked-by: Haojian Zhuang <haojian.zhuang@gmail.com>
Cc: Guoqing Li <ligq@marvell.com>
Cc: Florian Tobias Schandinat <FlorianSchandinat@gmx.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

video: mmp display controller support

Marvell mmp series display controller support in mmpdisp subsystem. This
driver focus on implementation of hardware operations of path/overlay,
which is defined in mmp display subsystem interface. This driver
registers all pathes to mmp display framework.

Signed-off-by: Guoqing Li <ligq@marvell.com>
Signed-off-by: Lisa Du <cldu@marvell.com>
Signed-off-by: Zhou Zhu <zzhu3@marvell.com>
Acked-by: Haojian Zhuang <haojian.zhuang@gmail.com>
Cc: Florian Tobias Schandinat <FlorianSchandinat@gmx.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

video: mmp fb support

Add fb support for Marvell mmp display subsystem. This driver is
configured using "buffer driver mach info". With configured name of path,
this driver get path using using exported interface of mmp display driver.
Then this driver get overlay using configured id and operates on this
overlay to show buffers on display devices.

Signed-off-by: Zhou Zhu <zzhu3@marvell.com>
Signed-off-by: Lisa Du <cldu@marvell.com>
Cc: Guoqing Li <ligq@marvell.com>
Acked-by: Haojian Zhuang <haojian.zhuang@gmail.com>
Cc: Florian Tobias Schandinat <FlorianSchandinat@gmx.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

video: mmp display subsystem

Add mmp display subsystem to support Marvell MMP display controllers.

This subsystem contains 4 parts:
--fb folder
--core.c
--hw folder
--panel folder

1. fb folder contains implementation of fb.  fb get path and overlay
   from common interface and operates on these structures.

2. core.c provides common interface for a hardware abstraction.  Major
   parts of this interface are:

   a) Path: path is a output device connected to a panel or HDMI TV.  Main
      operations of the path is set/get timing/output color.  fb operates
      output device through path structure.

   b) Ovly: Ovly is a buffer shown on the path.

      Ovly describes frame buffer and its source/destination size, offset,
      input color, buffer address, z-order, and so on.  Each fb device maps
      to one overlay.

3. hw folder contains implementation of hardware operations defined by
   core.c.  It registers paths for fb use.

4. panel folder contains implementation of panels.  It's connected to
   path.  Panel drivers would also regiester panels and linked to path
   when probe.

Signed-off-by: Zhou Zhu <zzhu3@marvell.com>
Signed-off-by: Lisa Du <cldu@marvell.com>
Cc: Guoqing Li <ligq@marvell.com>
Acked-by: Haojian Zhuang <haojian.zhuang@gmail.com>
Cc: Florian Tobias Schandinat <FlorianSchandinat@gmx.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

goldfish: framebuffer driver

Framebuffer support for the Goldfish emulator. This takes the Google
emulator and applies the x86 cleanups as well as moving the blank methods
to the usual Linux place and dropping the Android early suspend logic (for
now at least, that can be looked at as Android and upstream converge).
Dropped various oddities like setting MTRRs on a virtual frame buffer
emulation...

With the drivers so far you can now boot a Linux initrd and have fun.

[sheng@linux.intel.com: cleaned up to handle x86]
[thomas.keel@intel.com: ported to 3.4]
[alan@linux.intel.com: cleaned up for style and 3.7, moved blank methods]
Signed-off-by: Mike A. Chan <mikechan@google.com>
Signed-off-by: Arve Hjønnevåg <arve@android.com>
Signed-off-by: Sheng Yang <sheng@linux.intel.com>
Signed-off-by: Yunhong Jiang <yunhong.jiang@intel.com>
Signed-off-by: Xiaohui Xin <xiaohui.xin@intel.com>
Signed-off-by: Jun Nakajima <jun.nakajima@intel.com>
Signed-off-by: Bruce Beare <bruce.j.beare@intel.com>
Signed-off-by: Tom Keel <thomas.keel@intel.com>
Signed-off-by: Alan Cox <alan@linux.intel.com>
Cc: Florian Tobias Schandinat <FlorianSchandinat@gmx.de>
Cc: Tomi Valkeinen <tomi.valkeinen@ti.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

fbcon: clear the logo bitmap from the margin area

Explicitly clear_margins when clearing the logo, in case the font dimensions
are non-integral to the framebuffer dimensions.

Signed-off-by: Kamal Mostafa <kamal@whence.com>
Cc: Florian Tobias Schandinat <FlorianSchandinat@gmx.de>
Cc: Tomi Valkeinen <tomi.valkeinen@ti.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mm: use vm_unmapped_area() on powerpc architecture

Update the powerpc slice_get_unmapped_area function to make use of
vm_unmapped_area() instead of implementing a brute force search.

Signed-off-by: Michel Lespinasse <walken@google.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mm: remove free_area_cache use in powerpc architecture

As all other architectures have been converted to use vm_unmapped_area(),
we are about to retire the free_area_cache.

This change simply removes the use of that cache in
slice_get_unmapped_area(), which will most certainly have a
performance cost. Next one will convert that function to use the
vm_unmapped_area() infrastructure and regain the performance.

Signed-off-by: Michel Lespinasse <walken@google.com>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

pcmcia: move unbind/rebind into dev_pm_ops.complete

Move the device rebind procedures for cardbus devices from the pm.resume
into the pm.complete callback.

The reason for moving the code is: "[...] The PM code needs to send
suspend and resume messages to every device in the right order, and it
can't do that if new devices are being added at the same time.  [...]"

However the situation really isn't quite that rigid.  In particular,
adding new children during a resume callback shouldn't cause much of
problem because the children don't need to be resumed anyway (since they
were never suspended).  On the other hand, if you do it you will get a
dev_warn() from the PM core, something like 'parent should not be
sleeping'.

Still, it is considered bad form and should be avoided if possible."

(Alan Stern's full comment about the topic can
be found here: <https://lkml.org/lkml/2012/7/10/254>)

Signed-off-by: Christian Lamparter <chunkeey@googlemail.com>
Cc: Dominik Brodowski <linux@dominikbrodowski.net>
Cc: Alan Stern <stern@rowland.harvard.edu>
Cc: Greg KH <greg@kroah.com>
Acked-by: "Rafael J. Wysocki" <rjw@sisk.pl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

cris: Use "int" for ssize_t to match size_t

On cris-linux-gcc, __SIZE_TYPE__ expands to "unsigned int", as
gcc-4.6.3-nolibc/cris-linux/lib/gcc/cris-linux/4.6.3/plugin/include/config/cris/linux.h
has

    #define SIZE_TYPE "unsigned int"

Hence __kernel_size_t is also "unsigned int".  But __kernel_ssize_t is
"long", which has a different base type, causing compiler warnings like:

    fs/quota/quota_tree.c:372:4: warning: format '%zd' expects argument of type 'signed size_t', but argument 4 has type 'ssize_t' [-Wformat]

To fix this, __kernel_ssize_t should be changed to "int". Hence cris can
just use the generic 32-bit versions from include/asm-generic/posix_types.h
for all size-related types.

Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Mikael Starvik <starvik@axis.com>
Acked-by: Jesper Nilsson <jesper.nilsson@axis.com>
Cc: Hans-Peter Nilsson <hans-peter.nilsson@axis.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

drivers/md/persistent-data/dm-transaction-manager.c: rename HASH_SIZE

drivers/md/persistent-data/dm-transaction-manager.c:28:1: warning: "HASH_SIZE" redefined
In file included from include/linux/elevator.h:5,
                 from include/linux/blkdev.h:216,
                 from drivers/md/persistent-data/dm-block-manager.h:11,
                 from drivers/md/persistent-data/dm-transaction-manager.h:10,
                 from drivers/md/persistent-data/dm-transaction-manager.c:6:
include/linux/hashtable.h:22:1: warning: this is the location of the previous definition
Cc: Alasdair Kergon <agk@redhat.com>
Cc: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

x86: make 'mem=' option to work for efi platform

Current mem boot option only can work for non efi environment.  If the
user specifies add_efi_memmap, it cannot work for efi environment.  In the
efi environment, we call e820_add_region() to add the memory map.  So we
can modify __e820_add_region() and the mem boot option can work for efi
environment.

Note: Only E820_RAM is limited, and BOOT_SERVICES_{CODE,DATA} are always
mapped(If its address >= mem_limit, the memory won't be freed in
efi_free_boot_services()).

Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Cc: Matt Fleming <matt.fleming@intel.com>
Cc: Rob Landley <rob@landley.net>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Yasuaki ISIMATU <isimatu.yasuaki@jp.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Matthew Garrett <mjg@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

pageattr: prevent PSE and GLOABL leftovers to confuse pmd/pte_present and pmd_huge

Without this patch any kernel code that reads kernel memory in non
present kernel pte/pmds (as set by pageattr.c) will crash.

With this kernel code:

static struct page *crash_page;
static unsigned long *crash_address;
[..]
crash_page = alloc_pages(GFP_KERNEL, 9);
crash_address = page_address(crash_page);
if (set_memory_np((unsigned long)crash_address, 1))
printk("set_memory_np failure\n");
[..]

The kernel will crash if inside the "crash tool" one would try to read
the memory at the not present address.

crash> p crash_address
crash_address = $8 = (long unsigned int *) 0xffff88023c000000
crash> rd 0xffff88023c000000
[ *lockup* ]

The lockup happens because _PAGE_GLOBAL and _PAGE_PROTNONE shares the
same bit, and pageattr leaves _PAGE_GLOBAL set on a kernel pte which
is then mistaken as _PAGE_PROTNONE (so pte_present returns true by
mistake and the kernel fault then gets confused and loops).

With THP the same can happen after we taught pmd_present to check
_PAGE_PROTNONE and _PAGE_PSE in commit 027ef6c87853b0a9df5317 ("mm: thp:
fix pmd_present for split_huge_page and PROT_NONE with THP"). THP has the
same problem with _PAGE_GLOBAL as the 4k pages, but it also has a problem
with _PAGE_PSE, which must be cleared too.

After the patch is applied copy_user correctly returns -EFAULT and
doesn't lockup anymore.

crash> p crash_address
crash_address = $9 = (long unsigned int *) 0xffff88023c000000
crash> rd 0xffff88023c000000
rd: read error: kernel virtual address: ffff88023c000000 type: "64-bit KVADDR"

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Shaohua Li <shaohua.li@intel.com>
Cc: "H. Peter Anvin" <hpa@linux.intel.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Hugh Dickins <hughd@google.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Revert "x86, mm: Make spurious_fault check explicitly check the PRESENT bit"

I got a report for a minor regression introduced by commit 027ef6c87853b
("mm: thp: fix pmd_present for split_huge_page and PROT_NONE with THP").

So the problem is, pageattr creates kernel pagetables (pte and pmds) that
breaks pte_present/pmd_present and the patch above exposed this invariant
breakage for pmd_present.

The same problem already existed for the pte and pte_present and it was
fixed by commit 660a293ea9be709 ("x86, mm: Make spurious_fault check
explicitly check the PRESENT bit") (if it wasn't for that commit, it
wouldn't even be a regression).  That fix avoids the pagefault to use
pte_present.  I could follow through by stopping using
pmd_present/pmd_huge too.

However I think it's more robust to fix pageattr and to clear the
PSE/GLOBAL bitflags too in addition to the present bitflag.  So the kernel
page fault can keep using the regular pte_present/pmd_present/pmd_huge.

The confusion arises because _PAGE_GLOBAL and _PAGE_PROTNONE are sharing
the same bit, and in the pmd case we pretend _PAGE_PSE to be set only in
present pmds (to facilitate split_huge_page final tlb flush).

This patch:

Revert commit 660a293ea9be709 ("x86, mm: Make spurious_fault check
explicitly check the PRESENT bit").

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Shaohua Li <shaohua.li@intel.com>
Cc: "H. Peter Anvin" <hpa@linux.intel.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Hugh Dickins <hughd@google.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

x86 numa: don't check if node is NUMA_NO_NODE

If we aren't debugging per_cpu maps, the cpu's node is stored in per_cpu
variable numa_node. If `node' is NUMA_NO_NODE, it means the caller wants
to clear the cpu's node. So we should also call set_cpu_numa_node() in
this case.

Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Cc: Len Brown <len.brown@intel.com>
Cc: Pavel Machek <pavel@ucw.cz>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

fs/block_dev.c: page cache wrongly left invalidated after revalidate_disk()

We found that bdev->bd_invalidated was left set once revalidate_disk() is
called, which results in page cache flush every time that device is open.

Specifically, we found this problem in MD block device. Once we resize a
MD device, mdadm --monitor periodically flush all page cache for that
device every 60 or 1000 seconds when it opens the device.

This bug lies since at least 3.2.0 till the latest kernel(3.6.2).
Patch is attached.

The following steps will reproduce the problem.

1. prepair a block device(ex. /dev/sdb).
2. create two partitions.

sudo parted /dev/sdb
mklabel gpt
mkpart primary 0% 50%
mkpart primary 50% 100%

3. create a md device.

sudo mdadm -C /dev/md/hoge -l 1 -n 2 -e 1.2 --assume-clean --auto=md \
--symlink=no /dev/sdb1 /dev/sdb2

4. create file system and mount it

sudo mkfs.ext3 /dev/md/hoge
sudo mkdir /mnt/test
sudo mount /dev/md/hoge /mnt/test

5. try to resize the device

sudo mdadm -G /dev/md/hoge --size=max

6. create a file to fill file cache.

sudo dd if=/dev/urandom of=/mnt/test/data bs=1M count=10
and verity the current status of file by free command.

7. mdadm monitor will open the md device every 1000 seconds
and you will find all file cache on the device are cleared.

The timing can be reduced by the following steps.

a) kill mdadm and restart it with --delay option
/sbin/mdadm --monitor --delay=30 --pid-file /var/run/mdadm/monitor.pid \
--daemonise --scan --syslog

or open the md device directly.

sudo dd if=/dev/md/hoge of=/dev/null bs=4096 count=1

Signed-off-by: MITSUNARI Shigeo <herumi@nifty.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

inotify: remove broken mask checks causing unmount to be EINVAL

Running the command:

inotifywait -e unmount /mnt/disk

immediately aborts with a -EINVAL return code.  This is however a valid
parameter.  This abort occurs only if unmount is the sole event parameter.
If other event parameters are supplied, then the unmount event wait will
work.

The problem was introduced by commit 44b350fc23e ("inotify: Fix mask
checks").  In that commit, it states:

The mask checks in inotify_update_existing_watch() and
inotify_new_watch() are useless because inotify_arg_to_mask()
sets FS_IN_IGNORED and FS_EVENT_ON_CHILD bits anyway.

But instead of removing the useless checks, it did this:

        mask = inotify_arg_to_mask(arg);
-       if (unlikely(!mask))
+       if (unlikely(!(mask & IN_ALL_EVENTS)))
                return -EINVAL;

The problem is that IN_ALL_EVENTS doesn't include IN_UNMOUNT, and other
parts of the code keep IN_UNMOUNT separate from IN_ALL_EVENTS.  So the
check should be:

if (unlikely(!(mask & (IN_ALL_EVENTS | IN_UNMOUNT))))

But inotify_arg_to_mask(arg) always sets the IN_UNMOUNT bit in the mask
anyway, so the check is always going to pass and thus should simply be
removed.  Also note that inotify_arg_to_mask completely controls what mask
bits get set from arg, there's no way for invalid bits to get enabled
there.

Lets fix it by simply removing the useless broken checks.

Signed-off-by: Jim Somerville <Jim.Somerville@windriver.com>
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Jerome Marchand <jmarchan@redhat.com>
Cc: John McCutchan <john@johnmccutchan.com>
Cc: Robert Love <rlove@rlove.org>
Cc: Eric Paris <eparis@parisplace.org>
Cc: <stable@vger.kernel.org> [2.6.37+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

compat: return -EFAULT on error in waitid()

The copy_to_user() call returns the number of bytes remaining but we want
to return -EFAULT on error.

Fixes "x32: fix waitid()"

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

proc: avoid extra pde_put() in proc_fill_super()

If proc_get_inode() succeeded, but d_make_root() failed, pde_put() for
proc_root will be called twice: the first time due to iput() called from
d_make_root() and the second time directly in the end of
proc_fill_super().

Signed-off-by: Maxim Patlasov <mpatlasov@parallels.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

bugh-compilerh-introduce-compiletime_assert-build_bug_on_msg-checkpatch-fixes

WARNING: please, no space before tabs
#56: FILE: include/linux/bug.h:45:
+ * ^I^I error message.$

total: 0 errors, 1 warnings, 88 lines checked

./patches/bugh-compilerh-introduce-compiletime_assert-build_bug_on_msg.patch has style problems, please review.

If any of these errors are false positives, please report
them to the maintainer, see CHECKPATCH in MAINTAINERS.

Please run checkpatch prior to sending patches

Cc: Daniel Santos <daniel.santos@pobox.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

bug.h, compiler.h: Introduce compiletime_assert & BUILD_BUG_ON_MSG

Introduce compiletime_assert to compiler.h, which moves the details of how
to break a build and emit an error message for a specific compiler to the
headers where these details should be. Following in the tradition of the
POSIX assert macro, compiletime_assert creates a build-time error when the
supplied condition is *false*.

Next, we add BUILD_BUG_ON_MSG to bug.h which simply wraps
compiletime_assert, inverting the logic, so that it fails when the
condition is *true*, consistent with the language "build bug on." This
macro allows you to specify the error message you want emitted when the
supplied condition is true.

Finally, we remove all other code from bug.h that mucks with these details
(BUILD_BUG & BUILD_BUG_ON), and have them all call BUILD_BUG_ON_MSG. This
not only reduces source code bloat, but also prevents the possibility of
code being changed for one macro and not for the other (which was
previously the case for BUILD_BUG and BUILD_BUG_ON).

Since __compiletime_error_fallback is now only used in compiler.h, I'm
considering it a private macro and removing the double negation that's now
extraneous.

Signed-off-by: Daniel Santos <daniel.santos@pobox.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: David Rientjes <rientjes@google.com>
Cc: Joe Perches <joe@perches.com>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

compiler.h, bug.h: Prevent double error messages with BUILD_BUG{,_ON}

Prior to the introduction of __attribute__((error("msg"))) in gcc 4.3,
creating compile-time errors required a little trickery.  BUILD_BUG{,_ON}
uses this attribute when available to generate compile-time errors, but
also uses the negative-sized array trick for older compilers, resulting in
two error messages in some cases.  The reason it's "some" cases is that as
of gcc 4.4, the negative-sized array will not create an error in some
situations, like inline functions.

This patch replaces the negative-sized array code with the new
__compiletime_error_fallback() macro which expands to the same thing
unless the the error attribute is available, in which case it expands to
do{}while(0), resulting in exactly one compile-time error on all versions
of gcc.

Note that we are not changing the negative-sized array code for the
unoptimized version of BUILD_BUG_ON, since it has the potential to catch
problems that would be disabled in later versions of gcc were
__compiletime_error_fallback used.  The reason is that that an unoptimized
build can't always remove calls to an error-attributed function call (like
we are using) that should effectively become dead code if it were
optimized.  However, using a negative-sized array with a similar value
will not result in an false-positive (error).  The only caveat being that
it will also fail to catch valid conditions, which we should be expecting
in an unoptimized build anyway.

Signed-off-by: Daniel Santos <daniel.santos@pobox.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: David Rientjes <rientjes@google.com>
Cc: Joe Perches <joe@perches.com>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

bug.h: Make BUILD_BUG_ON generate compile-time error

Negative sized arrays wont create a compile-time error in some cases
starting with gcc 4.4 (e.g., inlined functions), but gcc 4.3 introduced
the error function attribute that will. This patch modifies BUILD_BUG_ON
to behave like BUILD_BUG already does, using the error function attribute
so that you don't have to build the entire kernel to discover that you
have a problem, and then enjoy trying to track it down from a link-time
error.

Also, we are only including asm/bug.h and then expecting that
linux/compiler.h will eventually be included to define __linktime_error
(used in BUILD_BUG_ON). This patch includes it directly for clarity and
to avoid the possibility of changes in <arch>/*/include/asm/bug.h being
changed or not including linux/compiler.h for some reason.

Signed-off-by: Daniel Santos <daniel.santos@pobox.com>
Acked-by: Borislav Petkov <bp@alien8.de>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Joe Perches <joe@perches.com>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

bugh-prevent-double-evaulation-of-in-build_bug_on-fix

tweak code layout

Cc: Daniel Santos <daniel.santos@pobox.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

bug.h: Prevent double evaulation of in BUILD_BUG_ON

When calling BUILD_BUG_ON in an optimized build using gcc 4.3 and later,
the condition will be evaulated twice, possibily with side-effects. This
patch eliminates that error.

Signed-off-by: Daniel Santos <daniel.santos@pobox.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: David Rientjes <rientjes@google.com>
Cc: Joe Perches <joe@perches.com>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

bug.h: Fix BUILD_BUG_ON macro in __CHECKER__

When __CHECKER__ is defined, we disable all of the BUILD_BUG.* macros.
However, both BUILD_BUG_ON_NOT_POWER_OF_2 and BUILD_BUG_ON was evaluating
to nothing in this case, and we want (0) since this is a function-like
macro that will be followed by a semicolon.

Signed-off-by: Daniel Santos <daniel.santos@pobox.com>
Acked-by: Borislav Petkov <bp@alien8.de>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Joe Perches <joe@perches.com>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

compiler{,-gcc4}.h, bug.h: Remove duplicate macros

__linktime_error() does the same thing as __compiletime_error() and is
only used in bug.h. Since the macro defines a function attribute that
will cause a failure at compile-time (not link-time), it makes more sense
to keep __compiletime_error(), which is also neatly mated with
__compiletime_warning().

Signed-off-by: Daniel Santos <daniel.santos@pobox.com>
Acked-by: David Rientjes <rientjes@google.com>
Acked-by: Borislav Petkov <bp@alien8.de>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Joe Perches <joe@perches.com>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

compiler-gcc{3,4}.h: Use GCC_VERSION macro

Using GCC_VERSION reduces complexity, is easier to read and is GCC's
recommended mechanism for doing version checks. (Just don't ask me why
they didn't define it in the first place.) This also makes it easy to
merge compiler-gcc{,3,4}.h should somebody want to.

Signed-off-by: Daniel Santos <daniel.santos@pobox.com>
Acked-by: David Rientjes <rientjes@google.com>
Acked-by: Borislav Petkov <bp@alien8.de>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Joe Perches <joe@perches.com>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

compiler-gcc.h: Add gcc-recommended GCC_VERSION macro

Throughout compiler*.h, many version checks are made. These can be
simplified by using the macro that gcc's documentation recommends.
However, my primary reason for adding this is that I need bug-check macros
that are enabled at certain gcc versions and it's cleaner to use this
macro than the tradition method:

if __GNUC__ > 4 || (__GNUC__ == 4 && __GNUC_MINOR__ => 2)

If you add patch level, it gets this ugly:

if __GNUC__ > 4 || (__GNUC__ == 4 && (__GNUC_MINOR__ > 2 || \
__GNUC_MINOR__ == 2 __GNUC_PATCHLEVEL__ >= 1))

As opposed to:

if GCC_VERSION >= 40201

While having separate headers for gcc 3 & 4 eliminates some of this
verbosity, they can still be cleaned up by this.

See also:
http://gcc.gnu.org/onlinedocs/cpp/Common-Predefined-Macros.html

Signed-off-by: Daniel Santos <daniel.santos@pobox.com>
Acked-by: Borislav Petkov <bp@alien8.de>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Joe Perches <joe@perches.com>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

compiler-gcc4.h: Reorder macros based upon gcc ver

This helps to keep the file from getting confusing, removes one duplicate
version check and should encourage future editors to put new macros where
they belong.

Signed-off-by: Daniel Santos <daniel.santos@pobox.com>
Acked-by: David Rientjes <rientjes@google.com>
Acked-by: Borislav Petkov <bp@alien8.de>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Joe Perches <joe@perches.com>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Merge branch 'akpm-current/current'

Conflicts:
drivers/tty/vt/vt.c

Merge remote-tracking branch 'modem_shm/remoteproc-next'