git.karo-electronics.de Git - karo-tx-linux.git/log

mmap: arch_get_unmapped_area(): use proper mmap base for bottom up direction

This is more or less the generic variant of 41aacc1eea ("x86
get_unmapped_area: Access mmap_legacy_base through mm_struct member").

So effectively architectures which use an own arch_pick_mmap_layout()
implementation but call the generic arch_get_unmapped_area() now can also
randomize their mmap_base.

All architectures which have an own arch_pick_mmap_layout() and call the
generic arch_get_unmapped_area() (arm64, s390, tile) currently set
mmap_base to TASK_UNMAPPED_BASE. This is also true for the generic
arch_pick_mmap_layout() function. So this change is a no-op currently.

Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Radu Caragea <sinaelgl@gmail.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mm/zswap: avoid unnecessary page scanning

Add SetPageReclaim() before __swap_writepage() so that page can be moved
to the tail of the inactive list, which can avoid unnecessary page
scanning as this page was reclaimed by swap subsystem before.

Signed-off-by: Weijie Yang <weijie.yang@samsung.com>
Reviewed-by: Bob Liu <bob.liu@oracle.com>
Reviewed-by: Minchan Kim <minchan@kernel.org>
Acked-by: Seth Jennings <sjenning@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

writeback-do-not-sync-data-dirtied-after-sync-start-fix-3

Fixup sync_inodes_sb() comment

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

writeback-do-not-sync-data-dirtied-after-sync-start-fix-2.txt

On Wed 09-10-13 14:21:25, Andrew Morton wrote:
> On Wed, 9 Oct 2013 17:03:25 +0200 Jan Kara <jack@suse.cz> wrote:
>
> > From: Jan Kara <jack@suse.cz>
> > Date: Wed, 9 Oct 2013 15:41:50 +0200
> > Subject: [PATCH] writeback: Use older_than_this_is_set instead of magic
> >  older_than_this == 0
> >
> > Currently we use 0 as a special value of work->older_than_this to
> > indicate that wb_writeback() should set work->older_that_this to current
> > time. This works but it is a bit magic. So use a special flag in
> > work_struct for that.
>
> OK.
>
> > - if (!work->older_than_this)
> > + if (!work->older_than_this_is_set)
> > work->older_than_this = jiffies;
>
> It would be logical although presumably unneeded to set
> older_than_this_is_set here?
  Yes. Updated.

> > Also fixup writeback from workqueue rescuer to include all inodes.
>
> There's nothing in the patch which matches this sentence?
  The sentence is about the hunk below. writeback_inodes_wb() is special in
that it directly calls queue_io() (everything else goes through
wb_writeback()) and my previous patch thus resulted in using 0 as an
older_than_this value => likely we wouldn't queue any inodes for writeback.

I've added WARN_ON_ONCE into move_expired_inodes() to increase a chance of
catching such mistakes in future (although in this particular case it
wouldn't really help because writeback_inodes_wb() gets hardly ever
called).

Currently we use 0 as a special value of work->older_than_this to
indicate that wb_writeback() should set work->older_that_this to current
time. This works but it is a bit magic. So use a special flag in
work_struct for that.

Also fixup writeback from workqueue rescuer (writeback_inodes_wb()) to
include all inodes. Currently it would use 0 as an older_than_this value
thus queue_io() would likely not queue any inodes for writeback.

Signed-off-by: Jan Kara <jack@suse.cz>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Cc: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

writeback: use older_than_this_is_set instead of magic older_than_this == 0

Currently we use 0 as a special value of work->older_than_this to
indicate that wb_writeback() should set work->older_that_this to current
time. This works but it is a bit magic. So use a special flag in
work_struct for that.

Also fixup writeback from workqueue rescuer to include all inodes.

Signed-off-by: Jan Kara <jack@suse.cz>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Cc: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

writeback: do not sync data dirtied after sync start

When there are processes heavily creating small files while sync(2) is
running, it can easily happen that quite some new files are created
between WB_SYNC_NONE and WB_SYNC_ALL pass of sync(2).  That can happen
especially if there are several busy filesystems (remember that sync
traverses filesystems sequentially and waits in WB_SYNC_ALL phase on one
fs before starting it on another fs).  Because WB_SYNC_ALL pass is slow
(e.g.  causes a transaction commit and cache flush for each inode in
ext3), resulting sync(2) times are rather large.

The following script reproduces the problem:

function run_writers
{
  for (( i = 0; i < 10; i++ )); do
    mkdir $1/dir$i
    for (( j = 0; j < 40000; j++ )); do
      dd if=/dev/zero of=$1/dir$i/$j bs=4k count=4 &>/dev/null
    done &
  done
}

for dir in "$@"; do
  run_writers $dir
done

sleep 40
time sync
======

Fix the problem by disregarding inodes dirtied after sync(2) was called in
the WB_SYNC_ALL pass.  To allow for this, sync_inodes_sb() now takes a
time stamp when sync has started which is used for setting up work for
flusher threads.

To give some numbers, when above script is run on two ext4 filesystems on
simple SATA drive, the average sync time from 10 runs is 267.549 seconds
with standard deviation 104.799426.  With the patched kernel, the average
sync time from 10 runs is 2.995 seconds with standard deviation 0.096.

Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Fengguang Wu <fengguang.wu@intel.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

tools/vm/page-types.c: support KPF_SOFTDIRTY bit

Soft dirty bit allows us to track which pages are written since the last
clear_ref (by "echo 4 > /proc/pid/clear_refs".) This is useful for
userspace applications to know their memory footprints.

Note that the kernel exposes this flag via bit[55] of /proc/pid/pagemap,
and the semantics is not a default one (scheduled to be the default in the
near future.) However, it shifts to the new semantics at the first
clear_ref, and the users of soft dirty bit always do it before utilizing
the bit, so that's not a big deal. Users must avoid relying on the bit in
page-types before the first clear_ref.

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

smaps-show-vm_softdirty-flag-in-vmflags-line-fix

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Acked-by: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

/proc/pid/smaps: show VM_SOFTDIRTY flag in VmFlags line

This flag shows that the VMA is "newly created" and thus represents
"dirty" in the task's VM.

You can clear it by "echo 4 > /proc/pid/clear_refs."

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Acked-by: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mm/page_alloc.c: remove unused marco LONG_ALIGN

Signed-off-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

frontswap: enable call to invalidate area on swapoff

During swapoff the frontswap_map was NULL-ified before calling
frontswap_invalidate_area().  However the frontswap_invalidate_area()
exits early if frontswap_map is NULL.  Invalidate was never called during
swapoff.

This patch moves frontswap_map_set() in swapoff just after calling
frontswap_invalidate_area() so outside of locks (swap_lock and
swap_info_struct->lock).  This shouldn't be a problem as during swapon the
frontswap_map_set() is called also outside of any locks.

Signed-off-by: Krzysztof Kozlowski <k.kozlowski@samsung.com>
Reviewed-by: Seth Jennings <sjenning@linux.vnet.ibm.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Shaohua Li <shli@fusionio.com>
Cc: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mm/swapfile.c: fix comment typos

Signed-off-by: Seth Jennings <sjenning@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mm: kmemleak: avoid false negatives on vmalloc'ed objects

Commit 248ac0e1 ("mm/vmalloc: remove guard page from between vmap blocks")
had the side effect of making vmap_area.va_end member point to the next
vmap_area.va_start. This was creating an artificial reference to
vmalloc'ed objects and kmemleak was rarely reporting vmalloc() leaks.

This patch marks the vmap_area containing pointers explicitly and reduces
the min ref_count to 2 as vm_struct still contains a reference to the
vmalloc'ed object. The kmemleak add_scan_area() function has been
improved to allow a SIZE_MAX argument covering the rest of the object (for
simpler calling sites).

Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mm-sparsemem-fix-a-bug-in-free_map_bootmem-when-config_sparsemem_vmemmap-v2

v2: Fix a bug introduced in v1 patch. Thanks wanpeng!

Signed-off-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Cc: Wen Congyang <wency@cn.fujitsu.com>
Cc: Tang Chen <tangchen@cn.fujitsu.com>
Cc: Toshi Kani <toshi.kani@hp.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Yasunori Goto <y-goto@jp.fujitsu.com>
Cc: Andy Whitcroft <apw@shadowen.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mm/sparsemem: fix a bug in free_map_bootmem when CONFIG_SPARSEMEM_VMEMMAP

We pass the number of pages which hold page structs of a memory section to
free_map_bootmem().  This is right when !CONFIG_SPARSEMEM_VMEMMAP but
wrong when CONFIG_SPARSEMEM_VMEMMAP.  When CONFIG_SPARSEMEM_VMEMMAP, we
should pass the number of pages of a memory section to free_map_bootmem.

So the fix is removing the nr_pages parameter.  When
CONFIG_SPARSEMEM_VMEMMAP, we directly use the prefined marco
PAGES_PER_SECTION in free_map_bootmem.  When !CONFIG_SPARSEMEM_VMEMMAP, we
calculate page numbers needed to hold the page structs for a memory
section and use the value in free_map_bootmem().

This was found by reading the code.  And I have no machine that support
memory hot-remove to test the bug now.

Signed-off-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Reviewed-by: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Cc: Wen Congyang <wency@cn.fujitsu.com>
Cc: Tang Chen <tangchen@cn.fujitsu.com>
Cc: Toshi Kani <toshi.kani@hp.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Yasunori Goto <y-goto@jp.fujitsu.com>
Cc: Andy Whitcroft <apw@shadowen.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mm/sparsemem: use PAGES_PER_SECTION to remove redundant nr_pages parameter

For below functions,

- sparse_add_one_section()
- kmalloc_section_memmap()
- __kmalloc_section_memmap()
- __kfree_section_memmap()

they are always invoked to operate on one memory section, so it is
redundant to always pass a nr_pages parameter, which is the page numbers
in one section. So we can directly use predefined macro PAGES_PER_SECTION
instead of passing the parameter.

Signed-off-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Cc: Wen Congyang <wency@cn.fujitsu.com>
Cc: Tang Chen <tangchen@cn.fujitsu.com>
Cc: Toshi Kani <toshi.kani@hp.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Yasunori Goto <y-goto@jp.fujitsu.com>
Cc: Andy Whitcroft <apw@shadowen.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

memcg: support hierarchical memory.numa_stats

The memory.numa_stat file was not hierarchical.  Memory charged to the
children was not shown in parent's numa_stat.

This change adds the "hierarchical_" stats to the existing stats.  The new
hierarchical stats include the sum of all children's values in addition to
the value of the memcg.

Tested: Create cgroup a, a/b and run workload under b.  The values of
b are included in the "hierarchical_*" under a.

$ cd /sys/fs/cgroup
$ echo 1 > memory.use_hierarchy
$ mkdir a a/b

Run workload in a/b:
$ (echo $BASHPID >> a/b/cgroup.procs && cat /some/file && bash) &

The hierarchical_ fields in parent (a) show use of workload in a/b:
$ cat a/memory.numa_stat
total=0 N0=0 N1=0 N2=0 N3=0
file=0 N0=0 N1=0 N2=0 N3=0
anon=0 N0=0 N1=0 N2=0 N3=0
unevictable=0 N0=0 N1=0 N2=0 N3=0
hierarchical_total=908 N0=552 N1=317 N2=39 N3=0
hierarchical_file=850 N0=549 N1=301 N2=0 N3=0
hierarchical_anon=58 N0=3 N1=16 N2=39 N3=0
hierarchical_unevictable=0 N0=0 N1=0 N2=0 N3=0

$ cat a/b/memory.numa_stat
total=908 N0=552 N1=317 N2=39 N3=0
file=850 N0=549 N1=301 N2=0 N3=0
anon=58 N0=3 N1=16 N2=39 N3=0
unevictable=0 N0=0 N1=0 N2=0 N3=0
hierarchical_total=908 N0=552 N1=317 N2=39 N3=0
hierarchical_file=850 N0=549 N1=301 N2=0 N3=0
hierarchical_anon=58 N0=3 N1=16 N2=39 N3=0
hierarchical_unevictable=0 N0=0 N1=0 N2=0 N3=0

Signed-off-by: Ying Han <yinghan@google.com>
Signed-off-by: Greg Thelen <gthelen@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

memcg: refactor mem_control_numa_stat_show()

Refactor mem_control_numa_stat_show() to use a new stats structure for
smaller and simpler code. This consolidates nearly identical code.

text data bss dec hex filename
8,137,679 1,703,496 1,896,448 11,737,623 b31a17 vmlinux.before
8,136,911 1,703,496 1,896,448 11,736,855 b31717 vmlinux.after

Signed-off-by: Greg Thelen <gthelen@google.com>
Signed-off-by: Ying Han <yinghan@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mm/mempolicy: use NUMA_NO_NODE

Use more appropriate NUMA_NO_NODE instead of -1

Signed-off-by: Jianguo Wu <wujianguo@huawei.com>
Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mm-thp-khugepaged-add-policy-for-finding-target-node-fix

make last_khugepaged_target_node local to khugepaged_find_target_node()

Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andrew Davidoff <davidoff@qedmf.net>
Cc: Bob Liu <bob.liu@oracle.com>
Cc: Bob Liu <lliubbo@gmail.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mm: thp: khugepaged: add policy for finding target node

Khugepaged will scan/free HPAGE_PMD_NR normal pages and replace with a
hugepage which is allocated from the node of the first scanned normal
page, but this policy is too rough and may end with unexpected result to
upper users.

The problem is the original page-balancing among all nodes will be broken
after hugepaged started.  Thinking about the case if the first scanned
normal page is allocated from node A, most of other scanned normal pages
are allocated from node B or C..  But hugepaged will always allocate
hugepage from node A which will cause extra memory pressure on node A
which is not the situation before khugepaged started.

This patch try to fix this problem by making khugepaged allocate hugepage
from the node which have max record of scaned normal pages hit, so that
the effect to original page-balancing can be minimized.

The other problem is if normal scanned pages are equally allocated from
Node A,B and C, after khugepaged started Node A will still suffer extra
memory pressure.

Andrew Davidoff reported a related issue several days ago.  He wanted his
application interleaving among all nodes and "numactl --interleave=all
./test" was used to run the testcase, but the result wasn't not as
expected.

cat /proc/2814/numa_maps:
7f50bd440000 interleave:0-3 anon=51403 dirty=51403 N0=435 N1=435 N2=435
N3=50098
The end result showed that most pages are from Node3 instead of interleave
among node0-3 which was unreasonable.

This patch also fix this issue by allocating hugepage round robin from all
nodes have the same record, after this patch the result was as expected:
7f78399c0000 interleave:0-3 anon=51403 dirty=51403 N0=12723 N1=12723
N2=13235 N3=12722

The simple testcase is like this:

int main() {
char *p;
int i;
int j;

for (i=0; i < 200; i++) {
p = (char *)malloc(1048576);
printf("malloc done\n");

if (p == 0) {
printf("Out of memory\n");
return 1;
}
for (j=0; j < 1048576; j++) {
p[j] = 'A';
}
printf("touched memory\n");

sleep(1);
}
printf("enter sleep\n");
while(1) {
sleep(100);
}
}

Reported-by: Andrew Davidoff <davidoff@qedmf.net>
Tested-by: Andrew Davidoff <davidoff@qedmf.net>
Signed-off-by: Bob Liu <bob.liu@oracle.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mm: thp: cleanup: mv alloc_hugepage to better place

Move alloc_hugepage() to a better place, no need for a seperate #ifndef
CONFIG_NUMA

Signed-off-by: Bob Liu <bob.liu@oracle.com>
Reviewed-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Andrew Davidoff <davidoff@qedmf.net>
Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Documentation/vm/zswap.txt: fix typos

Signed-off-by: Christian Hesse <mail@eworm.de>
Acked-by: Seth Jennings <sjenning@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

revert mm/vmalloc.c: emit the failure message before return

Don't warn twice in __vmalloc_area_node and __vmalloc_node_range if
__vmalloc_area_node allocation failure. This patch reverts commit
46c001a2 ("mm/vmalloc.c: emit the failure message before return").

Signed-off-by: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Reviewed-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Mitsuo Hayasaka <mitsuo.hayasaka.hu@hitachi.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mm/vmalloc: revert "mm/vmalloc.c: check VM_UNINITIALIZED flag in s_show instead of show_numa_info"

The VM_UNINITIALIZED/VM_UNLIST flag introduced by f5252e00 ("mm:
avoid null pointer access in vm_struct via /proc/vmallocinfo") is used to
avoid accessing the pages field with unallocated page when
show_numa_info() is called. This patch move the check just before
show_numa_info in order that some messages still can be dumped via
/proc/vmallocinfo. This patch revert commit d157a558 ("mm/vmalloc.c: check
VM_UNINITIALIZED flag in s_show instead of show_numa_info");

Reviewed-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Signed-off-by: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Cc: Mitsuo Hayasaka <mitsuo.hayasaka.hu@hitachi.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mm/vmalloc: fix show vmap_area information race with vmap_area tear down

There is a race window between vmap_area tear down and show vmap_area
information.

A                                                B

remove_vm_area
spin_lock(&vmap_area_lock);
va->vm = NULL;
va->flags &= ~VM_VM_AREA;
spin_unlock(&vmap_area_lock);
spin_lock(&vmap_area_lock);
if (va->flags & (VM_LAZY_FREE | VM_LAZY_FREEZING))
return 0;
if (!(va->flags & VM_VM_AREA)) {
seq_printf(m, "0x%pK-0x%pK %7ld vm_map_ram\n",
(void *)va->va_start, (void *)va->va_end,
va->va_end - va->va_start);
return 0;
}
free_unmap_vmap_area(va);
flush_cache_vunmap
free_unmap_vmap_area_noflush
unmap_vmap_area
free_vmap_area_noflush
va->flags |= VM_LAZY_FREE

The assumption !VM_VM_AREA represents vm_map_ram allocation is introduced
by d4033afd ("mm, vmalloc: iterate vmap_area_list, instead of vmlist, in
vmallocinfo()").  However, !VM_VM_AREA also represents vmap_area is being
tear down in race window mentioned above.  This patch fix it by don't dump
any information for !VM_VM_AREA case and also remove (VM_LAZY_FREE |
VM_LAZY_FREEING) check since they are not possible for !VM_VM_AREA case.

Suggested-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Cc: Mitsuo Hayasaka <mitsuo.hayasaka.hu@hitachi.com>
Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mm/vmalloc: don't set area->caller twice

The caller address has already been set in set_vmalloc_vm(), there's no
need to set it again in __vmalloc_area_node.

Reviewed-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Signed-off-by: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Mitsuo Hayasaka <mitsuo.hayasaka.hu@hitachi.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mm, mempolicy: make mpol_to_str robust and always succeed

mpol_to_str() should not fail. Currently, it either fails because the
string buffer is too small or because a string hasn't been defined for a
mempolicy mode.

If a new mempolicy mode is introduced and no string is defined for it,
just warn and return "unknown".

If the buffer is too small, just truncate the string and return, the same
behavior as snprintf().

This also fixes a bug where there was no NULL-byte termination when doing
*p++ = '=' and *p++ ':' and maxlen has been reached.

Signed-off-by: David Rientjes <rientjes@google.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Chen Gang <gang.chen@asianux.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Dave Jones <davej@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mm/arch: use NUMA_NO_NODE

Use more appropriate NUMA_NO_NODE instead of -1 in all archs' module_alloc()

Signed-off-by: Jianguo Wu <wujianguo@huawei.com>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mm/memory-failure.c: move set_migratetype_isolate() outside get_any_page()

Chen Gong pointed out that set/unset_migratetype_isolate() was done in
different functions in mm/memory-failure.c, which makes the code less
readable/maintainable. So this patch does it in soft_offline_page().

With this patch, we get to hold lock_memory_hotplug() longer but it's not
a problem because races between memory hotplug and soft offline are very
rare.

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Reviewed-by: Chen, Gong <gong.chen@linux.intel.com>
Acked-by: Andi Kleen <ak@linux.intel.com>
Reviewed-by: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

cpu/mem hotplug: add try_online_node() for cpu_up()

cpu_up() has #ifdef CONFIG_MEMORY_HOTPLUG code blocks, which call
mem_online_node() to put its node online if offlined and then call
build_all_zonelists() to initialize the zone list. These steps are
specific to memory hotplug, and should be managed in mm/memory_hotplug.c.
lock_memory_hotplug() should also be held for the whole steps.

For this reason, this patch replaces mem_online_node() with
try_online_node(), which performs the whole steps with
lock_memory_hotplug() held. try_online_node() is named after
try_offline_node() as they have similar purpose.

There is no functional change in this patch.

Signed-off-by: Toshi Kani <toshi.kani@hp.com>
Reviewed-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mm/nobootmem.c: have __free_pages_memory() free in larger chunks.

On large memory machines it can take a few minutes to get through
free_all_bootmem().

Currently, when free_all_bootmem() calls __free_pages_memory(), the number
of contiguous pages that __free_pages_memory() passes to the buddy
allocator is limited to BITS_PER_LONG.  BITS_PER_LONG was originally
chosen to keep things similar to mm/nobootmem.c.  But it is more efficient
to limit it to MAX_ORDER.

       base   new  change
8TB    202s  172s   30s
16TB   401s  351s   50s

That is around 1%-3% improvement on total boot time.

This patch was spun off from the boot time rfc Robin and I had been
working on.

Signed-off-by: Robin Holt <robin.m.holt@gmail.com>
Signed-off-by: Nathan Zimmer <nzimmer@sgi.com>
Cc: Robin Holt <robinmholt@linux.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Mike Travis <travis@sgi.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Mel Gorman <mgorman@suse.de>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mm: add a helper function to check may oom condition

Use helper function to check if we need to deal with oom condition.

Signed-off-by: Qiang Huang <h.huangqiang@huawei.com>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mm/memory_hotplug.c: use pfn_to_nid() instead of page_to_nid(pfn_to_page())

Use "pfn_to_nid(pfn)" instead of "page_to_nid(pfn_to_page(pfn))".

Signed-off-by: Xishi Qiu <qiuxishi@huawei.com>
Acked-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mm/memory_hotplug.c: rename the function is_memblock_offlined_cb()

A is_memblock_offlined() return or 1 means memory block is offlined, but
is_memblock_offlined_cb() returning 1 means memory block is not offlined,
this will confuse somebody, so rename the function.

Signed-off-by: Xishi Qiu <qiuxishi@huawei.com>
Acked-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mm: use populated_zone() instead of if(zone->present_pages)

Use "if (zone->present_pages)" instead of "if (zone->present_pages)".
Simplify the code, no functional change.

Signed-off-by: Xishi Qiu <qiuxishi@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mm: use pgdat_end_pfn() to simplify the code in others

Use "pgdat_end_pfn()" instead of "pgdat->node_start_pfn +
pgdat->node_spanned_pages". Simplify the code, no functional change.

Signed-off-by: Xishi Qiu <qiuxishi@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mm: use pgdat_end_pfn() to simplify the code in arch

Use "pgdat_end_pfn()" instead of "pgdat->node_start_pfn +
pgdat->node_spanned_pages". Simplify the code, no functional change.

Signed-off-by: Xishi Qiu <qiuxishi@huawei.com>
Cc: James Hogan <james.hogan@imgtec.com>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Paul Mundt <lethal@linux-sh.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mm/huge_memory.c: fix stale comments of transparent_hugepage_flags

Since commit 13ece886d9 ("thp: transparent hugepage config choice"),
transparent hugepage support is disabled by default, and
TRANSPARENT_HUGEPAGE_ALWAYS is configured when TRANSPARENT_HUGEPAGE=y.

And since commit d39d33c332 ("thp: enable direct defrag"), defrag is
enable for all transparent hugepage page faults by default, not only in
MADV_HUGEPAGE regions.

Signed-off-by: Jianguo Wu <wujianguo@huawei.com>
Reviewed-by: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mm: remove obsolete comments about page table lock

The callers of free_pgd_range() and hugetlb_free_pgd_range() don't hold
page table locks. The comments seems to be obsolete, so let's remove
them.

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

drivers/video/acornfb.c: use __free_reserved_page() to simplify the code

Use __free_reserved_page() to simplify the code in the others.

Signed-off-by: Xishi Qiu <qiuxishi@huawei.com>
Cc: Jean-Christophe Plagniol-Villard <plagnioj@jcrosoft.com>
Cc: Tomi Valkeinen <tomi.valkeinen@ti.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mm/arch: use __free_reserved_page() to simplify the code

Use __free_reserved_page() to simplify the code in arch.

It used split_page() in consistent_alloc()/__dma_alloc_coherent()/dma_alloc_coherent(),
so page->_count == 1, and we can free it safely.

__free_reserved_page()
ClearPageReserved()
init_page_count() // it won't change the value
__free_page()

Signed-off-by: Xishi Qiu <qiuxishi@huawei.com>
Cc: James Hogan <james.hogan@imgtec.com>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mm/compaction.c: update comment about zone lock in isolate_freepages_block

Since commit f40d1e4 ("mm: compaction: acquire the zone->lock as late as
possible"), isolate_freepages_block() takes the zone->lock itself. The
function description however still states that the zone->lock must be
held.

This patch removes this outdated statement.

Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mm/vmalloc: use NUMA_NO_NODE

Use more appropriate "if (node == NUMA_NO_NODE)" instead of "if (node < 0)"

Signed-off-by: Jianguo Wu <wujianguo@huawei.com>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

ksm: Remove redundant __GFP_ZERO from kcalloc

kcalloc returns zeroed memory. There's no need to use this flag.

Signed-off-by: Joe Perches <joe@perches.com>
Reviewed-by: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

watchdog: trigger all-cpu backtrace when locked up and going to panic

Send an NMI to all CPUs when a lockup is detected and the lockup watchdog
code is configured to panic. This gives us a fairly uptodate snapshot of
all CPUs in the system.

This lets us get stack trace of all CPUs which makes life easier trying to
debug a deadlock, and the NMI doesn't change anything since the next step
is a kernel panic.

Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

posix_acl: uninlining

Uninline vast tracts of nested inline functions in
include/linux/posix_acl.h.

This reduces the text+data+bss size of x86_64 allyesconfig vmlinux by 8026
bytes.

The patch also regularises the positioning of the EXPORT_SYMBOLs in
posix_acl.c.

Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: J. Bruce Fields <bfields@fieldses.org>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Tested-by: Benny Halevy <bhalevy@primarydata.com>
Cc: Benny Halevy <bhalevy@panasas.com>
Cc: Andreas Gruenbacher <agruen@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

anon_inodefs: forbid open via /proc

open("/proc/pid/$anon-fd") should fail, we can't create the new file with
correct f_op/etc correctly. Currently this creates the bogus file with
the empty anon_inode_fops, this is harmless but still wrong and
misleading.

Add anon_inode_fops->anon_open() which simply returns ENXIO like
sock_no_open() does in this case.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

block: restore /proc/partitions to not display non-partitionable removable devices

We found with newer kernels we started seeing the cdrom device showing
up in /proc/partitions, but it was not there before.

Looking into this I found that commit d27769ec ("block: add
GENHD_FL_NO_PART_SCAN") introduces this change in behavior. It's not
clear to me from the commit's changelog if this change was intentional or
not. This comment still remains: /* Don't show non-partitionable
removeable devices or empty devices */ so I've decided to send a patch to
restore the behavior of not printing unpartitionable removable devices.

Signed-off-by: Josh Hunt <johunt@akamai.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Kay Sievers <kay.sievers@vrfy.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

block: do not call sector_div() with a 64-bit divisor

do_div() (called by sector_div() if CONFIG_LBDAF=y) is meant for divisions
of 64-bit number by 32-bit numbers.  Passing 64-bit divisor types caused
issues in the past on 32-bit platforms, cfr.  commit ea077b1b96e073ea
("m68k: Truncate base in do_div()").

As queue_limits.max_discard_sectors and .discard_granularity are unsigned
int, max_discard_sectors and granularity should be unsigned int.  As
bdev_discard_alignment() returns int, alignment should be int.  Now 2
calls to sector_div() can be replaced by 32-bit arithmetic:

  - The 64-bit modulo operation can become a 32-bit modulo operation,
  - The 64-bit division and multiplication can be replaced by a 32-bit
    modulo operation and a subtraction.

Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

drivers/block/cciss.c:cciss_init_one(): use proper errnos

pci_driver.probe should return a meaningful errno, not -1.

Cc: Jens Axboe <axboe@kernel.dk>
Cc: Stephen M. Cameron <scameron@beardog.cce.hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

drivers/block/cciss.c: return 0 from driver probe function on success, not 1

A return value of 1 is interpreted as an error.

Signed-off-by: Stephen M. Cameron <scameron@beardog.cce.hp.com>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

hpsa: return 0 from driver probe function on success, not 1

A return value of 1 is interpreted as an error.  See pci_driver.  in
local_pci_probe().  If you're wondering how this ever could have worked,
it's because it used to be the case that only return values less than zero
were interpreted as failure.  But even in the current kernel if the driver
registers its various entry points with the kernel, and then returns a
value which is interpreted as failure, those registrations aren't undone,
so the driver still mostly works.  However, the driver's remove function
wouldn't be called on rmmod, and pci power management functions wouldn't
work.  In the case of Smart Array, since it has a battery backed cache (or
else no cache) even if the driver is not shut down properly as long as
there is no outstanding i/o, nothing too bad happens, which is why it took
so long to notice.

Requesting backport to stable because the change to pci-driver.c which
requires driver probe functions to return 0 occurred between 2.6.35 and
2.6.36 (the pci power management breakage) and again between 3.7 and 3.8
(pci_dev->driver getting set to NULL in local_pci_probe() preventing
driver remove function from being called on rmmod.)

Signed-off-by: Stephen M. Cameron <scameron@beardog.cce.hp.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

drivers/block/paride/pg.c: underflow bug in pg_write()

The test here can underflow so we pass bogus lengths to the hardware.
It's a static checker fix and I don't know the impact.

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

drivers/block/sx8.c: remove unnecessary pci_set_drvdata()

The driver core clears the driver data to NULL after device_release or on
probe failure. Thus, it is not needed to manually clear the device driver
data to NULL.

Signed-off-by: Jingoo Han <jg1.han@samsung.com>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

drivers/block/sx8.c: use module_pci_driver()

Use module_pci_driver() macro which makes the code smaller and simpler.

Signed-off-by: Jingoo Han <jg1.han@samsung.com>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

fs/bio-integrity.c: remove duplicated code

Most code of function bio_integrity_verify and bio_integrity_generate is
the same, so introduce a common function bio_integrity_generate_verify()
to remove the reduplicate code.

Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

drivers/cdrom/gdrom.c: remove deprecated IRQF_DISABLED

Remove the IRQF_DISABLED flag from drivers/cdrom/gdrom.c.

It's a NOOP since 2.6.35 and it will be removed one day.

Signed-off-by: Michael Opdenacker <michael.opdenacker@free-electrons.com>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

scsi: do not call do_div() with a 64-bit divisor

do_div() is meant for divisions of 64-bit number by 32-bit numbers.
Passing 64-bit divisor types caused issues in the past on 32-bit
platforms, cfr. commit ea077b1b96e073eac5c ("m68k: Truncate base in
do_div()").

As scsi_device.sector_size is unsigned (int), factor should be unsigned
int, too.

Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mm/readahead.c:do_readhead(): don't check for ->readpage

The callee force_page_cache_readahead() already does this and unlike
do_readahead(), force_page_cache_readahead() remembers to check for
->readpages() as well.

Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

drivers/pci/pci-driver.c: warn on driver probe return value greater than zero

Ages ago, drivers could return values greater than zero from their probe
function and this would be regarded as success. Commit f3ec4f87d607f40497
"PCI: change device runtime PM settings for probe and remove" slightly
altered this in 2010, and commit 967577b062417b4e4b8e27b ("PCI/PM: Keep
runtime PM enabled for unbound PCI devices") in late 2012 altered it more
signficantly, setting pci_dev->driver to NULL if the driver's probe
function returned a value greater than zero, which would for example
prevent the driver's remove function from being called on rmmod.

Neither of those changes would necessarily make the driver fail in an
obvious way though, and so at least a couple drivers (cciss, hpsa) fell
into this hole since they were returning 1, and this situation went
unnoticed for quite some time.

If a driver's probe function returns a value greater than zero, issue a
warning, but otherwise treat this as success.

Signed-off-by: Stephen M. Cameron <scameron@beardog.cce.hp.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Greg KH <greg@kroah.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

ocfs2: update inode size after zeroing the hole

fs-writeback will release the dirty pages without page lock whose offset
are over inode size, the release happens at block_write_full_page_endio().
If not update, dirty pages in file holes may be released before flushed
to the disk, then file holes will contain some non-zero data, this will
cause sparse file md5sum error.

To reproduce the bug, find a big sparse file with many holes, like vm
image file, its actual size should be bigger than available mem size to
make writeback work more frequently, tar it with -S option, then keep
untar it and check its md5sum again and again until you get a wrong
md5sum.

Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Younger Liu <younger.liu@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

ocfs2: fix issue that ocfs2_setattr() does not deal with new_i_size==i_size

The issue scenario is as following:

- Create a small file and fallocate a large disk space for a file with
  FALLOC_FL_KEEP_SIZE option.

- ftruncate the file back to the original size again.  but the disk free
  space is not changed back.  This is a real bug that be fixed in this
  patch.

In order to solve the issue above, we modified ocfs2_setattr(), if
attr->ia_size != i_size_read(inode), It calls ocfs2_truncate_file(), and
truncate disk space to attr->ia_size.

Signed-off-by: Younger Liu <younger.liu@huawei.com>
Reviewed-by: Jie Liu <jeff.liu@oracle.com>
Tested-by: Jie Liu <jeff.liu@oracle.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Sunil Mushran <sunil.mushran@gmail.com>
Reviewed-by: Jensen <shencanquan@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

ocfs2: llseek requires ocfs2 inode lock for the file in SEEK_END

llseek requires ocfs2 inode lock for updating the file size in SEEK_END.
because the file size maybe update on another node.

This bug can be reproduce the following scenario: at first, we dd a test
fileA, the file size is 10k.

on NodeA:
---------
1) open the test fileA, lseek the end of file. and print the position.
2) close the test fileA

on NodeB:
1) open the test fileA, append the 5k data to test FileA.
2) lseek the end of file. and print the position.
3) close file.

At first we run the test program1 on NodeA , the result is 10k. And then
run the test program2 on NodeB, the result is 15k. At last, we run the
test program1 on NodeA again, the result is 10k.

After applying this patch the three step result is 15k.

Signed-off-by: Jensen <shencanquan@huawei.com>
Cc: Jie Liu <jeff.liu@oracle.com>
Acked-by: Joel Becker <jlbec@evilplan.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Sunil Mushran <sunil.mushran@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

ocfs2: should call ocfs2_journal_access_di() before ocfs2_delete_entry() in ocfs2_orphan_del()

While deleting a file into orphan dir in ocfs2_orphan_del(), it calls
ocfs2_delete_entry() before ocfs2_journal_access_di(). If
ocfs2_delete_entry() succeeded and ocfs2_journal_access_di() failed, there
would be a inconsistency: the file is deleted from orphan dir, but orphan
dir dinode is not updated.

So we need to call ocfs2_journal_access_di() before ocfs2_orphan_del().

Signed-off-by: Younger Liu <younger.liu@huawei.com>
Reviewed-by: Jensen <shencanquan@huawei.com>
Cc: Jie Liu <jeff.liu@oracle.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

ocfs2: use the new DLM operation callbacks while requesting new lockspace

Attempt to use the new DLM operations.  If it is not supported, use the
traditional ocfs2_controld.

To exchange ocfs2 versioning, we use the LVB of the version dlm lock.  It
first attempts to take the lock in EX mode (non-blocking).  If successful
(which means it is the first mount), it writes the version number and
downconverts to PR lock.  If it is unsuccessful, it reads the version from
the lock.

If this becomes the standard (with o2cb as well), it could simplify
userspace tools to check if the filesystem is mounted on other nodes.

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

ocfs2: framework for version LVB

Use the native DLM locks for version control negotiation. Most of the
framework is taken from gfs2/lock_dlm.c

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

ocfs2: pass ocfs2_cluster_connection to ocfs2_this_node

This is done to differentiate between using and not using controld and use
the connection information accordingly. We need to be backward
compatible. So, we use a new enum ocfs2_connection_type to identify when
controld is used and when it is not.

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

ocfs2: shift allocation ocfs2_live_connection to user_connect()

We perform this because the DLM recovery callbacks will require the
ocfs2_live_connection structure to record the node information when
dlm_new_lockspace() is updated.

[AKPM] rc initialization is not required because it assigned in case of
errors. It will be cleared by compiler anyways.

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

ocfs2: add DLM recovery callbacks

These are the callbacks called by the fs/dlm code in case the membership
changes.  If there is a failure while/during calling any of these, the DLM
creates a new membership and relays to the rest of the nodes.

recover_prep() is called when DLM understands a node is down.
recover_slot() is called once all nodes have acknowledged recover_prep and
recovery can begin.  recover_done() is called once the recovery is
complete.  It returns the new membership.

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

ocfs2: add clustername to cluster connection

This is an effort of removing ocfs2_controld.pcmk and getting ocfs2 DLM
handling up to the times with respect to DLM (>=4.0.1) and corosync
(2.3.x).  AFAIK, cman also is being phased out for a unified corosync
cluster stack.

fs/dlm performs all the functions with respect to fencing and node
management and provides the API's to do so for ocfs2.  For all future
references, DLM stands for fs/dlm code.

The advantages are:
+ No need to run an additional userspace daemon (ocfs2_controld)
+ No contrrold devince handling and controld protocol
+ Shifting responsibilities of node management to DLM layer

For backward compatibility, we are keeping the controld handling code.
Once enough time has passed we can remove a significant portion of the
code.

This feature requires modification in the userspace ocfs2-tools.  The
changes can be found at: https://github.com/goldwynr/ocfs2-tools branch:
nocontrold Currently, not many checks are present in the userspace code,
but that would change soon.

These changes were developed on linux-stable 3.11.y.  However, the changes
are applicable to the current upstream as well.  If you wish to give the
entire kernel a spin, the link is:

https://github.com/goldwynr/linux-stable branch: nocontrold

This patch (of 6):

Add clustername to cluster connection.

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

ocfs2: fix possible double free in ocfs2_write_begin_nolock

When ocfs2_write_cluster_by_desc() failed in ocfs2_write_begin_nolock()
because of ENOSPC, it goes to out_quota, freeing data_ac(meta_ac).  Then
it calls ocfs2_try_to_free_truncate_log() to free space.  If enough space
freed, it will try to write again.  Unfortunately, some error happenes
before ocfs2_lock_allocators(), it goes to out and free data_ac(meta_ac)
again.

Signed-off-by: joyce <xuejiufei@huawei.com>
Reviewed-by: Jie Liu <jeff.liu@oracle.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

ocfs2: add missing errno in ocfs2_ioctl_move_extents()

If the file is not regular or writeable, it should return errno(EPERM).

This patch is based on 85a258b70d ("ocfs2: fix error handling in
ocfs2_ioctl_move_extents()").

Signed-off-by: Younger Liu <younger.liu@huawei.com>
Signed-off-by: Jie Liu <jeff.liu@oracle.com>
Reviewed-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Jie Liu <jeff.liu@oracle.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

ocfs2: do not call brelse() if group_bh is not initialized in ocfs2_group_add()

If group_bh is not initialized, there is no need to release. This problem
does not cause anything wrong, but the patch would make the code more
logical.

Signed-off-by: Younger Liu <younger.liu@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Acked-by: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

ocfs2: rollback transaction in ocfs2_group_add()

If ocfs2_journal_access_di() fails, group->bg_next_group should rollback.
Otherwise, there would be a inconsistency between group_bh and main_bm_bh.

Signed-off-by: Younger Liu <younger.liu@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Acked-by: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

ocfs2: break useless while loop

Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>
Signed-off-by: Joel Becker <jlbec@evilplan.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

ocfs2: use find_last_bit()

We already have find_last_bit(). So just use it as described in the
comment.

Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

ocfs2: delay migration when the lockres is in migration state

We trigger a bug in __dlm_lockres_reserve_ast() when we parallel umount 4
nodes.  The situation is as follows:

1) Node A migrate all lockres it owned(eg.  lockres A) to other nodes
   say node B when it umounts.

2) Receiving MIG_LOCKRES message from A, Node B masters the lockres A
   with DLM_LOCK_RES_MIGRATING state set.

3) Then we umount ocfs2 on node B.  It also should migrate lockres A to
   another node, say node C.  But now, DLM_LOCK_RES_MIGRATING state of
   lockers A is not cleared.  Node B triggered the BUG on lockres with
   state DLM_LOCK_RES_MIGRATING.

Signed-off-by: Xuejiufei <xuejiufei@huawei.com>
Signed-off-by: Joel Becker <jlbec@evilplan.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Tariq Saeed <tariq.x.saeed@oracle.com>
Cc: Srinivas Eeda <srinivas.eeda@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

ocfs2: skip locks in the blocked list

A parallel umount on 4 nodes triggered a bug in
dlm_process_recovery_date().  Here's the situation:

Receiving MIG_LOCKRES message, A node processes the locks in migratable
lockres.  It copys lvb from migratable lockres when processing the first
valid lock.

If there is a lock in the blocked list with the EX level, it triggers the
BUG.  Since valid lvbs are set when locks are granted with EX or PR
levels, locks in the blocked list cannot have valid lvbs.  Therefore I
think we should skip the locks in the blocked list.

Signed-off-by: Xuejiufei <xuejiufei@huawei.com>
Signed-off-by: Joel Becker <jlbec@evilplan.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

ocfs2: use bitmap_weight()

Use bitmap_weight() instead of reinventing the wheel.

Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

ocfs2: don't spam on -EDQUOT

-EDQUOT is a user-visible error, not a logic problem. Teach mlog_errno()
to ignore it like it ignores -ENOSPC, etc.

Signed-off-by: Joel Becker <jlbec@evilplan.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Reported-by: Marek Królikowski <admin@wset.edu.pl>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

ocfs2-add-necessary-check-in-case-sb_getblk-fails-update

Need also add a check after calling sb_getblk in ocfs2_create_xattr_block.

Cc: Rui Xiang <rui.xiang@huawei.com>
Cc: Jie Liu <jeff.liu@oracle.com>
Cc: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

ocfs2: add necessary check in case sb_getblk() fails

sb_getblk() may return an err, so add a check for bh.

Signed-off-by: Rui Xiang <rui.xiang@huawei.com>
Reviewed-by: Jie Liu <jeff.liu@oracle.com>
Reviewed-by: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

ocfs2-return-enomem-when-sb_getblk-fails-update

ocfs2_symlink_get_block in aops.c, and ocfs2_read_blocks_sync and
ocfs2_read_blocks in buffer_head_io.c need do the same change for
consistency.

Cc: Rui Xiang <rui.xiang@huawei.com>
Cc: Jie Liu <jeff.liu@oracle.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

ocfs2: return ENOMEM when sb_getblk() fails

The only reason for sb_getblk() failing is if it can't allocate the
buffer_head. So return ENOMEM instead when it fails.

Signed-off-by: Rui Xiang <rui.xiang@huawei.com>
Reviewed-by: Jie Liu <jeff.liu@oracle.com>
Reviewed-by: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

fs/ocfs2/file.c: fix wrong comment

Unwritten extent only exists for file systems which support holes. But
the comment said was opposite meaning and also the comment is not very
clear, so rephase it.

Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

fs/ocfs2: remove unnecessary variable bits_wanted from ocfs2_calc_extend_credits

Code cleanup to remove unnecessary variable passed but never used
to ocfs2_calc_extend_credits.

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

drivers/net/irda/donauboe.c: convert to module_pci_driver

Signed-off-by: Libo Chen <libo.chen@huawei.com>
Cc: Samuel Ortiz <samuel@sortiz.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

scripts/sortextable: support objects with more than 64K sections.

Building with a large config and -ffunction-sections results in a large
number of sections and sortextable needs to be able to handle that.
Implement support for > 64K sections as modpost does.

Signed-off-by: Jamie Iles <jamie.iles@oracle.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Michal Marek <mmarek@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

input-route-kbd-leds-through-the-generic-leds-layer-fix-3

Link input/leds.c along input/input.c instead of separate module

input.c needs to call leds.c and vice-versa, so it is simpler to stuff
them together. INPUT_LEDS thus now depends on LEDS_CLASS being available
enough for input.ko.

This also documents the new leds field.

Signed-off-by: Samuel Thibault <samuel.thibault@ens-lyon.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

input: CONFIG_INPUT_LEDS stubs should be static inline

The following commit breaks compile when CONFIG_INPUT_LEDS is not selected.

commit 0eb6ee2db914ebb594cb9bc8ae97e3c362c3475a
Author: Samuel Thibault <samuel.thibault@ens-lyon.org>
Date: Wed Oct 30 11:45:16 2013 +1100
input: route kbd LEDs through the generic LEDs layer

Signed-off-by: John Crispin <blogic@openwrt.org>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Samuel Thibault <samuel.thibault@ens-lyon.org>
Cc: Dmitry Torokhov <dmitry.torokhov@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

input: route kbd LEDs through the generic LEDs layer

This permits to reassign keyboard LEDs to something else than keyboard
"leds" state, by adding keyboard led and modifier triggers connected to a
series of VT input LEDs, themselves connected to VT input triggers, which
per-input device LEDs use by default. Userland can thus easily change the
LED behavior of (a priori) all input devices, or of particular input
devices.

This also permits to fix #7063 from userland by using a modifier to
implement proper CapsLock behavior and have the keyboard caps lock led
show that modifier state.

[ebroder@mokafive.com: Rebased to 3.2-rc1 or so, cleaned up some includes, and fixed some constants]
Signed-off-by: Samuel Thibault <samuel.thibault@ens-lyon.org>
Signed-off-by: Evan Broder <evan@ebroder.net>
Reviewed-by: David Herrmann <dh.herrmann@gmail.com>
Tested-by: Pavel Machek <pavel@ucw.cz>
Acked-by: Peter Korsgaard <jacmet@sunsite.dk>
Cc: Pavel Machek <pavel@ucw.cz>
Cc: Dmitry Torokhov <dmitry.torokhov@gmail.com>
Cc: Bryan Wu <cooloney@gmail.com>
Cc: Arnaud Patard <arnaud.patard@rtp-net.org>
Cc: Sascha Hauer <s.hauer@pengutronix.de>
Cc: Matt Sealey <matt@genesi-usa.com>
Cc: Rob Clark <robdclark@gmail.com>
Cc: Niels de Vos <devos@fedoraproject.org>
Cc: Steev Klimaszewski <steev@genesi-usa.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

drivers/input/touchscreen/cyttsp4_core.c: : remove unnecessary work pending test

Remove unnecessary work pending test before calling schedule_work(). It
has been tested in queue_work_on() already.

Signed-off-by: Xie XiuQi <xiexiuqi@huawei.com>
Reviewed-by: Tejun Heo <tj@kernel.org>
Cc: Ferruh Yigit <fery@cypress.com>
Cc: Dmitry Torokhov <dmitry.torokhov@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

drivers/infiniband/core/cm.c: convert to using idr_alloc_cyclic()

commit 3e6628c4b347 ("idr: introduce idr_alloc_cyclic()") adds a new
idr_alloc_cyclic routine and converts several of these users to it. This
is just a missed one - add it.

Signed-off-by: Zhao Hongjiang <zhaohongjiang@huawei.com>
Cc: Roland Dreier <roland@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Sean Hefty <sean.hefty@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

kernel/time/tick-common.c: document tick_do_timer_cpu

Taken straight from a tglx email ;)

Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

kernel/timer.c: convert kmalloc_node(...GFP_ZERO...) to kzalloc_node(...)

Use the helper function instead of __GFP_ZERO.

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

sched_clock: document 4Mhz vs 1Mhz decision

Bo Shen sent a patch to change this to 1Mhz instead of 4Mhz but according
to Russell King the use of 4Mhz was intentional. Add a comment to this
effect so that others don't try to change the code as well.

Signed-off-by: Stephen Boyd <sboyd@codeaurora.org>
Cc: Bo Shen <voice.shen@atmel.com>
Cc: Russell King <rmk@arm.linux.org.uk>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

genirq: correct fuzzy and fragile IRQ_RETVAL() definition

commit bedd30d986a0 ("genirq: make irqreturn_t an enum") blindly replaced
"0" by "IRQ_NONE" in the "IRQ_RETVAL(x)" macro definition.

However, as "x" is a condition, "0" meant "boolean false", not an
irqreturn_t value.

All of this worked, and kept working after the addition of IRQ_WAKE_THREAD,
as
  - both "boolean false" and "IRQ_NONE" are "0" (for the comparison),
  - "boolean true" and "boolean false" nicely map to the correct values of
    "IRQ_HANDLED" and "IRQ_NONE" (for the return value).

Correct the macro definition for clarity and future-proofness.

Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

drivers/iommu/omap-iopgtable.h: remove unneeded cast of void*

Signed-off-by: Jingoo Han <jg1.han@samsung.com>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Tony Lindgren <tony@atomide.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

nouveau: fix build eror when VGA_SWITCHEROO is not enabled

Fix nouveau build error on x86, when ACPI is enabled but VGA_SWITCHEROO is
not enabled, by providing a stub function.

drivers/built-in.o: In function `nouveau_pmops_runtime_suspend':
nouveau_drm.c:(.text+0x3aac89): undefined reference to `nouveau_switcheroo_optimus_dsm'

Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Ben Skeggs <bskeggs@redhat.com>
Cc: Jeff Mahoney <jeffm@suse.com>
Cc: Jiri Slaby <jslaby@suse.cz>
Cc: David Airlie <airlied@linux.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>