git.karo-electronics.de Git - karo-tx-linux.git/log

mm: huge_memory: use GFP_TRANSHUGE when charging huge pages

Transparent huge page charges prefer falling back to regular pages rather
than spending a lot of time in direct reclaim.

Desired reclaim behavior is usually declared in the gfp mask, but THP
charges use GFP_KERNEL and then rely on the fact that OOM is disabled for
THP charges, and that OOM-disabled charges don't retry reclaim. Needless
to say, this is anything but obvious and quite error prone.

Convert THP charges to use GFP_TRANSHUGE instead, which implies
__GFP_NORETRY, to indicate the low-latency requirement.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: Hugh Dickins <hughd@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vladimir Davydov <vdavydov@parallels.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mm: memcontrol: reclaim at least once for __GFP_NORETRY

Currently, __GFP_NORETRY tries charging once and gives up before even
trying to reclaim. Bring the behavior on par with the page allocator and
reclaim at least once before giving up.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: Hugh Dickins <hughd@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vladimir Davydov <vdavydov@parallels.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mm: memcontrol: rearrange charging fast path

The charging path currently starts out with OOM condition checks when OOM
is the rarest possible case.

Rearrange this code to run OOM/task dying checks only after trying the
percpu charge and the res_counter charge and bail out before entering
reclaim. Attempting a charge does not hurt an (oom-)killed task as much
as every charge attempt having to check OOM conditions. Also, only check
__GFP_NOFAIL when the charge would actually fail.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: Hugh Dickins <hughd@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vladimir Davydov <vdavydov@parallels.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mm: memcontrol: fold mem_cgroup_do_charge()

These patches rework memcg charge lifetime to integrate more naturally
with the lifetime of user pages.  This drastically simplifies the code and
reduces charging and uncharging overhead.  The most expensive part of
charging and uncharging is the page_cgroup bit spinlock, which is removed
entirely after this series.

Here are the top-10 profile entries of a stress test that reads a 128G
sparse file on a freshly booted box, without even a dedicated cgroup (i.e.
executing in the root memcg).  Before:

    15.36%              cat  [kernel.kallsyms]   [k] copy_user_generic_string
    13.31%              cat  [kernel.kallsyms]   [k] memset
    11.48%              cat  [kernel.kallsyms]   [k] do_mpage_readpage
     4.23%              cat  [kernel.kallsyms]   [k] get_page_from_freelist
     2.38%              cat  [kernel.kallsyms]   [k] put_page
     2.32%              cat  [kernel.kallsyms]   [k] __mem_cgroup_commit_charge
     2.18%          kswapd0  [kernel.kallsyms]   [k] __mem_cgroup_uncharge_common
     1.92%          kswapd0  [kernel.kallsyms]   [k] shrink_page_list
     1.86%              cat  [kernel.kallsyms]   [k] __radix_tree_lookup
     1.62%              cat  [kernel.kallsyms]   [k] __pagevec_lru_add_fn

After:

    15.67%           cat  [kernel.kallsyms]   [k] copy_user_generic_string
    13.48%           cat  [kernel.kallsyms]   [k] memset
    11.42%           cat  [kernel.kallsyms]   [k] do_mpage_readpage
     3.98%           cat  [kernel.kallsyms]   [k] get_page_from_freelist
     2.46%           cat  [kernel.kallsyms]   [k] put_page
     2.13%       kswapd0  [kernel.kallsyms]   [k] shrink_page_list
     1.88%           cat  [kernel.kallsyms]   [k] __radix_tree_lookup
     1.67%           cat  [kernel.kallsyms]   [k] __pagevec_lru_add_fn
     1.39%       kswapd0  [kernel.kallsyms]   [k] free_pcppages_bulk
     1.30%           cat  [kernel.kallsyms]   [k] kfree

As you can see, the memcg footprint has shrunk quite a bit.

   text    data     bss     dec     hex filename
  37970    9892     400   48262    bc86 mm/memcontrol.o.old
  35239    9892     400   45531    b1db mm/memcontrol.o

This patch (of 13):

This function was split out because mem_cgroup_try_charge() got too big.
But having essentially one sequence of operations arbitrarily split in
half is not good for reworking the code.  Fold it back in.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: Hugh Dickins <hughd@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vladimir Davydov <vdavydov@parallels.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mm: page-flags: clean up the page flag test, set, clear macros

- PAGEFLAG_FALSE only defines TEST, make it define SET and CLEAR as
well, analogous to PAGEFLAG.

- Define TESTSETFLAG_FALSE, analogous to TESTSETFLAG.

- Define TESTSCFLAG_FALSE, analogous to TESTSCFLAG

- Make PG_mlocked accessors the same on both MMU and !MMU setups

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mm, thp: replace smp_mb after atomic_add by smp_mb__after_atomic

In some architectures like x86, atomic_add() is a full memory barrier.  In
that case, an additional smp_mb() is just a waste of time.  This patch
replaces that smp_mb() by smp_mb__after_atomic() which will avoid the
redundant memory barrier in some architectures.

With a 3.16-rc1 based kernel, this patch reduced the execution time of
breaking 1000 transparent huge pages from 38,245us to 30,964us.  A
reduction of 19% which is quite sizeable.  It also reduces the %cpu time
of the __split_huge_page_refcount function in the perf profile from 2.18%
to 1.15%.

Signed-off-by: Waiman Long <Waiman.Long@hp.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Scott J Norton <scott.norton@hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mm, thp: move invariant bug check out of loop in __split_huge_page_map

In __split_huge_page_map(), the check for page_mapcount(page) is invariant
within the for loop.  Because of the fact that the macro is implemented
using atomic_read(), the redundant check cannot be optimized away by the
compiler leading to unnecessary read to the page structure.

This patch moves the invariant bug check out of the loop so that it will
be done only once.  On a 3.16-rc1 based kernel, the execution time of a
microbenchmark that broke up 1000 transparent huge pages using munmap()
had an execution time of 38,245us and 38,548us with and without the patch
respectively.  The performance gain is about 1%.

Signed-off-by: Waiman Long <Waiman.Long@hp.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Scott J Norton <scott.norton@hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mm-hugetlbfs-fix-rmapping-for-anonymous-hugepages-with-page_pgoff-v3-fix

tweak code comment

Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mm-hugetlbfs-fix-rmapping-for-anonymous-hugepages-with-page_pgoff-v3

- add comment on page_size_order()
- use compound_order(compound_head(page)) instead of huge_page_size_order()
- use page_pgoff() in rmap_walk_file() too
- use page_size_order() in kill_proc()
- fix space indent

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Sasha Levin <sasha.levin@oracle.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mm-hugetlbfs-fix-rmapping-for-anonymous-hugepages-with-page_pgoff-v2

- fix wrong shift direction
- introduce page_size_order() and huge_page_size_order()
- move the declaration of PageHuge() to include/linux/hugetlb_inline.h
to avoid macro definition.

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Sasha Levin <sasha.levin@oracle.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mm, hugetlbfs: fix rmapping for anonymous hugepages with page_pgoff()

page->index stores pagecache index when the page is mapped into file
mapping region, and the index is in pagecache size unit, so it depends on
the page size.  Some of users of reverse mapping obviously assumes that
page->index is in PAGE_CACHE_SHIFT unit, so they don't work for anonymous
hugepage.

For example, consider that we have 3-hugepage vma and try to mbind the 2nd
hugepage to migrate to another node.  Then the vma is split and
migrate_page() is called for the 2nd hugepage (belonging to the middle
vma.) In migrate operation, rmap_walk_anon() tries to find the relevant
vma to which the target hugepage belongs, but here we miscalculate pgoff.
So anon_vma_interval_tree_foreach() grabs invalid vma, which fires
VM_BUG_ON.

This patch introduces a new API that is usable both for normal page and
hugepage to get PAGE_SIZE offset from page->index.  Users should clearly
distinguish page_index for pagecache index and page_pgoff for page offset.

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Reported-by: Sasha Levin <sasha.levin@oracle.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: <stable@vger.kernel.org> [3.12+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mm, CMA: clean-up log message

We don't need explicit 'CMA:' prefix, since we already define prefix
'cma:' in pr_fmt. So remove it.

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Acked-by: Michal Nazarewicz <mina86@mina86.com>
Reviewed-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: Alexander Graf <agraf@suse.de>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Gleb Natapov <gleb@kernel.org>
Acked-by: Marek Szyprowski <m.szyprowski@samsung.com>
Tested-by: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mm, CMA: change cma_declare_contiguous() to obey coding convention

Conventionally, we put output param to the end of param list and put the
'base' ahead of 'size', but cma_declare_contiguous() doesn't look like
that, so change it.

Additionally, move down cma_areas reference code to the position
where it is really needed.

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Acked-by: Michal Nazarewicz <mina86@mina86.com>
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Alexander Graf <agraf@suse.de>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Gleb Natapov <gleb@kernel.org>
Acked-by: Marek Szyprowski <m.szyprowski@samsung.com>
Tested-by: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mm, CMA: clean-up CMA allocation error path

We can remove one call sites for clear_cma_bitmap() if we first
call it before checking error number.

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Acked-by: Minchan Kim <minchan@kernel.org>
Reviewed-by: Michal Nazarewicz <mina86@mina86.com>
Reviewed-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Alexander Graf <agraf@suse.de>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Gleb Natapov <gleb@kernel.org>
Acked-by: Marek Szyprowski <m.szyprowski@samsung.com>
Tested-by: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

PPC, KVM: fix build failure due to removed file

Commit ('PPC, KVM, CMA: use general CMA reserved area management
framework') removes book3s_hv_cma.c, but, missed to remove entry
in Makefile. Fix it.

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Reported-by: kbuild test robot <fengguang.wu@intel.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

PPC, KVM, CMA: use general CMA reserved area management framework

Now, we have general CMA reserved area management framework, so use it for
future maintainabilty. There is no functional change.

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Acked-by: Michal Nazarewicz <mina86@mina86.com>
Acked-by: Paolo Bonzini <pbonzini@redhat.com>
Tested-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Alexander Graf <agraf@suse.de>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Gleb Natapov <gleb@kernel.org>
Acked-by: Marek Szyprowski <m.szyprowski@samsung.com>
Tested-by: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

CMA: fix ARM build failure related to MAX_CMA_AREAS definition

If CMA is disabled, CONFIG_CMA_AREAS isn't defined so compile error
happens. To fix it, define MAX_CMA_AREAS if CONFIG_CMA_AREAS
isn't defined.

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

CMA: generalize CMA reserved area management functionality

Currently, there are two users on CMA functionality, one is the DMA
subsystem and the other is the KVM on powerpc.  They have their own code
to manage CMA reserved area even if they looks really similar.  From my
guess, it is caused by some needs on bitmap management.  KVM side wants to
maintain bitmap not for 1 page, but for more size.  Eventually it use
bitmap where one bit represents 64 pages.

When I implement CMA related patches, I should change those two places to
apply my change and it seem to be painful to me.  I want to change this
situation and reduce future code management overhead through this patch.

This change could also help developer who want to use CMA in their new
feature development, since they can use CMA easily without copying &
pasting this reserved area management code.

In previous patches, we have prepared some features to generalize CMA
reserved area management and now it's time to do it.  This patch moves
core functions to mm/cma.c and change DMA APIs to use these functions.

There is no functional change in DMA APIs.

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Acked-by: Michal Nazarewicz <mina86@mina86.com>
Acked-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Acked-by: Minchan Kim <minchan@kernel.org>
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Alexander Graf <agraf@suse.de>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Gleb Natapov <gleb@kernel.org>
Acked-by: Marek Szyprowski <m.szyprowski@samsung.com>
Tested-by: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

dma-cma-support-arbitrary-bitmap-granularity-fix

s/1/1UL/

Cc: Alexander Graf <agraf@suse.de>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Gleb Natapov <gleb@kernel.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Acked-by: Marek Szyprowski <m.szyprowski@samsung.com>
Tested-by: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Michal Nazarewicz <mina86@mina86.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

DMA, CMA: support arbitrary bitmap granularity

PPC KVM's CMA area management requires arbitrary bitmap granularity, since
they want to reserve very large memory and manage this region with bitmap
that one bit for several pages to reduce management overheads. So support
arbitrary bitmap granularity for following generalization.

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Acked-by: Michal Nazarewicz <mina86@mina86.com>
Acked-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Acked-by: Minchan Kim <minchan@kernel.org>
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Alexander Graf <agraf@suse.de>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Gleb Natapov <gleb@kernel.org>
Acked-by: Marek Szyprowski <m.szyprowski@samsung.com>
Tested-by: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

DMA, CMA: support alignment constraint on CMA region

PPC KVM's CMA area management needs alignment constraint on CMA region.
So support it to prepare generalization of CMA area management
functionality.

Additionally, add some comments which tell us why alignment
constraint is needed on CMA region.

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Acked-by: Michal Nazarewicz <mina86@mina86.com>
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Alexander Graf <agraf@suse.de>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Gleb Natapov <gleb@kernel.org>
Acked-by: Marek Szyprowski <m.szyprowski@samsung.com>
Tested-by: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

DMA, CMA: separate core CMA management codes from DMA APIs

To prepare future generalization work on CMA area management code, we need
to separate core CMA management codes from DMA APIs. We will extend these
core functions to cover requirements of PPC KVM's CMA area management
functionality in following patches. This separation helps us not to touch
DMA APIs while extending core functions.

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Acked-by: Michal Nazarewicz <mina86@mina86.com>
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Alexander Graf <agraf@suse.de>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Gleb Natapov <gleb@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Acked-by: Marek Szyprowski <m.szyprowski@samsung.com>
Tested-by: Marek Szyprowski <m.szyprowski@samsung.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mm/internal.h: use nth_page

Use nth_page instead of pfn_to_page(page_to_pfn

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

slab: set free_limit for dead caches to 0

We mustn't keep empty slabs on dead memcg caches, because otherwise they
will never be destroyed.

Currently, we check if the cache is dead in free_block and drop empty slab
if so irrespective of the node's free_limit.  This can be pretty
expensive.  So let's avoid this additional check by zeroing nodes'
free_limit for dead caches on kmem_cache_shrink.  This way no additional
overhead is added to free hot path.

Note, since ->free_limit can be updated on cpu/memory hotplug, we must
handle it properly there.

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Suggested-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

slab: do not keep free objects/slabs on dead memcg caches

Since a dead memcg cache is destroyed only after the last slab allocated
to it is freed, we must disable caching of free objects/slabs for such
caches, otherwise they will be hanging around forever.

For SLAB that means we must disable per cpu free object arrays and make
free_block always discard empty slabs irrespective of node's free_limit.

To disable per cpu arrays, we free them on kmem_cache_shrink (see
drain_cpu_caches -> do_drain) and make __cache_free fall back to
free_block if there is no per cpu array. Also, we have to disable
allocation of per cpu arrays on cpu hotplug for dead caches (see
cpuup_prepare, __do_tune_cpucache).

After we disabled free objects/slabs caching, there is no need to reap
those caches periodically. Moreover, it will only result in slowdown. So
we also make cache_reap skip then.

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

slub: kmem_cache_shrink: check if partial list is empty under list_lock

SLUB's implementation of kmem_cache_shrink skips nodes that have
nr_partial=0, because they surely don't have any empty slabs to free.
This check is done w/o holding any locks, therefore it can race with
concurrent kfree adding an empty slab to a partial list. As a result, a
just shrinked cache can have empty slabs.

This is unacceptable for kmemcg, which needs to be sure that there will be
no empty slabs on dead memcg caches after kmem_cache_shrink was called,
because otherwise we may leak a dead cache.

Let's fix this race by checking if node partial list is empty under
node->list_lock. Since the nr_partial!=0 branch of kmem_cache_shrink does
nothing if the list is empty, we can simply remove the nr_partial=0 check.

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Reported-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

slub: make dead memcg caches discard free slabs immediately

Since a dead memcg cache is destroyed only after the last slab allocated
to it is freed, we must disable caching of empty slabs for such caches,
otherwise they will be hanging around forever.

This patch makes SLUB discard dead memcg caches' slabs as soon as they
become empty. To achieve that, it disables per cpu partial lists for dead
caches (see put_cpu_partial) and forbids keeping empty slabs on per node
partial lists by setting cache's min_partial to 0 on kmem_cache_shrink,
which is always called on memcg offline (see memcg_unregister_all_caches).

Thanks to Joonsoo Kim.

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

memcg: wait for kfree's to finish before destroying cache

kmem_cache_free doesn't expect that the cache can be destroyed as soon as
the object is freed, e.g. SLUB's implementation may want to update cache
stats after putting the object to the free list.

Therefore we should wait for all kmem_cache_free's to finish before
proceeding to cache destruction. Since both SLAB and SLUB versions of
kmem_cache_free are non-preemptable, we wait for rcu-sched grace period to
elapse.

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

slub: make slab_free non-preemptable

Since per memcg cache destruction is scheduled when the last slab is
freed, to avoid use-after-free in kmem_cache_free we should either
rearrange code in kmem_cache_free so that it won't dereference the cache
ptr after freeing the object, or wait for all kmem_cache_free's to
complete before proceeding to cache destruction.

The former approach isn't a good option from the future development point
of view, because every modifications to kmem_cache_free must be done with
great care then. Hence we should provide a method to wait for all
currently executing kmem_cache_free's to finish.

This patch makes SLUB's implementation of kmem_cache_free non-preemptable.
As a result, synchronize_sched() will work as a barrier against
kmem_cache_free's in flight, so that issuing it before cache destruction
will protect us against the use-after-free.

This won't affect performance of kmem_cache_free, because we already
disable preemption there, and this patch only moves preempt_enable to the
end of the function. Neither should it affect the system latency, because
kmem_cache_free is extremely short, even in its slow path.

SLAB's version of kmem_cache_free already proceeds with irqs disabled, so
we only add a comment explaining why it's necessary for kmemcg there.

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Acked-by: Christoph Lameter <cl@linux.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

slub: don't fail kmem_cache_shrink if slab placement optimization fails

SLUB's kmem_cache_shrink not only removes empty slabs from the cache, but
also sorts slabs by the number of objects in-use to cope with
fragmentation. To achieve that, it tries to allocate a temporary array.
If it fails, it will abort the whole procedure.

This is unacceptable for kmemcg, where we want to be sure that all empty
slabs are removed from the cache on memcg offline, so let's just skip the
slab placement optimization step if the allocation fails, but still get
rid of empty slabs.

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Acked-by: Christoph Lameter <cl@linux.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

memcg: mark caches that belong to offline memcgs as dead

This will be used by the next patches.

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

memcg: destroy kmem caches when last slab is freed

When the memcg_cache_params->refcnt goes to 0, schedule the worker that
will unregister the cache.  To prevent this from happening when the owner
memcg is alive, keep the refcnt incremented during memcg lifetime.

Note, this doesn't guarantee that the cache that belongs to a dead memcg
will go away as soon as the last object is freed, because SL[AU]B
implementation can cache empty slabs for performance reasons.  Hence the
cache may be hanging around indefinitely after memcg offline.  This is to
be resolved by the next patches.

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Acked-by: Christoph Lameter <cl@linux.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

memcg: cleanup memcg_cache_params refcnt usage

When a memcg is turned offline, some of its kmem caches can still have
active objects and therefore cannot be destroyed immediately.  Currently,
we simply leak such caches along with the owner memcg, which is bad and
should be resolved.

It would be perfect if we could move all slab pages of such dead caches to
the root/parent cache on memcg offline.  However, when I tried to
implement such re-parenting, I was pointed out by Christoph that the
overhead of this would be unacceptable, at least for SLUB (see
https://lkml.org/lkml/2014/5/13/446)

The problem with re-parenting of individual slabs is that it requires
tracking of all slabs allocated to a cache, but SLUB doesn't track full
slabs if !debug.  Changing this behavior would result in significant
performance degradation of regular alloc/free paths, because it would make
alloc/free take per node list locks more often.

After pondering about this problem for some time, I think we should return
to dead caches self-destruction, i.e.  scheduling cache destruction work
when the last slab page is freed.

This is the behavior we had before commit 5bd93da9917f ("memcg, slab:
simplify synchronization scheme").  The reason why it was removed was that
it simply didn't work, because SL[AU]B are implemented in such a way that
they don't discard empty slabs immediately, but prefer keeping them cached
for indefinite time to speed up further allocations.

However, we can change this w/o noticeable performance impact for both
SLAB and SLUB by making them drop free slabs as soon as they become empty.
Since dead caches should never be allocated from, removing empty slabs
from them shouldn't result in noticeable performance degradation.

So, this patch set reintroduces dead cache self-destruction and adds some
tweaks to SL[AU]B to prevent dead caches from hanging around indefinitely.
It is organized as follows:

- patches 1-3 reintroduce dead memcg cache self-destruction;
- patch 4 makes SLUB's version of kmem_cache_shrink always drop empty
   slabs, even if it fails to allocate a temporary array;
- patches 5 and 6 fix possible use-after-free connected with
   asynchronous cache destruction;
- patches 7 and 8 disable caching of empty slabs for dead memcg caches
   for SLUB and SLAB respectively.

Note, this doesn't resolve the problem of memcgs pinned by dead kmem
caches. I'm planning to solve this by re-parenting dead kmem caches to
the parent memcg.

This patch (of 8):

Currently, we count the number of pages allocated to a per memcg cache in
memcg_cache_params->nr_pages.  We only use this counter to find out if the
cache is empty and can be destroyed.  So let's rename it to refcnt and
make it count not pages, but slabs so that we can use atomic_inc/dec
instead of atomic_add/sub in memcg_charge/uncharge_slab.

Also, as the number of slabs theoretically can be greater than INT_MAX,
let's use atomic_long for the counter.

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Acked-by: Christoph Lameter <cl@linux.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mm: page_alloc: simplify drain_zone_pages by using min()

Instead of open-coding getting minimal value of two, just use min macro.
That is why it is there for. While changing the function also change type
of batch local variable to match type of per_cpu_pages::batch (which is
int).

Signed-off-by: Michal Nazarewicz <mina86@mina86.com>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mem-hotplug: introduce MMOP_OFFLINE to replace the hard coding -1

In store_mem_state(), we have:
......
334         else if (!strncmp(buf, "offline", min_t(int, count, 7)))
335                 online_type = -1;
......
355         case -1:
356                 ret = device_offline(&mem->dev);
357                 break;
......

Here, "offline" is hard coded as -1.

This patch does the following renaming:
ONLINE_KEEP     ->  MMOP_ONLINE_KEEP
ONLINE_KERNEL   ->  MMOP_ONLINE_KERNEL
ONLINE_MOVABLE  ->  MMOP_ONLINE_MOVABLE

and introduce MMOP_OFFLINE = -1 to avoid hard coding.

Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Cc: Hu Tao <hutao@cn.fujitsu.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Gu Zheng <guz.fnst@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mem-hotplug: avoid illegal state prefixed with legal state when changing state of memory_block

We use the following command to online a memory_block:

echo online|online_kernel|online_movable > /sys/devices/system/memory/memoryXXX/state

But, if we do the following:

echo online_fhsjkghfkd > /sys/devices/system/memory/memoryXXX/state

the block will also be onlined.

This is because the following code in store_mem_state() does not compare
the whole string, but only the prefix of the string.

store_mem_state()
{
......
328         if (!strncmp(buf, "online_kernel", min_t(int, count, 13)))

Here, only compare the first 13 letters of the string. If we give "online_kernelXXXXXX",
it will be recognized as online_kernel, which is incorrect.

329                 online_type = ONLINE_KERNEL;
330         else if (!strncmp(buf, "online_movable", min_t(int, count, 14)))

We have the same problem here,

331                 online_type = ONLINE_MOVABLE;
332         else if (!strncmp(buf, "online", min_t(int, count, 6)))

here,

(Here is more problematic. If we give online_movalbe, which is a typo of online_movable,
it will be recognized as online without noticing the author.)

333                 online_type = ONLINE_KEEP;
334         else if (!strncmp(buf, "offline", min_t(int, count, 7)))

and here.

335                 online_type = -1;
336         else {
337                 ret = -EINVAL;
338                 goto err;
339         }
......
}

This patch fix this problem by using sysfs_streq() to compare the whole string.

Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Reported-by: Hu Tao <hutao@cn.fujitsu.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Gu Zheng <guz.fnst@cn.fujitsu.com>
Acked-by: Toshi Kani <toshi.kani@hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mm/memory.c: use entry = ACCESS_ONCE(*pte) in handle_pte_fault()

Use ACCESS_ONCE() in handle_pte_fault() when getting the entry or orig_pte
upon which all subsequent decisions and pte_same() tests will be made.

I have no evidence that its lack is responsible for the mm/filemap.c:202
BUG_ON(page_mapped(page)) in __delete_from_page_cache() found by trinity,
and I am not optimistic that it will fix it. But I have found no other
explanation, and ACCESS_ONCE() here will surely not hurt.

If gcc does re-access the pte before passing it down, then that would be
disastrous for correct page fault handling, and certainly could explain
the page_mapped() BUGs seen (concurrent fault causing page to be mapped in
a second time on top of itself: mapcount 2 for a single pte).

Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: Sasha Levin <sasha.levin@oracle.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Konstantin Khlebnikov <koct9i@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

vmalloc: use rcu list iterator to reduce vmap_area_lock contention

Richard Yao reported a month ago that his system have a trouble with
vmap_area_lock contention during performance analysis by /proc/meminfo.
Andrew asked why his analysis checks /proc/meminfo stressfully, but he
didn't answer it.

https://lkml.org/lkml/2014/4/10/416

Although I'm not sure that this is right usage or not, there is a solution
reducing vmap_area_lock contention with no side-effect.  That is just to
use rcu list iterator in get_vmalloc_info().

rcu can be used in this function because all RCU protocol is already
respected by writers, since Nick Piggin commit db64fe02258f1 ("mm: rewrite
vmap layer") back in linux-2.6.28

Specifically :
   insertions use list_add_rcu(),
   deletions use list_del_rcu() and kfree_rcu().

Note the rb tree is not used from rcu reader (it would not be safe),
only the vmap_area_list has full RCU protection.

Note that __purge_vmap_area_lazy() already uses this rcu protection.

        rcu_read_lock();
        list_for_each_entry_rcu(va, &vmap_area_list, list) {
                if (va->flags & VM_LAZY_FREE) {
                        if (va->va_start < *start)
                                *start = va->va_start;
                        if (va->va_end > *end)
                                *end = va->va_end;
                        nr += (va->va_end - va->va_start) >> PAGE_SHIFT;
                        list_add_tail(&va->purge_list, &valist);
                        va->flags |= VM_LAZY_FREEING;
                        va->flags &= ~VM_LAZY_FREE;
                }
        }
        rcu_read_unlock();

Peter:

: While rcu list traversal over the vmap_area_list is safe, this may
: arrive at different results than the spinlocked version. The rcu list
: traversal version will not be a 'snapshot' of a single, valid instant
: of the entire vmap_area_list, but rather a potential amalgam of
: different list states.

Joonsoo:

: Yes, you are right, but I don't think that we should be strict here.
: Meminfo is already not a 'snapshot' at specific time.  While we try to get
: certain stats, the other stats can change.  And, although we may arrive at
: different results than the spinlocked version, the difference would not be
: large and would not make serious side-effect.

[edumazet@google.com: add more commit description]
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Reported-by: Richard Yao <ryao@gentoo.org>
Acked-by: Eric Dumazet <edumazet@google.com>
Cc: Peter Hurley <peter@hurleysoftware.com>
Cc: Zhang Yanfei <zhangyanfei.yes@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Andi Kleen <andi@firstfloor.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

include/linux/memblock.h: add __init to memblock_set_bottom_up()

memblock_set_bottom_up() is only called by __init
cmdline_parse_movable_node() and __init numa_init().

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Reviewed-by: Tang Chen <tangchen@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

hwpoison: fix the handling path of the victimized page frame that belong to non-LRU

Until now, the kernel has the same policy to handle victimized page frames
that belong to kernel-space(reserved/slab-subsystem) or non-LRU(unknown
page state). In other word, the result of handling either of these
victimized page frames is (IGNORED | FAILED), and the return value of
memory_failure() is -EBUSY.

This patch is to avoid that memory_failure() returns very soon due to the
"true" value of (!PageLRU(p)), and it also ensures that action_result()
can report more precise information("reserved kernel", "kernel slab", and
"unknown page state") instead of "non LRU", especially for memory errors
which are detected by memory-scrubbing.

Signed-off-by: Chen Yucong <slaoub@gmail.com>
Acked-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mm/page_alloc.c: unexport alloc_pages_exact_nid()

It is only called by mm/page_cgroup.c whcih cannot be modular.

Reported-by: David Rientjes <rientjes@google.com>
Cc: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mm/page_alloc.c: add __meminit to alloc_pages_exact_nid()

alloc_pages_exact_nid() is only called by __meminit alloc_page_cgroup()

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Cc: Mel Gorman <mgorman@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mm/memory_hotplug.c: add __meminit to grow_zone_span/grow_pgdat_span

grow_zone_span and grow_pgdat_span are only called by
__meminit __add_zone

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Cc: Toshi Kani <toshi.kani@hp.com>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mm/readahead.c: remove unused file_ra_state from count_history_pages

count_history_pages does only call page_cache_prev_hole in rcu_lock
context using address_space mapping. There's no need to have
file_ra_state here.

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Acked-by: Fengguang Wu <fengguang.wu@intel.com>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

slub: reduce duplicate creation on the first object

When a kmem_cache is created with ctor, each object in the kmem_cache will
be initialized before use. In the slub implementation, the first object
will be initialized twice.

This patch avoids the duplication of initialization of the first object.

Fixes commit 7656c72b5a63: ("SLUB: add macros for scanning objects in a
slab").

Signed-off-by: Wei Yang <weiyang@linux.vnet.ibm.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mm: slub: SLUB_DEBUG=n: use the same alloc/free hooks as for SLUB_DEBUG=y

There are two versions of alloc/free hooks now - one for
CONFIG_SLUB_DEBUG=y and another one for CONFIG_SLUB_DEBUG=n.

I see no reason why calls to other debugging subsystems (LOCKDEP,
DEBUG_ATOMIC_SLEEP, KMEMCHECK and FAILSLAB) are hidden under SLUB_DEBUG.
All this features should work regardless of SLUB_DEBUG config, as all of
them already have own Kconfig options.

This also fixes failslab for CONFIG_SLUB_DEBUG=n configuration.  It simply
has not worked before because should_failslab() call was in a hook hidden
under "#ifdef CONFIG_SLUB_DEBUG #else".

Note: There is one concealed change in allocation path for SLUB_DEBUG=n
and all other debugging features disabled.  The might_sleep_if() call can
generate some code even if DEBUG_ATOMIC_SLEEP=n.  For PREEMPT_VOLUNTARY=y
might_sleep() inserts _cond_resched() call, but I think it should be ok.

Signed-off-by: Andrey Ryabinin <a.ryabinin@samsung.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mm, slub: mark resiliency_test as init text

resiliency_test() is only called for bootstrap, so it may be moved to
init.text and freed after boot.

Signed-off-by: David Rientjes <rientjes@google.com>
Acked-by: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mm: slab.h: wrap the whole file with guarding macro

Guarding section:
#ifndef MM_SLAB_H
#define MM_SLAB_H
...
#endif
currently doesn't cover the whole mm/slab.h. It seems like it was
done unintentionally.

Wrap the whole file by moving closing #endif to the end of it.

Signed-off-by: Andrey Ryabinin <a.ryabinin@samsung.com>
Acked-by: Christoph Lameter <cl@linux.com>
Acked-by: David Rientjes <rientjes@google.com>
Reviewed-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

slab-use-get_node-and-kmem_cache_node-functions-fix-2-fix

Cc: Christoph Lameter <cl@linux.com>
Reported-by: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

slab-use-get_node-and-kmem_cache_node-functions-fix-2

mm/slab.c: In function 'slab_set_debugobj_lock_classes':
mm/slab.c:524: error: 'h' undeclared (first use in this function)
mm/slab.c:524: error: (Each undeclared identifier is reported only once
mm/slab.c:524: error: for each function it appears in.)
mm/slab.c:524: warning: left-hand operand of comma expression has no effect
mm/slab.c: In function 'cpuup_prepare':
mm/slab.c:1308: warning: passing argument 2 of 'slab_set_debugobj_lock_classes_node' makes pointer from integer without a cast
mm/slab.c:513: note: expected 'struct kmem_cache_node *' but argument is of type 'int'

Cc: Christoph Lameter <cl@gentwo.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Pekka Enberg <penberg@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

slab-use-get_node-and-kmem_cache_node-functions-fix

On Thu, 12 Jun 2014, Joonsoo Kim wrote:

> > @@ -3759,8 +3746,8 @@ fail:
> >   /* Cache is not active yet. Roll back what we did */
> >   node--;
> >   while (node >= 0) {
> > - if (cachep->node[node]) {
> > - n = cachep->node[node];
> > + if (get_node(cachep, node)) {
> > + n = get_node(cachep, node);
>
> Could you do this as following?
>
> n = get_node(cachep, node);
> if (n) {
>         ...
> }

Sure....

Subject: slab: Fixes to earlier patch

Signed-off-by: Christoph Lameter <cl@linux.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

slab: use get_node() and kmem_cache_node() functions

Use the two functions to simplify the code avoiding numerous explicit
checks coded checking for a certain node to be online.

Get rid of various repeated calculations of kmem_cache_node structures.

Signed-off-by: Christoph Lameter <cl@linux.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

slub-use-new-node-functions-fix

On Wed, 11 Jun 2014, David Rientjes wrote:

> > + for_each_kmem_cache_node(s, node, n) {
> >
> >   free_partial(s, n);
> >   if (n->nr_partial || slabs_node(s, node))
>
> Newline not removed?

Ok got through the file and removed all the lines after
for_each_kmem_cache_node.

>
> > @@ -3407,11 +3401,7 @@ int __kmem_cache_shrink(struct kmem_cach
> >   return -ENOMEM;
> >
> >   flush_all(s);
> > - for_each_node_state(node, N_NORMAL_MEMORY) {
> > - n = get_node(s, node);
> > -
> > - if (!n->nr_partial)
> > - continue;
> > + for_each_kmem_cache_node(s, node, n) {
> >
> >   for (i = 0; i < objects; i++)
> >   INIT_LIST_HEAD(slabs_by_inuse + i);
>
> Is there any reason not to keep the !n->nr_partial check to avoid taking
> n->list_lock unnecessarily?

No this was simply a mistake the check needs to be preserved.

Subject: slub: Fix up earlier patch

Signed-off-by: Christoph Lameter <cl@linux.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

slub-use-new-node-functions-checkpatch-fixes

ERROR: space required before the open parenthesis '('
#189: FILE: mm/slub.c:4350:
+ for(node = 0; node < nr_node_ids; node++)

total: 1 errors, 0 warnings, 192 lines checked

./patches/slub-use-new-node-functions.patch has style problems, please review.

If any of these errors are false positives, please report
them to the maintainer, see CHECKPATCH in MAINTAINERS.

Please run checkpatch prior to sending patches

Cc: Christoph Lameter <cl@linux.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

slub: use new node functions

Make use of the new node functions in mm/slab.h to reduce code size and
simplify.

Signed-off-by: Christoph Lameter <cl@linux.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

SLOB has no node specific management structures.

Do not provide the defintions for node management structures for SLOB.

Signed-off-by: Christoph Lameter <cl@linux.com>
Reported-by: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

slab common: add functions for kmem_cache_node access

The patchset provides two new functions in mm/slab.h and modifies SLAB and
SLUB to use these. The kmem_cache_node structure is shared between both
allocators and the use of common accessors will allow us to move more code
into slab_common.c in the future.

This patch (of 3):

These functions allow to eliminate repeatedly used code in both SLAB and
SLUB and also allow for the insertion of debugging code that may be needed
in the development process.

Signed-off-by: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Acked-by: David Rientjes <rientjes@google.com>
Acked-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mm/slab.c: add __init to init_lock_keys

init_lock_keys is only called by __init kmem_cache_init_late

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Acked-by: Christoph Lameter <cl@linux.com>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Pekka Enberg <penberg@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

kernel/watchdog.c: convert printk/pr_warning to pr_foo()

Replace some obsolete functions.

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

block: restore /proc/partitions to not display non-partitionable removable devices

We found with newer kernels we started seeing the cdrom device showing
up in /proc/partitions, but it was not there before.

Looking into this I found that commit d27769ec ("block: add
GENHD_FL_NO_PART_SCAN") introduces this change in behavior. It's not
clear to me from the commit's changelog if this change was intentional or
not. This comment still remains: /* Don't show non-partitionable
removeable devices or empty devices */ so I've decided to send a patch to
restore the behavior of not printing unpartitionable removable devices.

Signed-off-by: Josh Hunt <johunt@akamai.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Kay Sievers <kay.sievers@vrfy.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

bio: modify __bio_add_page() to accept pages that don't start a new segment

The original behaviour is to refuse to add a new page if the maximum
number of segments has been reached, regardless of the fact the page we
are going to add can be merged into the last segment or not.

Unfortunately, when the system runs under heavy memory fragmentation
conditions, a driver may try to add multiple pages to the last segment.
The original code won't accept them and EBUSY will be reported to
userspace.

This patch modifies the function so it refuses to add a page only in case
the latter starts a new segment and the maximum number of segments has
already been reached.

The bug can be easily reproduced with the st driver:

1) set CONFIG_SCSI_MPT2SAS_MAX_SGE or CONFIG_SCSI_MPT3SAS_MAX_SGE to 16
2) modprobe st buffer_kbs=1024
3) #dd if=/dev/zero of=/dev/st0 bs=1M count=10
dd: error writing `/dev/st0': Device or resource busy

Signed-off-by: Maurizio Lombardi <mlombard@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Kent Overstreet <kmo@daterainc.com>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

MAINTAINERS: update IBM ServeRAID RAID info

- Invalid maintainer e-mail address:
Mail server reply:
Recipient address rejected: User unknown in virtual alias table
- Remove no longer working webpage URL
- Remove obsolete "Person" field
- Move status to "Orphan"
- Add Dave Jeffery and Jack Hammer to the CREDITS file

Signed-off-by: Michael Opdenacker <michael.opdenacker@free-electrons.com>
Reviewed-by: Jean Delvare <jdelvare@suse.de>
Cc: David Jeffery <dhjeffery@gmail.com>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
Cc: Paul Bolle <pebolle@tiscali.nl>
Reviewed-by: Jingoo Han <jg1.han@samsung.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

fs/ocfs2/slot_map.c: replace count*size kzalloc by kcalloc

kcalloc manages count*sizeof overflow.

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

ocfs2: call ocfs2_journal_access_di() before ocfs2_journal_dirty() in ocfs2_write_end_nolock()

After we call ocfs2_journal_access_di() in ocfs2_write_begin(),
jbd2_journal_restart() may also be called, in this function transaction
A's t_updates-- and obtains a new transaction B.  If
jbd2_journal_commit_transaction() is happened to commit transaction A,
when t_updates==0, it will continue to complete commit and unfile buffer.

So when jbd2_journal_dirty_metadata(), the handle is pointed a new
transaction B, and the buffer head's journal head is already freed,
jh->b_transaction == NULL, jh->b_next_transaction == NULL, it returns
EINVAL, So it triggers the BUG_ON(status).

thread 1:                             jbd2:
ocfs2_write_begin                     jbd2_journal_commit_transaction
ocfs2_write_begin_nolock
  ocfs2_start_trans
    jbd2__journal_start(t_updates+1,
                       transaction A)
    ocfs2_journal_access_di
    ocfs2_write_cluster_by_desc
      ocfs2_mark_extent_written
        ocfs2_change_extent_flag
          ocfs2_split_extent
            ocfs2_extend_rotate_transaction
              jbd2_journal_restart
              (t_updates-1,transaction B) t_updates==0
                                        __jbd2_journal_refile_buffer

ocfs2_write_end
ocfs2_write_end_nolock
    ocfs2_journal_dirty
        jbd2_journal_dirty_metadata(bug)
   ocfs2_commit_trans

In ext4, I found that: jbd2_journal_get_write_access() called by

ext4_write_end.
ext4_write_begin
    ext4_journal_start
        __ext4_journal_start_sb
            ext4_journal_check_start
            jbd2__journal_start

ext4_write_end
    ext4_mark_inode_dirty
        ext4_reserve_inode_write
            ext4_journal_get_write_access
                jbd2_journal_get_write_access
        ext4_mark_iloc_dirty
            ext4_do_update_inode
                ext4_handle_dirty_metadata
                    jbd2_journal_dirty_metadata

So I think we should put ocfs2_journal_access_di before
  ocfs2_journal_dirty in the ocfs2_write_end.  and it works well after my
  modification.

Signed-off-by: vicky <vicky.yangwenfang@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

ocfs2: quorum: add a log for node not fenced

For debug use, we can see from the log whether the fence decision is made
and why it is not fenced.

Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>
Reviewed-by: Srinivas Eeda <srinivas.eeda@oracle.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Joseph Qi <joseph.qi@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

ocfs2: o2net: set tcp user timeout to max value

When tcp retransmit timeout(15mins), the connection will be closed.
Pending messages may be lost during this time. So we set tcp user timeout
to override the retransmit timeout to the max value. This is OK for ocfs2
since we have disk heartbeat, if peer crash, the disk heartbeat will
timeout and it will be evicted, if disk heartbeat not timeout and
connection idle for a long time, then this means the cluster enters
split-brain state, since fence can't happen, we'd better keep the
connection and wait network recover.

Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>
Reviewed-by: Srinivas Eeda <srinivas.eeda@oracle.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Joseph Qi <joseph.qi@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

ocfs2: o2net: don't shutdown connection when idle timeout

This patch series is to fix a possible message lost bug in ocfs2 when
network go bad.  This bug will cause ocfs2 hung forever even network
become good again.

The messages may lost in this case.  After the tcp connection is
established between two nodes, an idle timer will be set to check its
state periodically, if no messages are received during this time, idle
timer will timeout, it will shutdown the connection and try to reconnect,
so pending messages in tcp queues will be lost.  This messages may be from
dlm.  Dlm may get hung in this case.  This may cause the whole ocfs2
cluster hung.

This is very possible to happen when network state goes bad.  Do the
reconnect is useless, it will fail if network state is still bad.  Just
waiting there for network recovering may be a good idea, it will not lost
messages and some node will be fenced until cluster goes into split-brain
state, for this case, Tcp user timeout is used to override the tcp
retransmit timeout.  It will timeout after 25 days, user should have
notice this through the provided log and fix the network, if they don't,
ocfs2 will fall back to original reconnect way.

This patch (of 3):

Some messages in the tcp queue maybe lost if we shutdown the connection
and reconnect when idle timeout.  If packets lost and reconnect success,
then the ocfs2 cluster maybe hung.

To fix this, we can leave the connection there and do the fence decision
when idle timeout, if network recover before fence dicision is made, the
connection survive without lost any messages.

This bug can be saw when network state go bad.  It may cause ocfs2 hung
forever if some packets lost.  With this fix, ocfs2 will recover from hung
if network becomes good again.

Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>
Reviewed-by: Srinivas Eeda <srinivas.eeda@oracle.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Joseph Qi <joseph.qi@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

ocfs2-free-inode-when-i_count-becomes-zero-checkpatch-fixes

ERROR: trailing whitespace
#41: FILE: fs/ocfs2/inode.c:1197:
+^Ireturn 1; $

total: 1 errors, 0 warnings, 18 lines checked

NOTE: whitespace errors detected, you may wish to use scripts/cleanpatch or
scripts/cleanfile

./patches/ocfs2-free-inode-when-i_count-becomes-zero.patch has style problems, please review.

If any of these errors are false positives, please report
them to the maintainer, see CHECKPATCH in MAINTAINERS.

Please run checkpatch prior to sending patches

Cc: Xue jiufei <xuejiufei@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

ocfs2: free inode when i_count becomes zero

Disk inode deletion may be heavily delayed when one node unlink a file
after the same dentry is freed on another node(say N1) because of memory
shrink but inode is left in memory.  This inode can only be freed while N1
doing the orphan scan work.

However, N1 may skip orphan scan for several times because other nodes may
do the work earlier.  In our tests, it may take 1 hour on 4 nodes cluster
and this will cause bad user experience.  So we think the inode should be
freed when i_count becomes zero to avoid such circumstances.

Signed-off-by: joyce.xue <xuejiufei@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

ocfs2: do not write error flag to user structure we cannot copy from/to

If we failed to copy from the structure, writing back the flags leaks 31
bits of kernel memory (the rest of the ir_flags field).

In any case, if we cannot copy from/to the structure, why should we expect
putting just the flags to work?

Also make sure ocfs2_info_handle_freeinode() returns the right error code
if the copy_to_user() fails.

Fixes: ddee5cdb70e6 ('Ocfs2: Add new OCFS2_IOC_INFO ioctl for ocfs2 v8.')
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

ocfs2: remove conversion of total_backoff in dlm_join_domain()

The unit of total_backoff is msecs not jiffies, so no need to do the
conversion. Otherwise, the join timeout is not 90 sec.

Signed-off-by: Yiwen Jiang <jiangyiwen@huawei.com>
Signed-off-by: joyce.xue <xuejiufei@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

ocfs2: correctly check the return value of ocfs2_search_extent_list

ocfs2_search_extent_list may return -1, so we should check the return
value in ocfs2_split_and_insert, otherwise it may cause array index out of
bound.

And ocfs2_search_extent_list can only return value less than
el->l_next_free_rec, so check if it is equal or larger than
le16_to_cpu(el->l_next_free_rec) is meaningless.

Signed-off-by: Yingtai Xie <xieyingtai@huawei.com>
Signed-off-by: Joseph Qi <joseph.qi@huawei.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

fs/ext4/fsync.c: generic_file_fsync call based on barrier flag

generic_file_fsync has been updated to issue a flush for older
filesystems.

This patch tests for barrier flag in ext4 mount flags and calls the right
function.

Lukas said:

: Note that the actual generic_file_fsync change fixes a real bug in ext4
: where we would _not_ send a flush on sync if we have file system
: without journal.
:
: Ted, it would be useful to mention that in the commit description
: along with the commit id:
:
: ac13a829f6adb674015ab399594c089990104af7 fs/libfs.c: add generic
: data flush to fsync

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Suggested-by: Jan Kara <jack@suse.cz>
Suggested-by: Christoph Hellwig <hch@infradead.org>
Reviewed-by: Lukas Czerner <lczerner@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

drivers/net/irda/donauboe.c: convert to module_pci_driver

Signed-off-by: Libo Chen <libo.chen@huawei.com>
Cc: Samuel Ortiz <samuel@sortiz.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

fs/squashfs/super.c: logging cleanup

- Convert printk to pr_foo()
- Add pr_fmt for future logging entries
- Coalesce formats

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Cc: Phillip Lougher <phillip@squashfs.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

fs/squashfs/file_direct.c: replace count*size kmalloc by kmalloc_array

kmalloc_array() manages count*sizeof overflow.

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Cc: Phillip Lougher <phillip@squashfs.org.uk>
Cc: Joe Perches <joe@perches.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

sh: remove CPU_SUBTYPE_SH7764

The symbol is an orphan, get rid of it.

Submitted by Richard a few months ago as "[PATCH 21/28] Remove
CPU_SUBTYPE_SH7764".

[pebolle@tiscali.nl: re-added dropped ||]
Signed-off-by: Richard Weinberger <richard@nod.at>
Signed-off-by: Paul Bolle <pebolle@tiscali.nl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

input-route-kbd-leds-through-the-generic-leds-layer-fix

Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

input: route kbd LEDs through the generic LEDs layer

This permits to reassign keyboard LEDs to something else than keyboard
"leds" state, by adding keyboard led and modifier triggers connected to a
series of VT input LEDs, themselves connected to VT input triggers, which
per-input device LEDs use by default. Userland can thus easily change the
LED behavior of (a priori) all input devices, or of particular input
devices.

This also permits to fix #7063 from userland by using a modifier to
implement proper CapsLock behavior and have the keyboard caps lock led
show that modifier state.

[ebroder@mokafive.com: Rebased to 3.2-rc1 or so, cleaned up some includes, and fixed some constants]
[blogic@openwrt.org: CONFIG_INPUT_LEDS stubs should be static inline]
[akpm@linux-foundation.org: remove unneeded `extern', fix comment layout]
Signed-off-by: Samuel Thibault <samuel.thibault@ens-lyon.org>
Signed-off-by: Evan Broder <evan@ebroder.net>
Reviewed-by: David Herrmann <dh.herrmann@gmail.com>
Tested-by: Pavel Machek <pavel@ucw.cz>
Acked-by: Peter Korsgaard <jacmet@sunsite.dk>
Cc: Pavel Machek <pavel@ucw.cz>
Cc: Dmitry Torokhov <dmitry.torokhov@gmail.com>
Cc: Bryan Wu <cooloney@gmail.com>
Cc: Arnaud Patard <arnaud.patard@rtp-net.org>
Cc: Sascha Hauer <s.hauer@pengutronix.de>
Cc: Matt Sealey <matt@genesi-usa.com>
Cc: Rob Clark <robdclark@gmail.com>
Cc: Niels de Vos <devos@fedoraproject.org>
Cc: Steev Klimaszewski <steev@genesi-usa.com>
Signed-off-by: John Crispin <blogic@openwrt.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

kernel-posix-timersc-code-clean-up-checkpatch-fixes

WARNING: space prohibited between function name and open parenthesis '('
#55: FILE: kernel/posix-timers.c:345:
+ sizeof (struct k_itimer), 0,

ERROR: do not use assignment in if condition
#70: FILE: kernel/posix-timers.c:504:
+ if ((event->sigev_notify & SIGEV_THREAD_ID) &&

total: 1 errors, 1 warnings, 192 lines checked

./patches/kernel-posix-timersc-code-clean-up.patch has style problems, please review.

If any of these errors are false positives, please report
them to the maintainer, see CHECKPATCH in MAINTAINERS.

Please run checkpatch prior to sending patches

Cc: Fabian Frederick <fabf@skynet.be>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

kernel/posix-timers.c: code clean-up

Fixing some checkpatch warnings:
-Convert printk to pr_foo()
-Remove spaces between function and (
-Split lines > 80 characters

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

fs/cifs/smb2file.c: replace count*size kzalloc by kcalloc

kcalloc manages count*sizeof overflow.

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Cc: Steve French <sfrench@samba.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

fs/cifs/file.c: replace count*size kzalloc by kcalloc

kcalloc manages count*sizeof overflow.

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Cc: Steve French <sfrench@samba.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

fs/cifs: remove obsolete __constant

Replace all __constant_foo to foo() except in smb2status.h (1700 lines to
update).

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Cc: Steve French <sfrench@samba.org>
Cc: Jeff Layton <jlayton@poochiereds.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

kernel/auditfilter.c: replace count*size kmalloc by kcalloc

kcalloc manages count*sizeof overflow.

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Cc: Eric Paris <eparis@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

x86,mem-hotplug: modify PGD entry when removing memory

When hot-adding/removing memory, sync_global_pgds() is called for
synchronizing PGD to PGD entries of all processes MM.  But when
hot-removing memory, sync_global_pgds() does not work correctly.

At first, sync_global_pgds() checks whether target PGD is none or not.
And if PGD is none, the PGD is skipped.  But when hot-removing memory, PGD
may be none since PGD may be cleared by free_pud_table().  So when
sync_global_pgds() is called after hot-removing memory, sync_global_pgds()
should not skip PGD even if the PGD is none.  And sync_global_pgds() must
clear PGD entries of all processes MM.

Currently sync_global_pgds() does not clear PGD entries of all processes
MM when hot-removing memory.  So when hot adding memory which is same
memory range as removed memory after hot-removing memory, following call
traces are shown:

kernel BUG at arch/x86/mm/init_64.c:206!
...
[<ffffffff815e0c80>] kernel_physical_mapping_init+0x1b2/0x1d2
[<ffffffff815ced94>] init_memory_mapping+0x1d4/0x380
[<ffffffff8104aebd>] arch_add_memory+0x3d/0xd0
[<ffffffff815d03d9>] add_memory+0xb9/0x1b0
[<ffffffff81352415>] acpi_memory_device_add+0x1af/0x28e
[<ffffffff81325dc4>] acpi_bus_device_attach+0x8c/0xf0
[<ffffffff813413b9>] acpi_ns_walk_namespace+0xc8/0x17f
[<ffffffff81325d38>] ? acpi_bus_type_and_status+0xb7/0xb7
[<ffffffff81325d38>] ? acpi_bus_type_and_status+0xb7/0xb7
[<ffffffff813418ed>] acpi_walk_namespace+0x95/0xc5
[<ffffffff81326b4c>] acpi_bus_scan+0x9a/0xc2
[<ffffffff81326bff>] acpi_scan_bus_device_check+0x8b/0x12e
[<ffffffff81326cb5>] acpi_scan_device_check+0x13/0x15
[<ffffffff81320122>] acpi_os_execute_deferred+0x25/0x32
[<ffffffff8107e02b>] process_one_work+0x17b/0x460
[<ffffffff8107edfb>] worker_thread+0x11b/0x400
[<ffffffff8107ece0>] ? rescuer_thread+0x400/0x400
[<ffffffff81085aef>] kthread+0xcf/0xe0
[<ffffffff81085a20>] ? kthread_create_on_node+0x140/0x140
[<ffffffff815fc76c>] ret_from_fork+0x7c/0xb0
[<ffffffff81085a20>] ? kthread_create_on_node+0x140/0x140

This patch clears PGD entries of all processes MM when sync_global_pgds()
is called after hot-removing memory

Signed-off-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Tang Chen <tangchen@cn.fujitsu.com>
Acked-by: Toshi Kani <toshi.kani@hp.com>
Cc: Gu Zheng <guz.fnst@cn.fujitsu.com>
Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

x86,mem-hotplug: pass sync_global_pgds() a correct argument in remove_pagetable()

When hot-adding memory after hot-removing memory, following call traces
are shown:

kernel BUG at arch/x86/mm/init_64.c:206!
...
[<ffffffff815e0c80>] kernel_physical_mapping_init+0x1b2/0x1d2
[<ffffffff815ced94>] init_memory_mapping+0x1d4/0x380
[<ffffffff8104aebd>] arch_add_memory+0x3d/0xd0
[<ffffffff815d03d9>] add_memory+0xb9/0x1b0
[<ffffffff81352415>] acpi_memory_device_add+0x1af/0x28e
[<ffffffff81325dc4>] acpi_bus_device_attach+0x8c/0xf0
[<ffffffff813413b9>] acpi_ns_walk_namespace+0xc8/0x17f
[<ffffffff81325d38>] ? acpi_bus_type_and_status+0xb7/0xb7
[<ffffffff81325d38>] ? acpi_bus_type_and_status+0xb7/0xb7
[<ffffffff813418ed>] acpi_walk_namespace+0x95/0xc5
[<ffffffff81326b4c>] acpi_bus_scan+0x9a/0xc2
[<ffffffff81326bff>] acpi_scan_bus_device_check+0x8b/0x12e
[<ffffffff81326cb5>] acpi_scan_device_check+0x13/0x15
[<ffffffff81320122>] acpi_os_execute_deferred+0x25/0x32
[<ffffffff8107e02b>] process_one_work+0x17b/0x460
[<ffffffff8107edfb>] worker_thread+0x11b/0x400
[<ffffffff8107ece0>] ? rescuer_thread+0x400/0x400
[<ffffffff81085aef>] kthread+0xcf/0xe0
[<ffffffff81085a20>] ? kthread_create_on_node+0x140/0x140
[<ffffffff815fc76c>] ret_from_fork+0x7c/0xb0
[<ffffffff81085a20>] ? kthread_create_on_node+0x140/0x140

The patch-sets fix the issue.

This patch (of 2):

remove_pagetable() gets start argument and passes the argument to
sync_global_pgds(). In this case, the argument must not be modified. If
the argument is modified and passed to sync_global_pgds(),
sync_global_pgds() does not correctly synchronize PGD to PGD entries of
all processes MM since synchronized range of memory [start, end] is wrong.

Unfortunately the start argument is modified in remove_pagetable(). So
this patch fixes the issue.

Signed-off-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Tang Chen <tangchen@cn.fujitsu.com>
Acked-by: Toshi Kani <toshi.kani@hp.com>
Cc: Gu Zheng <guz.fnst@cn.fujitsu.com>
Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

fs/seq_file: fallback to vmalloc allocation

There are a couple of seq_files which use the single_open() interface.
This interface requires that the whole output must fit into a single
buffer.

E.g. for /proc/stat allocation failures have been observed because an
order-4 memory allocation failed due to memory fragmentation. In such
situations reading /proc/stat is not possible anymore.

Therefore change the seq_file code to fallback to vmalloc allocations
which will usually result in a couple of order-0 allocations and hence
also work if memory is fragmented.

For reference a call trace where reading from /proc/stat failed:

[62129.701569] sadc: page allocation failure: order:4, mode:0x1040d0
[62129.701573] CPU: 1 PID: 192063 Comm: sadc Not tainted 3.10.0-123.el7.s390x #1
[...]
[62129.701586] Call Trace:
[62129.701588] ([<0000000000111fbe>] show_trace+0xe6/0x130)
[62129.701591] [<0000000000112074>] show_stack+0x6c/0xe8
[62129.701593] [<000000000020d356>] warn_alloc_failed+0xd6/0x138
[62129.701596] [<00000000002114d2>] __alloc_pages_nodemask+0x9da/0xb68
[62129.701598] [<000000000021168e>] __get_free_pages+0x2e/0x58
[62129.701599] [<000000000025a05c>] kmalloc_order_trace+0x44/0xc0
[62129.701602] [<00000000002f3ffa>] stat_open+0x5a/0xd8
[62129.701604] [<00000000002e9aaa>] proc_reg_open+0x8a/0x140
[62129.701606] [<0000000000273b64>] do_dentry_open+0x1bc/0x2c8
[62129.701608] [<000000000027411e>] finish_open+0x46/0x60
[62129.701610] [<000000000028675a>] do_last+0x382/0x10d0
[62129.701612] [<0000000000287570>] path_openat+0xc8/0x4f8
[62129.701614] [<0000000000288bde>] do_filp_open+0x46/0xa8
[62129.701616] [<000000000027541c>] do_sys_open+0x114/0x1f0
[62129.701618] [<00000000005b1c1c>] sysc_tracego+0x14/0x1a

Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Tested-by: David Rientjes <rientjes@google.com>
Cc: Ian Kent <raven@themaw.net>
Cc: Hendrik Brueckner <brueckner@linux.vnet.ibm.com>
Cc: Thorsten Diehl <thorsten.diehl@de.ibm.com>
Cc: Andrea Righi <andrea@betterlinux.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Stefan Bader <stefan.bader@canonical.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

proc/stat: convert to single_open_size()

These two patches are supposed to "fix" failed order-4 memory allocations
which have been observed when reading /proc/stat.  The problem has been
observed on s390 as well as on x86.

To address the problem change the seq_file memory allocations to fallback
to use vmalloc, so that allocations also work if memory is fragmented.

This approach seems to be simpler and less intrusive than changing
/proc/stat to use an interator.  Also it "fixes" other users as well,
which use seq_file's single_open() interface.

This patch (of 2):

Use seq_file's single_open_size() to preallocate a buffer that is large
enough to hold the whole output, instead of open coding it.  Also
calculate the requested size using the number of online cpus instead of
possible cpus, since the size of the output only depends on the number of
online cpus.

Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Ian Kent <raven@themaw.net>
Cc: Hendrik Brueckner <brueckner@linux.vnet.ibm.com>
Cc: Thorsten Diehl <thorsten.diehl@de.ibm.com>
Cc: Andrea Righi <andrea@betterlinux.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Stefan Bader <stefan.bader@canonical.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

tools: memory-hotplug fix unexpected operator error

on-off-test is a bash script and invoked from /bin/sh
This results in the following error:

./on-off-test.sh: 9: [: !=: unexpected operator

Chang Makefile to use bash instead.

Signed-off-by: Shuah Khan <shuah.kh@samsung.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

tools: cpu-hotplug fix unexpected operator error

on-off-test is a bash script and invoked from /bin/sh
This results in the following error:

./on-off-test.sh: 9: [: !=: unexpected operator

Chang Makefile to use bash instead.

Signed-off-by: Shuah Khan <shuah.kh@samsung.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

autofs4: fix false positive compile error

On strict build environments we can see:

fs/autofs4/inode.c: In function 'autofs4_fill_super':
fs/autofs4/inode.c:312: error: 'pgrp' may be used uninitialized in this
function
make[2]: *** [fs/autofs4/inode.o] Error 1
make[1]: *** [fs/autofs4] Error 2
make: *** [fs] Error 2
make: *** Waiting for unfinished jobs....

This is due to the use of pgrp_set being used to indicate pgrp has
has been set rather than initializing pgrp itself.

Signed-off-by: Ian Kent <raven@themaw.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

slub: fix off by one in number of slab tests

min_partial means minimum number of slab cached in node partial list.  So,
if nr_partial is less than it, we keep newly empty slab on node partial
list rather than freeing it.  But if nr_partial is equal or greater than
it, it means that we have enough partial slabs so should free newly empty
slab.  Current implementation missed the equal case so if we set
min_partial is 0, then, at least one slab could be cached.  This is
critical problem to kmemcg destroying logic because it doesn't works
properly if some slabs is cached.  This patch fixes this problem.

Fixes 91cb69620284 ("slub: make dead memcg caches discard free slabs
immediately").

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Acked-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mm: page_alloc: fix CMA area initialisation when pageblock > MAX_ORDER

With a kernel configured with ARM64_64K_PAGES && !TRANSPARENT_HUGEPAGE,
the following is triggered at early boot:

  SMP: Total of 8 processors activated.
  devtmpfs: initialized
  Unable to handle kernel NULL pointer dereference at virtual address 00000008
  pgd = fffffe0000050000
  [00000008] *pgd=00000043fba00003, *pmd=00000043fba00003, *pte=00e0000078010407
  Internal error: Oops: 96000006 [#1] SMP
  Modules linked in:
  CPU: 0 PID: 1 Comm: swapper/0 Not tainted 3.15.0-rc864k+ #44
  task: fffffe03bc040000 ti: fffffe03bc080000 task.ti: fffffe03bc080000
  PC is at __list_add+0x10/0xd4
  LR is at free_one_page+0x270/0x638
  ...
  Call trace:
  [<fffffe00003ee970>] __list_add+0x10/0xd4
  [<fffffe000019c478>] free_one_page+0x26c/0x638
  [<fffffe000019c8c8>] __free_pages_ok.part.52+0x84/0xbc
  [<fffffe000019d5e8>] __free_pages+0x74/0xbc
  [<fffffe0000c01350>] init_cma_reserved_pageblock+0xe8/0x104
  [<fffffe0000c24de0>] cma_init_reserved_areas+0x190/0x1e4
  [<fffffe0000090418>] do_one_initcall+0xc4/0x154
  [<fffffe0000bf0a50>] kernel_init_freeable+0x204/0x2a8
  [<fffffe00007520a0>] kernel_init+0xc/0xd4

This happens because init_cma_reserved_pageblock() calls __free_one_page()
with pageblock_order as page order but it is bigger than MAX_ORDER.  This
in turn causes accesses past zone->free_list[].

Fix the problem by changing init_cma_reserved_pageblock() such that it
splits pageblock into individual MAX_ORDER pages if pageblock is bigger
than a MAX_ORDER page.

In cases where !CONFIG_HUGETLB_PAGE_SIZE_VARIABLE, which is all
architectures expect for ia64, powerpc and tile at the moment, the
“pageblock_order > MAX_ORDER” condition will be optimised out since
both sides of the operator are constants.  In cases where pageblock size
is variable, the performance degradation should not be significant anyway
since init_cma_reserved_pageblock() is called only at boot time at most
MAX_CMA_AREAS times which by default is eight.

Signed-off-by: Michal Nazarewicz <mina86@mina86.com>
Reported-by: Mark Salter <msalter@redhat.com>
Tested-by: Mark Salter <msalter@redhat.com>
Tested-by: Christopher Covington <cov@codeaurora.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: David Rientjes <rientjes@google.com>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: <stable@vger.kernel.org> [3.5+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Merge branch 'merge' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc

Pull powerpc fixes and cleanups from Ben Herrenschmidt:
"Here are a handful or two of powerpc fixes and simple/trivial
  cleanups.  A bunch of them fix ftrace with the new ABI v2 in Little
  Endian, the rest is a scattering of fairly simple things"

* 'merge' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc:
  powerpc: Don't skip ePAPR spin-table CPUs
  powerpc/module: Fix TOC symbol CRC
  powerpc/powernv: Remove OPAL v1 takeover
  powerpc/kmemleak: Do not scan the DART table
  selftests/powerpc: Use the test harness for the TM DSCR test
  powerpc/cell: cbe_thermal.c: Cleaning up a variable is of the wrong type
  powerpc/kprobes: Fix jprobes on ABI v2 (LE)
  powerpc/ftrace: Use pr_fmt() to namespace error messages
  powerpc/ftrace: Fix nop of modules on 64bit LE (ABIv2)
  powerpc/ftrace: Fix inverted check of create_branch()
  powerpc/ftrace: Fix typo in mask of opcode
  powerpc: Add ppc_global_function_entry()
  powerpc/macintosh/smu.c: Fix closing brace followed by if
  powerpc: Remove __arch_swab*
  powerpc: Remove ancient DEBUG_SIG code
  powerpc/kerenl: Enable EEH for IO accessors

Merge tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost

Pull vhost cleanups from Michael S Tsirkin:
"Two cleanup patches removing code duplication that got introduced by
  changes in rc1.  Not fixing crashes, but I'd rather not carry the
  duplicate code until the next merge window"

* tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost:
  vhost-scsi: don't open-code kvfree
  vhost-net: don't open-code kvfree

Merge tag 'trace-fixes-v3.16-rc1-v2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace

Pull tracing cleanups and fixes from Steven Rostedt:
"This includes three patches from Oleg Nesterov.  The first is a fix to
  a race condition that happens between enabling/disabling syscall
  tracepoints and new process creations (the check to go into the ptrace
  path for a process can be set when it shouldn't, or not set when it
  should).  Not a major bug but one that should be fixed and even
  applied to stable.

  The other two patches are cleanup/fixes that are not that critical,
  but for an -rc1 release would be nice to have.  They both deal with
  syscall tracepoints.

  It also includes a patch to introduce a new macro for the
  TRACE_EVENT() format called __field_struct().  Originally, __field()
  was used to record any variable into a trace event, but with the
  addition of setting the "is signed" attribute, the check causes
  anything but a primitive variable to fail to compile.  That is,
  structs and unions can't be used as they once were.  When the "is
  signed" check was introduce there were only primitive variables being
  recorded.  But that will change soon and it was reported that
  __field() causes build failures.

  To solve the __field() issue, __field_struct() is introduced to allow
  trace_events to be able to record complex types too"

* tag 'trace-fixes-v3.16-rc1-v2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
  tracing: Add __field_struct macro for TRACE_EVENT()
  tracing: syscall_regfunc() should not skip kernel threads
  tracing: Change syscall_*regfunc() to check PF_KTHREAD and use for_each_process_thread()
  tracing: Fix syscall_*regfunc() vs copy_process() race

powerpc: Don't skip ePAPR spin-table CPUs

Commit 59a53afe70fd530040bdc69581f03d880157f15a "powerpc: Don't setup
CPUs with bad status" broke ePAPR SMP booting. ePAPR says that CPUs
that aren't presently running shall have status of disabled, with
enable-method being used to determine whether the CPU can be enabled.

Fix by checking for spin-table, which is currently the only supported
enable-method.

Signed-off-by: Scott Wood <scottwood@freescale.com>
Cc: Michael Neuling <mikey@neuling.org>
Cc: Emil Medve <Emilian.Medve@Freescale.com>
Cc: stable@vger.kernel.org
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>

powerpc/module: Fix TOC symbol CRC

The commit 71ec7c55ed91 introduced the magic symbol ".TOC." for ELFv2 ABI.
This symbol is built manually and has no CRC value computed. A zero value
is put in the CRC section to avoid modpost complaining about a missing CRC.
Unfortunately, this breaks the kernel module loading when the kernel is
relocated (kdump case for instance) because of the relocation applied to
the kcrctab values.

This patch compute a CRC value for the TOC symbol which will match the one
compute by the kernel when it is relocated - aka '0 - relocate_start' done in
maybe_relocated called by check_version (module.c).

Signed-off-by: Laurent Dufour <ldufour@linux.vnet.ibm.com>
Cc: Anton Blanchard <anton@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>

powerpc/powernv: Remove OPAL v1 takeover

In commit 27f4488872d9 "Add OPAL takeover from PowerVM" we added support
for "takeover" on OPAL v1 machines.

This was a mode of operation where we would boot under pHyp, and query
for the presence of OPAL. If detected we would then do a special
sequence to take over the machine, and the kernel would end up running
in hypervisor mode.

OPAL v1 was never a supported product, and was never shipped outside
IBM. As far as we know no one is still using it.

Newer versions of OPAL do not use the takeover mechanism. Although the
query for OPAL should be harmless on machines with newer OPAL, we have
seen a machine where it causes a crash in Open Firmware.

The code in early_init_devtree() to copy boot_command_line into cmd_line
was added in commit 817c21ad9a1f "Get kernel command line accross OPAL
takeover", and AFAIK is only used by takeover, so should also be
removed.

Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>