Johannes Weiner [Thu, 26 Jun 2014 00:42:36 +0000 (10:42 +1000)]
mm: huge_memory: use GFP_TRANSHUGE when charging huge pages
Transparent huge page charges prefer falling back to regular pages rather
than spending a lot of time in direct reclaim.
Desired reclaim behavior is usually declared in the gfp mask, but THP
charges use GFP_KERNEL and then rely on the fact that OOM is disabled for
THP charges, and that OOM-disabled charges don't retry reclaim. Needless
to say, this is anything but obvious and quite error prone.
Convert THP charges to use GFP_TRANSHUGE instead, which implies
__GFP_NORETRY, to indicate the low-latency requirement.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: Hugh Dickins <hughd@google.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vladimir Davydov <vdavydov@parallels.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Johannes Weiner [Thu, 26 Jun 2014 00:42:36 +0000 (10:42 +1000)]
mm: memcontrol: reclaim at least once for __GFP_NORETRY
Currently, __GFP_NORETRY tries charging once and gives up before even
trying to reclaim. Bring the behavior on par with the page allocator and
reclaim at least once before giving up.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: Hugh Dickins <hughd@google.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vladimir Davydov <vdavydov@parallels.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Johannes Weiner [Thu, 26 Jun 2014 00:42:36 +0000 (10:42 +1000)]
mm: memcontrol: rearrange charging fast path
The charging path currently starts out with OOM condition checks when OOM
is the rarest possible case.
Rearrange this code to run OOM/task dying checks only after trying the
percpu charge and the res_counter charge and bail out before entering
reclaim. Attempting a charge does not hurt an (oom-)killed task as much
as every charge attempt having to check OOM conditions. Also, only check
__GFP_NOFAIL when the charge would actually fail.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: Hugh Dickins <hughd@google.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vladimir Davydov <vdavydov@parallels.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Johannes Weiner [Thu, 26 Jun 2014 00:42:36 +0000 (10:42 +1000)]
mm: memcontrol: fold mem_cgroup_do_charge()
These patches rework memcg charge lifetime to integrate more naturally
with the lifetime of user pages. This drastically simplifies the code and
reduces charging and uncharging overhead. The most expensive part of
charging and uncharging is the page_cgroup bit spinlock, which is removed
entirely after this series.
Here are the top-10 profile entries of a stress test that reads a 128G
sparse file on a freshly booted box, without even a dedicated cgroup (i.e.
executing in the root memcg). Before:
As you can see, the memcg footprint has shrunk quite a bit.
text data bss dec hex filename
37970 9892 400 48262 bc86 mm/memcontrol.o.old
35239 9892 400 45531 b1db mm/memcontrol.o
This patch (of 13):
This function was split out because mem_cgroup_try_charge() got too big.
But having essentially one sequence of operations arbitrarily split in
half is not good for reworking the code. Fold it back in.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: Hugh Dickins <hughd@google.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vladimir Davydov <vdavydov@parallels.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Waiman Long [Thu, 26 Jun 2014 00:42:35 +0000 (10:42 +1000)]
mm, thp: replace smp_mb after atomic_add by smp_mb__after_atomic
In some architectures like x86, atomic_add() is a full memory barrier. In
that case, an additional smp_mb() is just a waste of time. This patch
replaces that smp_mb() by smp_mb__after_atomic() which will avoid the
redundant memory barrier in some architectures.
With a 3.16-rc1 based kernel, this patch reduced the execution time of
breaking 1000 transparent huge pages from 38,245us to 30,964us. A
reduction of 19% which is quite sizeable. It also reduces the %cpu time
of the __split_huge_page_refcount function in the perf profile from 2.18%
to 1.15%.
Signed-off-by: Waiman Long <Waiman.Long@hp.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Rik van Riel <riel@redhat.com> Cc: Scott J Norton <scott.norton@hp.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Waiman Long [Thu, 26 Jun 2014 00:42:35 +0000 (10:42 +1000)]
mm, thp: move invariant bug check out of loop in __split_huge_page_map
In __split_huge_page_map(), the check for page_mapcount(page) is invariant
within the for loop. Because of the fact that the macro is implemented
using atomic_read(), the redundant check cannot be optimized away by the
compiler leading to unnecessary read to the page structure.
This patch moves the invariant bug check out of the loop so that it will
be done only once. On a 3.16-rc1 based kernel, the execution time of a
microbenchmark that broke up 1000 transparent huge pages using munmap()
had an execution time of 38,245us and 38,548us with and without the patch
respectively. The performance gain is about 1%.
Signed-off-by: Waiman Long <Waiman.Long@hp.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Rik van Riel <riel@redhat.com> Cc: Scott J Norton <scott.norton@hp.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
- add comment on page_size_order()
- use compound_order(compound_head(page)) instead of huge_page_size_order()
- use page_pgoff() in rmap_walk_file() too
- use page_size_order() in kill_proc()
- fix space indent
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Sasha Levin <sasha.levin@oracle.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
- fix wrong shift direction
- introduce page_size_order() and huge_page_size_order()
- move the declaration of PageHuge() to include/linux/hugetlb_inline.h
to avoid macro definition.
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Sasha Levin <sasha.levin@oracle.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Naoya Horiguchi [Thu, 26 Jun 2014 00:42:34 +0000 (10:42 +1000)]
mm, hugetlbfs: fix rmapping for anonymous hugepages with page_pgoff()
page->index stores pagecache index when the page is mapped into file
mapping region, and the index is in pagecache size unit, so it depends on
the page size. Some of users of reverse mapping obviously assumes that
page->index is in PAGE_CACHE_SHIFT unit, so they don't work for anonymous
hugepage.
For example, consider that we have 3-hugepage vma and try to mbind the 2nd
hugepage to migrate to another node. Then the vma is split and
migrate_page() is called for the 2nd hugepage (belonging to the middle
vma.) In migrate operation, rmap_walk_anon() tries to find the relevant
vma to which the target hugepage belongs, but here we miscalculate pgoff.
So anon_vma_interval_tree_foreach() grabs invalid vma, which fires
VM_BUG_ON.
This patch introduces a new API that is usable both for normal page and
hugepage to get PAGE_SIZE offset from page->index. Users should clearly
distinguish page_index for pagecache index and page_pgoff for page offset.
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Reported-by: Sasha Levin <sasha.levin@oracle.com> Cc: Rik van Riel <riel@redhat.com> Cc: <stable@vger.kernel.org> [3.12+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Joonsoo Kim [Thu, 26 Jun 2014 00:42:34 +0000 (10:42 +1000)]
mm, CMA: change cma_declare_contiguous() to obey coding convention
Conventionally, we put output param to the end of param list and put the
'base' ahead of 'size', but cma_declare_contiguous() doesn't look like
that, so change it.
Additionally, move down cma_areas reference code to the position
where it is really needed.
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Acked-by: Michal Nazarewicz <mina86@mina86.com> Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Cc: Alexander Graf <agraf@suse.de> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Cc: Gleb Natapov <gleb@kernel.org> Acked-by: Marek Szyprowski <m.szyprowski@samsung.com> Tested-by: Marek Szyprowski <m.szyprowski@samsung.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Joonsoo Kim [Thu, 26 Jun 2014 00:42:34 +0000 (10:42 +1000)]
PPC, KVM: fix build failure due to removed file
Commit ('PPC, KVM, CMA: use general CMA reserved area management
framework') removes book3s_hv_cma.c, but, missed to remove entry
in Makefile. Fix it.
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Reported-by: kbuild test robot <fengguang.wu@intel.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Joonsoo Kim [Thu, 26 Jun 2014 00:42:33 +0000 (10:42 +1000)]
CMA: fix ARM build failure related to MAX_CMA_AREAS definition
If CMA is disabled, CONFIG_CMA_AREAS isn't defined so compile error
happens. To fix it, define MAX_CMA_AREAS if CONFIG_CMA_AREAS
isn't defined.
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Joonsoo Kim [Thu, 26 Jun 2014 00:42:33 +0000 (10:42 +1000)]
CMA: generalize CMA reserved area management functionality
Currently, there are two users on CMA functionality, one is the DMA
subsystem and the other is the KVM on powerpc. They have their own code
to manage CMA reserved area even if they looks really similar. From my
guess, it is caused by some needs on bitmap management. KVM side wants to
maintain bitmap not for 1 page, but for more size. Eventually it use
bitmap where one bit represents 64 pages.
When I implement CMA related patches, I should change those two places to
apply my change and it seem to be painful to me. I want to change this
situation and reduce future code management overhead through this patch.
This change could also help developer who want to use CMA in their new
feature development, since they can use CMA easily without copying &
pasting this reserved area management code.
In previous patches, we have prepared some features to generalize CMA
reserved area management and now it's time to do it. This patch moves
core functions to mm/cma.c and change DMA APIs to use these functions.
There is no functional change in DMA APIs.
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Acked-by: Michal Nazarewicz <mina86@mina86.com> Acked-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Acked-by: Minchan Kim <minchan@kernel.org> Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Cc: Alexander Graf <agraf@suse.de> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Cc: Gleb Natapov <gleb@kernel.org> Acked-by: Marek Szyprowski <m.szyprowski@samsung.com> Tested-by: Marek Szyprowski <m.szyprowski@samsung.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Joonsoo Kim [Thu, 26 Jun 2014 00:42:32 +0000 (10:42 +1000)]
DMA, CMA: support arbitrary bitmap granularity
PPC KVM's CMA area management requires arbitrary bitmap granularity, since
they want to reserve very large memory and manage this region with bitmap
that one bit for several pages to reduce management overheads. So support
arbitrary bitmap granularity for following generalization.
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Acked-by: Michal Nazarewicz <mina86@mina86.com> Acked-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Acked-by: Minchan Kim <minchan@kernel.org> Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Cc: Alexander Graf <agraf@suse.de> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Cc: Gleb Natapov <gleb@kernel.org> Acked-by: Marek Szyprowski <m.szyprowski@samsung.com> Tested-by: Marek Szyprowski <m.szyprowski@samsung.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Joonsoo Kim [Thu, 26 Jun 2014 00:42:32 +0000 (10:42 +1000)]
DMA, CMA: separate core CMA management codes from DMA APIs
To prepare future generalization work on CMA area management code, we need
to separate core CMA management codes from DMA APIs. We will extend these
core functions to cover requirements of PPC KVM's CMA area management
functionality in following patches. This separation helps us not to touch
DMA APIs while extending core functions.
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Acked-by: Michal Nazarewicz <mina86@mina86.com> Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Cc: Alexander Graf <agraf@suse.de> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Cc: Gleb Natapov <gleb@kernel.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Acked-by: Marek Szyprowski <m.szyprowski@samsung.com> Tested-by: Marek Szyprowski <m.szyprowski@samsung.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Vladimir Davydov [Thu, 26 Jun 2014 00:42:32 +0000 (10:42 +1000)]
slab: set free_limit for dead caches to 0
We mustn't keep empty slabs on dead memcg caches, because otherwise they
will never be destroyed.
Currently, we check if the cache is dead in free_block and drop empty slab
if so irrespective of the node's free_limit. This can be pretty
expensive. So let's avoid this additional check by zeroing nodes'
free_limit for dead caches on kmem_cache_shrink. This way no additional
overhead is added to free hot path.
Note, since ->free_limit can be updated on cpu/memory hotplug, we must
handle it properly there.
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com> Suggested-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Vladimir Davydov [Thu, 26 Jun 2014 00:42:31 +0000 (10:42 +1000)]
slab: do not keep free objects/slabs on dead memcg caches
Since a dead memcg cache is destroyed only after the last slab allocated
to it is freed, we must disable caching of free objects/slabs for such
caches, otherwise they will be hanging around forever.
For SLAB that means we must disable per cpu free object arrays and make
free_block always discard empty slabs irrespective of node's free_limit.
To disable per cpu arrays, we free them on kmem_cache_shrink (see
drain_cpu_caches -> do_drain) and make __cache_free fall back to
free_block if there is no per cpu array. Also, we have to disable
allocation of per cpu arrays on cpu hotplug for dead caches (see
cpuup_prepare, __do_tune_cpucache).
After we disabled free objects/slabs caching, there is no need to reap
those caches periodically. Moreover, it will only result in slowdown. So
we also make cache_reap skip then.
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Vladimir Davydov [Thu, 26 Jun 2014 00:42:31 +0000 (10:42 +1000)]
slub: kmem_cache_shrink: check if partial list is empty under list_lock
SLUB's implementation of kmem_cache_shrink skips nodes that have
nr_partial=0, because they surely don't have any empty slabs to free.
This check is done w/o holding any locks, therefore it can race with
concurrent kfree adding an empty slab to a partial list. As a result, a
just shrinked cache can have empty slabs.
This is unacceptable for kmemcg, which needs to be sure that there will be
no empty slabs on dead memcg caches after kmem_cache_shrink was called,
because otherwise we may leak a dead cache.
Let's fix this race by checking if node partial list is empty under
node->list_lock. Since the nr_partial!=0 branch of kmem_cache_shrink does
nothing if the list is empty, we can simply remove the nr_partial=0 check.
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com> Reported-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Vladimir Davydov [Thu, 26 Jun 2014 00:42:31 +0000 (10:42 +1000)]
slub: make dead memcg caches discard free slabs immediately
Since a dead memcg cache is destroyed only after the last slab allocated
to it is freed, we must disable caching of empty slabs for such caches,
otherwise they will be hanging around forever.
This patch makes SLUB discard dead memcg caches' slabs as soon as they
become empty. To achieve that, it disables per cpu partial lists for dead
caches (see put_cpu_partial) and forbids keeping empty slabs on per node
partial lists by setting cache's min_partial to 0 on kmem_cache_shrink,
which is always called on memcg offline (see memcg_unregister_all_caches).
Thanks to Joonsoo Kim.
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Vladimir Davydov [Thu, 26 Jun 2014 00:42:31 +0000 (10:42 +1000)]
memcg: wait for kfree's to finish before destroying cache
kmem_cache_free doesn't expect that the cache can be destroyed as soon as
the object is freed, e.g. SLUB's implementation may want to update cache
stats after putting the object to the free list.
Therefore we should wait for all kmem_cache_free's to finish before
proceeding to cache destruction. Since both SLAB and SLUB versions of
kmem_cache_free are non-preemptable, we wait for rcu-sched grace period to
elapse.
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Vladimir Davydov [Thu, 26 Jun 2014 00:42:31 +0000 (10:42 +1000)]
slub: make slab_free non-preemptable
Since per memcg cache destruction is scheduled when the last slab is
freed, to avoid use-after-free in kmem_cache_free we should either
rearrange code in kmem_cache_free so that it won't dereference the cache
ptr after freeing the object, or wait for all kmem_cache_free's to
complete before proceeding to cache destruction.
The former approach isn't a good option from the future development point
of view, because every modifications to kmem_cache_free must be done with
great care then. Hence we should provide a method to wait for all
currently executing kmem_cache_free's to finish.
This patch makes SLUB's implementation of kmem_cache_free non-preemptable.
As a result, synchronize_sched() will work as a barrier against
kmem_cache_free's in flight, so that issuing it before cache destruction
will protect us against the use-after-free.
This won't affect performance of kmem_cache_free, because we already
disable preemption there, and this patch only moves preempt_enable to the
end of the function. Neither should it affect the system latency, because
kmem_cache_free is extremely short, even in its slow path.
SLAB's version of kmem_cache_free already proceeds with irqs disabled, so
we only add a comment explaining why it's necessary for kmemcg there.
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com> Acked-by: Christoph Lameter <cl@linux.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Vladimir Davydov [Thu, 26 Jun 2014 00:42:30 +0000 (10:42 +1000)]
slub: don't fail kmem_cache_shrink if slab placement optimization fails
SLUB's kmem_cache_shrink not only removes empty slabs from the cache, but
also sorts slabs by the number of objects in-use to cope with
fragmentation. To achieve that, it tries to allocate a temporary array.
If it fails, it will abort the whole procedure.
This is unacceptable for kmemcg, where we want to be sure that all empty
slabs are removed from the cache on memcg offline, so let's just skip the
slab placement optimization step if the allocation fails, but still get
rid of empty slabs.
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com> Acked-by: Christoph Lameter <cl@linux.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Vladimir Davydov [Thu, 26 Jun 2014 00:42:30 +0000 (10:42 +1000)]
memcg: mark caches that belong to offline memcgs as dead
This will be used by the next patches.
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Vladimir Davydov [Thu, 26 Jun 2014 00:42:30 +0000 (10:42 +1000)]
memcg: destroy kmem caches when last slab is freed
When the memcg_cache_params->refcnt goes to 0, schedule the worker that
will unregister the cache. To prevent this from happening when the owner
memcg is alive, keep the refcnt incremented during memcg lifetime.
Note, this doesn't guarantee that the cache that belongs to a dead memcg
will go away as soon as the last object is freed, because SL[AU]B
implementation can cache empty slabs for performance reasons. Hence the
cache may be hanging around indefinitely after memcg offline. This is to
be resolved by the next patches.
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com> Acked-by: Christoph Lameter <cl@linux.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Vladimir Davydov [Thu, 26 Jun 2014 00:42:30 +0000 (10:42 +1000)]
memcg: cleanup memcg_cache_params refcnt usage
When a memcg is turned offline, some of its kmem caches can still have
active objects and therefore cannot be destroyed immediately. Currently,
we simply leak such caches along with the owner memcg, which is bad and
should be resolved.
It would be perfect if we could move all slab pages of such dead caches to
the root/parent cache on memcg offline. However, when I tried to
implement such re-parenting, I was pointed out by Christoph that the
overhead of this would be unacceptable, at least for SLUB (see
https://lkml.org/lkml/2014/5/13/446)
The problem with re-parenting of individual slabs is that it requires
tracking of all slabs allocated to a cache, but SLUB doesn't track full
slabs if !debug. Changing this behavior would result in significant
performance degradation of regular alloc/free paths, because it would make
alloc/free take per node list locks more often.
After pondering about this problem for some time, I think we should return
to dead caches self-destruction, i.e. scheduling cache destruction work
when the last slab page is freed.
This is the behavior we had before commit 5bd93da9917f ("memcg, slab:
simplify synchronization scheme"). The reason why it was removed was that
it simply didn't work, because SL[AU]B are implemented in such a way that
they don't discard empty slabs immediately, but prefer keeping them cached
for indefinite time to speed up further allocations.
However, we can change this w/o noticeable performance impact for both
SLAB and SLUB by making them drop free slabs as soon as they become empty.
Since dead caches should never be allocated from, removing empty slabs
from them shouldn't result in noticeable performance degradation.
So, this patch set reintroduces dead cache self-destruction and adds some
tweaks to SL[AU]B to prevent dead caches from hanging around indefinitely.
It is organized as follows:
- patches 1-3 reintroduce dead memcg cache self-destruction;
- patch 4 makes SLUB's version of kmem_cache_shrink always drop empty
slabs, even if it fails to allocate a temporary array;
- patches 5 and 6 fix possible use-after-free connected with
asynchronous cache destruction;
- patches 7 and 8 disable caching of empty slabs for dead memcg caches
for SLUB and SLAB respectively.
Note, this doesn't resolve the problem of memcgs pinned by dead kmem
caches. I'm planning to solve this by re-parenting dead kmem caches to
the parent memcg.
This patch (of 8):
Currently, we count the number of pages allocated to a per memcg cache in
memcg_cache_params->nr_pages. We only use this counter to find out if the
cache is empty and can be destroyed. So let's rename it to refcnt and
make it count not pages, but slabs so that we can use atomic_inc/dec
instead of atomic_add/sub in memcg_charge/uncharge_slab.
Also, as the number of slabs theoretically can be greater than INT_MAX,
let's use atomic_long for the counter.
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com> Acked-by: Christoph Lameter <cl@linux.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
mm: page_alloc: simplify drain_zone_pages by using min()
Instead of open-coding getting minimal value of two, just use min macro.
That is why it is there for. While changing the function also change type
of batch local variable to match type of per_cpu_pages::batch (which is
int).
Signed-off-by: Michal Nazarewicz <mina86@mina86.com> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Hugh Dickins [Thu, 26 Jun 2014 00:42:29 +0000 (10:42 +1000)]
mm/memory.c: use entry = ACCESS_ONCE(*pte) in handle_pte_fault()
Use ACCESS_ONCE() in handle_pte_fault() when getting the entry or orig_pte
upon which all subsequent decisions and pte_same() tests will be made.
I have no evidence that its lack is responsible for the mm/filemap.c:202
BUG_ON(page_mapped(page)) in __delete_from_page_cache() found by trinity,
and I am not optimistic that it will fix it. But I have found no other
explanation, and ACCESS_ONCE() here will surely not hurt.
If gcc does re-access the pte before passing it down, then that would be
disastrous for correct page fault handling, and certainly could explain
the page_mapped() BUGs seen (concurrent fault causing page to be mapped in
a second time on top of itself: mapcount 2 for a single pte).
Signed-off-by: Hugh Dickins <hughd@google.com> Cc: Sasha Levin <sasha.levin@oracle.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Konstantin Khlebnikov <koct9i@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Joonsoo Kim [Thu, 26 Jun 2014 00:42:29 +0000 (10:42 +1000)]
vmalloc: use rcu list iterator to reduce vmap_area_lock contention
Richard Yao reported a month ago that his system have a trouble with
vmap_area_lock contention during performance analysis by /proc/meminfo.
Andrew asked why his analysis checks /proc/meminfo stressfully, but he
didn't answer it.
https://lkml.org/lkml/2014/4/10/416
Although I'm not sure that this is right usage or not, there is a solution
reducing vmap_area_lock contention with no side-effect. That is just to
use rcu list iterator in get_vmalloc_info().
rcu can be used in this function because all RCU protocol is already
respected by writers, since Nick Piggin commit db64fe02258f1 ("mm: rewrite
vmap layer") back in linux-2.6.28
Specifically :
insertions use list_add_rcu(),
deletions use list_del_rcu() and kfree_rcu().
Note the rb tree is not used from rcu reader (it would not be safe),
only the vmap_area_list has full RCU protection.
Note that __purge_vmap_area_lazy() already uses this rcu protection.
: While rcu list traversal over the vmap_area_list is safe, this may
: arrive at different results than the spinlocked version. The rcu list
: traversal version will not be a 'snapshot' of a single, valid instant
: of the entire vmap_area_list, but rather a potential amalgam of
: different list states.
Joonsoo:
: Yes, you are right, but I don't think that we should be strict here.
: Meminfo is already not a 'snapshot' at specific time. While we try to get
: certain stats, the other stats can change. And, although we may arrive at
: different results than the spinlocked version, the difference would not be
: large and would not make serious side-effect.
[edumazet@google.com: add more commit description] Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Reported-by: Richard Yao <ryao@gentoo.org> Acked-by: Eric Dumazet <edumazet@google.com> Cc: Peter Hurley <peter@hurleysoftware.com> Cc: Zhang Yanfei <zhangyanfei.yes@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Andi Kleen <andi@firstfloor.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Chen Yucong [Thu, 26 Jun 2014 00:42:28 +0000 (10:42 +1000)]
hwpoison: fix the handling path of the victimized page frame that belong to non-LRU
Until now, the kernel has the same policy to handle victimized page frames
that belong to kernel-space(reserved/slab-subsystem) or non-LRU(unknown
page state). In other word, the result of handling either of these
victimized page frames is (IGNORED | FAILED), and the return value of
memory_failure() is -EBUSY.
This patch is to avoid that memory_failure() returns very soon due to the
"true" value of (!PageLRU(p)), and it also ensures that action_result()
can report more precise information("reserved kernel", "kernel slab", and
"unknown page state") instead of "non LRU", especially for memory errors
which are detected by memory-scrubbing.
Wei Yang [Thu, 26 Jun 2014 00:42:27 +0000 (10:42 +1000)]
slub: reduce duplicate creation on the first object
When a kmem_cache is created with ctor, each object in the kmem_cache will
be initialized before use. In the slub implementation, the first object
will be initialized twice.
This patch avoids the duplication of initialization of the first object.
Fixes commit 7656c72b5a63: ("SLUB: add macros for scanning objects in a
slab").
Signed-off-by: Wei Yang <weiyang@linux.vnet.ibm.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Andrey Ryabinin [Thu, 26 Jun 2014 00:42:27 +0000 (10:42 +1000)]
mm: slub: SLUB_DEBUG=n: use the same alloc/free hooks as for SLUB_DEBUG=y
There are two versions of alloc/free hooks now - one for
CONFIG_SLUB_DEBUG=y and another one for CONFIG_SLUB_DEBUG=n.
I see no reason why calls to other debugging subsystems (LOCKDEP,
DEBUG_ATOMIC_SLEEP, KMEMCHECK and FAILSLAB) are hidden under SLUB_DEBUG.
All this features should work regardless of SLUB_DEBUG config, as all of
them already have own Kconfig options.
This also fixes failslab for CONFIG_SLUB_DEBUG=n configuration. It simply
has not worked before because should_failslab() call was in a hook hidden
under "#ifdef CONFIG_SLUB_DEBUG #else".
Note: There is one concealed change in allocation path for SLUB_DEBUG=n
and all other debugging features disabled. The might_sleep_if() call can
generate some code even if DEBUG_ATOMIC_SLEEP=n. For PREEMPT_VOLUNTARY=y
might_sleep() inserts _cond_resched() call, but I think it should be ok.
Signed-off-by: Andrey Ryabinin <a.ryabinin@samsung.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
David Rientjes [Thu, 26 Jun 2014 00:42:27 +0000 (10:42 +1000)]
mm, slub: mark resiliency_test as init text
resiliency_test() is only called for bootstrap, so it may be moved to
init.text and freed after boot.
Signed-off-by: David Rientjes <rientjes@google.com> Acked-by: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Andrey Ryabinin [Thu, 26 Jun 2014 00:42:27 +0000 (10:42 +1000)]
mm: slab.h: wrap the whole file with guarding macro
Guarding section:
#ifndef MM_SLAB_H
#define MM_SLAB_H
...
#endif
currently doesn't cover the whole mm/slab.h. It seems like it was
done unintentionally.
Wrap the whole file by moving closing #endif to the end of it.
Signed-off-by: Andrey Ryabinin <a.ryabinin@samsung.com> Acked-by: Christoph Lameter <cl@linux.com> Acked-by: David Rientjes <rientjes@google.com> Reviewed-by: Vladimir Davydov <vdavydov@parallels.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
mm/slab.c: In function 'slab_set_debugobj_lock_classes':
mm/slab.c:524: error: 'h' undeclared (first use in this function)
mm/slab.c:524: error: (Each undeclared identifier is reported only once
mm/slab.c:524: error: for each function it appears in.)
mm/slab.c:524: warning: left-hand operand of comma expression has no effect
mm/slab.c: In function 'cpuup_prepare':
mm/slab.c:1308: warning: passing argument 2 of 'slab_set_debugobj_lock_classes_node' makes pointer from integer without a cast
mm/slab.c:513: note: expected 'struct kmem_cache_node *' but argument is of type 'int'
Cc: Christoph Lameter <cl@gentwo.org> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Pekka Enberg <penberg@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
> > @@ -3759,8 +3746,8 @@ fail:
> > /* Cache is not active yet. Roll back what we did */
> > node--;
> > while (node >= 0) {
> > - if (cachep->node[node]) {
> > - n = cachep->node[node];
> > + if (get_node(cachep, node)) {
> > + n = get_node(cachep, node);
>
> Could you do this as following?
>
> n = get_node(cachep, node);
> if (n) {
> ...
> }
Sure....
Subject: slab: Fixes to earlier patch
Signed-off-by: Christoph Lameter <cl@linux.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
slab: use get_node() and kmem_cache_node() functions
Use the two functions to simplify the code avoiding numerous explicit
checks coded checking for a certain node to be online.
Get rid of various repeated calculations of kmem_cache_node structures.
Signed-off-by: Christoph Lameter <cl@linux.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Acked-by: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Ok got through the file and removed all the lines after
for_each_kmem_cache_node.
>
> > @@ -3407,11 +3401,7 @@ int __kmem_cache_shrink(struct kmem_cach
> > return -ENOMEM;
> >
> > flush_all(s);
> > - for_each_node_state(node, N_NORMAL_MEMORY) {
> > - n = get_node(s, node);
> > -
> > - if (!n->nr_partial)
> > - continue;
> > + for_each_kmem_cache_node(s, node, n) {
> >
> > for (i = 0; i < objects; i++)
> > INIT_LIST_HEAD(slabs_by_inuse + i);
>
> Is there any reason not to keep the !n->nr_partial check to avoid taking
> n->list_lock unnecessarily?
No this was simply a mistake the check needs to be preserved.
Subject: slub: Fix up earlier patch
Signed-off-by: Christoph Lameter <cl@linux.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Make use of the new node functions in mm/slab.h to reduce code size and
simplify.
Signed-off-by: Christoph Lameter <cl@linux.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Acked-by: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
slab common: add functions for kmem_cache_node access
The patchset provides two new functions in mm/slab.h and modifies SLAB and
SLUB to use these. The kmem_cache_node structure is shared between both
allocators and the use of common accessors will allow us to move more code
into slab_common.c in the future.
This patch (of 3):
These functions allow to eliminate repeatedly used code in both SLAB and
SLUB and also allow for the insertion of debugging code that may be needed
in the development process.
Signed-off-by: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Acked-by: David Rientjes <rientjes@google.com> Acked-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Fabian Frederick [Thu, 26 Jun 2014 00:42:25 +0000 (10:42 +1000)]
mm/slab.c: add __init to init_lock_keys
init_lock_keys is only called by __init kmem_cache_init_late
Signed-off-by: Fabian Frederick <fabf@skynet.be> Acked-by: Christoph Lameter <cl@linux.com> Acked-by: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Pekka Enberg <penberg@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Josh Hunt [Thu, 26 Jun 2014 00:42:24 +0000 (10:42 +1000)]
block: restore /proc/partitions to not display non-partitionable removable devices
We found with newer kernels we started seeing the cdrom device showing
up in /proc/partitions, but it was not there before.
Looking into this I found that commit d27769ec ("block: add
GENHD_FL_NO_PART_SCAN") introduces this change in behavior. It's not
clear to me from the commit's changelog if this change was intentional or
not. This comment still remains: /* Don't show non-partitionable
removeable devices or empty devices */ so I've decided to send a patch to
restore the behavior of not printing unpartitionable removable devices.
Signed-off-by: Josh Hunt <johunt@akamai.com> Cc: Tejun Heo <tj@kernel.org> Cc: Kay Sievers <kay.sievers@vrfy.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
bio: modify __bio_add_page() to accept pages that don't start a new segment
The original behaviour is to refuse to add a new page if the maximum
number of segments has been reached, regardless of the fact the page we
are going to add can be merged into the last segment or not.
Unfortunately, when the system runs under heavy memory fragmentation
conditions, a driver may try to add multiple pages to the last segment.
The original code won't accept them and EBUSY will be reported to
userspace.
This patch modifies the function so it refuses to add a page only in case
the latter starts a new segment and the maximum number of segments has
already been reached.
The bug can be easily reproduced with the st driver:
1) set CONFIG_SCSI_MPT2SAS_MAX_SGE or CONFIG_SCSI_MPT3SAS_MAX_SGE to 16
2) modprobe st buffer_kbs=1024
3) #dd if=/dev/zero of=/dev/st0 bs=1M count=10
dd: error writing `/dev/st0': Device or resource busy
Signed-off-by: Maurizio Lombardi <mlombard@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Christoph Hellwig <hch@lst.de> Cc: Kent Overstreet <kmo@daterainc.com> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
- Invalid maintainer e-mail address:
Mail server reply:
Recipient address rejected: User unknown in virtual alias table
- Remove no longer working webpage URL
- Remove obsolete "Person" field
- Move status to "Orphan"
- Add Dave Jeffery and Jack Hammer to the CREDITS file
Signed-off-by: Michael Opdenacker <michael.opdenacker@free-electrons.com> Reviewed-by: Jean Delvare <jdelvare@suse.de> Cc: David Jeffery <dhjeffery@gmail.com> Cc: James Bottomley <James.Bottomley@HansenPartnership.com> Cc: Paul Bolle <pebolle@tiscali.nl> Reviewed-by: Jingoo Han <jg1.han@samsung.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Fabian Frederick [Thu, 26 Jun 2014 00:42:23 +0000 (10:42 +1000)]
fs/ocfs2/slot_map.c: replace count*size kzalloc by kcalloc
kcalloc manages count*sizeof overflow.
Signed-off-by: Fabian Frederick <fabf@skynet.be> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
yangwenfang [Thu, 26 Jun 2014 00:42:23 +0000 (10:42 +1000)]
ocfs2: call ocfs2_journal_access_di() before ocfs2_journal_dirty() in ocfs2_write_end_nolock()
After we call ocfs2_journal_access_di() in ocfs2_write_begin(),
jbd2_journal_restart() may also be called, in this function transaction
A's t_updates-- and obtains a new transaction B. If
jbd2_journal_commit_transaction() is happened to commit transaction A,
when t_updates==0, it will continue to complete commit and unfile buffer.
So when jbd2_journal_dirty_metadata(), the handle is pointed a new
transaction B, and the buffer head's journal head is already freed,
jh->b_transaction == NULL, jh->b_next_transaction == NULL, it returns
EINVAL, So it triggers the BUG_ON(status).
So I think we should put ocfs2_journal_access_di before
ocfs2_journal_dirty in the ocfs2_write_end. and it works well after my
modification.
Signed-off-by: vicky <vicky.yangwenfang@huawei.com> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Junxiao Bi [Thu, 26 Jun 2014 00:42:23 +0000 (10:42 +1000)]
ocfs2: quorum: add a log for node not fenced
For debug use, we can see from the log whether the fence decision is made
and why it is not fenced.
Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com> Reviewed-by: Srinivas Eeda <srinivas.eeda@oracle.com> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Joseph Qi <joseph.qi@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Junxiao Bi [Thu, 26 Jun 2014 00:42:23 +0000 (10:42 +1000)]
ocfs2: o2net: set tcp user timeout to max value
When tcp retransmit timeout(15mins), the connection will be closed.
Pending messages may be lost during this time. So we set tcp user timeout
to override the retransmit timeout to the max value. This is OK for ocfs2
since we have disk heartbeat, if peer crash, the disk heartbeat will
timeout and it will be evicted, if disk heartbeat not timeout and
connection idle for a long time, then this means the cluster enters
split-brain state, since fence can't happen, we'd better keep the
connection and wait network recover.
Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com> Reviewed-by: Srinivas Eeda <srinivas.eeda@oracle.com> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Joseph Qi <joseph.qi@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Junxiao Bi [Thu, 26 Jun 2014 00:42:23 +0000 (10:42 +1000)]
ocfs2: o2net: don't shutdown connection when idle timeout
This patch series is to fix a possible message lost bug in ocfs2 when
network go bad. This bug will cause ocfs2 hung forever even network
become good again.
The messages may lost in this case. After the tcp connection is
established between two nodes, an idle timer will be set to check its
state periodically, if no messages are received during this time, idle
timer will timeout, it will shutdown the connection and try to reconnect,
so pending messages in tcp queues will be lost. This messages may be from
dlm. Dlm may get hung in this case. This may cause the whole ocfs2
cluster hung.
This is very possible to happen when network state goes bad. Do the
reconnect is useless, it will fail if network state is still bad. Just
waiting there for network recovering may be a good idea, it will not lost
messages and some node will be fenced until cluster goes into split-brain
state, for this case, Tcp user timeout is used to override the tcp
retransmit timeout. It will timeout after 25 days, user should have
notice this through the provided log and fix the network, if they don't,
ocfs2 will fall back to original reconnect way.
This patch (of 3):
Some messages in the tcp queue maybe lost if we shutdown the connection
and reconnect when idle timeout. If packets lost and reconnect success,
then the ocfs2 cluster maybe hung.
To fix this, we can leave the connection there and do the fence decision
when idle timeout, if network recover before fence dicision is made, the
connection survive without lost any messages.
This bug can be saw when network state go bad. It may cause ocfs2 hung
forever if some packets lost. With this fix, ocfs2 will recover from hung
if network becomes good again.
Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com> Reviewed-by: Srinivas Eeda <srinivas.eeda@oracle.com> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Joseph Qi <joseph.qi@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Xue jiufei [Thu, 26 Jun 2014 00:42:22 +0000 (10:42 +1000)]
ocfs2: free inode when i_count becomes zero
Disk inode deletion may be heavily delayed when one node unlink a file
after the same dentry is freed on another node(say N1) because of memory
shrink but inode is left in memory. This inode can only be freed while N1
doing the orphan scan work.
However, N1 may skip orphan scan for several times because other nodes may
do the work earlier. In our tests, it may take 1 hour on 4 nodes cluster
and this will cause bad user experience. So we think the inode should be
freed when i_count becomes zero to avoid such circumstances.
Signed-off-by: joyce.xue <xuejiufei@huawei.com> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Ben Hutchings [Thu, 26 Jun 2014 00:42:22 +0000 (10:42 +1000)]
ocfs2: do not write error flag to user structure we cannot copy from/to
If we failed to copy from the structure, writing back the flags leaks 31
bits of kernel memory (the rest of the ir_flags field).
In any case, if we cannot copy from/to the structure, why should we expect
putting just the flags to work?
Also make sure ocfs2_info_handle_freeinode() returns the right error code
if the copy_to_user() fails.
Fixes: ddee5cdb70e6 ('Ocfs2: Add new OCFS2_IOC_INFO ioctl for ocfs2 v8.') Signed-off-by: Ben Hutchings <ben@decadent.org.uk> Cc: Joel Becker <jlbec@evilplan.org> Cc: Mark Fasheh <mfasheh@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Yingtai Xie [Thu, 26 Jun 2014 00:42:22 +0000 (10:42 +1000)]
ocfs2: correctly check the return value of ocfs2_search_extent_list
ocfs2_search_extent_list may return -1, so we should check the return
value in ocfs2_split_and_insert, otherwise it may cause array index out of
bound.
And ocfs2_search_extent_list can only return value less than
el->l_next_free_rec, so check if it is equal or larger than
le16_to_cpu(el->l_next_free_rec) is meaningless.
Signed-off-by: Yingtai Xie <xieyingtai@huawei.com> Signed-off-by: Joseph Qi <joseph.qi@huawei.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Mark Fasheh <mfasheh@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Fabian Frederick [Thu, 26 Jun 2014 00:42:21 +0000 (10:42 +1000)]
fs/ext4/fsync.c: generic_file_fsync call based on barrier flag
generic_file_fsync has been updated to issue a flush for older
filesystems.
This patch tests for barrier flag in ext4 mount flags and calls the right
function.
Lukas said:
: Note that the actual generic_file_fsync change fixes a real bug in ext4
: where we would _not_ send a flush on sync if we have file system
: without journal.
:
: Ted, it would be useful to mention that in the commit description
: along with the commit id:
:
: ac13a829f6adb674015ab399594c089990104af7 fs/libfs.c: add generic
: data flush to fsync
Signed-off-by: Fabian Frederick <fabf@skynet.be> Suggested-by: Jan Kara <jack@suse.cz> Suggested-by: Christoph Hellwig <hch@infradead.org> Reviewed-by: Lukas Czerner <lczerner@redhat.com> Cc: Jan Kara <jack@suse.cz> Cc: Christoph Hellwig <hch@infradead.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Samuel Thibault [Thu, 26 Jun 2014 00:42:20 +0000 (10:42 +1000)]
input: route kbd LEDs through the generic LEDs layer
This permits to reassign keyboard LEDs to something else than keyboard
"leds" state, by adding keyboard led and modifier triggers connected to a
series of VT input LEDs, themselves connected to VT input triggers, which
per-input device LEDs use by default. Userland can thus easily change the
LED behavior of (a priori) all input devices, or of particular input
devices.
This also permits to fix #7063 from userland by using a modifier to
implement proper CapsLock behavior and have the keyboard caps lock led
show that modifier state.
[ebroder@mokafive.com: Rebased to 3.2-rc1 or so, cleaned up some includes, and fixed some constants]
[blogic@openwrt.org: CONFIG_INPUT_LEDS stubs should be static inline]
[akpm@linux-foundation.org: remove unneeded `extern', fix comment layout] Signed-off-by: Samuel Thibault <samuel.thibault@ens-lyon.org> Signed-off-by: Evan Broder <evan@ebroder.net> Reviewed-by: David Herrmann <dh.herrmann@gmail.com> Tested-by: Pavel Machek <pavel@ucw.cz> Acked-by: Peter Korsgaard <jacmet@sunsite.dk> Cc: Pavel Machek <pavel@ucw.cz> Cc: Dmitry Torokhov <dmitry.torokhov@gmail.com> Cc: Bryan Wu <cooloney@gmail.com> Cc: Arnaud Patard <arnaud.patard@rtp-net.org> Cc: Sascha Hauer <s.hauer@pengutronix.de> Cc: Matt Sealey <matt@genesi-usa.com> Cc: Rob Clark <robdclark@gmail.com> Cc: Niels de Vos <devos@fedoraproject.org> Cc: Steev Klimaszewski <steev@genesi-usa.com> Signed-off-by: John Crispin <blogic@openwrt.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Fabian Frederick [Thu, 26 Jun 2014 00:42:19 +0000 (10:42 +1000)]
fs/cifs: remove obsolete __constant
Replace all __constant_foo to foo() except in smb2status.h (1700 lines to
update).
Signed-off-by: Fabian Frederick <fabf@skynet.be> Cc: Steve French <sfrench@samba.org> Cc: Jeff Layton <jlayton@poochiereds.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
x86,mem-hotplug: modify PGD entry when removing memory
When hot-adding/removing memory, sync_global_pgds() is called for
synchronizing PGD to PGD entries of all processes MM. But when
hot-removing memory, sync_global_pgds() does not work correctly.
At first, sync_global_pgds() checks whether target PGD is none or not.
And if PGD is none, the PGD is skipped. But when hot-removing memory, PGD
may be none since PGD may be cleared by free_pud_table(). So when
sync_global_pgds() is called after hot-removing memory, sync_global_pgds()
should not skip PGD even if the PGD is none. And sync_global_pgds() must
clear PGD entries of all processes MM.
Currently sync_global_pgds() does not clear PGD entries of all processes
MM when hot-removing memory. So when hot adding memory which is same
memory range as removed memory after hot-removing memory, following call
traces are shown:
remove_pagetable() gets start argument and passes the argument to
sync_global_pgds(). In this case, the argument must not be modified. If
the argument is modified and passed to sync_global_pgds(),
sync_global_pgds() does not correctly synchronize PGD to PGD entries of
all processes MM since synchronized range of memory [start, end] is wrong.
Unfortunately the start argument is modified in remove_pagetable(). So
this patch fixes the issue.
Signed-off-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Tang Chen <tangchen@cn.fujitsu.com> Acked-by: Toshi Kani <toshi.kani@hp.com> Cc: Gu Zheng <guz.fnst@cn.fujitsu.com> Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Heiko Carstens [Thu, 26 Jun 2014 00:42:18 +0000 (10:42 +1000)]
fs/seq_file: fallback to vmalloc allocation
There are a couple of seq_files which use the single_open() interface.
This interface requires that the whole output must fit into a single
buffer.
E.g. for /proc/stat allocation failures have been observed because an
order-4 memory allocation failed due to memory fragmentation. In such
situations reading /proc/stat is not possible anymore.
Therefore change the seq_file code to fallback to vmalloc allocations
which will usually result in a couple of order-0 allocations and hence
also work if memory is fragmented.
For reference a call trace where reading from /proc/stat failed:
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com> Tested-by: David Rientjes <rientjes@google.com> Cc: Ian Kent <raven@themaw.net> Cc: Hendrik Brueckner <brueckner@linux.vnet.ibm.com> Cc: Thorsten Diehl <thorsten.diehl@de.ibm.com> Cc: Andrea Righi <andrea@betterlinux.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Stefan Bader <stefan.bader@canonical.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Heiko Carstens [Thu, 26 Jun 2014 00:42:18 +0000 (10:42 +1000)]
proc/stat: convert to single_open_size()
These two patches are supposed to "fix" failed order-4 memory allocations
which have been observed when reading /proc/stat. The problem has been
observed on s390 as well as on x86.
To address the problem change the seq_file memory allocations to fallback
to use vmalloc, so that allocations also work if memory is fragmented.
This approach seems to be simpler and less intrusive than changing
/proc/stat to use an interator. Also it "fixes" other users as well,
which use seq_file's single_open() interface.
This patch (of 2):
Use seq_file's single_open_size() to preallocate a buffer that is large
enough to hold the whole output, instead of open coding it. Also
calculate the requested size using the number of online cpus instead of
possible cpus, since the size of the output only depends on the number of
online cpus.
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com> Acked-by: David Rientjes <rientjes@google.com> Cc: Ian Kent <raven@themaw.net> Cc: Hendrik Brueckner <brueckner@linux.vnet.ibm.com> Cc: Thorsten Diehl <thorsten.diehl@de.ibm.com> Cc: Andrea Righi <andrea@betterlinux.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Stefan Bader <stefan.bader@canonical.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Ian Kent [Thu, 26 Jun 2014 00:42:17 +0000 (10:42 +1000)]
autofs4: fix false positive compile error
On strict build environments we can see:
fs/autofs4/inode.c: In function 'autofs4_fill_super':
fs/autofs4/inode.c:312: error: 'pgrp' may be used uninitialized in this
function
make[2]: *** [fs/autofs4/inode.o] Error 1
make[1]: *** [fs/autofs4] Error 2
make: *** [fs] Error 2
make: *** Waiting for unfinished jobs....
This is due to the use of pgrp_set being used to indicate pgrp has
has been set rather than initializing pgrp itself.
Signed-off-by: Ian Kent <raven@themaw.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Joonsoo Kim [Thu, 26 Jun 2014 00:42:17 +0000 (10:42 +1000)]
slub: fix off by one in number of slab tests
min_partial means minimum number of slab cached in node partial list. So,
if nr_partial is less than it, we keep newly empty slab on node partial
list rather than freeing it. But if nr_partial is equal or greater than
it, it means that we have enough partial slabs so should free newly empty
slab. Current implementation missed the equal case so if we set
min_partial is 0, then, at least one slab could be cached. This is
critical problem to kmemcg destroying logic because it doesn't works
properly if some slabs is cached. This patch fixes this problem.
Fixes 91cb69620284 ("slub: make dead memcg caches discard free slabs
immediately").
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Acked-by: Vladimir Davydov <vdavydov@parallels.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
This happens because init_cma_reserved_pageblock() calls __free_one_page()
with pageblock_order as page order but it is bigger than MAX_ORDER. This
in turn causes accesses past zone->free_list[].
Fix the problem by changing init_cma_reserved_pageblock() such that it
splits pageblock into individual MAX_ORDER pages if pageblock is bigger
than a MAX_ORDER page.
In cases where !CONFIG_HUGETLB_PAGE_SIZE_VARIABLE, which is all
architectures expect for ia64, powerpc and tile at the moment, the
“pageblock_order > MAX_ORDER” condition will be optimised out since
both sides of the operator are constants. In cases where pageblock size
is variable, the performance degradation should not be significant anyway
since init_cma_reserved_pageblock() is called only at boot time at most
MAX_CMA_AREAS times which by default is eight.
Signed-off-by: Michal Nazarewicz <mina86@mina86.com> Reported-by: Mark Salter <msalter@redhat.com> Tested-by: Mark Salter <msalter@redhat.com> Tested-by: Christopher Covington <cov@codeaurora.org> Cc: Mel Gorman <mgorman@suse.de> Cc: David Rientjes <rientjes@google.com> Cc: Marek Szyprowski <m.szyprowski@samsung.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: <stable@vger.kernel.org> [3.5+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Linus Torvalds [Wed, 25 Jun 2014 12:44:17 +0000 (05:44 -0700)]
Merge branch 'merge' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc
Pull powerpc fixes and cleanups from Ben Herrenschmidt:
"Here are a handful or two of powerpc fixes and simple/trivial
cleanups. A bunch of them fix ftrace with the new ABI v2 in Little
Endian, the rest is a scattering of fairly simple things"
* 'merge' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc:
powerpc: Don't skip ePAPR spin-table CPUs
powerpc/module: Fix TOC symbol CRC
powerpc/powernv: Remove OPAL v1 takeover
powerpc/kmemleak: Do not scan the DART table
selftests/powerpc: Use the test harness for the TM DSCR test
powerpc/cell: cbe_thermal.c: Cleaning up a variable is of the wrong type
powerpc/kprobes: Fix jprobes on ABI v2 (LE)
powerpc/ftrace: Use pr_fmt() to namespace error messages
powerpc/ftrace: Fix nop of modules on 64bit LE (ABIv2)
powerpc/ftrace: Fix inverted check of create_branch()
powerpc/ftrace: Fix typo in mask of opcode
powerpc: Add ppc_global_function_entry()
powerpc/macintosh/smu.c: Fix closing brace followed by if
powerpc: Remove __arch_swab*
powerpc: Remove ancient DEBUG_SIG code
powerpc/kerenl: Enable EEH for IO accessors
Linus Torvalds [Wed, 25 Jun 2014 12:30:20 +0000 (05:30 -0700)]
Merge tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost
Pull vhost cleanups from Michael S Tsirkin:
"Two cleanup patches removing code duplication that got introduced by
changes in rc1. Not fixing crashes, but I'd rather not carry the
duplicate code until the next merge window"
* tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost:
vhost-scsi: don't open-code kvfree
vhost-net: don't open-code kvfree
Linus Torvalds [Wed, 25 Jun 2014 12:08:09 +0000 (05:08 -0700)]
Merge tag 'trace-fixes-v3.16-rc1-v2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing cleanups and fixes from Steven Rostedt:
"This includes three patches from Oleg Nesterov. The first is a fix to
a race condition that happens between enabling/disabling syscall
tracepoints and new process creations (the check to go into the ptrace
path for a process can be set when it shouldn't, or not set when it
should). Not a major bug but one that should be fixed and even
applied to stable.
The other two patches are cleanup/fixes that are not that critical,
but for an -rc1 release would be nice to have. They both deal with
syscall tracepoints.
It also includes a patch to introduce a new macro for the
TRACE_EVENT() format called __field_struct(). Originally, __field()
was used to record any variable into a trace event, but with the
addition of setting the "is signed" attribute, the check causes
anything but a primitive variable to fail to compile. That is,
structs and unions can't be used as they once were. When the "is
signed" check was introduce there were only primitive variables being
recorded. But that will change soon and it was reported that
__field() causes build failures.
To solve the __field() issue, __field_struct() is introduced to allow
trace_events to be able to record complex types too"
* tag 'trace-fixes-v3.16-rc1-v2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
tracing: Add __field_struct macro for TRACE_EVENT()
tracing: syscall_regfunc() should not skip kernel threads
tracing: Change syscall_*regfunc() to check PF_KTHREAD and use for_each_process_thread()
tracing: Fix syscall_*regfunc() vs copy_process() race
Scott Wood [Wed, 25 Jun 2014 01:15:51 +0000 (20:15 -0500)]
powerpc: Don't skip ePAPR spin-table CPUs
Commit 59a53afe70fd530040bdc69581f03d880157f15a "powerpc: Don't setup
CPUs with bad status" broke ePAPR SMP booting. ePAPR says that CPUs
that aren't presently running shall have status of disabled, with
enable-method being used to determine whether the CPU can be enabled.
Fix by checking for spin-table, which is currently the only supported
enable-method.
Signed-off-by: Scott Wood <scottwood@freescale.com> Cc: Michael Neuling <mikey@neuling.org> Cc: Emil Medve <Emilian.Medve@Freescale.com> Cc: stable@vger.kernel.org Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Laurent Dufour [Tue, 24 Jun 2014 08:53:59 +0000 (10:53 +0200)]
powerpc/module: Fix TOC symbol CRC
The commit 71ec7c55ed91 introduced the magic symbol ".TOC." for ELFv2 ABI.
This symbol is built manually and has no CRC value computed. A zero value
is put in the CRC section to avoid modpost complaining about a missing CRC.
Unfortunately, this breaks the kernel module loading when the kernel is
relocated (kdump case for instance) because of the relocation applied to
the kcrctab values.
This patch compute a CRC value for the TOC symbol which will match the one
compute by the kernel when it is relocated - aka '0 - relocate_start' done in
maybe_relocated called by check_version (module.c).
Signed-off-by: Laurent Dufour <ldufour@linux.vnet.ibm.com> Cc: Anton Blanchard <anton@samba.org> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Michael Ellerman [Tue, 24 Jun 2014 07:17:47 +0000 (17:17 +1000)]
powerpc/powernv: Remove OPAL v1 takeover
In commit 27f4488872d9 "Add OPAL takeover from PowerVM" we added support
for "takeover" on OPAL v1 machines.
This was a mode of operation where we would boot under pHyp, and query
for the presence of OPAL. If detected we would then do a special
sequence to take over the machine, and the kernel would end up running
in hypervisor mode.
OPAL v1 was never a supported product, and was never shipped outside
IBM. As far as we know no one is still using it.
Newer versions of OPAL do not use the takeover mechanism. Although the
query for OPAL should be harmless on machines with newer OPAL, we have
seen a machine where it causes a crash in Open Firmware.
The code in early_init_devtree() to copy boot_command_line into cmd_line
was added in commit 817c21ad9a1f "Get kernel command line accross OPAL
takeover", and AFAIK is only used by takeover, so should also be
removed.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>