Joonsoo Kim [Thu, 26 Jun 2014 00:42:33 +0000 (10:42 +1000)]
CMA: fix ARM build failure related to MAX_CMA_AREAS definition
If CMA is disabled, CONFIG_CMA_AREAS isn't defined so compile error
happens. To fix it, define MAX_CMA_AREAS if CONFIG_CMA_AREAS
isn't defined.
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Joonsoo Kim [Thu, 26 Jun 2014 00:42:33 +0000 (10:42 +1000)]
CMA: generalize CMA reserved area management functionality
Currently, there are two users on CMA functionality, one is the DMA
subsystem and the other is the KVM on powerpc. They have their own code
to manage CMA reserved area even if they looks really similar. From my
guess, it is caused by some needs on bitmap management. KVM side wants to
maintain bitmap not for 1 page, but for more size. Eventually it use
bitmap where one bit represents 64 pages.
When I implement CMA related patches, I should change those two places to
apply my change and it seem to be painful to me. I want to change this
situation and reduce future code management overhead through this patch.
This change could also help developer who want to use CMA in their new
feature development, since they can use CMA easily without copying &
pasting this reserved area management code.
In previous patches, we have prepared some features to generalize CMA
reserved area management and now it's time to do it. This patch moves
core functions to mm/cma.c and change DMA APIs to use these functions.
There is no functional change in DMA APIs.
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Acked-by: Michal Nazarewicz <mina86@mina86.com> Acked-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Acked-by: Minchan Kim <minchan@kernel.org> Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Cc: Alexander Graf <agraf@suse.de> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Cc: Gleb Natapov <gleb@kernel.org> Acked-by: Marek Szyprowski <m.szyprowski@samsung.com> Tested-by: Marek Szyprowski <m.szyprowski@samsung.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Joonsoo Kim [Thu, 26 Jun 2014 00:42:32 +0000 (10:42 +1000)]
DMA, CMA: support arbitrary bitmap granularity
PPC KVM's CMA area management requires arbitrary bitmap granularity, since
they want to reserve very large memory and manage this region with bitmap
that one bit for several pages to reduce management overheads. So support
arbitrary bitmap granularity for following generalization.
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Acked-by: Michal Nazarewicz <mina86@mina86.com> Acked-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Acked-by: Minchan Kim <minchan@kernel.org> Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Cc: Alexander Graf <agraf@suse.de> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Cc: Gleb Natapov <gleb@kernel.org> Acked-by: Marek Szyprowski <m.szyprowski@samsung.com> Tested-by: Marek Szyprowski <m.szyprowski@samsung.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Joonsoo Kim [Thu, 26 Jun 2014 00:42:32 +0000 (10:42 +1000)]
DMA, CMA: separate core CMA management codes from DMA APIs
To prepare future generalization work on CMA area management code, we need
to separate core CMA management codes from DMA APIs. We will extend these
core functions to cover requirements of PPC KVM's CMA area management
functionality in following patches. This separation helps us not to touch
DMA APIs while extending core functions.
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Acked-by: Michal Nazarewicz <mina86@mina86.com> Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Cc: Alexander Graf <agraf@suse.de> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Cc: Gleb Natapov <gleb@kernel.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Acked-by: Marek Szyprowski <m.szyprowski@samsung.com> Tested-by: Marek Szyprowski <m.szyprowski@samsung.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Vladimir Davydov [Thu, 26 Jun 2014 00:42:32 +0000 (10:42 +1000)]
slab: set free_limit for dead caches to 0
We mustn't keep empty slabs on dead memcg caches, because otherwise they
will never be destroyed.
Currently, we check if the cache is dead in free_block and drop empty slab
if so irrespective of the node's free_limit. This can be pretty
expensive. So let's avoid this additional check by zeroing nodes'
free_limit for dead caches on kmem_cache_shrink. This way no additional
overhead is added to free hot path.
Note, since ->free_limit can be updated on cpu/memory hotplug, we must
handle it properly there.
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com> Suggested-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Vladimir Davydov [Thu, 26 Jun 2014 00:42:31 +0000 (10:42 +1000)]
slab: do not keep free objects/slabs on dead memcg caches
Since a dead memcg cache is destroyed only after the last slab allocated
to it is freed, we must disable caching of free objects/slabs for such
caches, otherwise they will be hanging around forever.
For SLAB that means we must disable per cpu free object arrays and make
free_block always discard empty slabs irrespective of node's free_limit.
To disable per cpu arrays, we free them on kmem_cache_shrink (see
drain_cpu_caches -> do_drain) and make __cache_free fall back to
free_block if there is no per cpu array. Also, we have to disable
allocation of per cpu arrays on cpu hotplug for dead caches (see
cpuup_prepare, __do_tune_cpucache).
After we disabled free objects/slabs caching, there is no need to reap
those caches periodically. Moreover, it will only result in slowdown. So
we also make cache_reap skip then.
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Vladimir Davydov [Thu, 26 Jun 2014 00:42:31 +0000 (10:42 +1000)]
slub: kmem_cache_shrink: check if partial list is empty under list_lock
SLUB's implementation of kmem_cache_shrink skips nodes that have
nr_partial=0, because they surely don't have any empty slabs to free.
This check is done w/o holding any locks, therefore it can race with
concurrent kfree adding an empty slab to a partial list. As a result, a
just shrinked cache can have empty slabs.
This is unacceptable for kmemcg, which needs to be sure that there will be
no empty slabs on dead memcg caches after kmem_cache_shrink was called,
because otherwise we may leak a dead cache.
Let's fix this race by checking if node partial list is empty under
node->list_lock. Since the nr_partial!=0 branch of kmem_cache_shrink does
nothing if the list is empty, we can simply remove the nr_partial=0 check.
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com> Reported-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Vladimir Davydov [Thu, 26 Jun 2014 00:42:31 +0000 (10:42 +1000)]
slub: make dead memcg caches discard free slabs immediately
Since a dead memcg cache is destroyed only after the last slab allocated
to it is freed, we must disable caching of empty slabs for such caches,
otherwise they will be hanging around forever.
This patch makes SLUB discard dead memcg caches' slabs as soon as they
become empty. To achieve that, it disables per cpu partial lists for dead
caches (see put_cpu_partial) and forbids keeping empty slabs on per node
partial lists by setting cache's min_partial to 0 on kmem_cache_shrink,
which is always called on memcg offline (see memcg_unregister_all_caches).
Thanks to Joonsoo Kim.
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Vladimir Davydov [Thu, 26 Jun 2014 00:42:31 +0000 (10:42 +1000)]
memcg: wait for kfree's to finish before destroying cache
kmem_cache_free doesn't expect that the cache can be destroyed as soon as
the object is freed, e.g. SLUB's implementation may want to update cache
stats after putting the object to the free list.
Therefore we should wait for all kmem_cache_free's to finish before
proceeding to cache destruction. Since both SLAB and SLUB versions of
kmem_cache_free are non-preemptable, we wait for rcu-sched grace period to
elapse.
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Vladimir Davydov [Thu, 26 Jun 2014 00:42:31 +0000 (10:42 +1000)]
slub: make slab_free non-preemptable
Since per memcg cache destruction is scheduled when the last slab is
freed, to avoid use-after-free in kmem_cache_free we should either
rearrange code in kmem_cache_free so that it won't dereference the cache
ptr after freeing the object, or wait for all kmem_cache_free's to
complete before proceeding to cache destruction.
The former approach isn't a good option from the future development point
of view, because every modifications to kmem_cache_free must be done with
great care then. Hence we should provide a method to wait for all
currently executing kmem_cache_free's to finish.
This patch makes SLUB's implementation of kmem_cache_free non-preemptable.
As a result, synchronize_sched() will work as a barrier against
kmem_cache_free's in flight, so that issuing it before cache destruction
will protect us against the use-after-free.
This won't affect performance of kmem_cache_free, because we already
disable preemption there, and this patch only moves preempt_enable to the
end of the function. Neither should it affect the system latency, because
kmem_cache_free is extremely short, even in its slow path.
SLAB's version of kmem_cache_free already proceeds with irqs disabled, so
we only add a comment explaining why it's necessary for kmemcg there.
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com> Acked-by: Christoph Lameter <cl@linux.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Vladimir Davydov [Thu, 26 Jun 2014 00:42:30 +0000 (10:42 +1000)]
slub: don't fail kmem_cache_shrink if slab placement optimization fails
SLUB's kmem_cache_shrink not only removes empty slabs from the cache, but
also sorts slabs by the number of objects in-use to cope with
fragmentation. To achieve that, it tries to allocate a temporary array.
If it fails, it will abort the whole procedure.
This is unacceptable for kmemcg, where we want to be sure that all empty
slabs are removed from the cache on memcg offline, so let's just skip the
slab placement optimization step if the allocation fails, but still get
rid of empty slabs.
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com> Acked-by: Christoph Lameter <cl@linux.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Vladimir Davydov [Thu, 26 Jun 2014 00:42:30 +0000 (10:42 +1000)]
memcg: mark caches that belong to offline memcgs as dead
This will be used by the next patches.
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Vladimir Davydov [Thu, 26 Jun 2014 00:42:30 +0000 (10:42 +1000)]
memcg: destroy kmem caches when last slab is freed
When the memcg_cache_params->refcnt goes to 0, schedule the worker that
will unregister the cache. To prevent this from happening when the owner
memcg is alive, keep the refcnt incremented during memcg lifetime.
Note, this doesn't guarantee that the cache that belongs to a dead memcg
will go away as soon as the last object is freed, because SL[AU]B
implementation can cache empty slabs for performance reasons. Hence the
cache may be hanging around indefinitely after memcg offline. This is to
be resolved by the next patches.
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com> Acked-by: Christoph Lameter <cl@linux.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Vladimir Davydov [Thu, 26 Jun 2014 00:42:30 +0000 (10:42 +1000)]
memcg: cleanup memcg_cache_params refcnt usage
When a memcg is turned offline, some of its kmem caches can still have
active objects and therefore cannot be destroyed immediately. Currently,
we simply leak such caches along with the owner memcg, which is bad and
should be resolved.
It would be perfect if we could move all slab pages of such dead caches to
the root/parent cache on memcg offline. However, when I tried to
implement such re-parenting, I was pointed out by Christoph that the
overhead of this would be unacceptable, at least for SLUB (see
https://lkml.org/lkml/2014/5/13/446)
The problem with re-parenting of individual slabs is that it requires
tracking of all slabs allocated to a cache, but SLUB doesn't track full
slabs if !debug. Changing this behavior would result in significant
performance degradation of regular alloc/free paths, because it would make
alloc/free take per node list locks more often.
After pondering about this problem for some time, I think we should return
to dead caches self-destruction, i.e. scheduling cache destruction work
when the last slab page is freed.
This is the behavior we had before commit 5bd93da9917f ("memcg, slab:
simplify synchronization scheme"). The reason why it was removed was that
it simply didn't work, because SL[AU]B are implemented in such a way that
they don't discard empty slabs immediately, but prefer keeping them cached
for indefinite time to speed up further allocations.
However, we can change this w/o noticeable performance impact for both
SLAB and SLUB by making them drop free slabs as soon as they become empty.
Since dead caches should never be allocated from, removing empty slabs
from them shouldn't result in noticeable performance degradation.
So, this patch set reintroduces dead cache self-destruction and adds some
tweaks to SL[AU]B to prevent dead caches from hanging around indefinitely.
It is organized as follows:
- patches 1-3 reintroduce dead memcg cache self-destruction;
- patch 4 makes SLUB's version of kmem_cache_shrink always drop empty
slabs, even if it fails to allocate a temporary array;
- patches 5 and 6 fix possible use-after-free connected with
asynchronous cache destruction;
- patches 7 and 8 disable caching of empty slabs for dead memcg caches
for SLUB and SLAB respectively.
Note, this doesn't resolve the problem of memcgs pinned by dead kmem
caches. I'm planning to solve this by re-parenting dead kmem caches to
the parent memcg.
This patch (of 8):
Currently, we count the number of pages allocated to a per memcg cache in
memcg_cache_params->nr_pages. We only use this counter to find out if the
cache is empty and can be destroyed. So let's rename it to refcnt and
make it count not pages, but slabs so that we can use atomic_inc/dec
instead of atomic_add/sub in memcg_charge/uncharge_slab.
Also, as the number of slabs theoretically can be greater than INT_MAX,
let's use atomic_long for the counter.
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com> Acked-by: Christoph Lameter <cl@linux.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
mm: page_alloc: simplify drain_zone_pages by using min()
Instead of open-coding getting minimal value of two, just use min macro.
That is why it is there for. While changing the function also change type
of batch local variable to match type of per_cpu_pages::batch (which is
int).
Signed-off-by: Michal Nazarewicz <mina86@mina86.com> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Hugh Dickins [Thu, 26 Jun 2014 00:42:29 +0000 (10:42 +1000)]
mm/memory.c: use entry = ACCESS_ONCE(*pte) in handle_pte_fault()
Use ACCESS_ONCE() in handle_pte_fault() when getting the entry or orig_pte
upon which all subsequent decisions and pte_same() tests will be made.
I have no evidence that its lack is responsible for the mm/filemap.c:202
BUG_ON(page_mapped(page)) in __delete_from_page_cache() found by trinity,
and I am not optimistic that it will fix it. But I have found no other
explanation, and ACCESS_ONCE() here will surely not hurt.
If gcc does re-access the pte before passing it down, then that would be
disastrous for correct page fault handling, and certainly could explain
the page_mapped() BUGs seen (concurrent fault causing page to be mapped in
a second time on top of itself: mapcount 2 for a single pte).
Signed-off-by: Hugh Dickins <hughd@google.com> Cc: Sasha Levin <sasha.levin@oracle.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Konstantin Khlebnikov <koct9i@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Joonsoo Kim [Thu, 26 Jun 2014 00:42:29 +0000 (10:42 +1000)]
vmalloc: use rcu list iterator to reduce vmap_area_lock contention
Richard Yao reported a month ago that his system have a trouble with
vmap_area_lock contention during performance analysis by /proc/meminfo.
Andrew asked why his analysis checks /proc/meminfo stressfully, but he
didn't answer it.
https://lkml.org/lkml/2014/4/10/416
Although I'm not sure that this is right usage or not, there is a solution
reducing vmap_area_lock contention with no side-effect. That is just to
use rcu list iterator in get_vmalloc_info().
rcu can be used in this function because all RCU protocol is already
respected by writers, since Nick Piggin commit db64fe02258f1 ("mm: rewrite
vmap layer") back in linux-2.6.28
Specifically :
insertions use list_add_rcu(),
deletions use list_del_rcu() and kfree_rcu().
Note the rb tree is not used from rcu reader (it would not be safe),
only the vmap_area_list has full RCU protection.
Note that __purge_vmap_area_lazy() already uses this rcu protection.
: While rcu list traversal over the vmap_area_list is safe, this may
: arrive at different results than the spinlocked version. The rcu list
: traversal version will not be a 'snapshot' of a single, valid instant
: of the entire vmap_area_list, but rather a potential amalgam of
: different list states.
Joonsoo:
: Yes, you are right, but I don't think that we should be strict here.
: Meminfo is already not a 'snapshot' at specific time. While we try to get
: certain stats, the other stats can change. And, although we may arrive at
: different results than the spinlocked version, the difference would not be
: large and would not make serious side-effect.
[edumazet@google.com: add more commit description] Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Reported-by: Richard Yao <ryao@gentoo.org> Acked-by: Eric Dumazet <edumazet@google.com> Cc: Peter Hurley <peter@hurleysoftware.com> Cc: Zhang Yanfei <zhangyanfei.yes@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Andi Kleen <andi@firstfloor.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Chen Yucong [Thu, 26 Jun 2014 00:42:28 +0000 (10:42 +1000)]
hwpoison: fix the handling path of the victimized page frame that belong to non-LRU
Until now, the kernel has the same policy to handle victimized page frames
that belong to kernel-space(reserved/slab-subsystem) or non-LRU(unknown
page state). In other word, the result of handling either of these
victimized page frames is (IGNORED | FAILED), and the return value of
memory_failure() is -EBUSY.
This patch is to avoid that memory_failure() returns very soon due to the
"true" value of (!PageLRU(p)), and it also ensures that action_result()
can report more precise information("reserved kernel", "kernel slab", and
"unknown page state") instead of "non LRU", especially for memory errors
which are detected by memory-scrubbing.
Wei Yang [Thu, 26 Jun 2014 00:42:27 +0000 (10:42 +1000)]
slub: reduce duplicate creation on the first object
When a kmem_cache is created with ctor, each object in the kmem_cache will
be initialized before use. In the slub implementation, the first object
will be initialized twice.
This patch avoids the duplication of initialization of the first object.
Fixes commit 7656c72b5a63: ("SLUB: add macros for scanning objects in a
slab").
Signed-off-by: Wei Yang <weiyang@linux.vnet.ibm.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Andrey Ryabinin [Thu, 26 Jun 2014 00:42:27 +0000 (10:42 +1000)]
mm: slub: SLUB_DEBUG=n: use the same alloc/free hooks as for SLUB_DEBUG=y
There are two versions of alloc/free hooks now - one for
CONFIG_SLUB_DEBUG=y and another one for CONFIG_SLUB_DEBUG=n.
I see no reason why calls to other debugging subsystems (LOCKDEP,
DEBUG_ATOMIC_SLEEP, KMEMCHECK and FAILSLAB) are hidden under SLUB_DEBUG.
All this features should work regardless of SLUB_DEBUG config, as all of
them already have own Kconfig options.
This also fixes failslab for CONFIG_SLUB_DEBUG=n configuration. It simply
has not worked before because should_failslab() call was in a hook hidden
under "#ifdef CONFIG_SLUB_DEBUG #else".
Note: There is one concealed change in allocation path for SLUB_DEBUG=n
and all other debugging features disabled. The might_sleep_if() call can
generate some code even if DEBUG_ATOMIC_SLEEP=n. For PREEMPT_VOLUNTARY=y
might_sleep() inserts _cond_resched() call, but I think it should be ok.
Signed-off-by: Andrey Ryabinin <a.ryabinin@samsung.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
David Rientjes [Thu, 26 Jun 2014 00:42:27 +0000 (10:42 +1000)]
mm, slub: mark resiliency_test as init text
resiliency_test() is only called for bootstrap, so it may be moved to
init.text and freed after boot.
Signed-off-by: David Rientjes <rientjes@google.com> Acked-by: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Andrey Ryabinin [Thu, 26 Jun 2014 00:42:27 +0000 (10:42 +1000)]
mm: slab.h: wrap the whole file with guarding macro
Guarding section:
#ifndef MM_SLAB_H
#define MM_SLAB_H
...
#endif
currently doesn't cover the whole mm/slab.h. It seems like it was
done unintentionally.
Wrap the whole file by moving closing #endif to the end of it.
Signed-off-by: Andrey Ryabinin <a.ryabinin@samsung.com> Acked-by: Christoph Lameter <cl@linux.com> Acked-by: David Rientjes <rientjes@google.com> Reviewed-by: Vladimir Davydov <vdavydov@parallels.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
mm/slab.c: In function 'slab_set_debugobj_lock_classes':
mm/slab.c:524: error: 'h' undeclared (first use in this function)
mm/slab.c:524: error: (Each undeclared identifier is reported only once
mm/slab.c:524: error: for each function it appears in.)
mm/slab.c:524: warning: left-hand operand of comma expression has no effect
mm/slab.c: In function 'cpuup_prepare':
mm/slab.c:1308: warning: passing argument 2 of 'slab_set_debugobj_lock_classes_node' makes pointer from integer without a cast
mm/slab.c:513: note: expected 'struct kmem_cache_node *' but argument is of type 'int'
Cc: Christoph Lameter <cl@gentwo.org> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Pekka Enberg <penberg@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
> > @@ -3759,8 +3746,8 @@ fail:
> > /* Cache is not active yet. Roll back what we did */
> > node--;
> > while (node >= 0) {
> > - if (cachep->node[node]) {
> > - n = cachep->node[node];
> > + if (get_node(cachep, node)) {
> > + n = get_node(cachep, node);
>
> Could you do this as following?
>
> n = get_node(cachep, node);
> if (n) {
> ...
> }
Sure....
Subject: slab: Fixes to earlier patch
Signed-off-by: Christoph Lameter <cl@linux.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
slab: use get_node() and kmem_cache_node() functions
Use the two functions to simplify the code avoiding numerous explicit
checks coded checking for a certain node to be online.
Get rid of various repeated calculations of kmem_cache_node structures.
Signed-off-by: Christoph Lameter <cl@linux.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Acked-by: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Ok got through the file and removed all the lines after
for_each_kmem_cache_node.
>
> > @@ -3407,11 +3401,7 @@ int __kmem_cache_shrink(struct kmem_cach
> > return -ENOMEM;
> >
> > flush_all(s);
> > - for_each_node_state(node, N_NORMAL_MEMORY) {
> > - n = get_node(s, node);
> > -
> > - if (!n->nr_partial)
> > - continue;
> > + for_each_kmem_cache_node(s, node, n) {
> >
> > for (i = 0; i < objects; i++)
> > INIT_LIST_HEAD(slabs_by_inuse + i);
>
> Is there any reason not to keep the !n->nr_partial check to avoid taking
> n->list_lock unnecessarily?
No this was simply a mistake the check needs to be preserved.
Subject: slub: Fix up earlier patch
Signed-off-by: Christoph Lameter <cl@linux.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Make use of the new node functions in mm/slab.h to reduce code size and
simplify.
Signed-off-by: Christoph Lameter <cl@linux.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Acked-by: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
slab common: add functions for kmem_cache_node access
The patchset provides two new functions in mm/slab.h and modifies SLAB and
SLUB to use these. The kmem_cache_node structure is shared between both
allocators and the use of common accessors will allow us to move more code
into slab_common.c in the future.
This patch (of 3):
These functions allow to eliminate repeatedly used code in both SLAB and
SLUB and also allow for the insertion of debugging code that may be needed
in the development process.
Signed-off-by: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Acked-by: David Rientjes <rientjes@google.com> Acked-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Fabian Frederick [Thu, 26 Jun 2014 00:42:25 +0000 (10:42 +1000)]
mm/slab.c: add __init to init_lock_keys
init_lock_keys is only called by __init kmem_cache_init_late
Signed-off-by: Fabian Frederick <fabf@skynet.be> Acked-by: Christoph Lameter <cl@linux.com> Acked-by: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Pekka Enberg <penberg@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Josh Hunt [Thu, 26 Jun 2014 00:42:24 +0000 (10:42 +1000)]
block: restore /proc/partitions to not display non-partitionable removable devices
We found with newer kernels we started seeing the cdrom device showing
up in /proc/partitions, but it was not there before.
Looking into this I found that commit d27769ec ("block: add
GENHD_FL_NO_PART_SCAN") introduces this change in behavior. It's not
clear to me from the commit's changelog if this change was intentional or
not. This comment still remains: /* Don't show non-partitionable
removeable devices or empty devices */ so I've decided to send a patch to
restore the behavior of not printing unpartitionable removable devices.
Signed-off-by: Josh Hunt <johunt@akamai.com> Cc: Tejun Heo <tj@kernel.org> Cc: Kay Sievers <kay.sievers@vrfy.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
bio: modify __bio_add_page() to accept pages that don't start a new segment
The original behaviour is to refuse to add a new page if the maximum
number of segments has been reached, regardless of the fact the page we
are going to add can be merged into the last segment or not.
Unfortunately, when the system runs under heavy memory fragmentation
conditions, a driver may try to add multiple pages to the last segment.
The original code won't accept them and EBUSY will be reported to
userspace.
This patch modifies the function so it refuses to add a page only in case
the latter starts a new segment and the maximum number of segments has
already been reached.
The bug can be easily reproduced with the st driver:
1) set CONFIG_SCSI_MPT2SAS_MAX_SGE or CONFIG_SCSI_MPT3SAS_MAX_SGE to 16
2) modprobe st buffer_kbs=1024
3) #dd if=/dev/zero of=/dev/st0 bs=1M count=10
dd: error writing `/dev/st0': Device or resource busy
Signed-off-by: Maurizio Lombardi <mlombard@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Christoph Hellwig <hch@lst.de> Cc: Kent Overstreet <kmo@daterainc.com> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
- Invalid maintainer e-mail address:
Mail server reply:
Recipient address rejected: User unknown in virtual alias table
- Remove no longer working webpage URL
- Remove obsolete "Person" field
- Move status to "Orphan"
- Add Dave Jeffery and Jack Hammer to the CREDITS file
Signed-off-by: Michael Opdenacker <michael.opdenacker@free-electrons.com> Reviewed-by: Jean Delvare <jdelvare@suse.de> Cc: David Jeffery <dhjeffery@gmail.com> Cc: James Bottomley <James.Bottomley@HansenPartnership.com> Cc: Paul Bolle <pebolle@tiscali.nl> Reviewed-by: Jingoo Han <jg1.han@samsung.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Fabian Frederick [Thu, 26 Jun 2014 00:42:23 +0000 (10:42 +1000)]
fs/ocfs2/slot_map.c: replace count*size kzalloc by kcalloc
kcalloc manages count*sizeof overflow.
Signed-off-by: Fabian Frederick <fabf@skynet.be> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
yangwenfang [Thu, 26 Jun 2014 00:42:23 +0000 (10:42 +1000)]
ocfs2: call ocfs2_journal_access_di() before ocfs2_journal_dirty() in ocfs2_write_end_nolock()
After we call ocfs2_journal_access_di() in ocfs2_write_begin(),
jbd2_journal_restart() may also be called, in this function transaction
A's t_updates-- and obtains a new transaction B. If
jbd2_journal_commit_transaction() is happened to commit transaction A,
when t_updates==0, it will continue to complete commit and unfile buffer.
So when jbd2_journal_dirty_metadata(), the handle is pointed a new
transaction B, and the buffer head's journal head is already freed,
jh->b_transaction == NULL, jh->b_next_transaction == NULL, it returns
EINVAL, So it triggers the BUG_ON(status).
So I think we should put ocfs2_journal_access_di before
ocfs2_journal_dirty in the ocfs2_write_end. and it works well after my
modification.
Signed-off-by: vicky <vicky.yangwenfang@huawei.com> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Junxiao Bi [Thu, 26 Jun 2014 00:42:23 +0000 (10:42 +1000)]
ocfs2: quorum: add a log for node not fenced
For debug use, we can see from the log whether the fence decision is made
and why it is not fenced.
Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com> Reviewed-by: Srinivas Eeda <srinivas.eeda@oracle.com> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Joseph Qi <joseph.qi@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Junxiao Bi [Thu, 26 Jun 2014 00:42:23 +0000 (10:42 +1000)]
ocfs2: o2net: set tcp user timeout to max value
When tcp retransmit timeout(15mins), the connection will be closed.
Pending messages may be lost during this time. So we set tcp user timeout
to override the retransmit timeout to the max value. This is OK for ocfs2
since we have disk heartbeat, if peer crash, the disk heartbeat will
timeout and it will be evicted, if disk heartbeat not timeout and
connection idle for a long time, then this means the cluster enters
split-brain state, since fence can't happen, we'd better keep the
connection and wait network recover.
Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com> Reviewed-by: Srinivas Eeda <srinivas.eeda@oracle.com> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Joseph Qi <joseph.qi@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Junxiao Bi [Thu, 26 Jun 2014 00:42:23 +0000 (10:42 +1000)]
ocfs2: o2net: don't shutdown connection when idle timeout
This patch series is to fix a possible message lost bug in ocfs2 when
network go bad. This bug will cause ocfs2 hung forever even network
become good again.
The messages may lost in this case. After the tcp connection is
established between two nodes, an idle timer will be set to check its
state periodically, if no messages are received during this time, idle
timer will timeout, it will shutdown the connection and try to reconnect,
so pending messages in tcp queues will be lost. This messages may be from
dlm. Dlm may get hung in this case. This may cause the whole ocfs2
cluster hung.
This is very possible to happen when network state goes bad. Do the
reconnect is useless, it will fail if network state is still bad. Just
waiting there for network recovering may be a good idea, it will not lost
messages and some node will be fenced until cluster goes into split-brain
state, for this case, Tcp user timeout is used to override the tcp
retransmit timeout. It will timeout after 25 days, user should have
notice this through the provided log and fix the network, if they don't,
ocfs2 will fall back to original reconnect way.
This patch (of 3):
Some messages in the tcp queue maybe lost if we shutdown the connection
and reconnect when idle timeout. If packets lost and reconnect success,
then the ocfs2 cluster maybe hung.
To fix this, we can leave the connection there and do the fence decision
when idle timeout, if network recover before fence dicision is made, the
connection survive without lost any messages.
This bug can be saw when network state go bad. It may cause ocfs2 hung
forever if some packets lost. With this fix, ocfs2 will recover from hung
if network becomes good again.
Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com> Reviewed-by: Srinivas Eeda <srinivas.eeda@oracle.com> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Joseph Qi <joseph.qi@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Xue jiufei [Thu, 26 Jun 2014 00:42:22 +0000 (10:42 +1000)]
ocfs2: free inode when i_count becomes zero
Disk inode deletion may be heavily delayed when one node unlink a file
after the same dentry is freed on another node(say N1) because of memory
shrink but inode is left in memory. This inode can only be freed while N1
doing the orphan scan work.
However, N1 may skip orphan scan for several times because other nodes may
do the work earlier. In our tests, it may take 1 hour on 4 nodes cluster
and this will cause bad user experience. So we think the inode should be
freed when i_count becomes zero to avoid such circumstances.
Signed-off-by: joyce.xue <xuejiufei@huawei.com> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Ben Hutchings [Thu, 26 Jun 2014 00:42:22 +0000 (10:42 +1000)]
ocfs2: do not write error flag to user structure we cannot copy from/to
If we failed to copy from the structure, writing back the flags leaks 31
bits of kernel memory (the rest of the ir_flags field).
In any case, if we cannot copy from/to the structure, why should we expect
putting just the flags to work?
Also make sure ocfs2_info_handle_freeinode() returns the right error code
if the copy_to_user() fails.
Fixes: ddee5cdb70e6 ('Ocfs2: Add new OCFS2_IOC_INFO ioctl for ocfs2 v8.') Signed-off-by: Ben Hutchings <ben@decadent.org.uk> Cc: Joel Becker <jlbec@evilplan.org> Cc: Mark Fasheh <mfasheh@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Yingtai Xie [Thu, 26 Jun 2014 00:42:22 +0000 (10:42 +1000)]
ocfs2: correctly check the return value of ocfs2_search_extent_list
ocfs2_search_extent_list may return -1, so we should check the return
value in ocfs2_split_and_insert, otherwise it may cause array index out of
bound.
And ocfs2_search_extent_list can only return value less than
el->l_next_free_rec, so check if it is equal or larger than
le16_to_cpu(el->l_next_free_rec) is meaningless.
Signed-off-by: Yingtai Xie <xieyingtai@huawei.com> Signed-off-by: Joseph Qi <joseph.qi@huawei.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Mark Fasheh <mfasheh@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Fabian Frederick [Thu, 26 Jun 2014 00:42:21 +0000 (10:42 +1000)]
fs/ext4/fsync.c: generic_file_fsync call based on barrier flag
generic_file_fsync has been updated to issue a flush for older
filesystems.
This patch tests for barrier flag in ext4 mount flags and calls the right
function.
Lukas said:
: Note that the actual generic_file_fsync change fixes a real bug in ext4
: where we would _not_ send a flush on sync if we have file system
: without journal.
:
: Ted, it would be useful to mention that in the commit description
: along with the commit id:
:
: ac13a829f6adb674015ab399594c089990104af7 fs/libfs.c: add generic
: data flush to fsync
Signed-off-by: Fabian Frederick <fabf@skynet.be> Suggested-by: Jan Kara <jack@suse.cz> Suggested-by: Christoph Hellwig <hch@infradead.org> Reviewed-by: Lukas Czerner <lczerner@redhat.com> Cc: Jan Kara <jack@suse.cz> Cc: Christoph Hellwig <hch@infradead.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Samuel Thibault [Thu, 26 Jun 2014 00:42:20 +0000 (10:42 +1000)]
input: route kbd LEDs through the generic LEDs layer
This permits to reassign keyboard LEDs to something else than keyboard
"leds" state, by adding keyboard led and modifier triggers connected to a
series of VT input LEDs, themselves connected to VT input triggers, which
per-input device LEDs use by default. Userland can thus easily change the
LED behavior of (a priori) all input devices, or of particular input
devices.
This also permits to fix #7063 from userland by using a modifier to
implement proper CapsLock behavior and have the keyboard caps lock led
show that modifier state.
[ebroder@mokafive.com: Rebased to 3.2-rc1 or so, cleaned up some includes, and fixed some constants]
[blogic@openwrt.org: CONFIG_INPUT_LEDS stubs should be static inline]
[akpm@linux-foundation.org: remove unneeded `extern', fix comment layout] Signed-off-by: Samuel Thibault <samuel.thibault@ens-lyon.org> Signed-off-by: Evan Broder <evan@ebroder.net> Reviewed-by: David Herrmann <dh.herrmann@gmail.com> Tested-by: Pavel Machek <pavel@ucw.cz> Acked-by: Peter Korsgaard <jacmet@sunsite.dk> Cc: Pavel Machek <pavel@ucw.cz> Cc: Dmitry Torokhov <dmitry.torokhov@gmail.com> Cc: Bryan Wu <cooloney@gmail.com> Cc: Arnaud Patard <arnaud.patard@rtp-net.org> Cc: Sascha Hauer <s.hauer@pengutronix.de> Cc: Matt Sealey <matt@genesi-usa.com> Cc: Rob Clark <robdclark@gmail.com> Cc: Niels de Vos <devos@fedoraproject.org> Cc: Steev Klimaszewski <steev@genesi-usa.com> Signed-off-by: John Crispin <blogic@openwrt.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Fabian Frederick [Thu, 26 Jun 2014 00:42:19 +0000 (10:42 +1000)]
fs/cifs: remove obsolete __constant
Replace all __constant_foo to foo() except in smb2status.h (1700 lines to
update).
Signed-off-by: Fabian Frederick <fabf@skynet.be> Cc: Steve French <sfrench@samba.org> Cc: Jeff Layton <jlayton@poochiereds.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
x86,mem-hotplug: modify PGD entry when removing memory
When hot-adding/removing memory, sync_global_pgds() is called for
synchronizing PGD to PGD entries of all processes MM. But when
hot-removing memory, sync_global_pgds() does not work correctly.
At first, sync_global_pgds() checks whether target PGD is none or not.
And if PGD is none, the PGD is skipped. But when hot-removing memory, PGD
may be none since PGD may be cleared by free_pud_table(). So when
sync_global_pgds() is called after hot-removing memory, sync_global_pgds()
should not skip PGD even if the PGD is none. And sync_global_pgds() must
clear PGD entries of all processes MM.
Currently sync_global_pgds() does not clear PGD entries of all processes
MM when hot-removing memory. So when hot adding memory which is same
memory range as removed memory after hot-removing memory, following call
traces are shown:
remove_pagetable() gets start argument and passes the argument to
sync_global_pgds(). In this case, the argument must not be modified. If
the argument is modified and passed to sync_global_pgds(),
sync_global_pgds() does not correctly synchronize PGD to PGD entries of
all processes MM since synchronized range of memory [start, end] is wrong.
Unfortunately the start argument is modified in remove_pagetable(). So
this patch fixes the issue.
Signed-off-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Tang Chen <tangchen@cn.fujitsu.com> Acked-by: Toshi Kani <toshi.kani@hp.com> Cc: Gu Zheng <guz.fnst@cn.fujitsu.com> Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Heiko Carstens [Thu, 26 Jun 2014 00:42:18 +0000 (10:42 +1000)]
fs/seq_file: fallback to vmalloc allocation
There are a couple of seq_files which use the single_open() interface.
This interface requires that the whole output must fit into a single
buffer.
E.g. for /proc/stat allocation failures have been observed because an
order-4 memory allocation failed due to memory fragmentation. In such
situations reading /proc/stat is not possible anymore.
Therefore change the seq_file code to fallback to vmalloc allocations
which will usually result in a couple of order-0 allocations and hence
also work if memory is fragmented.
For reference a call trace where reading from /proc/stat failed:
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com> Tested-by: David Rientjes <rientjes@google.com> Cc: Ian Kent <raven@themaw.net> Cc: Hendrik Brueckner <brueckner@linux.vnet.ibm.com> Cc: Thorsten Diehl <thorsten.diehl@de.ibm.com> Cc: Andrea Righi <andrea@betterlinux.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Stefan Bader <stefan.bader@canonical.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Heiko Carstens [Thu, 26 Jun 2014 00:42:18 +0000 (10:42 +1000)]
proc/stat: convert to single_open_size()
These two patches are supposed to "fix" failed order-4 memory allocations
which have been observed when reading /proc/stat. The problem has been
observed on s390 as well as on x86.
To address the problem change the seq_file memory allocations to fallback
to use vmalloc, so that allocations also work if memory is fragmented.
This approach seems to be simpler and less intrusive than changing
/proc/stat to use an interator. Also it "fixes" other users as well,
which use seq_file's single_open() interface.
This patch (of 2):
Use seq_file's single_open_size() to preallocate a buffer that is large
enough to hold the whole output, instead of open coding it. Also
calculate the requested size using the number of online cpus instead of
possible cpus, since the size of the output only depends on the number of
online cpus.
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com> Acked-by: David Rientjes <rientjes@google.com> Cc: Ian Kent <raven@themaw.net> Cc: Hendrik Brueckner <brueckner@linux.vnet.ibm.com> Cc: Thorsten Diehl <thorsten.diehl@de.ibm.com> Cc: Andrea Righi <andrea@betterlinux.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Stefan Bader <stefan.bader@canonical.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Ian Kent [Thu, 26 Jun 2014 00:42:17 +0000 (10:42 +1000)]
autofs4: fix false positive compile error
On strict build environments we can see:
fs/autofs4/inode.c: In function 'autofs4_fill_super':
fs/autofs4/inode.c:312: error: 'pgrp' may be used uninitialized in this
function
make[2]: *** [fs/autofs4/inode.o] Error 1
make[1]: *** [fs/autofs4] Error 2
make: *** [fs] Error 2
make: *** Waiting for unfinished jobs....
This is due to the use of pgrp_set being used to indicate pgrp has
has been set rather than initializing pgrp itself.
Signed-off-by: Ian Kent <raven@themaw.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Joonsoo Kim [Thu, 26 Jun 2014 00:42:17 +0000 (10:42 +1000)]
slub: fix off by one in number of slab tests
min_partial means minimum number of slab cached in node partial list. So,
if nr_partial is less than it, we keep newly empty slab on node partial
list rather than freeing it. But if nr_partial is equal or greater than
it, it means that we have enough partial slabs so should free newly empty
slab. Current implementation missed the equal case so if we set
min_partial is 0, then, at least one slab could be cached. This is
critical problem to kmemcg destroying logic because it doesn't works
properly if some slabs is cached. This patch fixes this problem.
Fixes 91cb69620284 ("slub: make dead memcg caches discard free slabs
immediately").
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Acked-by: Vladimir Davydov <vdavydov@parallels.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
This happens because init_cma_reserved_pageblock() calls __free_one_page()
with pageblock_order as page order but it is bigger than MAX_ORDER. This
in turn causes accesses past zone->free_list[].
Fix the problem by changing init_cma_reserved_pageblock() such that it
splits pageblock into individual MAX_ORDER pages if pageblock is bigger
than a MAX_ORDER page.
In cases where !CONFIG_HUGETLB_PAGE_SIZE_VARIABLE, which is all
architectures expect for ia64, powerpc and tile at the moment, the
“pageblock_order > MAX_ORDER” condition will be optimised out since
both sides of the operator are constants. In cases where pageblock size
is variable, the performance degradation should not be significant anyway
since init_cma_reserved_pageblock() is called only at boot time at most
MAX_CMA_AREAS times which by default is eight.
Signed-off-by: Michal Nazarewicz <mina86@mina86.com> Reported-by: Mark Salter <msalter@redhat.com> Tested-by: Mark Salter <msalter@redhat.com> Tested-by: Christopher Covington <cov@codeaurora.org> Cc: Mel Gorman <mgorman@suse.de> Cc: David Rientjes <rientjes@google.com> Cc: Marek Szyprowski <m.szyprowski@samsung.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: <stable@vger.kernel.org> [3.5+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Linus Torvalds [Wed, 25 Jun 2014 12:44:17 +0000 (05:44 -0700)]
Merge branch 'merge' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc
Pull powerpc fixes and cleanups from Ben Herrenschmidt:
"Here are a handful or two of powerpc fixes and simple/trivial
cleanups. A bunch of them fix ftrace with the new ABI v2 in Little
Endian, the rest is a scattering of fairly simple things"
* 'merge' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc:
powerpc: Don't skip ePAPR spin-table CPUs
powerpc/module: Fix TOC symbol CRC
powerpc/powernv: Remove OPAL v1 takeover
powerpc/kmemleak: Do not scan the DART table
selftests/powerpc: Use the test harness for the TM DSCR test
powerpc/cell: cbe_thermal.c: Cleaning up a variable is of the wrong type
powerpc/kprobes: Fix jprobes on ABI v2 (LE)
powerpc/ftrace: Use pr_fmt() to namespace error messages
powerpc/ftrace: Fix nop of modules on 64bit LE (ABIv2)
powerpc/ftrace: Fix inverted check of create_branch()
powerpc/ftrace: Fix typo in mask of opcode
powerpc: Add ppc_global_function_entry()
powerpc/macintosh/smu.c: Fix closing brace followed by if
powerpc: Remove __arch_swab*
powerpc: Remove ancient DEBUG_SIG code
powerpc/kerenl: Enable EEH for IO accessors
Linus Torvalds [Wed, 25 Jun 2014 12:30:20 +0000 (05:30 -0700)]
Merge tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost
Pull vhost cleanups from Michael S Tsirkin:
"Two cleanup patches removing code duplication that got introduced by
changes in rc1. Not fixing crashes, but I'd rather not carry the
duplicate code until the next merge window"
* tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost:
vhost-scsi: don't open-code kvfree
vhost-net: don't open-code kvfree
Linus Torvalds [Wed, 25 Jun 2014 12:08:09 +0000 (05:08 -0700)]
Merge tag 'trace-fixes-v3.16-rc1-v2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing cleanups and fixes from Steven Rostedt:
"This includes three patches from Oleg Nesterov. The first is a fix to
a race condition that happens between enabling/disabling syscall
tracepoints and new process creations (the check to go into the ptrace
path for a process can be set when it shouldn't, or not set when it
should). Not a major bug but one that should be fixed and even
applied to stable.
The other two patches are cleanup/fixes that are not that critical,
but for an -rc1 release would be nice to have. They both deal with
syscall tracepoints.
It also includes a patch to introduce a new macro for the
TRACE_EVENT() format called __field_struct(). Originally, __field()
was used to record any variable into a trace event, but with the
addition of setting the "is signed" attribute, the check causes
anything but a primitive variable to fail to compile. That is,
structs and unions can't be used as they once were. When the "is
signed" check was introduce there were only primitive variables being
recorded. But that will change soon and it was reported that
__field() causes build failures.
To solve the __field() issue, __field_struct() is introduced to allow
trace_events to be able to record complex types too"
* tag 'trace-fixes-v3.16-rc1-v2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
tracing: Add __field_struct macro for TRACE_EVENT()
tracing: syscall_regfunc() should not skip kernel threads
tracing: Change syscall_*regfunc() to check PF_KTHREAD and use for_each_process_thread()
tracing: Fix syscall_*regfunc() vs copy_process() race
Scott Wood [Wed, 25 Jun 2014 01:15:51 +0000 (20:15 -0500)]
powerpc: Don't skip ePAPR spin-table CPUs
Commit 59a53afe70fd530040bdc69581f03d880157f15a "powerpc: Don't setup
CPUs with bad status" broke ePAPR SMP booting. ePAPR says that CPUs
that aren't presently running shall have status of disabled, with
enable-method being used to determine whether the CPU can be enabled.
Fix by checking for spin-table, which is currently the only supported
enable-method.
Signed-off-by: Scott Wood <scottwood@freescale.com> Cc: Michael Neuling <mikey@neuling.org> Cc: Emil Medve <Emilian.Medve@Freescale.com> Cc: stable@vger.kernel.org Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Laurent Dufour [Tue, 24 Jun 2014 08:53:59 +0000 (10:53 +0200)]
powerpc/module: Fix TOC symbol CRC
The commit 71ec7c55ed91 introduced the magic symbol ".TOC." for ELFv2 ABI.
This symbol is built manually and has no CRC value computed. A zero value
is put in the CRC section to avoid modpost complaining about a missing CRC.
Unfortunately, this breaks the kernel module loading when the kernel is
relocated (kdump case for instance) because of the relocation applied to
the kcrctab values.
This patch compute a CRC value for the TOC symbol which will match the one
compute by the kernel when it is relocated - aka '0 - relocate_start' done in
maybe_relocated called by check_version (module.c).
Signed-off-by: Laurent Dufour <ldufour@linux.vnet.ibm.com> Cc: Anton Blanchard <anton@samba.org> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Michael Ellerman [Tue, 24 Jun 2014 07:17:47 +0000 (17:17 +1000)]
powerpc/powernv: Remove OPAL v1 takeover
In commit 27f4488872d9 "Add OPAL takeover from PowerVM" we added support
for "takeover" on OPAL v1 machines.
This was a mode of operation where we would boot under pHyp, and query
for the presence of OPAL. If detected we would then do a special
sequence to take over the machine, and the kernel would end up running
in hypervisor mode.
OPAL v1 was never a supported product, and was never shipped outside
IBM. As far as we know no one is still using it.
Newer versions of OPAL do not use the takeover mechanism. Although the
query for OPAL should be harmless on machines with newer OPAL, we have
seen a machine where it causes a crash in Open Firmware.
The code in early_init_devtree() to copy boot_command_line into cmd_line
was added in commit 817c21ad9a1f "Get kernel command line accross OPAL
takeover", and AFAIK is only used by takeover, so should also be
removed.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Linus Torvalds [Tue, 24 Jun 2014 21:00:13 +0000 (14:00 -0700)]
Merge git://git.kvack.org/~bcrl/aio-fixes
Pull aio fixes from Ben LaHaise:
"These fix a kernel memory disclosure issue (arbitrary kmap() &
copy_to_user()) revealed in CVE-2014-0206 by changes that were
introduced in v3.10"
* git://git.kvack.org/~bcrl/aio-fixes:
aio: fix kernel memory disclosure in io_getevents() introduced in v3.10
aio: fix aio request leak when events are reaped by userspace
Linus Torvalds [Tue, 24 Jun 2014 20:59:00 +0000 (13:59 -0700)]
Merge branch 'fixes' of git://ftp.arm.linux.org.uk/~rmk/linux-arm
Pull ARM fixes from Russell King:
"A number of low impact fixes, the most noticable one is the thumb2
frame pointer fix. We also fix a regression caused during this merge
window with ARM925 CPUs running with caches disabled, and fix a number
of warnings"
* 'fixes' of git://ftp.arm.linux.org.uk/~rmk/linux-arm:
ARM: arm925: ensure assembly sets up writethrough mapping
ARM: perf: fix compiler warning with gcc 4.6.4 (and tidy code)
ARM: l2c: fix dependencies on PL310 errata symbols
ARM: 8069/1: Make thread_save_fp macro aware of THUMB2 mode
ARM: 8068/1: scoop: Remove unused variable
Benjamin LaHaise [Tue, 24 Jun 2014 17:32:51 +0000 (13:32 -0400)]
aio: fix kernel memory disclosure in io_getevents() introduced in v3.10
A kernel memory disclosure was introduced in aio_read_events_ring() in v3.10
by commit a31ad380bed817aa25f8830ad23e1a0480fef797. The changes made to
aio_read_events_ring() failed to correctly limit the index into
ctx->ring_pages[], allowing an attacked to cause the subsequent kmap() of
an arbitrary page with a copy_to_user() to copy the contents into userspace.
This vulnerability has been assigned CVE-2014-0206. Thanks to Mateusz and
Petr for disclosing this issue.
This patch applies to v3.12+. A separate backport is needed for 3.10/3.11.
Signed-off-by: Benjamin LaHaise <bcrl@kvack.org> Cc: Mateusz Guzik <mguzik@redhat.com> Cc: Petr Matousek <pmatouse@redhat.com> Cc: Kent Overstreet <kmo@daterainc.com> Cc: Jeff Moyer <jmoyer@redhat.com> Cc: stable@vger.kernel.org
Benjamin LaHaise [Tue, 24 Jun 2014 17:12:55 +0000 (13:12 -0400)]
aio: fix aio request leak when events are reaped by userspace
The aio cleanups and optimizations by kmo that were merged into the 3.10
tree added a regression for userspace event reaping. Specifically, the
reference counts are not decremented if the event is reaped in userspace,
leading to the application being unable to submit further aio requests.
This patch applies to 3.12+. A separate backport is required for 3.10/3.11.
This issue was uncovered as part of CVE-2014-0206.
Signed-off-by: Benjamin LaHaise <bcrl@kvack.org> Cc: stable@vger.kernel.org Cc: Kent Overstreet <kmo@daterainc.com> Cc: Mateusz Guzik <mguzik@redhat.com> Cc: Petr Matousek <pmatouse@redhat.com>
Catalin Marinas [Fri, 13 Jun 2014 08:44:21 +0000 (09:44 +0100)]
powerpc/kmemleak: Do not scan the DART table
The DART table allocation is registered to kmemleak via the
memblock_alloc_base() call. However, the DART table is later unmapped
and dart_tablebase VA no longer accessible. This patch tells kmemleak
not to scan this block and avoid an unhandled paging request.
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Michael Ellerman [Mon, 23 Jun 2014 03:23:31 +0000 (13:23 +1000)]
powerpc/kprobes: Fix jprobes on ABI v2 (LE)
In commit 721aeaa9 "Build little endian ppc64 kernel with ABIv2", we
missed some updates required in the kprobes code to make jprobes work
when the kernel is built with ABI v2.
Firstly update arch_deref_entry_point() to do the right thing. Now that
we have added ppc_global_function_entry() we can just always use that, it
will do the right thing for 32 & 64 bit and ABI v1 & v2.
Secondly we need to update the code that sets up the register state before
calling the jprobe handler. On ABI v1 we setup r2 to hold the TOC, on ABI
v2 we need to populate r12 with the function entry point address.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Michael Ellerman [Tue, 17 Jun 2014 06:15:36 +0000 (16:15 +1000)]
powerpc/ftrace: Use pr_fmt() to namespace error messages
The printks() in our ftrace code have no prefix, so they appear on the
console with very little context, eg:
Branch out of range
Use pr_fmt() & pr_err() to add a prefix. While we're at it, collapse a
few split lines that don't need to be, and add a missing newline to one
message.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Michael Ellerman [Tue, 17 Jun 2014 06:15:34 +0000 (16:15 +1000)]
powerpc/ftrace: Fix inverted check of create_branch()
In commit 24a1bdc35, "Fix ABIv2 issues with __ftrace_make_call", Anton
changed the logic that creates and patches the branch, and added a
thinko in the check of create_branch(). create_branch() returns the
instruction that was generated, so if we get zero then it succeeded.
The result is we can't ftrace modules:
Branch out of range
WARNING: at ../kernel/trace/ftrace.c:1638
ftrace failed to modify [<d000000004ba001c>] fuse_req_init_context+0x1c/0x90 [fuse]
We should probably fix patch_instruction() to do that check and make the
API saner, but that's a separate patch. For now just invert the test.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Michael Ellerman [Tue, 17 Jun 2014 06:15:33 +0000 (16:15 +1000)]
powerpc/ftrace: Fix typo in mask of opcode
In commit 24a1bdc35, "Fix ABIv2 issues with __ftrace_make_call", Anton
changed the logic that checks for the expected code sequence when
patching a module.
We missed the typo in the mask, 0xffff00000 should be 0xffff0000, which
has the effect of making the test always true.
That makes it impossible to ftrace against modules, eg:
Unexpected call sequence: 48000008e8410018
WARNING: at ../kernel/trace/ftrace.c:1638
ftrace failed to modify [<d000000007cf001c>] rng_dev_open+0x1c/0x70 [rng_core]
Reported-by: David Binderman <dcb314@hotmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Michael Ellerman [Tue, 17 Jun 2014 06:15:32 +0000 (16:15 +1000)]
powerpc: Add ppc_global_function_entry()
ABIv2 has the concept of a global and local entry point to a function.
In most cases we are interested in the local entry point, and so that is
what ppc_function_entry() returns.
However we have a case in the ftrace code where we want the global entry
point, and there may be other places we need it too. Rather than special
casing each, add an accessor.
For ABIv1 and 32-bit there is only a single entry point, so we return
that. That means it's safe for the caller to use this without also
checking the ABI version.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>