]> git.karo-electronics.de Git - karo-tx-linux.git/log
karo-tx-linux.git
9 years agothp: handle errors in hugepage_init() properly
Kirill A. Shutemov [Tue, 7 Apr 2015 23:44:32 +0000 (09:44 +1000)]
thp: handle errors in hugepage_init() properly

We miss error-handling in few cases hugepage_init(). Let's fix that.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 years agomm, mempool: poison elements backed by page allocator fix fix
David Rientjes [Tue, 7 Apr 2015 23:44:32 +0000 (09:44 +1000)]
mm, mempool: poison elements backed by page allocator fix fix

Elements backed by the page allocator might not be directly mapped into
lowmem, so do k{,un}map_atomic() before poisoning and verifying contents
to map into lowmem and return the virtual adddress.

Signed-off-by: David Rientjes <rientjes@google.com>
Reported-by: Andrey Ryabinin <a.ryabinin@samsung.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 years agomm, mempool: use '%zu' for printing 'size_t' variable
Fabio Estevam [Tue, 7 Apr 2015 23:44:31 +0000 (09:44 +1000)]
mm, mempool: use '%zu' for printing 'size_t' variable

fix printk warning

Commit 8b65aaa9c53404 ("mm, mempool: poison elements backed by page allocator")
caused the following build warning on ARM:

mm/mempool.c:31:2: warning: format '%ld' expects argument of type 'long int', but argument 3 has type 'size_t' [-Wformat]

Use '%zu' for printing 'size_t' variable.

Signed-off-by: Fabio Estevam <fabio.estevam@freescale.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 years agomm, mempool: poison elements backed by page allocator
David Rientjes [Tue, 7 Apr 2015 23:44:31 +0000 (09:44 +1000)]
mm, mempool: poison elements backed by page allocator

Elements backed by the slab allocator are poisoned when added to a
mempool's reserved pool.

It is also possible to poison elements backed by the page allocator
because the mempool layer knows the allocation order.

This patch extends mempool element poisoning to include memory backed by
the page allocator.

This is only effective for configs with CONFIG_DEBUG_SLAB or
CONFIG_SLUB_DEBUG_ON.

Signed-off-by: David Rientjes <rientjes@google.com>
Cc: Dave Kleikamp <shaggy@kernel.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Sebastian Ott <sebott@linux.vnet.ibm.com>
Cc: Mikulas Patocka <mpatocka@redhat.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 years agomm, mempool: poison elements backed by slab allocator
David Rientjes [Tue, 7 Apr 2015 23:44:31 +0000 (09:44 +1000)]
mm, mempool: poison elements backed by slab allocator

Mempools keep elements in a reserved pool for contexts in which allocation
may not be possible.  When an element is allocated from the reserved pool,
its memory contents is the same as when it was added to the reserved pool.

Because of this, elements lack any free poisoning to detect use-after-free
errors.

This patch adds free poisoning for elements backed by the slab allocator.
This is possible because the mempool layer knows the object size of each
element.

When an element is added to the reserved pool, it is poisoned with
POISON_FREE.  When it is removed from the reserved pool, the contents are
checked for POISON_FREE.  If there is a mismatch, a warning is emitted to
the kernel log.

This is only effective for configs with CONFIG_DEBUG_SLAB or
CONFIG_SLUB_DEBUG_ON.

Signed-off-by: David Rientjes <rientjes@google.com>
Cc: Dave Kleikamp <shaggy@kernel.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Sebastian Ott <sebott@linux.vnet.ibm.com>
Cc: Mikulas Patocka <mpatocka@redhat.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 years agomm, mempool: disallow mempools based on slab caches with constructors
David Rientjes [Tue, 7 Apr 2015 23:44:31 +0000 (09:44 +1000)]
mm, mempool: disallow mempools based on slab caches with constructors

All occurrences of mempools based on slab caches with object constructors
have been removed from the tree, so disallow creating them.

We can only dereference mem->ctor in mm/mempool.c without including
mm/slab.h in include/linux/mempool.h.  So simply note the restriction,
just like the comment restricting usage of __GFP_ZERO, and warn on kernels
with CONFIG_DEBUG_VM() if such a mempool is allocated from.

We don't want to incur this check on every element allocation, so use
VM_BUG_ON().

Signed-off-by: David Rientjes <rientjes@google.com>
Cc: Dave Kleikamp <shaggy@kernel.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Sebastian Ott <sebott@linux.vnet.ibm.com>
Cc: Mikulas Patocka <mpatocka@redhat.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 years agofs, jfs: remove slab object constructor
David Rientjes [Tue, 7 Apr 2015 23:44:31 +0000 (09:44 +1000)]
fs, jfs: remove slab object constructor

Mempools based on slab caches with object constructors are risky because
element allocation can happen either from the slab cache itself, meaning
the constructor is properly called before returning, or from the mempool
reserve pool, meaning the constructor is not called before returning,
depending on the allocation context.

For this reason, we should disallow creating mempools based on slab caches
that have object constructors.  Callers of mempool_alloc() will be
responsible for properly initializing the returned element.

Then, it doesn't matter if the element came from the slab cache or the
mempool reserved pool.

The only occurrence of a mempool being based on a slab cache with an
object constructor in the tree is in fs/jfs/jfs_metapage.c.  Remove it and
properly initialize the element in alloc_metapage().

At the same time, META_free is never used, so remove it as well.

Signed-off-by: David Rientjes <rientjes@google.com>
Acked-by: Dave Kleikamp <dave.kleikamp@oracle.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Sebastian Ott <sebott@linux.vnet.ibm.com>
Cc: Mikulas Patocka <mpatocka@redhat.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 years agomm: remove rest of ACCESS_ONCE() usages
Jason Low [Tue, 7 Apr 2015 23:44:30 +0000 (09:44 +1000)]
mm: remove rest of ACCESS_ONCE() usages

We converted some of the usages of ACCESS_ONCE to READ_ONCE in the mm/
tree since it doesn't work reliably on non-scalar types.

This patch removes the rest of the usages of ACCESS_ONCE, and use the new
READ_ONCE API for the read accesses.  This makes things cleaner, instead
of using separate/multiple sets of APIs.

Signed-off-by: Jason Low <jason.low2@hp.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Davidlohr Bueso <dave@stgolabs.net>
Acked-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 years agomm: use READ_ONCE() for non-scalar types
Jason Low [Tue, 7 Apr 2015 23:44:30 +0000 (09:44 +1000)]
mm: use READ_ONCE() for non-scalar types

Commit 38c5ce936a08 ("mm/gup: Replace ACCESS_ONCE with READ_ONCE")
converted ACCESS_ONCE usage in gup_pmd_range() to READ_ONCE, since
ACCESS_ONCE doesn't work reliably on non-scalar types.

This patch also fixes the other ACCESS_ONCE usages in gup_pte_range()
and __get_user_pages_fast() in mm/gup.c

Signed-off-by: Jason Low <jason.low2@hp.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Davidlohr Bueso <dave@stgolabs.net>
Acked-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 years agomm/mremap.c: clean up goto just return ERR_PTR
Derek [Tue, 7 Apr 2015 23:44:30 +0000 (09:44 +1000)]
mm/mremap.c: clean up goto just return ERR_PTR

As suggested by Kirill the "goto"s in vma_to_resize aren't necessary, just
change them to explicit return.

Signed-off-by: Derek Che <crquan@ymail.com>
Suggested-by: "Kirill A. Shutemov" <kirill@shutemov.name>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 years agomremap should return -ENOMEM when __vm_enough_memory fail
Derek [Tue, 7 Apr 2015 23:44:30 +0000 (09:44 +1000)]
mremap should return -ENOMEM when __vm_enough_memory fail

Recently I straced bash behavior in this dd zero pipe to read test, in
part of testing under vm.overcommit_memory=2 (OVERCOMMIT_NEVER mode):

    # dd if=/dev/zero | read x

The bash sub shell is calling mremap to reallocate more and more memory
untill it finally failed -ENOMEM (I expect), or to be killed by system OOM
killer (which should not happen under OVERCOMMIT_NEVER mode); But the
mremap system call actually failed of -EFAULT, which is a surprise to me,
I think it's supposed to be -ENOMEM?  then I wrote this piece of C code
testing confirmed it: https://gist.github.com/crquan/326bde37e1ddda8effe5

    $ ./remap
    allocated one page @0x7f686bf71000, (PAGE_SIZE: 4096)
    grabbed 7680512000 bytes of memory (1875125 pages) @ 00007f6690993000.
    mremap failed Bad address (14).

The -EFAULT comes from the branch of security_vm_enough_memory_mm failure,
underlyingly it calls __vm_enough_memory which returns only 0 for success
or -ENOMEM; So why vma_to_resize needs to return -EFAULT in this case?
this sounds like a mistake to me.

Some more digging into git history:

1) Before commit 119f657c7 ("RLIMIT_AS checking fix") in May 1 2005
   (pre 2.6.12 days) it was returning -ENOMEM for this failure;

2) but commit 119f657c7 ("untangling do_mremap(), part 1") changed it
   accidentally, to what ever is preserved in local ret, which happened to
   be -EFAULT, in a previous assignment;

3) then in commit 54f5de709 code refactoring, it's explicitly returning
   -EFAULT, should be wrong.

Signed-off-by: Derek Che <crquan@ymail.com>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 years agomm/vmalloc: get rid of dirty bitmap inside vmap_block structure
Roman Pen [Tue, 7 Apr 2015 23:44:30 +0000 (09:44 +1000)]
mm/vmalloc: get rid of dirty bitmap inside vmap_block structure

In original implementation of vm_map_ram made by Nick Piggin there were
two bitmaps: alloc_map and dirty_map.  None of them were used as supposed
to be: finding a suitable free hole for next allocation in block.
vm_map_ram allocates space sequentially in block and on free call marks
pages as dirty, so freed space can't be reused anymore.

Actually it would be very interesting to know the real meaning of those
bitmaps, maybe implementation was incomplete, etc.

But long time ago Zhang Yanfei removed alloc_map by these two commits:

  mm/vmalloc.c: remove dead code in vb_alloc
     3fcd76e8028e0be37b02a2002b4f56755daeda06
  mm/vmalloc.c: remove alloc_map from vmap_block
     b8e748b6c32999f221ea4786557b8e7e6c4e4e7a

In this patch I replaced dirty_map with two range variables: dirty min and
max.  These variables store minimum and maximum position of dirty space in
a block, since we need only to know the dirty range, not exact position of
dirty pages.

Why it was made?  Several reasons: at first glance it seems that
vm_map_ram allocator concerns about fragmentation thus it uses bitmaps for
finding free hole, but it is not true.  To avoid complexity seems it is
better to use something simple, like min or max range values.  Secondly,
code also becomes simpler, without iteration over bitmap, just comparing
values in min and max macros.  Thirdly, bitmap occupies up to 1024 bits
(4MB is a max size of a block).  Here I replaced the whole bitmap with two
longs.

Finally vm_unmap_aliases should be slightly faster and the whole
vmap_block structure occupies less memory.

Signed-off-by: Roman Pen <r.peniaev@gmail.com>
Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Cc: Eric Dumazet <edumazet@google.com>
Acked-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: David Rientjes <rientjes@google.com>
Cc: WANG Chao <chaowang@redhat.com>
Cc: Fabian Frederick <fabf@skynet.be>
Cc: Christoph Lameter <cl@linux.com>
Cc: Gioh Kim <gioh.kim@lge.com>
Cc: Rob Jones <rob.jones@codethink.co.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 years agomm/vmalloc: occupy newly allocated vmap block just after allocation
Roman Pen [Tue, 7 Apr 2015 23:44:29 +0000 (09:44 +1000)]
mm/vmalloc: occupy newly allocated vmap block just after allocation

Previous implementation allocates new vmap block and repeats search of a
free block from the very beginning, iterating over the CPU free list.

Why it can be better??

1. Allocation can happen on one CPU, but search can be done on another CPU.
   In worst case we preallocate amount of vmap blocks which is equal to
   CPU number on the system.

2. In previous patch I added newly allocated block to the tail of free list
   to avoid soon exhaustion of virtual space and give a chance to occupy
   blocks which were allocated long time ago.  Thus to find newly allocated
   block all the search sequence should be repeated, seems it is not efficient.

In this patch newly allocated block is occupied right away, address of
virtual space is returned to the caller, so there is no any need to repeat
the search sequence, allocation job is done.

Signed-off-by: Roman Pen <r.peniaev@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Eric Dumazet <edumazet@google.com>
Acked-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: David Rientjes <rientjes@google.com>
Cc: WANG Chao <chaowang@redhat.com>
Cc: Fabian Frederick <fabf@skynet.be>
Cc: Christoph Lameter <cl@linux.com>
Cc: Gioh Kim <gioh.kim@lge.com>
Cc: Rob Jones <rob.jones@codethink.co.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 years agomm/vmalloc: fix possible exhaustion of vmalloc space caused by vm_map_ram allocator
Roman Pen [Tue, 7 Apr 2015 23:44:29 +0000 (09:44 +1000)]
mm/vmalloc: fix possible exhaustion of vmalloc space caused by vm_map_ram allocator

Recently I came across high fragmentation of vm_map_ram allocator:
vmap_block has free space, but still new blocks continue to appear.
Further investigation showed that certain mapping/unmapping sequences can
exhaust vmalloc space.  On small 32bit systems that's not a big problem,
cause purging will be called soon on a first allocation failure
(alloc_vmap_area), but on 64bit machines, e.g.  x86_64 has 45 bits of
vmalloc space, that can be a disaster.

1) I came up with a simple allocation sequence, which exhausts virtual
   space very quickly:

  while (iters) {

/* Map/unmap big chunk */
vaddr = vm_map_ram(pages, 16, -1, PAGE_KERNEL);
        vm_unmap_ram(vaddr, 16);

/* Map/unmap small chunks.
 *
 * -1 for hole, which should be left at the end of each block
 * to keep it partially used, with some free space available */
for (i = 0; i < (VMAP_BBMAP_BITS - 16) / 8 - 1; i++) {
vaddr = vm_map_ram(pages, 8, -1, PAGE_KERNEL);
            vm_unmap_ram(vaddr, 8);
        }
  }

The idea behind is simple:

 1. We have to map a big chunk, e.g. 16 pages.

 2. Then we have to occupy the remaining space with smaller chunks, i.e.
    8 pages. At the end small hole should remain to keep block in free list,
    but do not let big chunk to occupy remaining space.

 3. Goto 1 - allocation request of 16 pages can't be completed (only 8 slots
    are left free in the block in the #2 step), new block will be allocated,
    all further requests will lay into newly allocated block.

To have some measurement numbers for all further tests I setup ftrace and
enabled 4 basic calls in a function profile:

echo vm_map_ram              > /sys/kernel/debug/tracing/set_ftrace_filter;
echo alloc_vmap_area        >> /sys/kernel/debug/tracing/set_ftrace_filter;
echo vm_unmap_ram           >> /sys/kernel/debug/tracing/set_ftrace_filter;
echo free_vmap_block        >> /sys/kernel/debug/tracing/set_ftrace_filter;

So for this scenario I got these results:

BEFORE (all new blocks are put to the head of a free list)
# cat /sys/kernel/debug/tracing/trace_stat/function0
  Function                               Hit    Time            Avg             s^2
  --------                               ---    ----            ---             ---
  vm_map_ram                          126000    30683.30 us     0.243 us        30819.36 us
  vm_unmap_ram                        126000    22003.24 us     0.174 us        340.886 us
  alloc_vmap_area                       1000    4132.065 us     4.132 us        0.903 us

AFTER (all new blocks are put to the tail of a free list)
# cat /sys/kernel/debug/tracing/trace_stat/function0
  Function                               Hit    Time            Avg             s^2
  --------                               ---    ----            ---             ---
  vm_map_ram                          126000    28713.13 us     0.227 us        24944.70 us
  vm_unmap_ram                        126000    20403.96 us     0.161 us        1429.872 us
  alloc_vmap_area                        993    3916.795 us     3.944 us        29.370 us
  free_vmap_block                        992    654.157 us      0.659 us        1.273 us

SUMMARY:

The most interesting numbers in those tables are numbers of block
allocations and deallocations: alloc_vmap_area and free_vmap_block calls,
which show that before the change blocks were not freed, and virtual space
and physical memory (vmap_block structure allocations, etc) were consumed.

Average time which were spent in vm_map_ram/vm_unmap_ram became slightly
better.  That can be explained with a reasonable amount of blocks in a
free list, which we need to iterate to find a suitable free block.

2) Another scenario is a random allocation:

  while (iters) {

/* Randomly take number from a range [1..32/64] */
nr = rand(1, VMAP_MAX_ALLOC);
vaddr = vm_map_ram(pages, nr, -1, PAGE_KERNEL);
vm_unmap_ram(vaddr, nr);
  }

I chose mersenne twister PRNG to generate persistent random state to
guarantee that both runs have the same random sequence.  For each
vm_map_ram call random number from [1..32/64] was taken to represent
amount of pages which I do map.

I did 10'000 vm_map_ram calls and got these two tables:

BEFORE (all new blocks are put to the head of a free list)

# cat /sys/kernel/debug/tracing/trace_stat/function0
  Function                               Hit    Time            Avg             s^2
  --------                               ---    ----            ---             ---
  vm_map_ram                           10000    10170.01 us     1.017 us        993.609 us
  vm_unmap_ram                         10000    5321.823 us     0.532 us        59.789 us
  alloc_vmap_area                        420    2150.239 us     5.119 us        3.307 us
  free_vmap_block                         37    159.587 us      4.313 us        134.344 us

AFTER (all new blocks are put to the tail of a free list)

# cat /sys/kernel/debug/tracing/trace_stat/function0
  Function                               Hit    Time            Avg             s^2
  --------                               ---    ----            ---             ---
  vm_map_ram                           10000    7745.637 us     0.774 us        395.229 us
  vm_unmap_ram                         10000    5460.573 us     0.546 us        67.187 us
  alloc_vmap_area                        414    2201.650 us     5.317 us        5.591 us
  free_vmap_block                        412    574.421 us      1.394 us        15.138 us

SUMMARY:

'BEFORE' table shows, that 420 blocks were allocated and only 37 were
freed.  Remained 383 blocks are still in a free list, consuming virtual
space and physical memory.

'AFTER' table shows, that 414 blocks were allocated and 412 were really
freed.  2 blocks remained in a free list.

So fragmentation was dramatically reduced.  Why?  Because when we put
newly allocated block to the head, all further requests will occupy new
block, regardless remained space in other blocks.  In this scenario all
requests come randomly.  Eventually remained free space will be less than
requested size, free list will be iterated and it is possible that nothing
will be found there - finally new block will be created.  So exhaustion in
random scenario happens for the maximum possible allocation size: 32 pages
for 32-bit system and 64 pages for 64-bit system.

Also average cost of vm_map_ram was reduced from 1.017 us to 0.774 us.
Again this can be explained by iteration through smaller list of free
blocks.

3) Next simple scenario is a sequential allocation, when the allocation
   order is increased for each block.  This scenario forces allocator to
   reach maximum amount of partially free blocks in a free list:

  while (iters) {

/* Populate free list with blocks with remaining space */
for (order = 0; order <= ilog2(VMAP_MAX_ALLOC); order++) {
nr = VMAP_BBMAP_BITS / (1 << order);

/* Leave a hole */
nr -= 1;

for (i = 0; i < nr; i++) {
vaddr = vm_map_ram(pages, (1 << order), -1, PAGE_KERNEL);
vm_unmap_ram(vaddr, (1 << order));
}

/* Completely occupy blocks from a free list */
for (order = 0; order <= ilog2(VMAP_MAX_ALLOC); order++) {
vaddr = vm_map_ram(pages, (1 << order), -1, PAGE_KERNEL);
vm_unmap_ram(vaddr, (1 << order));
}
  }

Results which I got:

BEFORE (all new blocks are put to the head of a free list)

# cat /sys/kernel/debug/tracing/trace_stat/function0
  Function                               Hit    Time            Avg             s^2
  --------                               ---    ----            ---             ---
  vm_map_ram                         2032000    399545.2 us     0.196 us        467123.7 us
  vm_unmap_ram                       2032000    363225.7 us     0.178 us        111405.9 us
  alloc_vmap_area                       7001    30627.76 us     4.374 us        495.755 us
  free_vmap_block                       6993    7011.685 us     1.002 us        159.090 us

AFTER (all new blocks are put to the tail of a free list)

# cat /sys/kernel/debug/tracing/trace_stat/function0
  Function                               Hit    Time            Avg             s^2
  --------                               ---    ----            ---             ---
  vm_map_ram                         2032000    394259.7 us     0.194 us        589395.9 us
  vm_unmap_ram                       2032000    292500.7 us     0.143 us        94181.08 us
  alloc_vmap_area                       7000    31103.11 us     4.443 us        703.225 us
  free_vmap_block                       7000    6750.844 us     0.964 us        119.112 us

SUMMARY:

No surprises here, almost all numbers are the same.

Fixing this fragmentation problem I also did some improvements in a
allocation logic of a new vmap block: occupy block immediately and get rid
of extra search in a free list.

Also I replaced dirty bitmap with min/max dirty range values to make the
logic simpler and slightly faster, since two longs comparison costs less,
than loop thru bitmap.

This patchset raises several questions:

 Q: Think the problem you comments is already known so that I wrote comments
    about it as "it could consume lots of address space through fragmentation".
    Could you tell me about your situation and reason why it should be avoided?
                                                                     Gioh Kim

 A: Indeed, there was a commit 364376383 which adds explicit comment about
    fragmentation.  But fragmentation which is described in this comment caused
    by mixing of long-lived and short-lived objects, when a whole block is pinned
    in memory because some page slots are still in use.  But here I am talking
    about blocks which are free, nobody uses them, and allocator keeps them alive
    forever, continuously allocating new blocks.

 Q: I think that if you put newly allocated block to the tail of a free
    list, below example would results in enormous performance degradation.

    new block: 1MB (256 pages)

    while (iters--) {
      vm_map_ram(3 or something else not dividable for 256) * 85
      vm_unmap_ram(3) * 85
    }

    On every iteration, it needs newly allocated block and it is put to the
    tail of a free list so finding it consumes large amount of time.
                                                                    Joonsoo Kim

 A: Second patch in current patchset gets rid of extra search in a free list,
    so new block will be immediately occupied..

    Also, the scenario above is impossible, cause vm_map_ram allocates virtual
    range in orders, i.e. 2^n.  I.e. passing 3 to vm_map_ram you will allocate
    4 slots in a block and 256 slots (capacity of a block) of course dividable
    on 4, so block will be completely occupied.

    But there is a worst case which we can achieve: each free block has a hole
    equal to order size.

    The maximum size of allocation is 64 pages for 64-bit system
    (if you try to map more, original alloc_vmap_area will be called).

    So the maximum order is 6.  That means that worst case, before allocator
    makes a decision to allocate a new block, is to iterate 7 blocks:

    HEAD
    1st block - has 1  page slot  free (order 0)
    2nd block - has 2  page slots free (order 1)
    3rd block - has 4  page slots free (order 2)
    4th block - has 8  page slots free (order 3)
    5th block - has 16 page slots free (order 4)
    6th block - has 32 page slots free (order 5)
    7th block - has 64 page slots free (order 6)
    TAIL

    So the worst scenario on 64-bit system is that each CPU queue can have 7
    blocks in a free list.

    This can happen only and only if you allocate blocks increasing the order.
    (as I did in the function written in the comment of the first patch)
    This is weird and rare case, but still it is possible.  Afterwards you will
    get 7 blocks in a list.

    All further requests should be placed in a newly allocated block or some
    free slots should be found in a free list.
    Seems it does not look dramatically awful.

This patch (of 3):

If suitable block can't be found, new block is allocated and put into a
head of a free list, so on next iteration this new block will be found
first.

That's bad, because old blocks in a free list will not get a chance to be
fully used, thus fragmentation will grow.

Let's consider this simple example:

 #1 We have one block in a free list which is partially used, and where only
    one page is free:

    HEAD |xxxxxxxxx-| TAIL
                   ^
                   free space for 1 page, order 0

 #2 New allocation request of order 1 (2 pages) comes, new block is allocated
    since we do not have free space to complete this request. New block is put
    into a head of a free list:

    HEAD |----------|xxxxxxxxx-| TAIL

 #3 Two pages were occupied in a new found block:

    HEAD |xx--------|xxxxxxxxx-| TAIL
          ^
          two pages mapped here

 #4 New allocation request of order 0 (1 page) comes.  Block, which was created
    on #2 step, is located at the beginning of a free list, so it will be found
    first:

  HEAD |xxX-------|xxxxxxxxx-| TAIL
          ^                 ^
          page mapped here, but better to use this hole

It is obvious, that it is better to complete request of #4 step using the
old block, where free space is left, because in other case fragmentation
will be highly increased.

But fragmentation is not only the case.  The worst thing is that I can
easily create scenario, when the whole vmalloc space is exhausted by
blocks, which are not used, but already dirty and have several free pages.

Let's consider this function which execution should be pinned to one CPU:

static void exhaust_virtual_space(struct page *pages[16], int iters)
{
/* Firstly we have to map a big chunk, e.g. 16 pages.
 * Then we have to occupy the remaining space with smaller
 * chunks, i.e. 8 pages. At the end small hole should remain.
 * So at the end of our allocation sequence block looks like
 * this:
 *                XX  big chunk
 * |XXxxxxxxx-|    x  small chunk
 *                 -  hole, which is enough for a small chunk,
 *                    but is not enough for a big chunk
 */
while (iters--) {
int i;
void *vaddr;

/* Map/unmap big chunk */
vaddr = vm_map_ram(pages, 16, -1, PAGE_KERNEL);
vm_unmap_ram(vaddr, 16);

/* Map/unmap small chunks.
 *
 * -1 for hole, which should be left at the end of each block
 * to keep it partially used, with some free space available */
for (i = 0; i < (VMAP_BBMAP_BITS - 16) / 8 - 1; i++) {
vaddr = vm_map_ram(pages, 8, -1, PAGE_KERNEL);
vm_unmap_ram(vaddr, 8);
}
}
}

On every iteration new block (1MB of vm area in my case) will be allocated
and then will be occupied, without attempt to resolve small allocation
request using previously allocated blocks in a free list.

In case of random allocation (size should be randomly taken from the range
[1..64] in 64-bit case or [1..32] in 32-bit case) situation is the same:
new blocks continue to appear if maximum possible allocation size (32 or
64) passed to the allocator, because all remaining blocks in a free list
do not have enough free space to complete this allocation request.

In summary if new blocks are put into the head of a free list eventually
virtual space will be exhausted.

In current patch I simply put newly allocated block to the tail of a free
list, thus reduce fragmentation, giving a chance to resolve allocation
request using older blocks with possible holes left.

Signed-off-by: Roman Pen <r.peniaev@gmail.com>
Cc: Eric Dumazet <edumazet@google.com>
Acked-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: David Rientjes <rientjes@google.com>
Cc: WANG Chao <chaowang@redhat.com>
Cc: Fabian Frederick <fabf@skynet.be>
Cc: Christoph Lameter <cl@linux.com>
Cc: Gioh Kim <gioh.kim@lge.com>
Cc: Rob Jones <rob.jones@codethink.co.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 years agohugetlbfs: document min_size mount option and cleanup
Mike Kravetz [Tue, 7 Apr 2015 23:44:29 +0000 (09:44 +1000)]
hugetlbfs: document min_size mount option and cleanup

Add min_size mount option to the hugetlbfs documentation.  Also, add the
missing pagesize option and mention that size can be specified as bytes or
a percentage of huge page pool.

Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 years agohugetlbfs: accept subpool min_size mount option and setup accordingly
Mike Kravetz [Tue, 7 Apr 2015 23:44:29 +0000 (09:44 +1000)]
hugetlbfs: accept subpool min_size mount option and setup accordingly

Make 'min_size=<value>' be an option when mounting a hugetlbfs.  This
option takes the same value as the 'size' option.  min_size can be
specified without specifying size.  If both are specified, min_size must
be less that or equal to size else the mount will fail.  If min_size is
specified, then at mount time an attempt is made to reserve min_size
pages.  If the reservation fails, the mount fails.  At umount time, the
reserved pages are released.

Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 years agohugetlbfs: add minimum size accounting to subpools
Mike Kravetz [Tue, 7 Apr 2015 23:44:29 +0000 (09:44 +1000)]
hugetlbfs: add minimum size accounting to subpools

The same routines that perform subpool maximum size accounting
hugepage_subpool_get/put_pages() are modified to also perform minimum size
accounting.  When a delta value is passed to these routines, calculate how
global reservations must be adjusted to maintain the subpool minimum size.
 The routines now return this global reserve count adjustment.  This
global reserve count adjustment is then passed to the global accounting
routine hugetlb_acct_memory().

Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 years agohugetlbfs: add minimum size tracking fields to subpool structure
Mike Kravetz [Tue, 7 Apr 2015 23:44:28 +0000 (09:44 +1000)]
hugetlbfs: add minimum size tracking fields to subpool structure

hugetlbfs allocates huge pages from the global pool as needed.  Even if
the global pool contains a sufficient number pages for the filesystem size
at mount time, those global pages could be grabbed for some other use.  As
a result, filesystem huge page allocations may fail due to lack of pages.

Applications such as a database want to use huge pages for performance
reasons.  hugetlbfs filesystem semantics with ownership and modes work
well to manage access to a pool of huge pages.  However, the application
would like some reasonable assurance that allocations will not fail due to
a lack of huge pages.  At application startup time, the application would
like to configure itself to use a specific number of huge pages.  Before
starting, the application can check to make sure that enough huge pages
exist in the system global pools.  However, there are no guarantees that
those pages will be available when needed by the application.  What the
application wants is exclusive use of a subset of huge pages.

Add a new hugetlbfs mount option 'min_size=<value>' to indicate that the
specified number of pages will be available for use by the filesystem.  At
mount time, this number of huge pages will be reserved for exclusive use
of the filesystem.  If there is not a sufficient number of free pages, the
mount will fail.  As pages are allocated to and freeed from the
filesystem, the number of reserved pages is adjusted so that the specified
minimum is maintained.

This patch (of 4):

Add a field to the subpool structure to indicate the minimimum number of
huge pages to always be used by this subpool.  This minimum count includes
allocated pages as well as reserved pages.  If the minimum number of pages
for the subpool have not been allocated, pages are reserved up to this
minimum.  An additional field (rsv_hpages) is used to track the number of
pages reserved to meet this minimum size.  The hstate pointer in the
subpool is convenient to have when reserving and unreserving the pages.

Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 years agomm/compaction: reset compaction scanner positions
Gioh Kim [Tue, 7 Apr 2015 23:44:28 +0000 (09:44 +1000)]
mm/compaction: reset compaction scanner positions

When the compaction is activated via /proc/sys/vm/compact_memory it would
better scan the whole zone.  And some platforms, for instance ARM, have
the start_pfn of a zone at zero.  Therefore the first try to compact via
/proc doesn't work.  It needs to reset the compaction scanner position
first.

Signed-off-by: Gioh Kim <gioh.kim@lge.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 years agomm, memcg: sync allocation and memcg charge gfp flags for thp fix fix
David Rientjes [Tue, 7 Apr 2015 23:44:28 +0000 (09:44 +1000)]
mm, memcg: sync allocation and memcg charge gfp flags for thp fix fix

"mm, memcg: sync allocation and memcg charge gfp flags for THP" in -mm
introduces a formal to pass the gfp mask for khugepaged's hugepage
allocation.  This is just too ugly to live.

alloc_hugepage_gfpmask() cannot differ between NUMA and UMA configs by
anything in GFP_RECLAIM_MASK, which is the only thing that matters for
memcg reclaim, so just determine the gfp flags once in
collapse_huge_page() and avoid the complexity.

Signed-off-by: David Rientjes <rientjes@google.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 years agomm-memcg-sync-allocation-and-memcg-charge-gfp-flags-for-thp-fix
Andrew Morton [Tue, 7 Apr 2015 23:44:28 +0000 (09:44 +1000)]
mm-memcg-sync-allocation-and-memcg-charge-gfp-flags-for-thp-fix

fix weird code layout while we're there

Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 years agomm, memcg: sync allocation and memcg charge gfp flags for THP
Michal Hocko [Tue, 7 Apr 2015 23:44:27 +0000 (09:44 +1000)]
mm, memcg: sync allocation and memcg charge gfp flags for THP

memcg currently uses hardcoded GFP_TRANSHUGE gfp flags for all THP
charges.  THP allocations, however, might be using different flags
depending on /sys/kernel/mm/transparent_hugepage/{,khugepaged/}defrag and
the current allocation context.

The primary difference is that defrag configured to "madvise" value will
clear __GFP_WAIT flag from the core gfp mask to make the allocation
lighter for all mappings which are not backed by VM_HUGEPAGE vmas.  If
memcg charge path ignores this fact we will get light allocation but the a
potential memcg reclaim would kill the whole point of the configuration.

Fix the mismatch by providing the same gfp mask used for the allocation to
the charge functions.  This is quite easy for all paths except for
hugepaged kernel thread with !CONFIG_NUMA which is doing a pre-allocation
long before the allocated page is used in collapse_huge_page via
khugepaged_alloc_page.  To prevent from cluttering the whole code path
from khugepaged_do_scan we simply return the current flags as per
khugepaged_defrag() value which might have changed since the
preallocation.  If somebody changed the value of the knob we would charge
differently but this shouldn't happen often and it is definitely not
critical because it would only lead to a reduced success rate of one-off
THP promotion.

Signed-off-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 years agomm: rename deactivate_page to deactivate_file_page
Minchan Kim [Tue, 7 Apr 2015 23:44:27 +0000 (09:44 +1000)]
mm: rename deactivate_page to deactivate_file_page

"deactivate_page" was created for file invalidation so it has too specific
logic for file-backed pages.  So, let's change the name of the function
and date to a file-specific one and yield the generic name.

Signed-off-by: Minchan Kim <minchan@kernel.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Shaohua Li <shli@kernel.org>
Cc: Wang, Yalin <Yalin.Wang@sonymobile.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 years agodocument-interaction-between-compaction-and-the-unevictable-lru-fix
Andrew Morton [Tue, 7 Apr 2015 23:44:27 +0000 (09:44 +1000)]
document-interaction-between-compaction-and-the-unevictable-lru-fix

identify /proc/sys/vm/compact_unevictable directly

Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Eric B Munson <emunson@akamai.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 years agoDocumentation/vm/unevictable-lru.txt: document interaction between compaction and...
Eric B Munson [Tue, 7 Apr 2015 23:44:27 +0000 (09:44 +1000)]
Documentation/vm/unevictable-lru.txt: document interaction between compaction and the unevictable LRU

The memory compaction code uses the migration code to do most of the work
in compaction.  However, the compaction code interacts with the
unevictable LRU differently than migration code and this difference should
be noted in the documentation.

Signed-off-by: Eric B Munson <emunson@akamai.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Mel Gorman <mgorman@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 years agomm: allow compaction of unevictable pages
Eric B Munson [Tue, 7 Apr 2015 23:44:27 +0000 (09:44 +1000)]
mm: allow compaction of unevictable pages

Currently, pages which are marked as unevictable are protected from
compaction, but not from other types of migration.  The POSIX real time
extension explicitly states that mlock() will prevent a major page fault,
but the spirit of this is that mlock() should give a process the ability
to control sources of latency, including minor page faults.  However, the
mlock manpage only explicitly says that a locked page will not be written
to swap and this can cause some confusion.  The compaction code today does
not give a developer who wants to avoid swap but wants to have large
contiguous areas available any method to achieve this state.  This patch
introduces a sysctl for controlling compaction behavior with respect to
the unevictable lru.  Users who demand no page faults after a page is
present can set compact_unevictable_allowed to 0 and users who need the
large contiguous areas can enable compaction on locked memory by leaving
the default value of 1.

To illustrate this problem I wrote a quick test program that mmaps a large
number of 1MB files filled with random data.  These maps are created
locked and read only.  Then every other mmap is unmapped and I attempt to
allocate huge pages to the static huge page pool.  When the
compact_unevictable_allowed sysctl is 0, I cannot allocate hugepages after
fragmenting memory.  When the value is set to 1, allocations succeed.

Signed-off-by: Eric B Munson <emunson@akamai.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Christoph Lameter <cl@linux.com>
Acked-by: David Rientjes <rientjes@google.com>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Christoph Lameter <cl@linux.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: David Rientjes <rientjes@google.com>
Cc: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 years agoinclude/linux/page-flags.h: rename macros to avoid collisions
Andrew Morton [Tue, 7 Apr 2015 23:44:26 +0000 (09:44 +1000)]
include/linux/page-flags.h: rename macros to avoid collisions

Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 years agomm-sanitize-page-mapping-for-tail-pages-v2
Kirill A. Shutemov [Tue, 7 Apr 2015 23:44:26 +0000 (09:44 +1000)]
mm-sanitize-page-mapping-for-tail-pages-v2

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Steve Capper <steve.capper@linaro.org>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Jerome Marchand <jmarchan@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 years agomm: sanitize page->mapping for tail pages
Kirill A. Shutemov [Tue, 7 Apr 2015 23:44:26 +0000 (09:44 +1000)]
mm: sanitize page->mapping for tail pages

We don't define meaning of page->mapping for tail pages.  Currently it's
always NULL, which can be inconsistent with head page and potentially lead
to problems.

Let's poison the pointer to catch all illigal uses.

page_rmapping(), page_mapping() and page_anon_vma() are changed to look on
head page.

The only illegal use I've caught so far is __GPF_COMP pages from sound
subsystem, mapped with PTEs.  do_shared_fault() is changed to use
page_rmapping() instead of direct access to fault_page->mapping.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Steve Capper <steve.capper@linaro.org>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Jerome Marchand <jmarchan@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 years agopage-flags: look at head page if the flag is encoded in page->mapping
Kirill A. Shutemov [Tue, 7 Apr 2015 23:44:26 +0000 (09:44 +1000)]
page-flags: look at head page if the flag is encoded in page->mapping

PageAnon() and PageKsm() look at lower bits of page->mapping to check if
the page is Anon or KSM.  page->mapping can be overloaded in tail pages.

Let's always look at head page to avoid false-positives.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Steve Capper <steve.capper@linaro.org>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Jerome Marchand <jmarchan@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 years agopage-flags: define PG_uptodate behavior on compound pages
Kirill A. Shutemov [Tue, 7 Apr 2015 23:44:26 +0000 (09:44 +1000)]
page-flags: define PG_uptodate behavior on compound pages

We use PG_uptodate on head pages on transparent huge page.
Let's use NO_TAIL.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Steve Capper <steve.capper@linaro.org>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Jerome Marchand <jmarchan@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 years agopage-flags: define PG_uncached behavior on compound pages
Kirill A. Shutemov [Tue, 7 Apr 2015 23:44:25 +0000 (09:44 +1000)]
page-flags: define PG_uncached behavior on compound pages

So far, only IA64 uses PG_uncached and only on non-compound pages.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Steve Capper <steve.capper@linaro.org>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Jerome Marchand <jmarchan@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 years agopage-flags: define PG_mlocked behavior on compound pages
Kirill A. Shutemov [Tue, 7 Apr 2015 23:44:25 +0000 (09:44 +1000)]
page-flags: define PG_mlocked behavior on compound pages

Transparent huge pages can be mlocked -- whole compund page at once.
Something went wrong if we're trying to mlock() tail page.
Let's use NO_TAIL.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Steve Capper <steve.capper@linaro.org>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Jerome Marchand <jmarchan@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 years agopage-flags: define PG_swapcache behavior on compound pages
Kirill A. Shutemov [Tue, 7 Apr 2015 23:44:25 +0000 (09:44 +1000)]
page-flags: define PG_swapcache behavior on compound pages

Swap cannot handle compound pages so far. Transparent huge pages are
split on the way to swap.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Steve Capper <steve.capper@linaro.org>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Jerome Marchand <jmarchan@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 years agopage-flags: define PG_swapbacked behavior on compound pages
Kirill A. Shutemov [Tue, 7 Apr 2015 23:44:25 +0000 (09:44 +1000)]
page-flags: define PG_swapbacked behavior on compound pages

PG_swapbacked is used for transparent huge pages. For head pages only.
Let's use NO_TAIL policy.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Steve Capper <steve.capper@linaro.org>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Jerome Marchand <jmarchan@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 years agopage-flags: define PG_reserved behavior on compound pages
Kirill A. Shutemov [Tue, 7 Apr 2015 23:44:24 +0000 (09:44 +1000)]
page-flags: define PG_reserved behavior on compound pages

As far as I can see there's no users of PG_reserved on compound pages.
Let's use NO_COMPOUND here.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Steve Capper <steve.capper@linaro.org>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Jerome Marchand <jmarchan@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 years agopage-flags: define behavior of Xen-related flags on compound pages
Kirill A. Shutemov [Tue, 7 Apr 2015 23:44:24 +0000 (09:44 +1000)]
page-flags: define behavior of Xen-related flags on compound pages

PG_pinned and PG_savepinned are about page table's pages which are never
compound.

I'm not so sure about PG_foreign, but it seems we shouldn't see compound
pages there too.

Let's use NO_COMPOUND for all of them.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Steve Capper <steve.capper@linaro.org>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Jerome Marchand <jmarchan@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 years agopage-flags: define behavior SL*B-related flags on compound pages
Kirill A. Shutemov [Tue, 7 Apr 2015 23:44:24 +0000 (09:44 +1000)]
page-flags: define behavior SL*B-related flags on compound pages

SL*B uses compound pages and marks head pages with PG_slab.
__SetPageSlab() and __ClearPageSlab() are never called for tail pages.

The same situation with PG_slob_free in SLOB allocator.

NO_TAIL is appropriate for these flags.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Steve Capper <steve.capper@linaro.org>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Jerome Marchand <jmarchan@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 years agopage-flags: define behavior of LRU-related flags on compound pages
Kirill A. Shutemov [Tue, 7 Apr 2015 23:44:24 +0000 (09:44 +1000)]
page-flags: define behavior of LRU-related flags on compound pages

Only head pages are ever on LRU. Let's use HEAD policy to avoid any
confusion for all LRU-related flags.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Steve Capper <steve.capper@linaro.org>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Jerome Marchand <jmarchan@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 years agopage-flags: define behavior of FS/IO-related flags on compound pages
Kirill A. Shutemov [Tue, 7 Apr 2015 23:44:24 +0000 (09:44 +1000)]
page-flags: define behavior of FS/IO-related flags on compound pages

It seems we don't have compound page on FS/IO path currently.  Use
NO_COMPOUND to catch if we have.

The odd exception is PG_dirty: sound uses compound pages and maps them
with PTEs.  NO_COMPOUND triggers VM_BUG_ON() in set_page_dirty() on
handling shared fault.  Let's use HEAD for PG_dirty.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Steve Capper <steve.capper@linaro.org>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Jerome Marchand <jmarchan@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 years agopage-flags: define PG_locked behavior on compound pages
Kirill A. Shutemov [Tue, 7 Apr 2015 23:44:23 +0000 (09:44 +1000)]
page-flags: define PG_locked behavior on compound pages

lock_page() must operate on the whole compound page.  It doesn't make much
sense to lock part of compound page.  Change code to use head page's
PG_locked, if tail page is passed.

This patch also gets rid of custom helper functions -- __set_page_locked()
and __clear_page_locked().  They are replaced with helpers generated by
__SETPAGEFLAG/__CLEARPAGEFLAG.  Tail pages to these helper would trigger
VM_BUG_ON().

SLUB uses PG_locked as a bit spin locked.  IIUC, tail pages should never
appear there.  VM_BUG_ON() is added to make sure that this assumption is
correct.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Steve Capper <steve.capper@linaro.org>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Jerome Marchand <jmarchan@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 years agopage-flags: introduce page flags policies wrt compound pages
Kirill A. Shutemov [Tue, 7 Apr 2015 23:44:23 +0000 (09:44 +1000)]
page-flags: introduce page flags policies wrt compound pages

This patch adds a third argument to macros which create function
definitions for page flags.  This argument defines how page-flags helpers
behave on compound functions.

For now we define four policies:

- PF_ANY: the helper function operates on the page it gets, regardless
  if it's non-compound, head or tail.

- PF_HEAD: the helper function operates on the head page of the compound
  page if it gets tail page.

- PF_NO_TAIL: only head and non-compond pages are acceptable for this
  helper function.

- PF_NO_COMPOUND: only non-compound pages are acceptable for this helper
  function.

For now we use policy PF_ANY for all helpers, which matches current
behaviour.

We do not enforce the policy for TESTPAGEFLAG, because we have flags
checked for random pages all over the kernel.  Noticeable exception to
this is PageTransHuge() which triggers VM_BUG_ON() for tail page.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Steve Capper <steve.capper@linaro.org>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Jerome Marchand <jmarchan@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 years agopage-flags: trivial cleanup for PageTrans* helpers
Kirill A. Shutemov [Tue, 7 Apr 2015 23:44:23 +0000 (09:44 +1000)]
page-flags: trivial cleanup for PageTrans* helpers

Use TESTPAGEFLAG_FALSE() to get it a bit cleaner.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Steve Capper <steve.capper@linaro.org>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Jerome Marchand <jmarchan@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 years agomm/page-writeback: check-before-clear PageReclaim
Naoya Horiguchi [Tue, 7 Apr 2015 23:44:23 +0000 (09:44 +1000)]
mm/page-writeback: check-before-clear PageReclaim

With the page flag sanitization patchset, an invalid usage of
ClearPageReclaim() is detected in set_page_dirty().  This can be called
from __unmap_hugepage_range(), so let's check PageReclaim() before trying
to clear it to avoid the misuse.

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 years agomm/migrate: check-before-clear PageSwapCache
Naoya Horiguchi [Tue, 7 Apr 2015 23:44:23 +0000 (09:44 +1000)]
mm/migrate: check-before-clear PageSwapCache

With the page flag sanitization patchset, an invalid usage of
ClearPageSwapCache() is detected in migration_page_copy().
migrate_page_copy() is shared by both normal and hugepage (both thp and
hugetlb) code path, so let's check PageSwapCache() and clear it if it's
set to avoid misuse of the invalid clear operation.

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 years agomm: avoid tail page refcounting on non-THP compound pages
Kirill A. Shutemov [Tue, 7 Apr 2015 23:44:22 +0000 (09:44 +1000)]
mm: avoid tail page refcounting on non-THP compound pages

THP uses tail page refcounting to be able to split huge pages at any time.
 Tail page refcounting is not needed for other users of compound pages and
it's harmful because of overhead.

We try to exclude non-THP pages from tail page refcounting using
__compound_tail_refcounted() check.  It excludes most common non-THP
compound pages: SL*B and hugetlb, but it doesn't catch rest of __GFP_COMP
users -- drivers.

And it's not only about overhead.

Drivers might want to use compound pages to get refcounting semantics
suitable for mapping high-order pages to userspace.  But tail page
refcounting breaks it.

Tail page refcounting uses ->_mapcount in tail pages to store GUP pins on
them.  It means GUP pins would affect page_mapcount() for tail pages.
It's not a problem for THP, because it never maps tail pages.  But unlike
THP, drivers map parts of compound pages with PTEs and it makes
page_mapcount() be called for tail pages.

In particular, GUP pins would shift PSS up and affect /proc/kpagecount for
such pages.  But, I'm not aware about anything which can lead to crash or
other serious misbehaviour.

Since currently all THP pages are anonymous and all drivers pages are not,
we can fix the __compound_tail_refcounted() check by requiring PageAnon()
to enable tail page refcounting.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Hugh Dickins <hughd@google.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 years agomm: consolidate all page-flags helpers in <linux/page-flags.h>
Kirill A. Shutemov [Tue, 7 Apr 2015 23:44:22 +0000 (09:44 +1000)]
mm: consolidate all page-flags helpers in <linux/page-flags.h>

Currently we take a naive approach to page flags on compound pages - we
set the flag on the page without consideration if the flag makes sense for
tail page or for compound page in general.  This patchset try to sort this
out by defining per-flag policy on what need to be done if page-flag
helper operate on compound page.

The last patch in the patchset also sanitizes usege of page->mapping for
tail pages.  We don't define the meaning of page->mapping for tail pages.
Currently it's always NULL, which can be inconsistent with head page and
potentially lead to problems.

For now I caught one case of illegal usage of page flags or ->mapping:
sound subsystem allocates pages with __GFP_COMP and maps them with PTEs.
It leads to setting dirty bit on tail pages and access to tail_page's
->mapping.  I don't see any bad behaviour caused by this, but worth fixing
anyway.

This patchset makes more sense if you take my THP refcounting into
account: we will see more compound pages mapped with PTEs and we need to
define behaviour of flags on compound pages to avoid bugs.

This patch (of 16):

We have page-flags helper function declarations/definitions spread over
several header files.  Let's consolidate them in <linux/page-flags.h>.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Hugh Dickins <hughd@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Steve Capper <steve.capper@linaro.org>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Jerome Marchand <jmarchan@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 years agomm-memory-failurec-define-page-types-for-action_result-in-one-place-v3
Naoya Horiguchi [Tue, 7 Apr 2015 23:44:22 +0000 (09:44 +1000)]
mm-memory-failurec-define-page-types-for-action_result-in-one-place-v3

ChangeLog v2 -> v3:
- rename action_page_type to action_page_types
- rename enum page_type to enum action_page_type

ChangeLog v1 -> v2:
- fix DIRTY_UNEVICTABLE_LRU typo
- adding "MSG_" prefix to each enum value
- use declaration with type "enum page_type" instead of int
- define action_type_type as "static const char * const" (not "static const=
 char *")

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 years agomm/memory-failure.c: define page types for action_result() in one place
Naoya Horiguchi [Tue, 7 Apr 2015 23:44:22 +0000 (09:44 +1000)]
mm/memory-failure.c: define page types for action_result() in one place

This cleanup patch moves all strings passed to action_result() into a
singl= e array action_page_type so that a reader can easily find which
kind of actio= n results are possible.  And this patch also fixes the odd
lines to be printed out, like "unknown page state page" or "free buddy,
2nd try page".

[akpm@linux-foundation.org: rename messages, per David]
[akpm@linux-foundation.org: s/DIRTY_UNEVICTABLE_LRU/CLEAN_UNEVICTABLE_LRU', per Andi]
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: "Xie XiuQi" <xiexiuqi@huawei.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Chen Gong <gong.chen@linux.intel.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 years agomemcg: remove obsolete comment
Vladimir Davydov [Tue, 7 Apr 2015 23:44:22 +0000 (09:44 +1000)]
memcg: remove obsolete comment

Low and high watermarks, as they defined in the TODO to the mem_cgroup
struct, have already been implemented by Johannes, so remove the stale
comment.

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 years agomemcg: zap mem_cgroup_lookup()
Vladimir Davydov [Tue, 7 Apr 2015 23:44:22 +0000 (09:44 +1000)]
memcg: zap mem_cgroup_lookup()

mem_cgroup_lookup() is a wrapper around mem_cgroup_from_id(), which checks
that id != 0 before issuing the function call.  Today, there is no point
in this additional check apart from optimization, because there is no css
with id <= 0, so that css_from_id, called by mem_cgroup_from_id, will
return NULL for any id <= 0.

Since mem_cgroup_from_id is only called from mem_cgroup_lookup, let us zap
mem_cgroup_lookup, substituting calls to it with mem_cgroup_from_id and
moving the check if id > 0 to css_from_id.

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 years agomm: refactor zone_movable_is_highmem()
Zhang Zhen [Tue, 7 Apr 2015 23:44:21 +0000 (09:44 +1000)]
mm: refactor zone_movable_is_highmem()

All callers of zone_movable_is_highmem are under #ifdef CONFIG_HIGHMEM,
so the else branch return 0 is not needed.

Signed-off-by: Zhang Zhen <zhenzhang.zhang@huawei.com>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 years agomm/oom_kill.c: fix typo in comment
Yaowei Bai [Tue, 7 Apr 2015 23:44:21 +0000 (09:44 +1000)]
mm/oom_kill.c: fix typo in comment

Alter 'taks' -> 'task'

Signed-off-by: Yaowei Bai <bywxiaobai@163.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 years agoDocumentation: update arch list in the 'memtest' entry
Vladimir Murzin [Tue, 7 Apr 2015 23:44:21 +0000 (09:44 +1000)]
Documentation: update arch list in the 'memtest' entry

Since arm64/arm support memtest command line option update the "memtest"
entry.

Signed-off-by: Vladimir Murzin <vladimir.murzin@arm.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Russell King <rmk@arm.linux.org.uk>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will.deacon@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 years agoKconfig: memtest: update number of test patterns up to 17
Vladimir Murzin [Tue, 7 Apr 2015 23:44:21 +0000 (09:44 +1000)]
Kconfig: memtest: update number of test patterns up to 17

Additional test patterns for memtest were introduced since 63823126
"x86: memtest: add additional (regular) test patterns", but looks like
Kconfig was not updated that time.

Update Kconfig entry with the actual number of maximum test patterns.

Signed-off-by: Vladimir Murzin <vladimir.murzin@arm.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Russell King <rmk@arm.linux.org.uk>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will.deacon@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 years agoarm: add support for memtest
Vladimir Murzin [Tue, 7 Apr 2015 23:44:21 +0000 (09:44 +1000)]
arm: add support for memtest

Add support for memtest command line option.

Signed-off-by: Vladimir Murzin <vladimir.murzin@arm.com>
Acked-by: Will Deacon <will.deacon@arm.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Russell King <rmk@arm.linux.org.uk>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 years agoarm64: add support for memtest
Vladimir Murzin [Tue, 7 Apr 2015 23:44:20 +0000 (09:44 +1000)]
arm64: add support for memtest

Add support for memtest command line option.

Signed-off-by: Vladimir Murzin <vladimir.murzin@arm.com>
Acked-by: Will Deacon <will.deacon@arm.com>
Tested-by: Mark Rutland <mark.rutland@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Russell King <rmk@arm.linux.org.uk>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 years agomemtest: use phys_addr_t for physical addresses
Vladimir Murzin [Tue, 7 Apr 2015 23:44:20 +0000 (09:44 +1000)]
memtest: use phys_addr_t for physical addresses

Since memtest might be used by other architectures pass input parameters
as phys_addr_t instead of long to prevent overflow.

Signed-off-by: Vladimir Murzin <vladimir.murzin@arm.com>
Acked-by: Will Deacon <will.deacon@arm.com>
Tested-by: Mark Rutland <mark.rutland@arm.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Russell King <rmk@arm.linux.org.uk>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 years agomm-move-memtest-under-mm-fix-fix
Vladimir Murzin [Tue, 7 Apr 2015 23:44:20 +0000 (09:44 +1000)]
mm-move-memtest-under-mm-fix-fix

It was noticed by Paul Bolle that patch above simply disables MEMTEST
altogether.  ?

Reported-by: Paul Bolle <pebolle@tiscali.nl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 years agomm: MEMTEST depends on MEMBLOCK
Guenter Roeck [Tue, 7 Apr 2015 23:44:20 +0000 (09:44 +1000)]
mm: MEMTEST depends on MEMBLOCK

Building alpha:allmodconfig fails with

mm/memtest.c: In function 'reserve_bad_mem':
mm/memtest.c:38:2: error: implicit declaration of function 'memblock_reserve'
mm/memtest.c: In function 'do_one_pass':
mm/memtest.c:77:2: error: implicit declaration of function 'for_each_free_mem_range'
mm/memtest.c:77:73: error: expected ';' before '{' token

because it depends on MEMBLOCK which is not defined for the alpha
architecture.

Fixes: 420c89e6185d ("mm: move memtest under mm")
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Acked-by: Vladimir Murzin <vladimir.murzin@arm.com>
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 years agomm: move memtest under mm
Vladimir Murzin [Tue, 7 Apr 2015 23:44:20 +0000 (09:44 +1000)]
mm: move memtest under mm

Memtest is a simple feature which fills the memory with a given set of
patterns and validates memory contents, if bad memory regions is detected
it reserves them via memblock API.  Since memblock API is widely used by
other architectures this feature can be enabled outside of x86 world.

This patch set promotes memtest to live under generic mm umbrella and
enables memtest feature for arm/arm64.

It was reported that this patch set was useful for tracking down an issue
with some errant DMA on an arm64 platform.

This patch (of 6):

There is nothing platform dependent in the core memtest code, so other
platforms might benefit from this feature too.

Signed-off-by: Vladimir Murzin <vladimir.murzin@arm.com>
Acked-by: Will Deacon <will.deacon@arm.com>
Tested-by: Mark Rutland <mark.rutland@arm.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Russell King <rmk@arm.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 years agomm, hugetlb: abort __get_user_pages if current has been oom killed
David Rientjes [Tue, 7 Apr 2015 23:44:19 +0000 (09:44 +1000)]
mm, hugetlb: abort __get_user_pages if current has been oom killed

If __get_user_pages() is faulting a significant number of hugetlb pages,
usually as the result of mmap(MAP_LOCKED), it can potentially allocate a
very large amount of memory.

If the process has been oom killed, this will cause a lot of memory to
potentially deplete memory reserves.

In the same way that commit 4779280d1ea4 ("mm: make get_user_pages()
interruptible") aborted for pending SIGKILLs when faulting non-hugetlb
memory, based on the premise of commit 462e00cc7151 ("oom: stop allocating
user memory if TIF_MEMDIE is set"), hugetlb page faults now terminate when
the process has been oom killed.

Signed-off-by: David Rientjes <rientjes@google.com>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Greg Thelen <gthelen@google.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Acked-by: Davidlohr Bueso <dave@stgolabs.net>
Acked-by: "Kirill A. Shutemov" <kirill@shutemov.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 years agomm-mempool-do-not-allow-atomic-resizing-checkpatch-fixes
Andrew Morton [Tue, 7 Apr 2015 23:44:19 +0000 (09:44 +1000)]
mm-mempool-do-not-allow-atomic-resizing-checkpatch-fixes

WARNING: Prefer kmalloc_array over kmalloc with multiply
#111: FILE: mm/mempool.c:149:
+ new_elements = kmalloc(new_min_nr * sizeof(*new_elements), GFP_KERNEL);

total: 0 errors, 1 warnings, 81 lines checked

./patches/mm-mempool-do-not-allow-atomic-resizing.patch has style problems, please review.

If any of these errors are false positives, please report
them to the maintainer, see CHECKPATCH in MAINTAINERS.

Please run checkpatch prior to sending patches

Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 years agomm, mempool: do not allow atomic resizing
David Rientjes [Tue, 7 Apr 2015 23:44:19 +0000 (09:44 +1000)]
mm, mempool: do not allow atomic resizing

Allocating a large number of elements in atomic context could quickly
deplete memory reserves, so just disallow atomic resizing entirely.

Nothing currently uses mempool_resize() with anything other than
GFP_KERNEL, so convert existing callers to drop the gfp_mask.

Signed-off-by: David Rientjes <rientjes@google.com>
Acked-by: Steffen Maier <maier@linux.vnet.ibm.com> [zfcp]
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Steve French <sfrench@samba.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 years agomemcg: print cgroup information when system panics due to panic_on_oom
Balasubramani Vivekanandan [Tue, 7 Apr 2015 23:44:19 +0000 (09:44 +1000)]
memcg: print cgroup information when system panics due to panic_on_oom

If kernel panics due to oom, caused by a cgroup reaching its limit, when
'compulsory panic_on_oom' is enabled, then we will only see that the OOM
happened because of "compulsory panic_on_oom is enabled" but this doesn't
tell the difference between mempolicy and memcg.  And dumping system wide
information is plain wrong and more confusing.  This patch provides the
information of the cgroup whose limit triggerred panic

Signed-off-by: Balasubramani Vivekanandan <balasubramani_vivekanandan@mentor.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 years agomm: numa: remove migrate_ratelimited
Mel Gorman [Tue, 7 Apr 2015 23:44:18 +0000 (09:44 +1000)]
mm: numa: remove migrate_ratelimited

This code is dead since commit 9e645ab6d089 ("sched/numa: Continue PTE
scanning even if migrate rate limited") so remove it.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 years agomm: fold arch_randomize_brk into ARCH_HAS_ELF_RANDOMIZE
Kees Cook [Tue, 7 Apr 2015 23:44:18 +0000 (09:44 +1000)]
mm: fold arch_randomize_brk into ARCH_HAS_ELF_RANDOMIZE

The arch_randomize_brk() function is used on several architectures,
even those that don't support ET_DYN ASLR. To avoid bulky extern/#define
tricks, consolidate the support under CONFIG_ARCH_HAS_ELF_RANDOMIZE for
the architectures that support it, while still handling CONFIG_COMPAT_BRK.

Signed-off-by: Kees Cook <keescook@chromium.org>
Cc: Hector Marco-Gisbert <hecmargi@upv.es>
Cc: Russell King <linux@arm.linux.org.uk>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: "David A. Long" <dave.long@linaro.org>
Cc: Andrey Ryabinin <a.ryabinin@samsung.com>
Cc: Arun Chandran <achandran@mvista.com>
Cc: Yann Droneaud <ydroneaud@opteya.com>
Cc: Min-Hua Chen <orca.chen@gmail.com>
Cc: Paul Burton <paul.burton@imgtec.com>
Cc: Alex Smith <alex@alex-smith.me.uk>
Cc: Markos Chandras <markos.chandras@imgtec.com>
Cc: Vineeth Vijayan <vvijayan@mvista.com>
Cc: Jeff Bailey <jeffbailey@google.com>
Cc: Michael Holzheu <holzheu@linux.vnet.ibm.com>
Cc: Ben Hutchings <ben@decadent.org.uk>
Cc: Behan Webster <behanw@converseincode.com>
Cc: Ismael Ripoll <iripoll@upv.es>
Cc: Jan-Simon Mller <dl9pf@gmx.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 years agomm: split ET_DYN ASLR from mmap ASLR
Kees Cook [Tue, 7 Apr 2015 23:44:18 +0000 (09:44 +1000)]
mm: split ET_DYN ASLR from mmap ASLR

This fixes the "offset2lib" weakness in ASLR for arm, arm64, mips,
powerpc, and x86.  The problem is that if there is a leak of ASLR from the
executable (ET_DYN), it means a leak of shared library offset as well
(mmap), and vice versa.  Further details and a PoC of this attack is
available here:
http://cybersecurity.upv.es/attacks/offset2lib/offset2lib.html

With this patch, a PIE linked executable (ET_DYN) has its own ASLR region:

$ ./show_mmaps_pie
54859ccd6000-54859ccd7000 r-xp  ...  /tmp/show_mmaps_pie
54859ced6000-54859ced7000 r--p  ...  /tmp/show_mmaps_pie
54859ced7000-54859ced8000 rw-p  ...  /tmp/show_mmaps_pie
7f75be764000-7f75be91f000 r-xp  ...  /lib/x86_64-linux-gnu/libc.so.6
7f75be91f000-7f75beb1f000 ---p  ...  /lib/x86_64-linux-gnu/libc.so.6
7f75beb1f000-7f75beb23000 r--p  ...  /lib/x86_64-linux-gnu/libc.so.6
7f75beb23000-7f75beb25000 rw-p  ...  /lib/x86_64-linux-gnu/libc.so.6
7f75beb25000-7f75beb2a000 rw-p  ...
7f75beb2a000-7f75beb4d000 r-xp  ...  /lib64/ld-linux-x86-64.so.2
7f75bed45000-7f75bed46000 rw-p  ...
7f75bed46000-7f75bed47000 r-xp  ...
7f75bed47000-7f75bed4c000 rw-p  ...
7f75bed4c000-7f75bed4d000 r--p  ...  /lib64/ld-linux-x86-64.so.2
7f75bed4d000-7f75bed4e000 rw-p  ...  /lib64/ld-linux-x86-64.so.2
7f75bed4e000-7f75bed4f000 rw-p  ...
7fffb3741000-7fffb3762000 rw-p  ...  [stack]
7fffb377b000-7fffb377d000 r--p  ...  [vvar]
7fffb377d000-7fffb377f000 r-xp  ...  [vdso]

The change is to add a call the newly created arch_mmap_rnd() into the
ELF loader for handling ET_DYN ASLR in a separate region from mmap ASLR,
as was already done on s390. Removes CONFIG_BINFMT_ELF_RANDOMIZE_PIE,
which is no longer needed.

Signed-off-by: Kees Cook <keescook@chromium.org>
Reported-by: Hector Marco-Gisbert <hecmargi@upv.es>
Cc: Russell King <linux@arm.linux.org.uk>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: "David A. Long" <dave.long@linaro.org>
Cc: Andrey Ryabinin <a.ryabinin@samsung.com>
Cc: Arun Chandran <achandran@mvista.com>
Cc: Yann Droneaud <ydroneaud@opteya.com>
Cc: Min-Hua Chen <orca.chen@gmail.com>
Cc: Paul Burton <paul.burton@imgtec.com>
Cc: Alex Smith <alex@alex-smith.me.uk>
Cc: Markos Chandras <markos.chandras@imgtec.com>
Cc: Vineeth Vijayan <vvijayan@mvista.com>
Cc: Jeff Bailey <jeffbailey@google.com>
Cc: Michael Holzheu <holzheu@linux.vnet.ibm.com>
Cc: Ben Hutchings <ben@decadent.org.uk>
Cc: Behan Webster <behanw@converseincode.com>
Cc: Ismael Ripoll <iripoll@upv.es>
Cc: Jan-Simon Mller <dl9pf@gmx.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 years agos390: redefine randomize_et_dyn for ELF_ET_DYN_BASE
Kees Cook [Tue, 7 Apr 2015 23:44:18 +0000 (09:44 +1000)]
s390: redefine randomize_et_dyn for ELF_ET_DYN_BASE

In preparation for moving ET_DYN randomization into the ELF loader (which
requires a static ELF_ET_DYN_BASE), this redefines s390's existing ET_DYN
randomization in a call to arch_mmap_rnd(). This refactoring results in
the same ET_DYN randomization on s390.

Signed-off-by: Kees Cook <keescook@chromium.org>
Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 years agomm: expose arch_mmap_rnd when available
Kees Cook [Tue, 7 Apr 2015 23:44:18 +0000 (09:44 +1000)]
mm: expose arch_mmap_rnd when available

When an architecture fully supports randomizing the ELF load location,
a per-arch mmap_rnd() function is used to find a randomized mmap base.
In preparation for randomizing the location of ET_DYN binaries
separately from mmap, this renames and exports these functions as
arch_mmap_rnd(). Additionally introduces CONFIG_ARCH_HAS_ELF_RANDOMIZE
for describing this feature on architectures that support it
(which is a superset of ARCH_BINFMT_ELF_RANDOMIZE_PIE, since s390
already supports a separated ET_DYN ASLR from mmap ASLR without the
ARCH_BINFMT_ELF_RANDOMIZE_PIE logic).

Signed-off-by: Kees Cook <keescook@chromium.org>
Cc: Hector Marco-Gisbert <hecmargi@upv.es>
Cc: Russell King <linux@arm.linux.org.uk>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: "David A. Long" <dave.long@linaro.org>
Cc: Andrey Ryabinin <a.ryabinin@samsung.com>
Cc: Arun Chandran <achandran@mvista.com>
Cc: Yann Droneaud <ydroneaud@opteya.com>
Cc: Min-Hua Chen <orca.chen@gmail.com>
Cc: Paul Burton <paul.burton@imgtec.com>
Cc: Alex Smith <alex@alex-smith.me.uk>
Cc: Markos Chandras <markos.chandras@imgtec.com>
Cc: Vineeth Vijayan <vvijayan@mvista.com>
Cc: Jeff Bailey <jeffbailey@google.com>
Cc: Michael Holzheu <holzheu@linux.vnet.ibm.com>
Cc: Ben Hutchings <ben@decadent.org.uk>
Cc: Behan Webster <behanw@converseincode.com>
Cc: Ismael Ripoll <iripoll@upv.es>
Cc: Jan-Simon Mller <dl9pf@gmx.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 years agos390: standardize mmap_rnd() usage
Kees Cook [Tue, 7 Apr 2015 23:44:18 +0000 (09:44 +1000)]
s390: standardize mmap_rnd() usage

In preparation for splitting out ET_DYN ASLR, this refactors the use of
mmap_rnd() to be used similarly to arm and x86, and extracts the checking
of PF_RANDOMIZE.

Signed-off-by: Kees Cook <keescook@chromium.org>
Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 years agopowerpc-standardize-mmap_rnd-usage-v4
Kees Cook [Tue, 7 Apr 2015 23:44:17 +0000 (09:44 +1000)]
powerpc-standardize-mmap_rnd-usage-v4

fixed mmap_base argument convention to be the same on all archs

Signed-off-by: Kees Cook <keescook@chromium.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 years agopowerpc: standardize mmap_rnd() usage
Kees Cook [Tue, 7 Apr 2015 23:44:17 +0000 (09:44 +1000)]
powerpc: standardize mmap_rnd() usage

In preparation for splitting out ET_DYN ASLR, this refactors the use of
mmap_rnd() to be used similarly to arm and x86.

(Can mmap ASLR be safely enabled in the legacy mmap case here?  Other
archs use "mm->mmap_base = TASK_UNMAPPED_BASE + random_factor".)

Signed-off-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 years agomips-extract-logic-for-mmap_rnd-v4
Kees Cook [Tue, 7 Apr 2015 23:44:17 +0000 (09:44 +1000)]
mips-extract-logic-for-mmap_rnd-v4

Signed-off-by: Kees Cook <keescook@chromium.org>
Cc: Ralf Baechle <ralf@linux-mips.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 years agomips: extract logic for mmap_rnd()
Kees Cook [Tue, 7 Apr 2015 23:44:17 +0000 (09:44 +1000)]
mips: extract logic for mmap_rnd()

In preparation for splitting out ET_DYN ASLR, extract the mmap ASLR
selection into a separate function.

Signed-off-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
Cc: Ralf Baechle <ralf@linux-mips.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 years agoarm64-standardize-mmap_rnd-usage-v4
Kees Cook [Tue, 7 Apr 2015 23:44:17 +0000 (09:44 +1000)]
arm64-standardize-mmap_rnd-usage-v4

fixed mmap_base argument convention to be the same on all archs

Signed-off-by: Kees Cook <keescook@chromium.org>
Cc: Russell King <linux@arm.linux.org.uk>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 years agoarm64: standardize mmap_rnd() usage
Kees Cook [Tue, 7 Apr 2015 23:44:16 +0000 (09:44 +1000)]
arm64: standardize mmap_rnd() usage

In preparation for splitting out ET_DYN ASLR, this refactors the use of
mmap_rnd() to be used similarly to arm and x86. This additionally enables
mmap ASLR on legacy mmap layouts, which appeared to be missing on arm64,
and was already supported on arm. Additionally removes a copy/pasted
declaration of an unused function.

Signed-off-by: Kees Cook <keescook@chromium.org>
Cc: Russell King <linux@arm.linux.org.uk>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 years agox86: standardize mmap_rnd() usage
Kees Cook [Tue, 7 Apr 2015 23:44:16 +0000 (09:44 +1000)]
x86: standardize mmap_rnd() usage

In preparation for splitting out ET_DYN ASLR, this refactors the use of
mmap_rnd() to be used similarly to arm, and extracts the checking of
PF_RANDOMIZE.

Signed-off-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 years agoarm: factor out mmap ASLR into mmap_rnd
Kees Cook [Tue, 7 Apr 2015 23:44:16 +0000 (09:44 +1000)]
arm: factor out mmap ASLR into mmap_rnd

To address the "offset2lib" ASLR weakness[1], this separates ET_DYN ASLR
from mmap ASLR, as already done on s390.  The architectures that are
already randomizing mmap (arm, arm64, mips, powerpc, s390, and x86), have
their various forms of arch_mmap_rnd() made available via the new
CONFIG_ARCH_HAS_ELF_RANDOMIZE.  For these architectures,
arch_randomize_brk() is collapsed as well.

This is an alternative to the solutions in:
https://lkml.org/lkml/2015/2/23/442

I've been able to test x86 and arm, and the buildbot (so far) seems happy
with building the rest.

[1] http://cybersecurity.upv.es/attacks/offset2lib/offset2lib.html

This patch (of 10):

In preparation for splitting out ET_DYN ASLR, this moves the ASLR
calculations for mmap on ARM into a separate routine, similar to x86.
This also removes the redundant check of personality (PF_RANDOMIZE is
already set before calling arch_pick_mmap_layout).

Signed-off-by: Kees Cook <keescook@chromium.org>
Cc: Hector Marco-Gisbert <hecmargi@upv.es>
Cc: Russell King <linux@arm.linux.org.uk>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: "David A. Long" <dave.long@linaro.org>
Cc: Andrey Ryabinin <a.ryabinin@samsung.com>
Cc: Arun Chandran <achandran@mvista.com>
Cc: Yann Droneaud <ydroneaud@opteya.com>
Cc: Min-Hua Chen <orca.chen@gmail.com>
Cc: Paul Burton <paul.burton@imgtec.com>
Cc: Alex Smith <alex@alex-smith.me.uk>
Cc: Markos Chandras <markos.chandras@imgtec.com>
Cc: Vineeth Vijayan <vvijayan@mvista.com>
Cc: Jeff Bailey <jeffbailey@google.com>
Cc: Michael Holzheu <holzheu@linux.vnet.ibm.com>
Cc: Ben Hutchings <ben@decadent.org.uk>
Cc: Behan Webster <behanw@converseincode.com>
Cc: Ismael Ripoll <iripoll@upv.es>
Cc: Jan-Simon Mller <dl9pf@gmx.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 years agomm: memcontrol: let mem_cgroup_move_account() have effect only if MMU enabled
Chen Gang [Tue, 7 Apr 2015 23:44:16 +0000 (09:44 +1000)]
mm: memcontrol: let mem_cgroup_move_account() have effect only if MMU enabled

When !MMU, it will report warning. The related warning with allmodconfig
under c6x:

    CC      mm/memcontrol.o
  mm/memcontrol.c:2802:12: warning: 'mem_cgroup_move_account' defined but not used [-Wunused-function]
   static int mem_cgroup_move_account(struct page *page,
              ^

Signed-off-by: Chen Gang <gang.chen.5i5j@gmail.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 years agox86, mm: ioremap_pud_capable can be static
kbuild test robot [Tue, 7 Apr 2015 23:44:16 +0000 (09:44 +1000)]
x86, mm: ioremap_pud_capable can be static

Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
Reviewed-by: Toshi Kani <toshi.kani@hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 years agox86, mm: support huge KVA mappings on x86
Toshi Kani [Tue, 7 Apr 2015 23:44:15 +0000 (09:44 +1000)]
x86, mm: support huge KVA mappings on x86

Implement huge KVA mapping interfaces on x86.

On x86, MTRRs can override PAT memory types with a 4KB granularity.  When
using a huge page, MTRRs can override the memory type of the huge page,
which may lead a performance penalty.  The processor can also behave in an
undefined manner if a huge page is mapped to a memory range that MTRRs
have mapped with multiple different memory types.  Therefore, the mapping
code falls back to use a smaller page size toward 4KB when a mapping range
is covered by non-WB type of MTRRs.  The WB type of MTRRs has no affect on
the PAT memory types.

pud_set_huge() and pmd_set_huge() call mtrr_type_lookup() to see if a
given range is covered by MTRRs.  MTRR_TYPE_WRBACK indicates that the
range is either covered by WB or not covered and the MTRR default value is
set to WB.  0xFF indicates that MTRRs are disabled.

HAVE_ARCH_HUGE_VMAP is selected when X86_64 or X86_32 with X86_PAE is set.
 X86_32 without X86_PAE is not supported since such config can unlikey be
benefited from this feature, and there was an issue found in testing.

Signed-off-by: Toshi Kani <toshi.kani@hp.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Robert Elliott <Elliott@hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 years agox86, mm: support huge I/O mapping capability I/F
Toshi Kani [Tue, 7 Apr 2015 23:44:15 +0000 (09:44 +1000)]
x86, mm: support huge I/O mapping capability I/F

Implement huge I/O mapping capability interfaces for ioremap() on x86.

IOREMAP_MAX_ORDER is defined to PUD_SHIFT on x86/64 and PMD_SHIFT on
x86/32, which overrides the default value defined in <linux/vmalloc.h>.

Signed-off-by: Toshi Kani <toshi.kani@hp.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Robert Elliott <Elliott@hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 years agomm-change-vunmap-to-tear-down-huge-kva-mappings-fix
Andrew Morton [Tue, 7 Apr 2015 23:44:15 +0000 (09:44 +1000)]
mm-change-vunmap-to-tear-down-huge-kva-mappings-fix

use consistent code layout

Cc: Toshi Kani <toshi.kani@hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 years agomm: change vunmap to tear down huge KVA mappings
Toshi Kani [Tue, 7 Apr 2015 23:44:15 +0000 (09:44 +1000)]
mm: change vunmap to tear down huge KVA mappings

Change vunmap_pmd_range() and vunmap_pud_range() to tear down huge KVA
mappings when they are set.  pud_clear_huge() and pmd_clear_huge() return
zero when no-operation is performed, i.e.  huge page mapping was not used.

These changes are only enabled when CONFIG_HAVE_ARCH_HUGE_VMAP is defined
on the architecture.

Signed-off-by: Toshi Kani <toshi.kani@hp.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Robert Elliott <Elliott@hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 years agoFix build errors in asm-generic/pgtable.h
Toshi Kani [Tue, 7 Apr 2015 23:44:15 +0000 (09:44 +1000)]
Fix build errors in asm-generic/pgtable.h

Fix build errors in pud_set_huge() and pmd_set_huge() in
asm-generic/pgtable.h on some architectures in linux-next
and -mm trees.

C-stype code needs be used under #ifndef __ASSEMBLY__.

Signed-off-by: Toshi Kani <toshi.kani@hp.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Fengguang Wu <fengguang.wu@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 years agomm: change ioremap to set up huge I/O mappings
Toshi Kani [Tue, 7 Apr 2015 23:44:15 +0000 (09:44 +1000)]
mm: change ioremap to set up huge I/O mappings

ioremap_pud_range() and ioremap_pmd_range() are changed to create huge I/O
mappings when their capability is enabled, and a request meets required
conditions -- both virtual & physical addresses are aligned by their huge
page size, and a requested range fufills their huge page size.  When
pud_set_huge() or pmd_set_huge() returns zero, i.e.  no-operation is
performed, the code simply falls back to the next level.

The changes are only enabled when CONFIG_HAVE_ARCH_HUGE_VMAP is defined on
the architecture.

Signed-off-by: Toshi Kani <toshi.kani@hp.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Robert Elliott <Elliott@hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 years agoFix undefined ioremap_huge_init when CONFIG_MMU is not set
Toshi Kani [Tue, 7 Apr 2015 23:44:14 +0000 (09:44 +1000)]
Fix undefined ioremap_huge_init when CONFIG_MMU is not set

Fix a build error, undefined reference to ioremap_huge_init, when
CONFIG_MMU is not defined on linux-next and -mm tree.

lib/ioremap.o is not linked to the kernel when CONFIG_MMU is not
defined.

Signed-off-by: Toshi Kani <toshi.kani@hp.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Fengguang Wu <fengguang.wu@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 years agolib/ioremap.c: add huge I/O map capability interfaces
Toshi Kani [Tue, 7 Apr 2015 23:44:14 +0000 (09:44 +1000)]
lib/ioremap.c: add huge I/O map capability interfaces

Add ioremap_pud_enabled() and ioremap_pmd_enabled(), which return 1 when
I/O mappings with pud/pmd are enabled on the kernel.

ioremap_huge_init() calls arch_ioremap_pud_supported() and
arch_ioremap_pmd_supported() to initialize the capabilities at boot-time.

A new kernel option "nohugeiomap" is also added, so that user can disable
the huge I/O map capabilities when necessary.

Signed-off-by: Toshi Kani <toshi.kani@hp.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Robert Elliott <Elliott@hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 years agomm: change __get_vm_area_node() to use fls_long()
Toshi Kani [Tue, 7 Apr 2015 23:44:14 +0000 (09:44 +1000)]
mm: change __get_vm_area_node() to use fls_long()

ioremap() and its related interfaces are used to create I/O mappings to
memory-mapped I/O devices.  The mapping sizes of the traditional I/O
devices are relatively small.  Non-volatile memory (NVM), however, has
many GB and is going to have TB soon.  It is not very efficient to create
large I/O mappings with 4KB.

This patchset extends the ioremap() interfaces to transparently create I/O
mappings with huge pages whenever possible.  ioremap() continues to use
4KB mappings when a huge page does not fit into a requested range.  There
is no change necessary to the drivers using ioremap().  A requested
physical address must be aligned by a huge page size (1GB or 2MB on x86)
for using huge page mapping, though.  The kernel huge I/O mapping will
improve performance of NVM and other devices with large memory, and reduce
the time to create their mappings as well.

On x86, MTRRs can override PAT memory types with a 4KB granularity.  When
using a huge page, MTRRs can override the memory type of the huge page,
which may lead a performance penalty.  The processor can also behave in an
undefined manner if a huge page is mapped to a memory range that MTRRs
have mapped with multiple different memory types.  Therefore, the mapping
code falls back to use a smaller page size toward 4KB when a mapping range
is covered by non-WB type of MTRRs.  The WB type of MTRRs has no affect on
the PAT memory types.

The patchset introduces HAVE_ARCH_HUGE_VMAP, which indicates that the arch
supports huge KVA mappings for ioremap().  User may specify a new kernel
option "nohugeiomap" to disable the huge I/O mapping capability of
ioremap() when necessary.

Patch 1-4 change common files to support huge I/O mappings.  There is no
change in the functinalities unless HAVE_ARCH_HUGE_VMAP is defined on the
architecture of the system.

Patch 5-6 implement the HAVE_ARCH_HUGE_VMAP funcs on x86, and set
HAVE_ARCH_HUGE_VMAP on x86.

This patch (of 6):

__get_vm_area_node() takes unsigned long size, which is a 64-bit value on
a 64-bit kernel.  However, fls(size) simply ignores the upper 32-bit.
Change to use fls_long() to handle the size properly.

Signed-off-by: Toshi Kani <toshi.kani@hp.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Robert Elliott <Elliott@hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 years agomm/page_alloc.c: clean up comment
Yaowei Bai [Tue, 7 Apr 2015 23:44:14 +0000 (09:44 +1000)]
mm/page_alloc.c: clean up comment

Signed-off-by: Yaowei Bai <bywxiaobai@163.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 years agosparc: clarify __GFP_NOFAIL allocation
Michal Hocko [Tue, 7 Apr 2015 23:44:14 +0000 (09:44 +1000)]
sparc: clarify __GFP_NOFAIL allocation

920c3ed74134 ([SPARC64]: Add basic infrastructure for MD add/remove
notification.) has added __GFP_NOFAIL for the allocation request but it
hasn't mentioned why is this strict requirement really needed.  The code
was handling an allocation failure and propagated it properly up the
callchain so it is not clear why it is needed.

Dave has clarified the intention when I tried to remove the flag as not
being necessary:

: It is a serious failure.
:
: If we miss an MDESC update due to this allocation failure, the update
: is not an event which gets retransmitted so we will lose the updated
: machine description forever.
:
: We really need this allocation to succeed.

So add a comment to clarify the nofail flag and get rid of the failure
check because __GFP_NOFAIL allocation doesn't fail.

Signed-off-by: Michal Hocko <mhocko@suse.cz>
Cc: David Rientjes <rientjes@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Vipul Pandya <vipul@chelsio.com>
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 years agomm-clarify-__gfp_nofail-deprecation-status-checkpatch-fixes
Andrew Morton [Tue, 7 Apr 2015 23:44:13 +0000 (09:44 +1000)]
mm-clarify-__gfp_nofail-deprecation-status-checkpatch-fixes

WARNING: 'carefuly' may be misspelled - perhaps 'carefully'?
#40: FILE: include/linux/gfp.h:60:
+ * cannot handle allocation failures. New users should be evaluated carefuly

total: 0 errors, 1 warnings, 12 lines checked

./patches/mm-clarify-__gfp_nofail-deprecation-status.patch has style problems, please review.

If any of these errors are false positives, please report
them to the maintainer, see CHECKPATCH in MAINTAINERS.

Please run checkpatch prior to sending patches

Cc: David Rientjes <rientjes@google.com>
Cc: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 years agomm: clarify __GFP_NOFAIL deprecation status
Michal Hocko [Tue, 7 Apr 2015 23:44:13 +0000 (09:44 +1000)]
mm: clarify __GFP_NOFAIL deprecation status

__GFP_NOFAIL is documented as a deprecated flag since 478352e789f5 (mm:
add comment about deprecation of __GFP_NOFAIL).  This has discouraged
people from using it but in some cases an opencoded endless loop around
allocator has been used instead.  So the allocator is not aware of the de
facto __GFP_NOFAIL allocation because this information was not
communicated properly.

Let's make clear that if the allocation context really cannot effort
failure because there is no good failure policy then using __GFP_NOFAIL is
preferable to opencoding the loop outside of the allocator.

Signed-off-by: Michal Hocko <mhocko@suse.cz>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Vipul Pandya <vipul@chelsio.com>
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 years agomm: cma: constify and use correct signness in mm/cma.c
Sasha Levin [Tue, 7 Apr 2015 23:44:13 +0000 (09:44 +1000)]
mm: cma: constify and use correct signness in mm/cma.c

Constify function parameters and use correct signness where needed.

Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
Cc: Michal Nazarewicz <mina86@mina86.com>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Laurent Pinchart <laurent.pinchart+renesas@ideasonboard.com>
Acked-by: Gregory Fong <gregory.0xf0@gmail.com>
Cc: Pintu Kumar <pintu.k@samsung.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 years agokernel, cpuset: remove exception for __GFP_THISNODE
David Rientjes [Tue, 7 Apr 2015 23:44:13 +0000 (09:44 +1000)]
kernel, cpuset: remove exception for __GFP_THISNODE

Nothing calls __cpuset_node_allowed() with __GFP_THISNODE set anymore, so
remove the obscure comment about it and its special-case exception.

Signed-off-by: David Rientjes <rientjes@google.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Pravin Shelar <pshelar@nicira.com>
Cc: Jarno Rajahalme <jrajahalme@nicira.com>
Cc: Li Zefan <lizefan@huawei.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 years agomm, thp: really limit transparent hugepage allocation to local node
David Rientjes [Tue, 7 Apr 2015 23:44:13 +0000 (09:44 +1000)]
mm, thp: really limit transparent hugepage allocation to local node

Commit 077fcf116c8c ("mm/thp: allocate transparent hugepages on local
node") restructured alloc_hugepage_vma() with the intent of only
allocating transparent hugepages locally when there was not an effective
interleave mempolicy.

alloc_pages_exact_node() does not limit the allocation to the single node,
however, but rather prefers it.  This is because __GFP_THISNODE is not set
which would cause the node-local nodemask to be passed.  Without it, only
a nodemask that prefers the local node is passed.

Fix this by passing __GFP_THISNODE and falling back to small pages when
the allocation fails.

Commit 9f1b868a13ac ("mm: thp: khugepaged: add policy for finding target
node") suffers from a similar problem for khugepaged, which is also fixed.

Fixes: 077fcf116c8c ("mm/thp: allocate transparent hugepages on local node")
Fixes: 9f1b868a13ac ("mm: thp: khugepaged: add policy for finding target node")
Signed-off-by: David Rientjes <rientjes@google.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Pravin Shelar <pshelar@nicira.com>
Cc: Jarno Rajahalme <jrajahalme@nicira.com>
Cc: Li Zefan <lizefan@huawei.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 years agomm: remove GFP_THISNODE
David Rientjes [Tue, 7 Apr 2015 23:44:12 +0000 (09:44 +1000)]
mm: remove GFP_THISNODE

NOTE: this is not about __GFP_THISNODE, this is only about GFP_THISNODE.

GFP_THISNODE is a secret combination of gfp bits that have different
behavior than expected.  It is a combination of __GFP_THISNODE,
__GFP_NORETRY, and __GFP_NOWARN and is special-cased in the page allocator
slowpath to fail without trying reclaim even though it may be used in
combination with __GFP_WAIT.

An example of the problem this creates: commit e97ca8e5b864 ("mm: fix
GFP_THISNODE callers and clarify") fixed up many users of GFP_THISNODE
that really just wanted __GFP_THISNODE.  The problem doesn't end there,
however, because even it was a no-op for alloc_misplaced_dst_page(), which
also sets __GFP_NORETRY and __GFP_NOWARN, and
migrate_misplaced_transhuge_page(), where __GFP_NORETRY and __GFP_NOWAIT
is set in GFP_TRANSHUGE.  Converting GFP_THISNODE to __GFP_THISNODE is a
no-op in these cases since the page allocator special-cases __GFP_THISNODE
&& __GFP_NORETRY && __GFP_NOWARN.

It's time to just remove GFP_THISNODE entirely.  We leave __GFP_THISNODE
to restrict an allocation to a local node, but remove GFP_THISNODE and its
obscurity.  Instead, we require that a caller clear __GFP_WAIT if it wants
to avoid reclaim.

This allows the aforementioned functions to actually reclaim as they
should.  It also enables any future callers that want to do __GFP_THISNODE
but also __GFP_NORETRY && __GFP_NOWARN to reclaim.  The rule is simple: if
you don't want to reclaim, then don't set __GFP_WAIT.

Aside: ovs_flow_stats_update() really wants to avoid reclaim as well, so
it is unchanged.

Signed-off-by: David Rientjes <rientjes@google.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Acked-by: Pekka Enberg <penberg@kernel.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Pravin Shelar <pshelar@nicira.com>
Cc: Jarno Rajahalme <jrajahalme@nicira.com>
Cc: Li Zefan <lizefan@huawei.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 years agomm, mempolicy: migrate_to_node should only migrate to node
David Rientjes [Tue, 7 Apr 2015 23:44:12 +0000 (09:44 +1000)]
mm, mempolicy: migrate_to_node should only migrate to node

migrate_to_node() is intended to migrate a page from one source node to a
target node.

Today, migrate_to_node() could end up migrating to any node, not only the
target node.  This is because the page migration allocator,
new_node_page() does not pass __GFP_THISNODE to alloc_pages_exact_node().
This causes the target node to be preferred but allows fallback to any
other node in order of affinity.

Prevent this by allocating with __GFP_THISNODE.  If memory is not
available, -ENOMEM will be returned as appropriate.

Signed-off-by: David Rientjes <rientjes@google.com>
Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 years agocleancache-remove-limit-on-the-number-of-cleancache-enabled-filesystems-fix
Vladimir Davydov [Tue, 7 Apr 2015 23:44:12 +0000 (09:44 +1000)]
cleancache-remove-limit-on-the-number-of-cleancache-enabled-filesystems-fix

I admit the synchronization between cleancache_register_ops and
cleancache_init_fs is far not obvious.  I should have updated the comment
instead of merely dropping it, sorry.  What about the following patch
proving correctness of register_ops-vs-init_fs synchronization?  It is
meant to be applied incrementally on top of patch #4.

Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: David Vrabel <david.vrabel@citrix.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Stefan Hengelein <ilendir@googlemail.com>
Cc: Florian Schmaus <fschmaus@gmail.com>
Cc: Andor Daam <andor.daam@googlemail.com>
Cc: Dan Magenheimer <dan.magenheimer@oracle.com>
Cc: Bob Liu <lliubbo@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>