]> git.karo-electronics.de Git - karo-tx-linux.git/log
karo-tx-linux.git
11 years agomm/tile: use common help functions to free reserved pages
Jiang Liu [Wed, 19 Jun 2013 00:06:17 +0000 (10:06 +1000)]
mm/tile: use common help functions to free reserved pages

Use common help functions to free reserved pages.

Signed-off-by: Jiang Liu <jiang.liu@huawei.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Wen Congyang <wency@cn.fujitsu.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: <sworddragon2@aol.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jeremy Fitzhardinge <jeremy@goop.org>
Cc: Jianguo Wu <wujianguo@huawei.com>
Cc: Joonsoo Kim <js1304@gmail.com>
Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Michel Lespinasse <walken@google.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Tang Chen <tangchen@cn.fujitsu.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Russell King <rmk@arm.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agomm/x86: use free_reserved_area() to simplify code
Jiang Liu [Wed, 19 Jun 2013 00:06:17 +0000 (10:06 +1000)]
mm/x86: use free_reserved_area() to simplify code

Use common help function free_reserved_area() to simplify code.

Signed-off-by: Jiang Liu <jiang.liu@huawei.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Tang Chen <tangchen@cn.fujitsu.com>
Cc: Wen Congyang <wency@cn.fujitsu.com>
Cc: Jianguo Wu <wujianguo@huawei.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: <sworddragon2@aol.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Jeremy Fitzhardinge <jeremy@goop.org>
Cc: Joonsoo Kim <js1304@gmail.com>
Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Michel Lespinasse <walken@google.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Tejun Heo <tj@kernel.org>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Russell King <rmk@arm.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agomm/ARM64: kill poison_init_mem()
Jiang Liu [Wed, 19 Jun 2013 00:06:17 +0000 (10:06 +1000)]
mm/ARM64: kill poison_init_mem()

Use free_reserved_area() to poison initmem memory pages and kill
poison_init_mem() on ARM64.

Signed-off-by: Jiang Liu <jiang.liu@huawei.com>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: <sworddragon2@aol.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jeremy Fitzhardinge <jeremy@goop.org>
Cc: Jianguo Wu <wujianguo@huawei.com>
Cc: Joonsoo Kim <js1304@gmail.com>
Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Michel Lespinasse <walken@google.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Tang Chen <tangchen@cn.fujitsu.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Wen Congyang <wency@cn.fujitsu.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Russell King <rmk@arm.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agomm: enhance free_reserved_area() to support poisoning memory with zero
Jiang Liu [Wed, 19 Jun 2013 00:06:16 +0000 (10:06 +1000)]
mm: enhance free_reserved_area() to support poisoning memory with zero

Address more review comments from last round of code review.
1) Enhance free_reserved_area() to support poisoning freed memory with
   pattern '0'. This could be used to get rid of poison_init_mem()
   on ARM64.
2) A previous patch has disabled memory poison for initmem on s390
   by mistake, so restore to the original behavior.
3) Remove redundant PAGE_ALIGN() when calling free_reserved_area().

Signed-off-by: Jiang Liu <jiang.liu@huawei.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: <sworddragon2@aol.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jeremy Fitzhardinge <jeremy@goop.org>
Cc: Jianguo Wu <wujianguo@huawei.com>
Cc: Joonsoo Kim <js1304@gmail.com>
Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Michel Lespinasse <walken@google.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Tang Chen <tangchen@cn.fujitsu.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Wen Congyang <wency@cn.fujitsu.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Russell King <rmk@arm.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agomm: change signature of free_reserved_area() to fix building warnings
Jiang Liu [Wed, 19 Jun 2013 00:06:16 +0000 (10:06 +1000)]
mm: change signature of free_reserved_area() to fix building warnings

Change signature of free_reserved_area() according to Russell King's
suggestion to fix following build warnings:

arch/arm/mm/init.c: In function 'mem_init':
arch/arm/mm/init.c:603:2: warning: passing argument 1 of 'free_reserved_area' makes integer from pointer without a cast [enabled by default]
  free_reserved_area(__va(PHYS_PFN_OFFSET), swapper_pg_dir, 0, NULL);
  ^
In file included from include/linux/mman.h:4:0,
                 from arch/arm/mm/init.c:15:
include/linux/mm.h:1301:22: note: expected 'long unsigned int' but argument is of type 'void *'
 extern unsigned long free_reserved_area(unsigned long start, unsigned long end,

   mm/page_alloc.c: In function 'free_reserved_area':
>> mm/page_alloc.c:5134:3: warning: passing argument 1 of 'virt_to_phys' makes pointer from integer without a cast [enabled by default]
   In file included from arch/mips/include/asm/page.h:49:0,
                    from include/linux/mmzone.h:20,
                    from include/linux/gfp.h:4,
                    from include/linux/mm.h:8,
                    from mm/page_alloc.c:18:
   arch/mips/include/asm/io.h:119:29: note: expected 'const volatile void *' but argument is of type 'long unsigned int'
   mm/page_alloc.c: In function 'free_area_init_nodes':
   mm/page_alloc.c:5030:34: warning: array subscript is below array bounds [-Warray-bounds]

Also address some minor code review comments.

Signed-off-by: Jiang Liu <jiang.liu@huawei.com>
Reported-by: Arnd Bergmann <arnd@arndb.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: <sworddragon2@aol.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jeremy Fitzhardinge <jeremy@goop.org>
Cc: Jianguo Wu <wujianguo@huawei.com>
Cc: Joonsoo Kim <js1304@gmail.com>
Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Michel Lespinasse <walken@google.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Tang Chen <tangchen@cn.fujitsu.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Wen Congyang <wency@cn.fujitsu.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Russell King <rmk@arm.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agoswap-discard-while-swapping-only-if-swap_flag_discard_pages-fix
Andrew Morton [Wed, 19 Jun 2013 00:06:16 +0000 (10:06 +1000)]
swap-discard-while-swapping-only-if-swap_flag_discard_pages-fix

tweak comments

Cc: Hugh Dickins <hughd@google.com>
Cc: Rafael Aquini <aquini@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agoswap: discard while swapping only if SWAP_FLAG_DISCARD_PAGES
Rafael Aquini [Wed, 19 Jun 2013 00:06:15 +0000 (10:06 +1000)]
swap: discard while swapping only if SWAP_FLAG_DISCARD_PAGES

Considering the use cases where the swap device supports discard:
a) and can do it quickly;
b) but it's slow to do in small granularities (or concurrent with other
   I/O);
c) but the implementation is so horrendous that you don't even want to
   send one down;

And assuming that the sysadmin considers it useful to send the discards down
at all, we would (probably) want the following solutions:

  i. do the fine-grained discards for freed swap pages, if device is
     capable of doing so optimally;
 ii. do single-time (batched) swap area discards, either at swapon
     or via something like fstrim (not implemented yet);
iii. allow doing both single-time and fine-grained discards; or
 iv. turn it off completely (default behavior)

As implemented today, one can only enable/disable discards for swap, but
one cannot select, for instance, solution (ii) on a swap device like (b)
even though the single-time discard is regarded to be interesting, or
necessary to the workload because it would imply (1), and the device is
not capable of performing it optimally.

This patch addresses the scenario depicted above by introducing a way to
ensure the (probably) wanted solutions (i, ii, iii and iv) can be flexibly
flagged through swapon(8) to allow a sysadmin to select the best suitable
swap discard policy accordingly to system constraints.

This patch introduces SWAP_FLAG_DISCARD_PAGES and SWAP_FLAG_DISCARD_ONCE
new flags to allow more flexibe swap discard policies being flagged
through swapon(8).  The default behavior is to keep both single-time, or
batched, area discards (SWAP_FLAG_DISCARD_ONCE) and fine-grained discards
for page-clusters (SWAP_FLAG_DISCARD_PAGES) enabled, in order to keep
consistentcy with older kernel behavior, as well as maintain compatibility
with older swapon(8).  However, through the new introduced flags the best
suitable discard policy can be selected accordingly to any given swap
device constraint.

Signed-off-by: Rafael Aquini <aquini@redhat.com>
Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Shaohua Li <shli@kernel.org>
Cc: Karel Zak <kzak@redhat.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Larry Woodman <lwoodman@redhat.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agomm-tune-vm_committed_as-percpu_counter-batching-size-fix
Andrew Morton [Wed, 19 Jun 2013 00:06:15 +0000 (10:06 +1000)]
mm-tune-vm_committed_as-percpu_counter-batching-size-fix

fix section mismatch

Reported-by: Wu Fengguang <fengguang.wu@intel.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agomm: tune vm_committed_as percpu_counter batching size
Tim Chen [Wed, 19 Jun 2013 00:06:15 +0000 (10:06 +1000)]
mm: tune vm_committed_as percpu_counter batching size

Currently the per cpu counter's batch size for memory accounting is
configured as twice the number of cpus in the system.  However, for system
with very large memory, it is more appropriate to make it proportional to
the memory size per cpu in the system.

For example, for a x86_64 system with 64 cpus and 128 GB of memory, the
batch size is only 2*64 pages (0.5 MB).  So any memory accounting changes
of more than 0.5MB will overflow the per cpu counter into the global
counter.  Instead, for the new scheme, the batch size is configured to be
0.4% of the memory/cpu = 8MB (128 GB/64 /256), which is more inline with
the memory size.

I've done a repeated brk test of 800KB (from will-it-scale test suite)
with 80 concurrent processes on a 4 socket Westmere machine with a total
of 40 cores.  Without the patch, about 80% of cpu is spent on spin-lock
contention within the vm_committed_as counter.  With the patch, there's a
73x speedup on the benchmark and the lock contention drops off almost
entirely.

Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agomm/hugetlb: use already existing interface huge_page_shift
Wanpeng Li [Wed, 19 Jun 2013 00:06:15 +0000 (10:06 +1000)]
mm/hugetlb: use already existing interface huge_page_shift

Use the already existing interface huge_page_shift instead of h->order +
PAGE_SHIFT.

Signed-off-by: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agomm/hugetlb: remove hugetlb_prefault
Wanpeng Li [Wed, 19 Jun 2013 00:06:14 +0000 (10:06 +1000)]
mm/hugetlb: remove hugetlb_prefault

hugetlb_prefault() is not used any more, this patch removes it.

Signed-off-by: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agomm/pageblock: remove get/set_pageblock_flags
Wanpeng Li [Wed, 19 Jun 2013 00:06:14 +0000 (10:06 +1000)]
mm/pageblock: remove get/set_pageblock_flags

get_pageblock_flags and set_pageblock_flags are not used any more, this
patch removes them.

Signed-off-by: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agomm-memory-hotplug-fix-lowmem-count-overflow-when-offline-pages-fix
Michal Hocko [Wed, 19 Jun 2013 00:06:14 +0000 (10:06 +1000)]
mm-memory-hotplug-fix-lowmem-count-overflow-when-offline-pages-fix

Fix CONFIG_HIGHMEM=n build

Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agomm/memory-hotplug: fix lowmem count overflow when offline pages
Wanpeng Li [Wed, 19 Jun 2013 00:06:13 +0000 (10:06 +1000)]
mm/memory-hotplug: fix lowmem count overflow when offline pages

Logic memory-remove code fails to correctly account the Total High Memory
when a memory block which contains High Memory is offlined as shown in the
example below.  The following patch fixes it.

Before logic memory remove:

MemTotal:        7603740 kB
MemFree:         6329612 kB
Buffers:           94352 kB
Cached:           872008 kB
SwapCached:            0 kB
Active:           626932 kB
Inactive:         519216 kB
Active(anon):     180776 kB
Inactive(anon):   222944 kB
Active(file):     446156 kB
Inactive(file):   296272 kB
Unevictable:           0 kB
Mlocked:               0 kB
HighTotal:       7294672 kB
HighFree:        5704696 kB
LowTotal:         309068 kB
LowFree:          624916 kB

After logic memory remove:

MemTotal:        7079452 kB
MemFree:         5805976 kB
Buffers:           94372 kB
Cached:           872000 kB
SwapCached:            0 kB
Active:           626936 kB
Inactive:         519236 kB
Active(anon):     180780 kB
Inactive(anon):   222944 kB
Active(file):     446156 kB
Inactive(file):   296292 kB
Unevictable:           0 kB
Mlocked:               0 kB
HighTotal:       7294672 kB
HighFree:        5181024 kB
LowTotal:       4294752076 kB
LowFree:          624952 kB

Signed-off-by: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Cc: <stable@vger.kernel.org> [2.6.24+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agomm/memory_hotplug.c: change normal message to use pr_debug
Toshi Kani [Wed, 19 Jun 2013 00:06:13 +0000 (10:06 +1000)]
mm/memory_hotplug.c: change normal message to use pr_debug

During early boot-up, iomem_resource is set up from the boot descriptor
table, such as EFI Memory Table and e820.  Later, acpi_memory_device_add()
calls add_memory() for each ACPI memory device object as it enumerates
ACPI namespace.  This add_memory() call is expected to fail in
register_memory_resource() at boot since iomem_resource has been set up
from EFI/e820.  As a result, add_memory() returns -EEXIST, which
acpi_memory_device_add() handles as the normal case.

This scheme works fine, but the following error message is
logged for every ACPI memory device object during boot-up.

  "System RAM resource %pR cannot be added\n"

This patch changes register_memory_resource() to use pr_debug() for the
message as it shows up under the normal case.

Signed-off-by: Toshi Kani <toshi.kani@hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agomm/memory-failure.c: fix memory leak in successful soft offlining
Naoya Horiguchi [Wed, 19 Jun 2013 00:06:13 +0000 (10:06 +1000)]
mm/memory-failure.c: fix memory leak in successful soft offlining

After a successful page migration by soft offlining, the source page is
not properly freed and it's never reusable even if we unpoison it
afterward.

This is caused by the race between freeing page and setting PG_hwpoison.
In successful soft offlining, the source page is put (and the refcount
becomes 0) by putback_lru_page() in unmap_and_move(), where it's linked to
pagevec and actual freeing back to buddy is delayed.  So if PG_hwpoison is
set for the page before freeing, the freeing does not functions as
expected (in such case freeing aborts in free_pages_prepare() check.)

This patch tries to make sure to free the source page before setting
PG_hwpoison on it.  To avoid reallocating, the page keeps MIGRATE_ISOLATE
until after setting PG_hwpoison.

This patch also removes obsolete comments about "keeping elevated
refcount" because what they say is not true.  Unlike memory_failure(),
soft_offline_page() uses no special page isolation code, and the
soft-offlined pages have no elevated.

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agomm/nommu.c: add additional check for vread() just like vwrite() has done
Chen Gang [Wed, 19 Jun 2013 00:06:12 +0000 (10:06 +1000)]
mm/nommu.c: add additional check for vread() just like vwrite() has done

vwrite() checks for overflow. vread() should do the same thing.

Since vwrite() checks the source buffer address, vread() should check
the destination buffer address.

Signed-off-by: Chen Gang <gang.chen@asianux.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Michel Lespinasse <walken@google.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agomm/page_alloc.c: add additional checking and return value for the 'table->data'
Chen Gang [Wed, 19 Jun 2013 00:06:12 +0000 (10:06 +1000)]
mm/page_alloc.c: add additional checking and return value for the 'table->data'

- check the length of the procfs data before copying it into a fixed
  size array.

- when __parse_numa_zonelist_order() fails, save the error code for
  return.

- 'char*' --> 'char *' coding style fix

Signed-off-by: Chen Gang <gang.chen@asianux.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agomm: Clear page active before releasing pages
Mel Gorman [Wed, 19 Jun 2013 00:06:12 +0000 (10:06 +1000)]
mm: Clear page active before releasing pages

Active pages should not be freed to the page allocator as it triggers a
bad page state warning.  Fengguang Wu reported the following bug and
bisected it to the patch "mm: remove lru parameter from __lru_cache_add
and lru_cache_add_lru" which is currently in mmotm as
mm-remove-lru-parameter-from-__lru_cache_add-and-lru_cache_add_lru.patch

[   84.212960] BUG: Bad page state in process rm  pfn:0b0c9
[   84.214682] page:ffff88000d646240 count:0 mapcount:0 mapping:          (null) index:0x0
[   84.216883] page flags: 0x20000000004c(referenced|uptodate|active)
[   84.218697] CPU: 1 PID: 283 Comm: rm Not tainted 3.10.0-rc4-04361-geeb9bfc #49
[   84.220729]  ffff88000d646240 ffff88000d179bb8 ffffffff82562956 ffff88000d179bd8
[   84.223242]  ffffffff811333f1 000020000000004c ffff88000d646240 ffff88000d179c28
[   84.225387]  ffffffff811346a4 ffff880000270000 0000000000000000 0000000000000006
[   84.227294] Call Trace:
[   84.227867]  [<ffffffff82562956>] dump_stack+0x27/0x30
[   84.229045]  [<ffffffff811333f1>] bad_page+0x130/0x158
[   84.230261]  [<ffffffff811346a4>] free_pages_prepare+0x8b/0x1e3
[   84.231765]  [<ffffffff8113542a>] free_hot_cold_page+0x28/0x1cf
[   84.233171]  [<ffffffff82585830>] ? _raw_spin_unlock_irqrestore+0x6b/0xc6
[   84.234822]  [<ffffffff81135b59>] free_hot_cold_page_list+0x30/0x5a
[   84.236311]  [<ffffffff8113a4ed>] release_pages+0x251/0x267
[   84.237653]  [<ffffffff8112a88d>] ? delete_from_page_cache+0x48/0x9e
[   84.239142]  [<ffffffff8113ad93>] __pagevec_release+0x2b/0x3d
[   84.240473]  [<ffffffff8113b45a>] truncate_inode_pages_range+0x1b0/0x7ce
[   84.242032]  [<ffffffff810e76ab>] ? put_lock_stats.isra.20+0x1c/0x53
[   84.243480]  [<ffffffff810e77f5>] ? lock_release_holdtime+0x113/0x11f
[   84.244935]  [<ffffffff8113ba8c>] truncate_inode_pages+0x14/0x1d
[   84.246337]  [<ffffffff8119b3ef>] evict+0x11f/0x232
[   84.247501]  [<ffffffff8119c527>] iput+0x1a5/0x218
[   84.248607]  [<ffffffff8118f015>] do_unlinkat+0x19b/0x25a
[   84.249828]  [<ffffffff810ea993>] ? trace_hardirqs_on_caller+0x210/0x2ce
[   84.251382]  [<ffffffff8144372e>] ? trace_hardirqs_on_thunk+0x3a/0x3f
[   84.252879]  [<ffffffff8118f10d>] SyS_unlinkat+0x39/0x4c
[   84.254174]  [<ffffffff825874d6>] system_call_fastpath+0x1a/0x1f
[   84.255596] Disabling lock debugging due to kernel taint

The problem was that a page marked for activation was released via
pagevec. This patch clears the active bit before freeing in this case.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Reported-and-Tested-by: Fengguang Wu <fengguang.wu@intel.com>
Tested-by: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agomm: remove lru parameter from __lru_cache_add and lru_cache_add_lru
Mel Gorman [Wed, 19 Jun 2013 00:06:11 +0000 (10:06 +1000)]
mm: remove lru parameter from __lru_cache_add and lru_cache_add_lru

Similar to __pagevec_lru_add, this patch removes the LRU parameter from
__lru_cache_add and lru_cache_add_lru as the caller does not control the
exact LRU the page gets added to.  lru_cache_add_lru gets renamed to
lru_cache_add the name is silly without the lru parameter.  With the
parameter removed, it is required that the caller indicate if they want
the page added to the active or inactive list by setting or clearing
PageActive respectively.

[akpm@linux-foundation.org: Suggested the patch]
Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: Jan Kara <jack@suse.cz>
Cc: Rik van Riel <riel@redhat.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Alexey Lyahkov <alexey.lyashkov@gmail.com>
Cc: Andrew Perepechko <anserper@ya.ru>
Cc: Robin Dong <sanbai@taobao.com>
Cc: Theodore Tso <tytso@mit.edu>
Cc: Hugh Dickins <hughd@google.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Bernd Schubert <bernd.schubert@fastmail.fm>
Cc: David Howells <dhowells@redhat.com>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agomm: remove lru parameter from __pagevec_lru_add and remove parts of pagevec API
Mel Gorman [Wed, 19 Jun 2013 00:06:11 +0000 (10:06 +1000)]
mm: remove lru parameter from __pagevec_lru_add and remove parts of pagevec API

Now that the LRU to add a page to is decided at LRU-add time, remove the
misleading lru parameter from __pagevec_lru_add.  A consequence of this is
that the pagevec_lru_add_file, pagevec_lru_add_anon and similar helpers
are misleading as the caller no longer has direct control over what LRU
the page is added to.  Unused helpers are removed by this patch and
existing users of pagevec_lru_add_file() are converted to use
lru_cache_add_file() directly and use the per-cpu pagevecs instead of
creating their own pagevec.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Rik van Riel <riel@redhat.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Alexey Lyahkov <alexey.lyashkov@gmail.com>
Cc: Andrew Perepechko <anserper@ya.ru>
Cc: Robin Dong <sanbai@taobao.com>
Cc: Theodore Tso <tytso@mit.edu>
Cc: Hugh Dickins <hughd@google.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Bernd Schubert <bernd.schubert@fastmail.fm>
Cc: David Howells <dhowells@redhat.com>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agomm: activate !PageLRU pages on mark_page_accessed if page is on local pagevec
Mel Gorman [Wed, 19 Jun 2013 00:06:11 +0000 (10:06 +1000)]
mm: activate !PageLRU pages on mark_page_accessed if page is on local pagevec

If a page is on a pagevec then it is !PageLRU and mark_page_accessed() may
fail to move a page to the active list as expected.  Now that the LRU is
selected at LRU drain time, mark pages PageActive if they are on the local
pagevec so it gets moved to the correct list at LRU drain time.  Using a
debugging patch it was found that for a simple git checkout based workload
that pages were never added to the active file list in practice but with
this patch applied they are.

before   after
LRU Add Active File                  0      750583
LRU Add Active Anon            2640587     2702818
LRU Add Inactive File          8833662     8068353
LRU Add Inactive Anon              207         200

Note that only pages on the local pagevec are considered on purpose.  A
!PageLRU page could be in the process of being released, reclaimed,
migrated or on a remote pagevec that is currently being drained.  Marking
it PageActive is vunerable to races where PageLRU and Active bits are
checked at the wrong time.  Page reclaim will trigger VM_BUG_ONs but
depending on when the race hits, it could also free a PageActive page to
the page allocator and trigger a bad_page warning.  Similarly a potential
race exists between a per-cpu drain on a pagevec list and an activation on
a remote CPU.

lru_add_drain_cpu
__pagevec_lru_add
  lru = page_lru(page);
mark_page_accessed
  if (PageLRU(page))
    activate_page
  else
    SetPageActive
  SetPageLRU(page);
  add_page_to_lru_list(page, lruvec, lru);

In this case a PageActive page is added to the inactivate list and later
the inactive/active stats will get skewed.  While the PageActive checks in
vmscan could be removed and potentially dealt with, a skew in the
statistics would be very difficult to detect.  Hence this patch deals just
with the common case where a page being marked accessed has just been
added to the local pagevec.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: Jan Kara <jack@suse.cz>
Cc: Rik van Riel <riel@redhat.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Alexey Lyahkov <alexey.lyashkov@gmail.com>
Cc: Andrew Perepechko <anserper@ya.ru>
Cc: Robin Dong <sanbai@taobao.com>
Cc: Theodore Tso <tytso@mit.edu>
Cc: Hugh Dickins <hughd@google.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Bernd Schubert <bernd.schubert@fastmail.fm>
Cc: David Howells <dhowells@redhat.com>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agomm: pagevec: defer deciding which LRU to add a page to until pagevec drain time
Mel Gorman [Wed, 19 Jun 2013 00:06:10 +0000 (10:06 +1000)]
mm: pagevec: defer deciding which LRU to add a page to until pagevec drain time

mark_page_accessed() cannot activate an inactive page that is located on
an inactive LRU pagevec.  Hints from filesystems may be ignored as a
result.  In preparation for fixing that problem, this patch removes the
per-LRU pagevecs and leaves just one pagevec.  The final LRU the page is
added to is deferred until the pagevec is drained.

This means that fewer pagevecs are available and potentially there is
greater contention on the LRU lock.  However, this only applies in the
case where there is an almost perfect mix of file, anon, active and
inactive pages being added to the LRU.  In practice I expect that we are
adding stream of pages of a particular time and that the changes in
contention will barely be measurable.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Alexey Lyahkov <alexey.lyashkov@gmail.com>
Cc: Andrew Perepechko <anserper@ya.ru>
Cc: Robin Dong <sanbai@taobao.com>
Cc: Theodore Tso <tytso@mit.edu>
Cc: Hugh Dickins <hughd@google.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Bernd Schubert <bernd.schubert@fastmail.fm>
Cc: David Howells <dhowells@redhat.com>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agomm: add tracepoints for LRU activation and insertions
Mel Gorman [Wed, 19 Jun 2013 00:06:10 +0000 (10:06 +1000)]
mm: add tracepoints for LRU activation and insertions

Andrew Perepechko reported a problem whereby pages are being prematurely
evicted as the mark_page_accessed() hint is ignored for pages that are
currently on a pagevec --
http://www.spinics.net/lists/linux-ext4/msg37340.html .  Alexey Lyahkov
and Robin Dong have also reported problems recently that could be due to
hot pages reaching the end of the inactive list too quickly and be
reclaimed.

Rather than addressing this on a per-filesystem basis, this series aims to
fix the mark_page_accessed() interface by deferring what LRU a page is
added to pagevec drain time and allowing mark_page_accessed() to call
SetPageActive on a pagevec page.

Patch 1 adds two tracepoints for LRU page activation and insertion. Using
these processes it's possible to build a model of pages in the
LRU that can be processed offline.

Patch 2 defers making the decision on what LRU to add a page to until when
the pagevec is drained.

Patch 3 searches the local pagevec for pages to mark PageActive on
mark_page_accessed. The changelog explains why only the local
pagevec is examined.

Patches 4 and 5 tidy up the API.

postmark, a dd-based test and fs-mark both single and threaded mode were
run but none of them showed any performance degradation or gain as a
result of the patch.

Using patch 1, I built a *very* basic model of the LRU to examine offline
what the average age of different page types on the LRU were in
milliseconds.  Of course, capturing the trace distorts the test as it's
written to local disk but it does not matter for the purposes of this
test.  The average age of pages in milliseconds were

    vanilla deferdrain
Average age mapped anon:               1454       1250
Average age mapped file:             127841     155552
Average age unmapped anon:               85        235
Average age unmapped file:            73633      38884
Average age unmapped buffers:         74054     116155

The LRU activity was mostly files which you'd expect for a dd-based
workload.  Note that the average age of buffer pages is increased by the
series and it is expected this is due to the fact that the buffer pages
are now getting added to the active list when drained from the pagevecs.
Note that the average age of the unmapped file data is decreased as they
are still added to the inactive list and are reclaimed before the buffers.
 There is no guarantee this is a universal win for all workloads and it
would be nice if the filesystem people gave some thought as to whether
this decision is generally a win or a loss.

This patch:

Using these tracepoints it is possible to model LRU activity and the
average residency of pages of different types.  This can be used to debug
problems related to premature reclaim of pages of particular types.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Alexey Lyahkov <alexey.lyashkov@gmail.com>
Cc: Andrew Perepechko <anserper@ya.ru>
Cc: Robin Dong <sanbai@taobao.com>
Cc: Theodore Tso <tytso@mit.edu>
Cc: Hugh Dickins <hughd@google.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Bernd Schubert <bernd.schubert@fastmail.fm>
Cc: David Howells <dhowells@redhat.com>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agomemcg: update TODO list in Documentation
Li Zefan [Wed, 19 Jun 2013 00:06:10 +0000 (10:06 +1000)]
memcg: update TODO list in Documentation

hugetlb cgroup has already been implemented.

Signed-off-by: Li Zefan <lizefan@huawei.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Rob Landley <rob@landley.net>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agovmcore: disable mmap_vmcore() if CONFIG_MMU is not defined
HATAYAMA Daisuke [Wed, 19 Jun 2013 00:06:09 +0000 (10:06 +1000)]
vmcore: disable mmap_vmcore() if CONFIG_MMU is not defined

From Arnd's report of a link-time build error in vmcore.c, it turned out
that mmap_vmcore() work overlooked no-MMU configuraiton.

In the current design, it's impossible to implement mmap_vmcore() for
no-MMU configuration since MMU is essential in order to map physically
non-contiguous objects (ELF header, ELF note segment and memory regions in
the 1st kernel pointed to by PT_LOAD entries) into virtually contiguous
user-space in ELF layout.

Hence, this patch disables mmap_vmcore() if CONFIG_MMU is not defined,
returning -ENOSYS.

Another change is to fix the build error by using vmalloc_user() instead
of calling vzalloc() and find_vm_area() in order, by which we no longer
need to call find_vm_area() in vmcore.c that has no counterpart on non-MMU
configuration.

Also, on no-MMU configuration, because we don't export buffer for ELF note
segment to user-space, we use vzalloc() to allocate the buffer.
Therefore, we use differnet functions to allocate the buffer for ELF note
segment.  To avoid code duplication, introduce a helper
alloc_elfnotes_buf().

Signed-off-by: HATAYAMA Daisuke <d.hatayama@jp.fujitsu.com>
Reported-by: Arnd Bergmann <arnd@arndb.de>
Cc: Atsushi Kumagai <kumagai-atsushi@mxc.nes.nec.co.jp>
Cc: HATAYAMA Daisuke <d.hatayama@jp.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Lisa Mitchell <lisa.mitchell@hp.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agovmcore-support-mmap-on-proc-vmcore-fix
Andrew Morton [Wed, 19 Jun 2013 00:06:09 +0000 (10:06 +1000)]
vmcore-support-mmap-on-proc-vmcore-fix

use min(), switch to conventional error-unwinding approach

Cc: Atsushi Kumagai <kumagai-atsushi@mxc.nes.nec.co.jp>
Cc: HATAYAMA Daisuke <d.hatayama@jp.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Lisa Mitchell <lisa.mitchell@hp.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agovmcore: support mmap() on /proc/vmcore
HATAYAMA Daisuke [Wed, 19 Jun 2013 00:06:09 +0000 (10:06 +1000)]
vmcore: support mmap() on /proc/vmcore

This patch introduces mmap_vmcore().

Don't permit writable nor executable mapping even with mprotect() because
this mmap() is aimed at reading crash dump memory.  Non-writable mapping
is also requirement of remap_pfn_range() when mapping linear pages on
non-consecutive physical pages; see is_cow_mapping().

Set VM_MIXEDMAP flag to remap memory by remap_pfn_range and by
remap_vmalloc_range_pertial at the same time for a single vma.
do_munmap() can correctly clean partially remapped vma with two functions
in abnormal case.  See zap_pte_range(), vm_normal_page() and their
comments for details.

On x86-32 PAE kernels, mmap() supports at most 16TB memory only.  This
limitation comes from the fact that the third argument of
remap_pfn_range(), pfn, is of 32-bit length on x86-32: unsigned long.

Signed-off-by: HATAYAMA Daisuke <d.hatayama@jp.fujitsu.com>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Atsushi Kumagai <kumagai-atsushi@mxc.nes.nec.co.jp>
Cc: Lisa Mitchell <lisa.mitchell@hp.com>
Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agovmcore: calculate vmcore file size from buffer size and total size of vmcore objects
HATAYAMA Daisuke [Wed, 19 Jun 2013 00:06:09 +0000 (10:06 +1000)]
vmcore: calculate vmcore file size from buffer size and total size of vmcore objects

The previous patches newly added holes before each chunk of memory and the
holes need to be count in vmcore file size.  There are two ways to count
file size in such a way:

1) suppose m is a poitner to the last vmcore object in vmcore_list.
   Then file size is (m->offset + m->size), or

2) calculate sum of size of buffers for ELF header, program headers,
   ELF note segments and objects in vmcore_list.

Although 1) is more direct and simpler than 2), 2) seems better in that it
reflects internal object structure of /proc/vmcore.  Thus, this patch
changes get_vmcore_size_elf{64, 32} so that it calculates size in the way
of 2).

As a result, both get_vmcore_size_elf{64, 32} have the same definition.
Merge them as get_vmcore_size.

Signed-off-by: HATAYAMA Daisuke <d.hatayama@jp.fujitsu.com>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Atsushi Kumagai <kumagai-atsushi@mxc.nes.nec.co.jp>
Cc: Lisa Mitchell <lisa.mitchell@hp.com>
Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agovmcore-allow-user-process-to-remap-elf-note-segment-buffer-fix
Andrew Morton [Wed, 19 Jun 2013 00:06:08 +0000 (10:06 +1000)]
vmcore-allow-user-process-to-remap-elf-note-segment-buffer-fix

use the conventional comment layout format

Cc: Atsushi Kumagai <kumagai-atsushi@mxc.nes.nec.co.jp>
Cc: HATAYAMA Daisuke <d.hatayama@jp.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Lisa Mitchell <lisa.mitchell@hp.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agovmcore: allow user process to remap ELF note segment buffer
HATAYAMA Daisuke [Wed, 19 Jun 2013 00:06:08 +0000 (10:06 +1000)]
vmcore: allow user process to remap ELF note segment buffer

Now ELF note segment has been copied in the buffer on vmalloc memory.  To
allow user process to remap the ELF note segment buffer with
remap_vmalloc_page, the corresponding VM area object has to have
VM_USERMAP flag set.

Signed-off-by: HATAYAMA Daisuke <d.hatayama@jp.fujitsu.com>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Atsushi Kumagai <kumagai-atsushi@mxc.nes.nec.co.jp>
Cc: Lisa Mitchell <lisa.mitchell@hp.com>
Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agovmcore-allocate-elf-note-segment-in-the-2nd-kernel-vmalloc-memory-fix
Andrew Morton [Wed, 19 Jun 2013 00:06:08 +0000 (10:06 +1000)]
vmcore-allocate-elf-note-segment-in-the-2nd-kernel-vmalloc-memory-fix

use min(), fix error-path vzalloc() leaks

Cc: Atsushi Kumagai <kumagai-atsushi@mxc.nes.nec.co.jp>
Cc: HATAYAMA Daisuke <d.hatayama@jp.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Lisa Mitchell <lisa.mitchell@hp.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agovmcore: allocate ELF note segment in the 2nd kernel vmalloc memory
HATAYAMA Daisuke [Wed, 19 Jun 2013 00:06:07 +0000 (10:06 +1000)]
vmcore: allocate ELF note segment in the 2nd kernel vmalloc memory

The reasons why we don't allocate ELF note segment in the 1st kernel (old
memory) on page boundary is to keep backward compatibility for old
kernels, and that if doing so, we waste not a little memory due to
round-up operation to fit the memory to page boundary since most of the
buffers are in per-cpu area.

ELF notes are per-cpu, so total size of ELF note segments depends on
number of CPUs.  The current maximum number of CPUs on x86_64 is 5192, and
there's already system with 4192 CPUs in SGI, where total size amounts to
1MB.  This can be larger in the near future or possibly even now on
another architecture that has larger size of note per a single cpu.  Thus,
to avoid the case where memory allocation for large block fails, we
allocate vmcore objects on vmalloc memory.

This patch adds elfnotes_buf and elfnotes_sz variables to keep pointer to
the ELF note segment buffer and its size.  There's no longer the vmcore
object that corresponds to the ELF note segment in vmcore_list.
Accordingly, read_vmcore() has new case for ELF note segment and
set_vmcore_list_offsets_elf{64,32}() and other helper functions starts
calculating offset from sum of size of ELF headers and size of ELF note
segment.

Signed-off-by: HATAYAMA Daisuke <d.hatayama@jp.fujitsu.com>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Atsushi Kumagai <kumagai-atsushi@mxc.nes.nec.co.jp>
Cc: Lisa Mitchell <lisa.mitchell@hp.com>
Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agovmalloc-introduce-remap_vmalloc_range_partial-fix
Andrew Morton [Wed, 19 Jun 2013 00:06:07 +0000 (10:06 +1000)]
vmalloc-introduce-remap_vmalloc_range_partial-fix

use PAGE_ALIGNED()

Cc: Atsushi Kumagai <kumagai-atsushi@mxc.nes.nec.co.jp>
Cc: HATAYAMA Daisuke <d.hatayama@jp.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Lisa Mitchell <lisa.mitchell@hp.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agovmalloc: introduce remap_vmalloc_range_partial
HATAYAMA Daisuke [Wed, 19 Jun 2013 00:06:07 +0000 (10:06 +1000)]
vmalloc: introduce remap_vmalloc_range_partial

We want to allocate ELF note segment buffer on the 2nd kernel in vmalloc
space and remap it to user-space in order to reduce the risk that memory
allocation fails on system with huge number of CPUs and so with huge ELF
note segment that exceeds 11-order block size.

Although there's already remap_vmalloc_range for the purpose of remapping
vmalloc memory to user-space, we need to specify user-space range via vma.
 Mmap on /proc/vmcore needs to remap range across multiple objects, so the
interface that requires vma to cover full range is problematic.

This patch introduces remap_vmalloc_range_partial that receives user-space
range as a pair of base address and size and can be used for mmap on
/proc/vmcore case.

remap_vmalloc_range is rewritten using remap_vmalloc_range_partial.

Signed-off-by: HATAYAMA Daisuke <d.hatayama@jp.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Atsushi Kumagai <kumagai-atsushi@mxc.nes.nec.co.jp>
Cc: Lisa Mitchell <lisa.mitchell@hp.com>
Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agovmalloc: make find_vm_area check in range
HATAYAMA Daisuke [Wed, 19 Jun 2013 00:06:06 +0000 (10:06 +1000)]
vmalloc: make find_vm_area check in range

Currently, __find_vmap_area searches for the kernel VM area starting at a
given address.  This patch changes this behavior so that it searches for
the kernel VM area to which the address belongs.  This change is needed by
remap_vmalloc_range_partial to be introduced in later patch that receives
any position of kernel VM area as target address.

This patch changes the condition (addr > va->va_start) to the equivalent
(addr >= va->va_end) by taking advantage of the fact that each kernel VM
area is non-overlapping.

Signed-off-by: HATAYAMA Daisuke <d.hatayama@jp.fujitsu.com>
Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Atsushi Kumagai <kumagai-atsushi@mxc.nes.nec.co.jp>
Cc: Lisa Mitchell <lisa.mitchell@hp.com>
Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agovmcore: treat memory chunks referenced by PT_LOAD program header entries in page...
HATAYAMA Daisuke [Wed, 19 Jun 2013 00:06:06 +0000 (10:06 +1000)]
vmcore: treat memory chunks referenced by PT_LOAD program header entries in page-size boundary in vmcore_list

Treat memory chunks referenced by PT_LOAD program header entries in
page-size boundary in vmcore_list.  Formally, for each range [start, end],
we set up the corresponding vmcore object in vmcore_list to
[rounddown(start, PAGE_SIZE), roundup(end, PAGE_SIZE)].

This change affects layout of /proc/vmcore.  The gaps generated by the
rearrangement are newly made visible to applications as holes.
Concretely, they are two ranges [rounddown(start, PAGE_SIZE), start] and
[end, roundup(end, PAGE_SIZE)].

Suppose variable m points at a vmcore object in vmcore_list, and variable
phdr points at the program header of PT_LOAD type the variable m
corresponds to.  Then, pictorially:

  m->offset                    +---------------+
                               | hole          |
phdr->p_offset =               +---------------+
  m->offset + (paddr - start)  |               |\
                               | kernel memory | phdr->p_memsz
                               |               |/
                               +---------------+
                               | hole          |
  m->offset + m->size          +---------------+

where m->offset and m->offset + m->size are always page-size aligned.

Signed-off-by: HATAYAMA Daisuke <d.hatayama@jp.fujitsu.com>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Atsushi Kumagai <kumagai-atsushi@mxc.nes.nec.co.jp>
Cc: Lisa Mitchell <lisa.mitchell@hp.com>
Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agovmcore-allocate-buffer-for-elf-headers-on-page-size-alignment-fix
Andrew Morton [Wed, 19 Jun 2013 00:06:06 +0000 (10:06 +1000)]
vmcore-allocate-buffer-for-elf-headers-on-page-size-alignment-fix

Cc: Atsushi Kumagai <kumagai-atsushi@mxc.nes.nec.co.jp>
Cc: HATAYAMA Daisuke <d.hatayama@jp.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Lisa Mitchell <lisa.mitchell@hp.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agovmcore: allocate buffer for ELF headers on page-size alignment
HATAYAMA Daisuke [Wed, 19 Jun 2013 00:06:05 +0000 (10:06 +1000)]
vmcore: allocate buffer for ELF headers on page-size alignment

Allocate ELF headers on page-size boundary using __get_free_pages()
instead of kmalloc().

Later patch will merge PT_NOTE entries into a single unique one and
decrease the buffer size actually used.  Keep original buffer size in
variable elfcorebuf_sz_orig to kfree the buffer later and actually used
buffer size with rounded up to page-size boundary in variable
elfcorebuf_sz separately.

The size of part of the ELF buffer exported from /proc/vmcore is
elfcorebuf_sz.

The merged, removed PT_NOTE entries, i.e.  the range [elfcorebuf_sz,
elfcorebuf_sz_orig], is filled with 0.

Use size of the ELF headers as an initial offset value in
set_vmcore_list_offsets_elf{64,32} and
process_ptload_program_headers_elf{64,32} in order to indicate that the
offset includes the holes towards the page boundary.

As a result, both set_vmcore_list_offsets_elf{64,32} have the same
definition.  Merge them as set_vmcore_list_offsets.

Signed-off-by: HATAYAMA Daisuke <d.hatayama@jp.fujitsu.com>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Atsushi Kumagai <kumagai-atsushi@mxc.nes.nec.co.jp>
Cc: Lisa Mitchell <lisa.mitchell@hp.com>
Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agovmcore: clean up read_vmcore()
HATAYAMA Daisuke [Wed, 19 Jun 2013 00:06:05 +0000 (10:06 +1000)]
vmcore: clean up read_vmcore()

Rewrite part of read_vmcore() that reads objects in vmcore_list in the
same way as part reading ELF headers, by which some duplicated and
redundant codes are removed.

Signed-off-by: HATAYAMA Daisuke <d.hatayama@jp.fujitsu.com>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Atsushi Kumagai <kumagai-atsushi@mxc.nes.nec.co.jp>
Cc: Lisa Mitchell <lisa.mitchell@hp.com>
Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agoinclude/linux/mm.h: add PAGE_ALIGNED() helper
Andrew Morton [Wed, 19 Jun 2013 00:06:05 +0000 (10:06 +1000)]
include/linux/mm.h: add PAGE_ALIGNED() helper

To test whether an address is aligned to PAGE_SIZE.

Cc: HATAYAMA Daisuke <d.hatayama@jp.fujitsu.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>,
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agomemory_hotplug-use-pgdat_resize_lock-in-__offline_pages-fix
Andrew Morton [Wed, 19 Jun 2013 00:06:05 +0000 (10:06 +1000)]
memory_hotplug-use-pgdat_resize_lock-in-__offline_pages-fix

fix build

Cc: Cody P Schafer <cody@linux.vnet.ibm.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agomemory_hotplug: use pgdat_resize_lock() in __offline_pages()
Cody P Schafer [Wed, 19 Jun 2013 00:06:04 +0000 (10:06 +1000)]
memory_hotplug: use pgdat_resize_lock() in __offline_pages()

mmzone.h documents node_size_lock (which pgdat_resize_lock() locks) as
follows:

        * Must be held any time you expect node_start_pfn, node_present_pages
        * or node_spanned_pages stay constant.  [...]

So actually hold it when we update node_present_pages in __offline_pages().

Signed-off-by: Cody P Schafer <cody@linux.vnet.ibm.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agomemory_hotplug: use pgdat_resize_lock() in online_pages()
Cody P Schafer [Wed, 19 Jun 2013 00:06:04 +0000 (10:06 +1000)]
memory_hotplug: use pgdat_resize_lock() in online_pages()

mmzone.h documents node_size_lock (which pgdat_resize_lock() locks) as
follows:

        * Must be held any time you expect node_start_pfn, node_present_pages
        * or node_spanned_pages stay constant.  [...]

So actually hold it when we update node_present_pages in online_pages().

Signed-off-by: Cody P Schafer <cody@linux.vnet.ibm.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agommzone: note that node_size_lock should be manipulated via pgdat_resize_lock()
Cody P Schafer [Wed, 19 Jun 2013 00:06:04 +0000 (10:06 +1000)]
mmzone: note that node_size_lock should be manipulated via pgdat_resize_lock()

Signed-off-by: Cody P Schafer <cody@linux.vnet.ibm.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agomm: fix comment referring to non-existent size_seqlock, change to span_seqlock
Cody P Schafer [Wed, 19 Jun 2013 00:06:03 +0000 (10:06 +1000)]
mm: fix comment referring to non-existent size_seqlock, change to span_seqlock

Signed-off-by: Cody P Schafer <cody@linux.vnet.ibm.com>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agofs: nfs: inform the VM about pages being committed or unstable
Mel Gorman [Wed, 19 Jun 2013 00:06:03 +0000 (10:06 +1000)]
fs: nfs: inform the VM about pages being committed or unstable

VM page reclaim uses dirty and writeback page states to determine if
flushers are cleaning pages too slowly and that page reclaim should stall
waiting on flushers to catch up.  Page state in NFS is a bit more complex
and a clean page can be unreclaimable due to being unstable which is
effectively "dirty" from the perspective of the VM from reclaim context.
Similarly, if the inode is currently being committed then it's similar to
being under writeback.

This patch adds a is_dirty_writeback() handled for NFS that checks if a
pages backing inode is being committed and should be accounted as
writeback and if a page has private state indicating that it is
effectively dirty.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Rik van Riel <riel@redhat.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Jiri Slaby <jslaby@suse.cz>
Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu>
Cc: Zlatko Calusic <zcalusic@bitsync.net>
Cc: dormando <dormando@rydia.net>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agomm: vmscan: take page buffers dirty and locked state into account
Mel Gorman [Wed, 19 Jun 2013 00:06:03 +0000 (10:06 +1000)]
mm: vmscan: take page buffers dirty and locked state into account

Page reclaim keeps track of dirty and under writeback pages and uses it to
determine if wait_iff_congested() should stall or if kswapd should begin
writing back pages.  This fails to account for buffer pages that can be
under writeback but not PageWriteback which is the case for filesystems
like ext3 ordered mode.  Furthermore, PageDirty buffer pages can have all
the buffers clean and writepage does no IO so it should not be accounted
as congested.

This patch adds an address_space operation that filesystems may optionally
use to check if a page is really dirty or really under writeback.  An
implementation is provided for for buffer_heads is added and used for
block operations and ext3 in ordered mode.  By default the page flags are
obeyed.

Credit goes to Jan Kara for identifying that the page flags alone are not
sufficient for ext3 and sanity checking a number of ideas on how the
problem could be addressed.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Rik van Riel <riel@redhat.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Jiri Slaby <jslaby@suse.cz>
Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu>
Cc: Zlatko Calusic <zcalusic@bitsync.net>
Cc: dormando <dormando@rydia.net>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agomm: vmscan: treat pages marked for immediate reclaim as zone congestion
Mel Gorman [Wed, 19 Jun 2013 00:06:02 +0000 (10:06 +1000)]
mm: vmscan: treat pages marked for immediate reclaim as zone congestion

Currently a zone will only be marked congested if the underlying BDI is
congested but if dirty pages are spread across zones it is possible that
an individual zone is full of dirty pages without being congested.  The
impact is that zone gets scanned very quickly potentially reclaiming
really clean pages.  This patch treats pages marked for immediate reclaim
as congested for the purposes of marking a zone ZONE_CONGESTED and
stalling in wait_iff_congested.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Rik van Riel <riel@redhat.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Jiri Slaby <jslaby@suse.cz>
Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu>
Cc: Zlatko Calusic <zcalusic@bitsync.net>
Cc: dormando <dormando@rydia.net>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agomm: vmscan: move direct reclaim wait_iff_congested into shrink_list
Mel Gorman [Wed, 19 Jun 2013 00:06:02 +0000 (10:06 +1000)]
mm: vmscan: move direct reclaim wait_iff_congested into shrink_list

shrink_inactive_list makes decisions on whether to stall based on the
number of dirty pages encountered.  The wait_iff_congested() call in
shrink_page_list does no such thing and it's arbitrary.

This patch moves the decision on whether to set ZONE_CONGESTED and the
wait_iff_congested call into shrink_page_list.  This keeps all the
decisions on whether to stall or not in the one place.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Rik van Riel <riel@redhat.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Jiri Slaby <jslaby@suse.cz>
Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu>
Cc: Zlatko Calusic <zcalusic@bitsync.net>
Cc: dormando <dormando@rydia.net>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agomm: vmscan: set zone flags before blocking
Mel Gorman [Wed, 19 Jun 2013 00:06:02 +0000 (10:06 +1000)]
mm: vmscan: set zone flags before blocking

In shrink_page_list a decision may be made to stall and flag a zone as
ZONE_WRITEBACK so that if a large number of unqueued dirty pages are
encountered later then the reclaimer will stall.  Set ZONE_WRITEBACK
before potentially going to sleep so it is noticed sooner.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Rik van Riel <riel@redhat.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Jiri Slaby <jslaby@suse.cz>
Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu>
Cc: Zlatko Calusic <zcalusic@bitsync.net>
Cc: dormando <dormando@rydia.net>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agomm: vmscan: stall page reclaim after a list of pages have been processed
Mel Gorman [Wed, 19 Jun 2013 00:06:01 +0000 (10:06 +1000)]
mm: vmscan: stall page reclaim after a list of pages have been processed

Commit "mm: vmscan: Block kswapd if it is encountering pages under
writeback" blocks page reclaim if it encounters pages under writeback
marked for immediate reclaim.  It blocks while pages are still isolated
from the LRU which is unnecessary.  This patch defers the blocking until
after the isolated pages have been processed and tidies up some of the
comments.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Rik van Riel <riel@redhat.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Jiri Slaby <jslaby@suse.cz>
Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu>
Cc: Zlatko Calusic <zcalusic@bitsync.net>
Cc: dormando <dormando@rydia.net>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agomm: vmscan: stall page reclaim and writeback pages based on dirty/writepage pages...
Mel Gorman [Wed, 19 Jun 2013 00:06:01 +0000 (10:06 +1000)]
mm: vmscan: stall page reclaim and writeback pages based on dirty/writepage pages encountered

Further testing of the "Reduce system disruption due to kswapd" discovered
a few problems. First and foremost, it's possible for pages under writeback
to be freed which will lead to badness. Second, as pages were not being
swapped the file LRU was being scanned faster and clean file pages were
being reclaimed. In some cases this results in increased read IO to re-read
data from disk.  Third, more pages were being written from kswapd context
which can adversly affect IO performance. Lastly, it was observed that
PageDirty pages are not necessarily dirty on all filesystems (buffers can be
clean while PageDirty is set and ->writepage generates no IO) and not all
filesystems set PageWriteback when the page is being written (e.g. ext3).
This disconnect confuses the reclaim stalling logic. This follow-up series
is aimed at these problems.

The tests were based on three kernels

vanilla: kernel 3.9 as that is what the current mmotm uses as a baseline
mmotm-20130522 is mmotm as of 22nd May with "Reduce system disruption due to
kswapd" applied on top as per what should be in Andrew's tree
right now
lessdisrupt-v7r10 is this follow-up series on top of the mmotm kernel

The first test used memcached+memcachetest while some background IO
was in progress as implemented by the parallel IO tests implement in
MM Tests. memcachetest benchmarks how many operations/second memcached
can service. It starts with no background IO on a freshly created ext4
filesystem and then re-runs the test with larger amounts of IO in the
background to roughly simulate a large copy in progress. The expectation
is that the IO should have little or no impact on memcachetest which is
running entirely in memory.

parallelio
                                             3.9.0                       3.9.0                       3.9.0
                                           vanilla          mm1-mmotm-20130522       mm1-lessdisrupt-v7r10
Ops memcachetest-0M             23117.00 (  0.00%)          22780.00 ( -1.46%)          22763.00 ( -1.53%)
Ops memcachetest-715M           23774.00 (  0.00%)          23299.00 ( -2.00%)          22934.00 ( -3.53%)
Ops memcachetest-2385M           4208.00 (  0.00%)          24154.00 (474.00%)          23765.00 (464.76%)
Ops memcachetest-4055M           4104.00 (  0.00%)          25130.00 (512.33%)          24614.00 (499.76%)
Ops io-duration-0M                  0.00 (  0.00%)              0.00 (  0.00%)              0.00 (  0.00%)
Ops io-duration-715M               12.00 (  0.00%)              7.00 ( 41.67%)              6.00 ( 50.00%)
Ops io-duration-2385M             116.00 (  0.00%)             21.00 ( 81.90%)             21.00 ( 81.90%)
Ops io-duration-4055M             160.00 (  0.00%)             36.00 ( 77.50%)             35.00 ( 78.12%)
Ops swaptotal-0M                    0.00 (  0.00%)              0.00 (  0.00%)              0.00 (  0.00%)
Ops swaptotal-715M             140138.00 (  0.00%)             18.00 ( 99.99%)             18.00 ( 99.99%)
Ops swaptotal-2385M            385682.00 (  0.00%)              0.00 (  0.00%)              0.00 (  0.00%)
Ops swaptotal-4055M            418029.00 (  0.00%)              0.00 (  0.00%)              0.00 (  0.00%)
Ops swapin-0M                       0.00 (  0.00%)              0.00 (  0.00%)              0.00 (  0.00%)
Ops swapin-715M                   144.00 (  0.00%)              0.00 (  0.00%)              0.00 (  0.00%)
Ops swapin-2385M               134227.00 (  0.00%)              0.00 (  0.00%)              0.00 (  0.00%)
Ops swapin-4055M               125618.00 (  0.00%)              0.00 (  0.00%)              0.00 (  0.00%)
Ops minorfaults-0M            1536429.00 (  0.00%)        1531632.00 (  0.31%)        1533541.00 (  0.19%)
Ops minorfaults-715M          1786996.00 (  0.00%)        1612148.00 (  9.78%)        1608832.00 (  9.97%)
Ops minorfaults-2385M         1757952.00 (  0.00%)        1614874.00 (  8.14%)        1613541.00 (  8.21%)
Ops minorfaults-4055M         1774460.00 (  0.00%)        1633400.00 (  7.95%)        1630881.00 (  8.09%)
Ops majorfaults-0M                  1.00 (  0.00%)              0.00 (  0.00%)              0.00 (  0.00%)
Ops majorfaults-715M              184.00 (  0.00%)            167.00 (  9.24%)            166.00 (  9.78%)
Ops majorfaults-2385M           24444.00 (  0.00%)            155.00 ( 99.37%)             93.00 ( 99.62%)
Ops majorfaults-4055M           21357.00 (  0.00%)            147.00 ( 99.31%)            134.00 ( 99.37%)

memcachetest is the transactions/second reported by memcachetest. In
        the vanilla kernel note that performance drops from around
        23K/sec to just over 4K/second when there is 2385M of IO going
        on in the background. With current mmotm, there is no collapse
in performance and with this follow-up series there is little
change.

swaptotal is the total amount of swap traffic. With mmotm and the follow-up
series, the total amount of swapping is much reduced.

                                 3.9.0       3.9.0       3.9.0
                               vanillamm1-mmotm-20130522mm1-lessdisrupt-v7r10
Minor Faults                  11160152    10706748    10622316
Major Faults                     46305         755         678
Swap Ins                        260249           0           0
Swap Outs                       683860          18          18
Direct pages scanned                 0         678        2520
Kswapd pages scanned           6046108     8814900     1639279
Kswapd pages reclaimed         1081954     1172267     1094635
Direct pages reclaimed               0         566        2304
Kswapd efficiency                  17%         13%         66%
Kswapd velocity               5217.560    7618.953    1414.879
Direct efficiency                 100%         83%         91%
Direct velocity                  0.000       0.586       2.175
Percentage direct scans             0%          0%          0%
Zone normal velocity          5105.086    6824.681     671.158
Zone dma32 velocity            112.473     794.858     745.896
Zone dma velocity                0.000       0.000       0.000
Page writes by reclaim     1929612.000 6861768.000   32821.000
Page writes file               1245752     6861750       32803
Page writes anon                683860          18          18
Page reclaim immediate            7484          40         239
Sector Reads                   1130320       93996       86900
Sector Writes                 13508052    10823500    11804436
Page rescued immediate               0           0           0
Slabs scanned                    33536       27136       18560
Direct inode steals                  0           0           0
Kswapd inode steals               8641        1035           0
Kswapd skipped wait                  0           0           0
THP fault alloc                      8          37          33
THP collapse alloc                 508         552         515
THP splits                          24           1           1
THP fault fallback                   0           0           0
THP collapse fail                    0           0           0

There are a number of observations to make here

1. Swap outs are almost eliminated. Swap ins are 0 indicating that the
   pages swapped were really unused anonymous pages. Related to that,
   major faults are much reduced.

2. kswapd efficiency was impacted by the initial series but with these
   follow-up patches, the efficiency is now at 66% indicating that far
   fewer pages were skipped during scanning due to dirty or writeback
   pages.

3. kswapd velocity is reduced indicating that fewer pages are being scanned
   with the follow-up series as kswapd now stalls when the tail of the
   LRU queue is full of unqueued dirty pages. The stall gives flushers a
   chance to catch-up so kswapd can reclaim clean pages when it wakes

4. In light of Zlatko's recent reports about zone scanning imbalances,
   mmtests now reports scanning velocity on a per-zone basis. With mainline,
   you can see that the scanning activity is dominated by the Normal
   zone with over 45 times more scanning in Normal than the DMA32 zone.
   With the series currently in mmotm, the ratio is slightly better but it
   is still the case that the bulk of scanning is in the highest zone. With
   this follow-up series, the ratio of scanning between the Normal and
   DMA32 zone is roughly equal.

5. As Dave Chinner observed, the current patches in mmotm increased the
   number of pages written from kswapd context which is expected to adversly
   impact IO performance. With the follow-up patches, far fewer pages are
   written from kswapd context than the mainline kernel

6. With the series in mmotm, fewer inodes were reclaimed by kswapd. With
   the follow-up series, there is less slab shrinking activity and no inodes
   were reclaimed.

7. Note that "Sectors Read" is drastically reduced implying that the source
   data being used for the IO is not being aggressively discarded due to
   page reclaim skipping over dirty pages and reclaiming clean pages. Note
   that the reducion in reads could also be due to inode data not being
   re-read from disk after a slab shrink.

                       3.9.0       3.9.0       3.9.0
                     vanillamm1-mmotm-20130522mm1-lessdisrupt-v7r10
Mean sda-avgqz        166.99       32.09       33.44
Mean sda-await        853.64      192.76      185.43
Mean sda-r_await        6.31        9.24        5.97
Mean sda-w_await     2992.81      202.65      192.43
Max  sda-avgqz       1409.91      718.75      698.98
Max  sda-await       6665.74     3538.00     3124.23
Max  sda-r_await       58.96      111.95       58.00
Max  sda-w_await    28458.94     3977.29     3148.61

In light of the changes in writes from reclaim context, the number of
reads and Dave Chinner's concerns about IO performance I took a closer
look at the IO stats for the test disk. Few observations

1. The average queue size is reduced by the initial series and roughly
   the same with this follow up.

2. Average wait times for writes are reduced and as the IO
   is completing faster it at least implies that the gain is because
   flushers are writing the files efficiently instead of page reclaim
   getting in the way.

3. The reduction in maximum write latency is staggering. 28 seconds down
   to 3 seconds.

Jan Kara asked how NFS is affected by all of this. Unstable pages can
be taken into account as one of the patches in the series shows but it
is still the case that filesystems with unusual handling of dirty or
writeback could still be treated better.

Tests like postmark, fsmark and largedd showed up nothing useful. On my test
setup, pages are simply not being written back from reclaim context with or
without the patches and there are no changes in performance. My test setup
probably is just not strong enough network-wise to be really interesting.

I ran a longer-lived memcached test with IO going to NFS instead of a local disk

parallelio
                                             3.9.0                       3.9.0                       3.9.0
                                           vanilla          mm1-mmotm-20130522       mm1-lessdisrupt-v7r10
Ops memcachetest-0M             23323.00 (  0.00%)          23241.00 ( -0.35%)          23321.00 ( -0.01%)
Ops memcachetest-715M           25526.00 (  0.00%)          24763.00 ( -2.99%)          23242.00 ( -8.95%)
Ops memcachetest-2385M           8814.00 (  0.00%)          26924.00 (205.47%)          23521.00 (166.86%)
Ops memcachetest-4055M           5835.00 (  0.00%)          26827.00 (359.76%)          25560.00 (338.05%)
Ops io-duration-0M                  0.00 (  0.00%)              0.00 (  0.00%)              0.00 (  0.00%)
Ops io-duration-715M               65.00 (  0.00%)             71.00 ( -9.23%)             11.00 ( 83.08%)
Ops io-duration-2385M             129.00 (  0.00%)             94.00 ( 27.13%)             53.00 ( 58.91%)
Ops io-duration-4055M             301.00 (  0.00%)            100.00 ( 66.78%)            108.00 ( 64.12%)
Ops swaptotal-0M                    0.00 (  0.00%)              0.00 (  0.00%)              0.00 (  0.00%)
Ops swaptotal-715M              14394.00 (  0.00%)            949.00 ( 93.41%)             63.00 ( 99.56%)
Ops swaptotal-2385M            401483.00 (  0.00%)          24437.00 ( 93.91%)          30118.00 ( 92.50%)
Ops swaptotal-4055M            554123.00 (  0.00%)          35688.00 ( 93.56%)          63082.00 ( 88.62%)
Ops swapin-0M                       0.00 (  0.00%)              0.00 (  0.00%)              0.00 (  0.00%)
Ops swapin-715M                  4522.00 (  0.00%)            560.00 ( 87.62%)             63.00 ( 98.61%)
Ops swapin-2385M               169861.00 (  0.00%)           5026.00 ( 97.04%)          13917.00 ( 91.81%)
Ops swapin-4055M               192374.00 (  0.00%)          10056.00 ( 94.77%)          25729.00 ( 86.63%)
Ops minorfaults-0M            1445969.00 (  0.00%)        1520878.00 ( -5.18%)        1454024.00 ( -0.56%)
Ops minorfaults-715M          1557288.00 (  0.00%)        1528482.00 (  1.85%)        1535776.00 (  1.38%)
Ops minorfaults-2385M         1692896.00 (  0.00%)        1570523.00 (  7.23%)        1559622.00 (  7.87%)
Ops minorfaults-4055M         1654985.00 (  0.00%)        1581456.00 (  4.44%)        1596713.00 (  3.52%)
Ops majorfaults-0M                  0.00 (  0.00%)              1.00 (-99.00%)              0.00 (  0.00%)
Ops majorfaults-715M              763.00 (  0.00%)            265.00 ( 65.27%)             75.00 ( 90.17%)
Ops majorfaults-2385M           23861.00 (  0.00%)            894.00 ( 96.25%)           2189.00 ( 90.83%)
Ops majorfaults-4055M           27210.00 (  0.00%)           1569.00 ( 94.23%)           4088.00 ( 84.98%)

1. Performance does not collapse due to IO which is good. IO is also completing
   faster. Note with mmotm, IO completes in a third of the time and faster again
   with this series applied

2. Swapping is reduced, although not eliminated. The figures for the follow-up
   look bad but it does vary a bit as the stalling is not perfect for nfs
   or filesystems like ext3 with unusual handling of dirty and writeback
   pages

3. There are swapins, particularly with larger amounts of IO indicating
   that active pages are being reclaimed. However, the number of much
   reduced.

                                 3.9.0       3.9.0       3.9.0
                               vanillamm1-mmotm-20130522mm1-lessdisrupt-v7r10
Minor Faults                  36339175    35025445    35219699
Major Faults                    310964       27108       51887
Swap Ins                       2176399      173069      333316
Swap Outs                      3344050      357228      504824
Direct pages scanned              8972       77283       43242
Kswapd pages scanned          20899983     8939566    14772851
Kswapd pages reclaimed         6193156     5172605     5231026
Direct pages reclaimed            8450       73802       39514
Kswapd efficiency                  29%         57%         35%
Kswapd velocity               3929.743    1847.499    3058.840
Direct efficiency                  94%         95%         91%
Direct velocity                  1.687      15.972       8.954
Percentage direct scans             0%          0%          0%
Zone normal velocity          3721.907     939.103    2185.142
Zone dma32 velocity            209.522     924.368     882.651
Zone dma velocity                0.000       0.000       0.000
Page writes by reclaim     4082185.000  526319.000  537114.000
Page writes file                738135      169091       32290
Page writes anon               3344050      357228      504824
Page reclaim immediate            9524         170     5595843
Sector Reads                   8909900      861192     1483680
Sector Writes                 13428980     1488744     2076800
Page rescued immediate               0           0           0
Slabs scanned                    38016       31744       28672
Direct inode steals                  0           0           0
Kswapd inode steals                424           0           0
Kswapd skipped wait                  0           0           0
THP fault alloc                     14          15         119
THP collapse alloc                1767        1569        1618
THP splits                          30          29          25
THP fault fallback                   0           0           0
THP collapse fail                    8           5           0
Compaction stalls                   17          41         100
Compaction success                   7          31          95
Compaction failures                 10          10           5
Page migrate success              7083       22157       62217
Page migrate failure                 0           0           0
Compaction pages isolated        14847       48758      135830
Compaction migrate scanned       18328       48398      138929
Compaction free scanned        2000255      355827     1720269
Compaction cost                      7          24          68

I guess the main takeaway again is the much reduced page writes
from reclaim context and reduced reads.

                       3.9.0       3.9.0       3.9.0
                     vanillamm1-mmotm-20130522mm1-lessdisrupt-v7r10
Mean sda-avgqz         23.58        0.35        0.44
Mean sda-await        133.47       15.72       15.46
Mean sda-r_await        4.72        4.69        3.95
Mean sda-w_await      507.69       28.40       33.68
Max  sda-avgqz        680.60       12.25       23.14
Max  sda-await       3958.89      221.83      286.22
Max  sda-r_await       63.86       61.23       67.29
Max  sda-w_await    11710.38      883.57     1767.28

And as before, write wait times are much reduced.

This patch:

The patch "mm: vmscan: Have kswapd writeback pages based on dirty pages
encountered, not priority" decides whether to writeback pages from reclaim
context based on the number of dirty pages encountered.  This situation is
flagged too easily and flushers are not given the chance to catch up
resulting in more pages being written from reclaim context and potentially
impacting IO performance.  The check for PageWriteback is also misplaced
as it happens within a PageDirty check which is nonsense as the dirty may
have been cleared for IO.  The accounting is updated very late and pages
that are already under writeback, were reactivated, could not unmapped or
could not be released are all missed.  Similarly, a page is considered
congested for reasons other than being congested and pages that cannot be
written out in the correct context are skipped.  Finally, it considers
stalling and writing back filesystem pages due to encountering dirty
anonymous pages at the tail of the LRU which is dumb.

This patch causes kswapd to begin writing filesystem pages from reclaim
context only if page reclaim found that all filesystem pages at the tail
of the LRU were unqueued dirty pages.  Before it starts writing filesystem
pages, it will stall to give flushers a chance to catch up.  The decision
on whether wait_iff_congested is also now determined by dirty filesystem
pages only.  Congested pages are based on whether the underlying BDI is
congested regardless of the context of the reclaiming process.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Rik van Riel <riel@redhat.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Jiri Slaby <jslaby@suse.cz>
Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu>
Cc: Zlatko Calusic <zcalusic@bitsync.net>
Cc: dormando <dormando@rydia.net>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agomm: vmscan: move logic from balance_pgdat() to kswapd_shrink_zone()
Mel Gorman [Wed, 19 Jun 2013 00:06:01 +0000 (10:06 +1000)]
mm: vmscan: move logic from balance_pgdat() to kswapd_shrink_zone()

balance_pgdat() is very long and some of the logic can and should be
internal to kswapd_shrink_zone().  Move it so the flow of balance_pgdat()
is marginally easier to follow.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Jiri Slaby <jslaby@suse.cz>
Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu>
Tested-by: Zlatko Calusic <zcalusic@bitsync.net>
Cc: dormando <dormando@rydia.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agomm: vmscan: check if kswapd should writepage once per pgdat scan
Mel Gorman [Wed, 19 Jun 2013 00:06:00 +0000 (10:06 +1000)]
mm: vmscan: check if kswapd should writepage once per pgdat scan

Currently kswapd checks if it should start writepage as it shrinks each
zone without taking into consideration if the zone is balanced or not.
This is not wrong as such but it does not make much sense either.  This
patch checks once per pgdat scan if kswapd should be writing pages.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Jiri Slaby <jslaby@suse.cz>
Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu>
Tested-by: Zlatko Calusic <zcalusic@bitsync.net>
Cc: dormando <dormando@rydia.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agomm-vmscan-block-kswapd-if-it-is-encountering-pages-under-writeback-fix-2
Mel Gorman [Wed, 19 Jun 2013 00:06:00 +0000 (10:06 +1000)]
mm-vmscan-block-kswapd-if-it-is-encountering-pages-under-writeback-fix-2

The patch "mm: vmscan: Block kswapd if it is encountering pages under
writeback" stalls in congestion_wait it encounters a page under writeback
that is marked for immediate reclaim.  Initially this was a
wait_on_page_writeback() but after the switch to congestion_wait(), there
is no guarantee the page has completed writeback and it can be placed on a
list for freeing.

This is a fix for
mm-vmscan-block-kswapd-if-it-is-encountering-pages-under-writeback.patch

Signed-off-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agomm-vmscan-block-kswapd-if-it-is-encountering-pages-under-writeback-fix
Mel Gorman [Wed, 19 Jun 2013 00:06:00 +0000 (10:06 +1000)]
mm-vmscan-block-kswapd-if-it-is-encountering-pages-under-writeback-fix

Historically, kswapd used to congestion_wait() at higher priorities if it
was not making forward progress. This made no sense as the failure to make
progress could be completely independent of IO. It was later replaced by
wait_iff_congested() and removed entirely by commit 258401a6 (mm: don't
wait on congested zones in balance_pgdat()) as it was duplicating logic
in shrink_inactive_list().

This is problematic. If kswapd encounters many pages under writeback and
it continues to scan until it reaches the high watermark then it will
quickly skip over the pages under writeback and reclaim clean young
pages or push applications out to swap.

The use of wait_iff_congested() is not suited to kswapd as it will only
stall if the underlying BDI is really congested or a direct reclaimer was
unable to write to the underlying BDI. kswapd bypasses the BDI congestion
as it sets PF_SWAPWRITE but even if this was taken into account then it
would cause direct reclaimers to stall on writeback which is not desirable.

This patch sets a ZONE_WRITEBACK flag if direct reclaim or kswapd is
encountering too many pages under writeback. If this flag is set and
kswapd encounters a PageReclaim page under writeback then it'll assume
that the LRU lists are being recycled too quickly before IO can complete
and block waiting for some IO to complete.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Rik van Riel <riel@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Jiri Slaby <jslaby@suse.cz>
Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu>
Tested-by: Zlatko Calusic <zcalusic@bitsync.net>
Cc: dormando <dormando@rydia.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agomm: vmscan: block kswapd if it is encountering pages under writeback
Mel Gorman [Wed, 19 Jun 2013 00:06:00 +0000 (10:06 +1000)]
mm: vmscan: block kswapd if it is encountering pages under writeback

Historically, kswapd used to congestion_wait() at higher priorities if it
was not making forward progress.  This made no sense as the failure to
make progress could be completely independent of IO.  It was later
replaced by wait_iff_congested() and removed entirely by commit 258401a6
(mm: don't wait on congested zones in balance_pgdat()) as it was
duplicating logic in shrink_inactive_list().

This is problematic.  If kswapd encounters many pages under writeback and
it continues to scan until it reaches the high watermark then it will
quickly skip over the pages under writeback and reclaim clean young pages
or push applications out to swap.

The use of wait_iff_congested() is not suited to kswapd as it will only
stall if the underlying BDI is really congested or a direct reclaimer was
unable to write to the underlying BDI.  kswapd bypasses the BDI congestion
as it sets PF_SWAPWRITE but even if this was taken into account then it
would cause direct reclaimers to stall on writeback which is not
desirable.

This patch sets a ZONE_WRITEBACK flag if direct reclaim or kswapd is
encountering too many pages under writeback.  If this flag is set and
kswapd encounters a PageReclaim page under writeback then it'll assume
that the LRU lists are being recycled too quickly before IO can complete
and block waiting for some IO to complete.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Jiri Slaby <jslaby@suse.cz>
Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu>
Tested-by: Zlatko Calusic <zcalusic@bitsync.net>
Cc: dormando <dormando@rydia.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agomm: vmscan: have kswapd writeback pages based on dirty pages encountered, not priority
Mel Gorman [Wed, 19 Jun 2013 00:05:59 +0000 (10:05 +1000)]
mm: vmscan: have kswapd writeback pages based on dirty pages encountered, not priority

Currently kswapd queues dirty pages for writeback if scanning at an
elevated priority but the priority kswapd scans at is not related to the
number of unqueued dirty encountered.  Since commit "mm: vmscan: Flatten
kswapd priority loop", the priority is related to the size of the LRU and
the zone watermark which is no indication as to whether kswapd should
write pages or not.

This patch tracks if an excessive number of unqueued dirty pages are being
encountered at the end of the LRU.  If so, it indicates that dirty pages
are being recycled before flusher threads can clean them and flags the
zone so that kswapd will start writing pages until the zone is balanced.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Jiri Slaby <jslaby@suse.cz>
Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu>
Tested-by: Zlatko Calusic <zcalusic@bitsync.net>
Cc: dormando <dormando@rydia.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agomm: vmscan: do not allow kswapd to scan at maximum priority
Mel Gorman [Wed, 19 Jun 2013 00:05:59 +0000 (10:05 +1000)]
mm: vmscan: do not allow kswapd to scan at maximum priority

Page reclaim at priority 0 will scan the entire LRU as priority 0 is
considered to be a near OOM condition.  Kswapd can reach priority 0 quite
easily if it is encountering a large number of pages it cannot reclaim
such as pages under writeback.  When this happens, kswapd reclaims very
aggressively even though there may be no real risk of allocation failure
or OOM.

This patch prevents kswapd reaching priority 0 and trying to reclaim the
world.  Direct reclaimers will still reach priority 0 in the event of an
OOM situation.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Jiri Slaby <jslaby@suse.cz>
Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu>
Tested-by: Zlatko Calusic <zcalusic@bitsync.net>
Cc: dormando <dormando@rydia.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agomm: vmscan: decide whether to compact the pgdat based on reclaim progress
Mel Gorman [Wed, 19 Jun 2013 00:05:59 +0000 (10:05 +1000)]
mm: vmscan: decide whether to compact the pgdat based on reclaim progress

In the past, kswapd makes a decision on whether to compact memory after
the pgdat was considered balanced.  This more or less worked but it is
late to make such a decision and does not fit well now that kswapd makes a
decision whether to exit the zone scanning loop depending on reclaim
progress.

This patch will compact a pgdat if at least the requested number of pages
were reclaimed from unbalanced zones for a given priority.  If any zone is
currently balanced, kswapd will not call compaction as it is expected the
necessary pages are already available.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Jiri Slaby <jslaby@suse.cz>
Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu>
Tested-by: Zlatko Calusic <zcalusic@bitsync.net>
Cc: dormando <dormando@rydia.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agomm: vmscan: flatten kswapd priority loop
Mel Gorman [Wed, 19 Jun 2013 00:05:58 +0000 (10:05 +1000)]
mm: vmscan: flatten kswapd priority loop

kswapd stops raising the scanning priority when at least SWAP_CLUSTER_MAX
pages have been reclaimed or the pgdat is considered balanced.  It then
rechecks if it needs to restart at DEF_PRIORITY and whether high-order
reclaim needs to be reset.  This is not wrong per-se but it is confusing
to follow and forcing kswapd to stay at DEF_PRIORITY may require several
restarts before it has scanned enough pages to meet the high watermark
even at 100% efficiency.  This patch irons out the logic a bit by
controlling when priority is raised and removing the "goto loop_again".

This patch has kswapd raise the scanning priority until it is scanning
enough pages that it could meet the high watermark in one shrink of the
LRU lists if it is able to reclaim at 100% efficiency.  It will not raise
the scanning prioirty higher unless it is failing to reclaim any pages.

To avoid infinite looping for high-order allocation requests kswapd will
not reclaim for high-order allocations when it has reclaimed at least
twice the number of pages as the allocation request.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Jiri Slaby <jslaby@suse.cz>
Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu>
Tested-by: Zlatko Calusic <zcalusic@bitsync.net>
Cc: dormando <dormando@rydia.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agomm: vmscan: obey proportional scanning requirements for kswapd
Mel Gorman [Wed, 19 Jun 2013 00:05:58 +0000 (10:05 +1000)]
mm: vmscan: obey proportional scanning requirements for kswapd

Simplistically, the anon and file LRU lists are scanned proportionally
depending on the value of vm.swappiness although there are other factors
taken into account by get_scan_count().  The patch "mm: vmscan: Limit the
number of pages kswapd reclaims" limits the number of pages kswapd
reclaims but it breaks this proportional scanning and may evenly shrink
anon/file LRUs regardless of vm.swappiness.

This patch preserves the proportional scanning and reclaim.  It does mean
that kswapd will reclaim more than requested but the number of pages will
be related to the high watermark.

[mhocko@suse.cz: Correct proportional reclaim for memcg and simplify]
[kamezawa.hiroyu@jp.fujitsu.com: Recalculate scan based on target]
[hannes@cmpxchg.org: Account for already scanned pages properly]
Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Jiri Slaby <jslaby@suse.cz>
Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu>
Tested-by: Zlatko Calusic <zcalusic@bitsync.net>
Cc: dormando <dormando@rydia.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agomm: vmscan: limit the number of pages kswapd reclaims at each priority
Mel Gorman [Wed, 19 Jun 2013 00:05:58 +0000 (10:05 +1000)]
mm: vmscan: limit the number of pages kswapd reclaims at each priority

This series does not fix all the current known problems with reclaim but
it addresses one important swapping bug when there is background IO.

Changelog since V3
o Drop the slab shrink changes in light of Glaubers series and
  discussions highlighted that there were a number of potential
  problems with the patch. (mel)
o Rebased to 3.10-rc1

Changelog since V2
o Preserve ratio properly for proportional scanning (kamezawa)

Changelog since V1
o Rename ZONE_DIRTY to ZONE_TAIL_LRU_DIRTY (andi)
o Reformat comment in shrink_page_list (andi)
o Clarify some comments (dhillf)
o Rework how the proportional scanning is preserved
o Add PageReclaim check before kswapd starts writeback
o Reset sc.nr_reclaimed on every full zone scan

Kswapd and page reclaim behaviour has been screwy in one way or the other
for a long time.  Very broadly speaking it worked in the far past because
machines were limited in memory so it did not have that many pages to scan
and it stalled congestion_wait() frequently to prevent it going completely
nuts.  In recent times it has behaved very unsatisfactorily with some of
the problems compounded by the removal of stall logic and the introduction
of transparent hugepage support with high-order reclaims.

There are many variations of bugs that are rooted in this area.  One
example is reports of a large copy operations or backup causing the
machine to grind to a halt or applications pushed to swap.  Sometimes in
low memory situations a large percentage of memory suddenly gets
reclaimed.  In other cases an application starts and kswapd hits 100% CPU
usage for prolonged periods of time and so on.  There is now talk of
introducing features like an extra free kbytes tunable to work around
aspects of the problem instead of trying to deal with it.  It's compounded
by the problem that it can be very workload and machine specific.

This series aims at addressing some of the worst of these problems without
attempting to fundmentally alter how page reclaim works.

Patches 1-2 limits the number of pages kswapd reclaims while still obeying
the anon/file proportion of the LRUs it should be scanning.

Patches 3-4 control how and when kswapd raises its scanning priority and
deletes the scanning restart logic which is tricky to follow.

Patch 5 notes that it is too easy for kswapd to reach priority 0 when
scanning and then reclaim the world. Down with that sort of thing.

Patch 6 notes that kswapd starts writeback based on scanning priority which
is not necessarily related to dirty pages. It will have kswapd
writeback pages if a number of unqueued dirty pages have been
recently encountered at the tail of the LRU.

Patch 7 notes that sometimes kswapd should stall waiting on IO to complete
to reduce LRU churn and the likelihood that it'll reclaim young
clean pages or push applications to swap. It will cause kswapd
to block on IO if it detects that pages being reclaimed under
writeback are recycling through the LRU before the IO completes.

Patchies 8-9 are cosmetic but balance_pgdat() is easier to follow after they
are applied.

This was tested using memcached+memcachetest while some background IO was
in progress as implemented by the parallel IO tests implement in MM Tests.
 memcachetest benchmarks how many operations/second memcached can service
and it is run multiple times.  It starts with no background IO and then
re-runs the test with larger amounts of IO in the background to roughly
simulate a large copy in progress.  The expectation is that the IO should
have little or no impact on memcachetest which is running entirely in
memory.

                                        3.10.0-rc1                  3.10.0-rc1
                                           vanilla            lessdisrupt-v4
Ops memcachetest-0M             22155.00 (  0.00%)          22180.00 (  0.11%)
Ops memcachetest-715M           22720.00 (  0.00%)          22355.00 ( -1.61%)
Ops memcachetest-2385M           3939.00 (  0.00%)          23450.00 (495.33%)
Ops memcachetest-4055M           3628.00 (  0.00%)          24341.00 (570.92%)
Ops io-duration-0M                  0.00 (  0.00%)              0.00 (  0.00%)
Ops io-duration-715M               12.00 (  0.00%)              7.00 ( 41.67%)
Ops io-duration-2385M             118.00 (  0.00%)             21.00 ( 82.20%)
Ops io-duration-4055M             162.00 (  0.00%)             36.00 ( 77.78%)
Ops swaptotal-0M                    0.00 (  0.00%)              0.00 (  0.00%)
Ops swaptotal-715M             140134.00 (  0.00%)             18.00 ( 99.99%)
Ops swaptotal-2385M            392438.00 (  0.00%)              0.00 (  0.00%)
Ops swaptotal-4055M            449037.00 (  0.00%)          27864.00 ( 93.79%)
Ops swapin-0M                       0.00 (  0.00%)              0.00 (  0.00%)
Ops swapin-715M                     0.00 (  0.00%)              0.00 (  0.00%)
Ops swapin-2385M               148031.00 (  0.00%)              0.00 (  0.00%)
Ops swapin-4055M               135109.00 (  0.00%)              0.00 (  0.00%)
Ops minorfaults-0M            1529984.00 (  0.00%)        1530235.00 ( -0.02%)
Ops minorfaults-715M          1794168.00 (  0.00%)        1613750.00 ( 10.06%)
Ops minorfaults-2385M         1739813.00 (  0.00%)        1609396.00 (  7.50%)
Ops minorfaults-4055M         1754460.00 (  0.00%)        1614810.00 (  7.96%)
Ops majorfaults-0M                  0.00 (  0.00%)              0.00 (  0.00%)
Ops majorfaults-715M              185.00 (  0.00%)            180.00 (  2.70%)
Ops majorfaults-2385M           24472.00 (  0.00%)            101.00 ( 99.59%)
Ops majorfaults-4055M           22302.00 (  0.00%)            229.00 ( 98.97%)

Note how the vanilla kernels performance collapses when there is enough IO
taking place in the background.  This drop in performance is part of what
users complain of when they start backups.  Note how the swapin and major
fault figures indicate that processes were being pushed to swap
prematurely.  With the series applied, there is no noticable performance
drop and while there is still some swap activity, it's tiny.

20 iterations of this test were run in total and averaged.  Every 5
iterations, additional IO was generated in the background using dd to
measure how the workload was impacted.  The 0M, 715M, 2385M and 4055M
subblock refer to the amount of IO going on in the background at each
iteration.  So memcachetest-2385M is reporting how many
transactions/second memcachetest recorded on average over 5 iterations
while there was 2385M of IO going on in the ground.  There are six blocks
of information reported here

memcachetest is the transactions/second reported by memcachetest. In
the vanilla kernel note that performance drops from around
22K/sec to just under 4K/second when there is 2385M of IO going
on in the background. This is one type of performance collapse
users complain about if a large cp or backup starts in the
background

io-duration refers to how long it takes for the background IO to
complete. It's showing that with the patched kernel that the IO
completes faster while not interfering with the memcache
workload

swaptotal is the total amount of swap traffic. With the patched kernel,
the total amount of swapping is much reduced although it is
still not zero.

swapin in this case is an indication as to whether we are swap trashing.
The closer the swapin/swapout ratio is to 1, the worse the
trashing is.  Note with the patched kernel that there is no swapin
activity indicating that all the pages swapped were really inactive
unused pages.

minorfaults are just minor faults. An increased number of minor faults
can indicate that page reclaim is unmapping the pages but not
swapping them out before they are faulted back in. With the
patched kernel, there is only a small change in minor faults

majorfaults are just major faults in the target workload and a high
number can indicate that a workload is being prematurely
swapped. With the patched kernel, major faults are much reduced. As
there are no swapin's recorded so it's not being swapped. The likely
explanation is that that libraries or configuration files used by
the workload during startup get paged out by the background IO.

Overall with the series applied, there is no noticable performance drop
due to background IO and while there is still some swap activity, it's
tiny and the lack of swapins imply that the swapped pages were inactive
and unused.

                            3.10.0-rc1  3.10.0-rc1
                               vanilla lessdisrupt-v4
Page Ins                       1234608      101892
Page Outs                     12446272    11810468
Swap Ins                        283406           0
Swap Outs                       698469       27882
Direct pages scanned                 0      136480
Kswapd pages scanned           6266537     5369364
Kswapd pages reclaimed         1088989      930832
Direct pages reclaimed               0      120901
Kswapd efficiency                  17%         17%
Kswapd velocity               5398.371    4635.115
Direct efficiency                 100%         88%
Direct velocity                  0.000     117.817
Percentage direct scans             0%          2%
Page writes by reclaim         1655843     4009929
Page writes file                957374     3982047
Page writes anon                698469       27882
Page reclaim immediate            5245        1745
Page rescued immediate               0           0
Slabs scanned                    33664       25216
Direct inode steals                  0           0
Kswapd inode steals              19409         778
Kswapd skipped wait                  0           0
THP fault alloc                     35          30
THP collapse alloc                 472         401
THP splits                          27          22
THP fault fallback                   0           0
THP collapse fail                    0           1
Compaction stalls                    0           4
Compaction success                   0           0
Compaction failures                  0           4
Page migrate success                 0           0
Page migrate failure                 0           0
Compaction pages isolated            0           0
Compaction migrate scanned           0           0
Compaction free scanned              0           0
Compaction cost                      0           0
NUMA PTE updates                     0           0
NUMA hint faults                     0           0
NUMA hint local faults               0           0
NUMA pages migrated                  0           0
AutoNUMA cost                        0           0

Unfortunately, note that there is a small amount of direct reclaim due to
kswapd no longer reclaiming the world.  ftrace indicates that the direct
reclaim stalls are mostly harmless with the vast bulk of the stalls
incurred by dd

     23 tclsh-3367
     38 memcachetest-13733
     49 memcachetest-12443
     57 tee-3368
   1541 dd-13826
   1981 dd-12539

A consequence of the direct reclaim for dd is that the processes for the
IO workload may show a higher system CPU usage.  There is also a risk that
kswapd not reclaiming the world may mean that it stays awake balancing
zones, does not stall on the appropriate events and continually scans
pages it cannot reclaim consuming CPU.  This will be visible as continued
high CPU usage but in my own tests I only saw a single spike lasting less
than a second and I did not observe any problems related to reclaim while
running the series on my desktop.

This patch:

The number of pages kswapd can reclaim is bound by the number of pages it
scans which is related to the size of the zone and the scanning priority.
In many cases the priority remains low because it's reset every
SWAP_CLUSTER_MAX reclaimed pages but in the event kswapd scans a large
number of pages it cannot reclaim, it will raise the priority and
potentially discard a large percentage of the zone as sc->nr_to_reclaim is
ULONG_MAX.  The user-visible effect is a reclaim "spike" where a large
percentage of memory is suddenly freed.  It would be bad enough if this
was just unused memory but because of how anon/file pages are balanced it
is possible that applications get pushed to swap unnecessarily.

This patch limits the number of pages kswapd will reclaim to the high
watermark.  Reclaim will still overshoot due to it not being a hard limit
as shrink_lruvec() will ignore the sc.nr_to_reclaim at DEF_PRIORITY but it
prevents kswapd reclaiming the world at higher priorities.  The number of
pages it reclaims is not adjusted for high-order allocations as kswapd
will reclaim excessively if it is to balance zones for high-order
allocations.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Jiri Slaby <jslaby@suse.cz>
Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu>
Tested-by: Zlatko Calusic <zcalusic@bitsync.net>
Cc: dormando <dormando@rydia.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agomm/THP: deposit the transpare huge pgtable before set_pmd
Aneesh Kumar K.V [Wed, 19 Jun 2013 00:05:57 +0000 (10:05 +1000)]
mm/THP: deposit the transpare huge pgtable before set_pmd

Architectures like powerpc use the deposited pgtable to store hash index
values.  We need to make the deposted pgtable is visible to other cpus
before we are ready to take a hash fault.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: David Gibson <david@gibson.dropbear.id.au>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agothp: define HPAGE_PMD_* constants as BUILD_BUG() if !THP
Kirill A. Shutemov [Wed, 19 Jun 2013 00:05:57 +0000 (10:05 +1000)]
thp: define HPAGE_PMD_* constants as BUILD_BUG() if !THP

Currently, HPAGE_PMD_* constans rely on PMD_SHIFT regardless of
CONFIG_TRANSPARENT_HUGEPAGE.  PMD_SHIFT is not defined everywhere (e.g.
arm nommu case).

It means we can't use anything like this in generic code:

        if (PageTransHuge(page))
                zero_huge_user(page, 0, HPAGE_PMD_SIZE);
        else
                clear_highpage(page);

For !THP case, PageTransHuge() is 0 and compiler can eliminate
zero_huge_user() call.  But it still need to be valid C expression, means
HPAGE_PMD_SIZE has to expand to something compiler can understand.

Previously, HPAGE_PMD_* were defined to BUILD_BUG() for !THP. Let's come
back to it.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: David Gibson <david@gibson.dropbear.id.au>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agomm/THP: don't use HPAGE_SHIFT in transparent hugepage code
Aneesh Kumar K.V [Wed, 19 Jun 2013 00:05:57 +0000 (10:05 +1000)]
mm/THP: don't use HPAGE_SHIFT in transparent hugepage code

For architectures like powerpc that support multiple explicit hugepage
sizes, HPAGE_SHIFT indicate the default explicit hugepage shift.  For THP
to work the hugepage size should be same as PMD_SIZE.  So use PMD_SHIFT
directly.  So move the define outside CONFIG_TRANSPARENT_HUGEPAGE #ifdef
because we want to use these defines in generic code with if
(pmd_trans_huge()) conditional.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: David Gibson <david@gibson.dropbear.id.au>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agomm/THP: withdraw the pgtable after pmdp related operations
Aneesh Kumar K.V [Wed, 19 Jun 2013 00:05:57 +0000 (10:05 +1000)]
mm/THP: withdraw the pgtable after pmdp related operations

For architectures like ppc64 we look at deposited pgtable when calling
pmdp_get_and_clear.  So do the pgtable_trans_huge_withdraw after finishing
pmdp related operations.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: David Gibson <david@gibson.dropbear.id.au>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agomm/THP: add pmd args to pgtable deposit and withdraw APIs
Aneesh Kumar K.V [Wed, 19 Jun 2013 00:05:56 +0000 (10:05 +1000)]
mm/THP: add pmd args to pgtable deposit and withdraw APIs

This will be later used by powerpc THP support.  In powerpc we want to use
pgtable for storing the hash index values.  So instead of adding them to
mm_context list, we would like to store them in the second half of pmd

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agomm/thp: use the correct function when updating access flags
Aneesh Kumar K.V [Wed, 19 Jun 2013 00:05:56 +0000 (10:05 +1000)]
mm/thp: use the correct function when updating access flags

We should use pmdp_set_access_flags to update access flags.  Archs like
powerpc use extra checks(_PAGE_BUSY) when updating a hugepage PTE.  A
set_pmd_at doesn't do those checks.  We should use set_pmd_at only when
updating a none hugepage PTE.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>a
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agomm/page_alloc: don't re-init pageset in zone_pcp_update()
Cody P Schafer [Wed, 19 Jun 2013 00:05:56 +0000 (10:05 +1000)]
mm/page_alloc: don't re-init pageset in zone_pcp_update()

When memory hotplug is triggered, we call pageset_init() on
per-cpu-pagesets which both contain pages and are in use, causing both the
leakage of those pages and (potentially) bad behaviour if a page is
allocated from a pageset while it is being cleared.

Avoid this by factoring out pageset_set_high_and_batch() (which contains
all needed logic too set a pageset's ->high and ->batch inrespective of
system state) from zone_pageset_init() and using the new
pageset_set_high_and_batch() instead of zone_pageset_init() in
zone_pcp_update().

Signed-off-by: Cody P Schafer <cody@linux.vnet.ibm.com>
Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agomm/page_alloc: rename setup_pagelist_highmark() to match naming of pageset_set_batch()
Cody P Schafer [Wed, 19 Jun 2013 00:05:55 +0000 (10:05 +1000)]
mm/page_alloc: rename setup_pagelist_highmark() to match naming of pageset_set_batch()

Signed-off-by: Cody P Schafer <cody@linux.vnet.ibm.com>
Cc: Gilad Ben-Yossef <gilad@benyossef.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Pekka Enberg <penberg@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agomm/page_alloc: in zone_pcp_update(), uze zone_pageset_init()
Cody P Schafer [Wed, 19 Jun 2013 00:05:55 +0000 (10:05 +1000)]
mm/page_alloc: in zone_pcp_update(), uze zone_pageset_init()

Previously, zone_pcp_update() called pageset_set_batch() directly,
essentially assuming that percpu_pagelist_fraction == 0.  Correct this by
calling zone_pageset_init(), which chooses the appropriate ->batch and
->high calculations.

Signed-off-by: Cody P Schafer <cody@linux.vnet.ibm.com>
Cc: Gilad Ben-Yossef <gilad@benyossef.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Pekka Enberg <penberg@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agomm/page_alloc: factor zone_pageset_init() out of setup_zone_pageset()
Cody P Schafer [Wed, 19 Jun 2013 00:05:55 +0000 (10:05 +1000)]
mm/page_alloc: factor zone_pageset_init() out of setup_zone_pageset()

Signed-off-by: Cody P Schafer <cody@linux.vnet.ibm.com>
Cc: Gilad Ben-Yossef <gilad@benyossef.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Pekka Enberg <penberg@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agomm/page_alloc: relocate comment to be directly above code it refers to.
Cody P Schafer [Wed, 19 Jun 2013 00:05:54 +0000 (10:05 +1000)]
mm/page_alloc: relocate comment to be directly above code it refers to.

Signed-off-by: Cody P Schafer <cody@linux.vnet.ibm.com>
Cc: Gilad Ben-Yossef <gilad@benyossef.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Pekka Enberg <penberg@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agomm/page_alloc: factor setup_pageset() into pageset_init() and pageset_set_batch()
Cody P Schafer [Wed, 19 Jun 2013 00:05:54 +0000 (10:05 +1000)]
mm/page_alloc: factor setup_pageset() into pageset_init() and pageset_set_batch()

Signed-off-by: Cody P Schafer <cody@linux.vnet.ibm.com>
Cc: Gilad Ben-Yossef <gilad@benyossef.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Pekka Enberg <penberg@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agomm/page_alloc: when handling percpu_pagelist_fraction, don't unneedly recalulate...
Cody P Schafer [Wed, 19 Jun 2013 00:05:54 +0000 (10:05 +1000)]
mm/page_alloc: when handling percpu_pagelist_fraction, don't unneedly recalulate high

Simply moves calculation of the new 'high' value outside the
for_each_possible_cpu() loop, as it does not depend on the cpu.

Signed-off-by: Cody P Schafer <cody@linux.vnet.ibm.com>
Cc: Gilad Ben-Yossef <gilad@benyossef.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Pekka Enberg <penberg@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agomm/page_alloc: convert zone_pcp_update() to rely on memory barriers instead of stop_m...
Cody P Schafer [Wed, 19 Jun 2013 00:05:53 +0000 (10:05 +1000)]
mm/page_alloc: convert zone_pcp_update() to rely on memory barriers instead of stop_machine()

zone_pcp_update()'s goal is to adjust the ->high and ->mark members of a
percpu pageset based on a zone's ->managed_pages.  We don't need to drain
the entire percpu pageset just to modify these fields.

This lets us avoid calling setup_pageset() (and the draining required to
call it) and instead allows simply setting the fields' values (with some
attention paid to memory barriers to prevent the relationship between
->batch and ->high from being thrown off).

This does change the behavior of zone_pcp_update() as the percpu pagesets
will not be drained when zone_pcp_update() is called (they will end up
being shrunk, not completely drained, later when a 0-order page is freed
in free_hot_cold_page()).

Signed-off-by: Cody P Schafer <cody@linux.vnet.ibm.com>
Cc: Gilad Ben-Yossef <gilad@benyossef.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Pekka Enberg <penberg@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agomm/page_alloc: protect pcp->batch accesses with ACCESS_ONCE
Cody P Schafer [Wed, 19 Jun 2013 00:05:53 +0000 (10:05 +1000)]
mm/page_alloc: protect pcp->batch accesses with ACCESS_ONCE

pcp->batch could change at any point, avoid relying on it being a stable
value.

Signed-off-by: Cody P Schafer <cody@linux.vnet.ibm.com>
Cc: Gilad Ben-Yossef <gilad@benyossef.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Pekka Enberg <penberg@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agomm/page_alloc: insert memory barriers to allow async update of pcp batch and high
Cody P Schafer [Wed, 19 Jun 2013 00:05:53 +0000 (10:05 +1000)]
mm/page_alloc: insert memory barriers to allow async update of pcp batch and high

Introduce pageset_update() to perform a safe transision from one set of
pcp->{batch,high} to a new set using memory barriers.

This ensures that batch is always set to a safe value (1) prior to
updating high, and ensure that high is fully updated before setting the
real value of batch.  It avoids ->batch ever rising above ->high.

Suggested by Gilad Ben-Yossef in these threads:

https://lkml.org/lkml/2013/4/9/23
https://lkml.org/lkml/2013/4/10/49

Also reproduces his proposed comment.

Signed-off-by: Cody P Schafer <cody@linux.vnet.ibm.com>
Reviewed-by: Gilad Ben-Yossef <gilad@benyossef.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Pekka Enberg <penberg@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agomm/page_alloc: prevent concurrent updaters of pcp ->batch and ->high
Cody P Schafer [Wed, 19 Jun 2013 00:05:52 +0000 (10:05 +1000)]
mm/page_alloc: prevent concurrent updaters of pcp ->batch and ->high

Because we are going to rely upon a careful transision between old and new
->high and ->batch values using memory barriers and will remove
stop_machine(), we need to prevent multiple updaters from interweaving
their memory writes.

Add a simple mutex to protect both update loops.

Signed-off-by: Cody P Schafer <cody@linux.vnet.ibm.com>
Cc: Gilad Ben-Yossef <gilad@benyossef.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Pekka Enberg <penberg@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agomm/page_alloc: factor out setting of pcp->high and pcp->batch
Cody P Schafer [Wed, 19 Jun 2013 00:05:52 +0000 (10:05 +1000)]
mm/page_alloc: factor out setting of pcp->high and pcp->batch

"Problems" with the current code:

1: there is a lack of synchronization in setting ->high and ->batch in
   percpu_pagelist_fraction_sysctl_handler()

2: stop_machine() in zone_pcp_update() is unnecissary.

3: zone_pcp_update() does not consider the case where
   percpu_pagelist_fraction is non-zero

To fix:

1: add memory barriers, a safe ->batch value, an update side mutex when
   updating ->high and ->batch, and use ACCESS_ONCE() for ->batch users
   that expect a stable value.

2: avoid draining pages in zone_pcp_update(), rely upon the memory
   barriers added to fix #1

3: factor out quite a few functions, and then call the appropriate one.

Note that it results in a change to the behavior of zone_pcp_update(),
which is used by memory_hotplug.  I'm rather certain that I've diserned
(and preserved) the essential behavior (changing ->high and ->batch), and
only eliminated unneeded actions (draining the per cpu pages), but this
may not be the case.

Further note that the draining of pages that previously took place in
zone_pcp_update() occured after repeated draining when attempting to
offline a page, and after the offline has "succeeded".  It appears that
the draining was added to zone_pcp_update() to avoid refactoring
setup_pageset() into 2 funtions.

This patch:

Creates pageset_set_batch() for use in setup_pageset().
pageset_set_batch() imitates the functionality of
setup_pagelist_highmark(), but uses the boot time
(percpu_pagelist_fraction == 0) calculations for determining ->high based
on ->batch.

Signed-off-by: Cody P Schafer <cody@linux.vnet.ibm.com>
Cc: Gilad Ben-Yossef <gilad@benyossef.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Pekka Enberg <penberg@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agouio: use vma_pages() to replace (vm_end - vm_start) >> PAGE_SHIFT
Libin [Wed, 19 Jun 2013 00:05:52 +0000 (10:05 +1000)]
uio: use vma_pages() to replace (vm_end - vm_start) >> PAGE_SHIFT

(*->vm_end - *->vm_start) >> PAGE_SHIFT operation is implemented
as a inline funcion vma_pages() in linux/mm.h, so using it.

Signed-off-by: Libin <huawei.libin@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agoncpfs: use vma_pages() to replace (vm_end - vm_start) >> PAGE_SHIFT
Libin [Wed, 19 Jun 2013 00:05:52 +0000 (10:05 +1000)]
ncpfs: use vma_pages() to replace (vm_end - vm_start) >> PAGE_SHIFT

(*->vm_end - *->vm_start) >> PAGE_SHIFT operation is implemented
as a inline funcion vma_pages() in linux/mm.h, so using it.

Signed-off-by: Libin <huawei.libin@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agomm: use vma_pages() to replace (vm_end - vm_start) >> PAGE_SHIFT
Libin [Wed, 19 Jun 2013 00:05:51 +0000 (10:05 +1000)]
mm: use vma_pages() to replace (vm_end - vm_start) >> PAGE_SHIFT

(*->vm_end - *->vm_start) >> PAGE_SHIFT operation is implemented
as a inline funcion vma_pages() in linux/mm.h, so using it.

Signed-off-by: Libin <huawei.libin@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agoswap: swapin_nr_pages() can be static
Fengguang Wu [Wed, 19 Jun 2013 00:05:51 +0000 (10:05 +1000)]
swap: swapin_nr_pages() can be static

Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
Cc: Shaohua Li <shli@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agoswap: add a simple detector for inappropriate swapin readahead
Shaohua Li [Wed, 19 Jun 2013 00:05:51 +0000 (10:05 +1000)]
swap: add a simple detector for inappropriate swapin readahead

This is a patch to improve swap readahead algorithm. It's from Hugh and I
slightly changed it.

Hugh's original changelog:

swapin readahead does a blind readahead, whether or not the swapin
is sequential.  This may be ok on harddisk, because large reads have
relatively small costs, and if the readahead pages are unneeded they
can be reclaimed easily - though, what if their allocation forced
reclaim of useful pages?  But on SSD devices large reads are more
expensive than small ones: if the readahead pages are unneeded,
reading them in caused significant overhead.

This patch adds very simplistic random read detection.  Stealing
the PageReadahead technique from Konstantin Khlebnikov's patch,
avoiding the vma/anon_vma sophistications of Shaohua Li's patch,
swapin_nr_pages() simply looks at readahead's current success
rate, and narrows or widens its readahead window accordingly.
There is little science to its heuristic: it's about as stupid
as can be whilst remaining effective.

The table below shows elapsed times (in centiseconds) when running
a single repetitive swapping load across a 1000MB mapping in 900MB
ram with 1GB swap (the harddisk tests had taken painfully too long
when I used mem=500M, but SSD shows similar results for that).

Vanilla is the 3.6-rc7 kernel on which I started; Shaohua denotes
his Sep 3 patch in mmotm and linux-next; HughOld denotes my Oct 1
patch which Shaohua showed to be defective; HughNew this Nov 14
patch, with page_cluster as usual at default of 3 (8-page reads);
HughPC4 this same patch with page_cluster 4 (16-page reads);
HughPC0 with page_cluster 0 (1-page reads: no readahead).

HDD for swapping to harddisk, SSD for swapping to VertexII SSD.
Seq for sequential access to the mapping, cycling five times around;
Rand for the same number of random touches.  Anon for a MAP_PRIVATE
anon mapping; Shmem for a MAP_SHARED anon mapping, equivalent to tmpfs.

One weakness of Shaohua's vma/anon_vma approach was that it did
not optimize Shmem: seen below.  Konstantin's approach was perhaps
mistuned, 50% slower on Seq: did not compete and is not shown below.

HDD        Vanilla Shaohua HughOld HughNew HughPC4 HughPC0
Seq Anon     73921   76210   75611   76904   78191  121542
Seq Shmem    73601   73176   73855   72947   74543  118322
Rand Anon   895392  831243  871569  845197  846496  841680
Rand Shmem 1058375 1053486  827935  764955  764376  756489

SSD        Vanilla Shaohua HughOld HughNew HughPC4 HughPC0
Seq Anon     24634   24198   24673   25107   21614   70018
Seq Shmem    24959   24932   25052   25703   22030   69678
Rand Anon    43014   26146   28075   25989   26935   25901
Rand Shmem   45349   45215   28249   24268   24138   24332

These tests are, of course, two extremes of a very simple case:
under heavier mixed loads I've not yet observed any consistent
improvement or degradation, and wider testing would be welcome.

Shaohua Li:

Test shows Vanilla is slightly better in sequential workload than Hugh's patch.
I observed with Hugh's patch sometimes the readahead size is shrinked too fast
(from 8 to 1 immediately) in sequential workload if there is no hit. And in
such case, continuing doing readahead is good actually.

I don't prepare a sophisticated algorithm for the sequential workload because
so far we can't guarantee sequential accessed pages are swap out sequentially.
So I slightly change Hugh's heuristic - don't shrink readahead size too fast.

Here is my test result (unit second, 3 runs average):
Vanilla Hugh New
Seq 356 370 360
Random 4525 2447 2444

Attached graph is the swapin/swapout throughput I collected with 'vmstat 2'.
The first part is running a random workload (till around 1200 of the x-axis)
and the second part is running a sequential workload. swapin and swapout
throughput are almost identical in steady state in both workloads. These are
expected behavior. while in Vanilla, swapin is much bigger than swapout
especially in random workload (because wrong readahead).

Original patches by: Shaohua Li and Konstantin Khlebnikov.

Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Shaohua Li <shli@fusionio.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Konstantin Khlebnikov <khlebnikov@openvz.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agomm-remove-compressed-copy-from-zram-in-memory-fix-2-fix
Andrew Morton [Wed, 19 Jun 2013 00:05:50 +0000 (10:05 +1000)]
mm-remove-compressed-copy-from-zram-in-memory-fix-2-fix

invert unlikely() test, augment comment, 80-col cleanup

Cc: Artem Savkov <artem.savkov@gmail.com>
Cc: Dan Magenheimer <dan.magenheimer@oracle.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Konrad Rzeszutek Wilk <konrad@darnok.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Nitin Gupta <ngupta@vflare.org>
Cc: Seth Jennings <sjenning@linux.vnet.ibm.com>
Cc: Shaohua Li <shli@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agonon-swapcache pages in end_swap_bio_read()
Artem Savkov [Wed, 19 Jun 2013 00:05:50 +0000 (10:05 +1000)]
non-swapcache pages in end_swap_bio_read()

There is no guarantee that page in end_swap_bio_read is in swapcache so we
need to check it before calling page_swap_info().  Otherwise kernel hits a
bug on like the one below.  Introduced in "mm: remove compressed copy from
zram in-memory"

kernel BUG at mm/swapfile.c:2361!
invalid opcode: 0000 [#1] SMP
Modules linked in:
CPU: 1 PID: 0 Comm: swapper/1 Not tainted 3.10.0-rc4-next-20130607+ #61
Hardware name: Bochs Bochs, BIOS Bochs 01/01/2007
task: ffff88001e5ccfc0 ti: ffff88001e5ea000 task.ti: ffff88001e5ea000
RIP: 0010:[<ffffffff811462eb>]  [<ffffffff811462eb>] page_swap_info+0xab/0xb0
RSP: 0000:ffff88001ec03c78  EFLAGS: 00010246
RAX: 0100000000000009 RBX: ffffea0000794780 RCX: 0000000000000c0b
RDX: 0000000000000046 RSI: 0000000000000000 RDI: 0000000000000000
RBP: ffff88001ec03c88 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
R13: 0000000000000001 R14: ffff88001e7f6200 R15: 0000000000001000
FS:  0000000000000000(0000) GS:ffff88001ec00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000000000000000 CR3: 000000000240b000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Stack:
ffffea0000794780 ffff88001e7f6200 ffff88001ec03cb8 ffffffff81145486
ffff88001e5cd5c0 ffff88001c02cd20 0000000000001000 0000000000000000
ffff88001ec03cc8 ffffffff81199518 ffff88001ec03d28 ffffffff81518ec3
Call Trace:
<IRQ>
[<ffffffff81145486>] end_swap_bio_read+0x96/0x130
[<ffffffff81199518>] bio_endio+0x18/0x40
[<ffffffff81518ec3>] blk_update_request+0x213/0x540
[<ffffffff81518fa0>] ? blk_update_request+0x2f0/0x540
[<ffffffff817986a6>] ? ata_hsm_qc_complete+0x46/0x130
[<ffffffff81519212>] blk_update_bidi_request+0x22/0x90
[<ffffffff8151b9ea>] blk_end_bidi_request+0x2a/0x80
[<ffffffff8151ba8b>] blk_end_request+0xb/0x10
[<ffffffff817693aa>] scsi_io_completion+0xaa/0x6b0
[<ffffffff817608d8>] scsi_finish_command+0xc8/0x130
[<ffffffff81769aff>] scsi_softirq_done+0x13f/0x160
[<ffffffff81521ebc>] blk_done_softirq+0x7c/0x90
[<ffffffff81049030>] __do_softirq+0x130/0x3f0
[<ffffffff810d454e>] ? handle_irq_event+0x4e/0x70
[<ffffffff81049405>] irq_exit+0xa5/0xb0
[<ffffffff81003cb1>] do_IRQ+0x61/0xe0
[<ffffffff81c2832f>] common_interrupt+0x6f/0x6f
<EOI>
[<ffffffff8107ebff>] ? local_clock+0x4f/0x60
[<ffffffff81c27f85>] ? _raw_spin_unlock_irq+0x35/0x50
[<ffffffff81c27f7b>] ? _raw_spin_unlock_irq+0x2b/0x50
[<ffffffff81078bd0>] finish_task_switch+0x80/0x110
[<ffffffff81078b93>] ? finish_task_switch+0x43/0x110
[<ffffffff81c2525c>] __schedule+0x32c/0x8c0
[<ffffffff81c2c010>] ? notifier_call_chain+0x150/0x150
[<ffffffff81c259d4>] schedule+0x24/0x70
[<ffffffff81c25d42>] schedule_preempt_disabled+0x22/0x30
[<ffffffff81093645>] cpu_startup_entry+0x335/0x380
[<ffffffff81c1ed7e>] start_secondary+0x217/0x219
Code: 69 bc 16 82 48 c7 c7 77 bc 16 82 31 c0 49 c1 ec 39 49 c1 e9 10 41 83 e1 01 e8 6c d2 ad 00 5b 4a 8b 04 e5 e0 bf 14 83 41 5c c9 c3 <0f> 0b eb fe 90 48 8b 07 55 48 89 e5 a9 00 00 01 00 74 12 e8 3d
RIP  [<ffffffff811462eb>] page_swap_info+0xab/0xb0
RSP <ffff88001ec03c78>

Signed-off-by: Artem Savkov <artem.savkov@gmail.com>
Cc: Dan Magenheimer <dan.magenheimer@oracle.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Konrad Rzeszutek Wilk <konrad@darnok.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Nitin Gupta <ngupta@vflare.org>
Cc: Seth Jennings <sjenning@linux.vnet.ibm.com>
Cc: Shaohua Li <shli@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agomm-remove-compressed-copy-from-zram-in-memory-fix
Andrew Morton [Wed, 19 Jun 2013 00:05:50 +0000 (10:05 +1000)]
mm-remove-compressed-copy-from-zram-in-memory-fix

tweak comment text

Cc: Dan Magenheimer <dan.magenheimer@oracle.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Konrad Rzeszutek Wilk <konrad@darnok.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Nitin Gupta <ngupta@vflare.org>
Cc: Seth Jennings <sjenning@linux.vnet.ibm.com>
Cc: Shaohua Li <shli@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agomm: remove compressed copy from zram in-memory
Minchan Kim [Wed, 19 Jun 2013 00:05:49 +0000 (10:05 +1000)]
mm: remove compressed copy from zram in-memory

Swap subsystem does lazy swap slot free with expecting the page would be
swapped out again so we can avoid unnecessary write.

But the problem in in-memory swap(ex, zram) is that it consumes memory
space until vm_swap_full(ie, used half of all of swap device) condition
meet.  It could be bad if we use multiple swap device, small in-memory
swap and big storage swap or in-memory swap alone.

This patch makes swap subsystem free swap slot as soon as swap-read is
completed and make the swapcache page dirty so the page should be written
out the swap device to reclaim it.  It means we never lose it.

I tested this patch with kernel compile workload.

1. before

compile time : 9882.42
zram max wasted space by fragmentation: 13471881 byte
memory space consumed by zram: 174227456 byte
the number of slot free notify: 206684

2. after

compile time : 9653.90
zram max wasted space by fragmentation: 11805932 byte
memory space consumed by zram: 154001408 byte
the number of slot free notify: 426972

* changelog from v3
  * Rebased on next-20130508

* changelog from v1
  * Add more comment

Signed-off-by: Dan Magenheimer <dan.magenheimer@oracle.com>
Signed-off-by: Minchan Kim <minchan@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Seth Jennings <sjenning@linux.vnet.ibm.com>
Cc: Nitin Gupta <ngupta@vflare.org>
Cc: Konrad Rzeszutek Wilk <konrad@darnok.org>
Cc: Shaohua Li <shli@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agomm: remove free_area_cache
Michel Lespinasse [Wed, 19 Jun 2013 00:05:49 +0000 (10:05 +1000)]
mm: remove free_area_cache

Since all architectures have been converted to use vm_unmapped_area(),
there is no remaining use for the free_area_cache.

Signed-off-by: Michel Lespinasse <walken@google.com>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: "James E.J. Bottomley" <jejb@parisc-linux.org>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: David Howells <dhowells@redhat.com>
Cc: Helge Deller <deller@gmx.de>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Richard Henderson <rth@twiddle.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agomm, memcg: don't take task_lock in task_in_mem_cgroup
David Rientjes [Wed, 19 Jun 2013 00:05:49 +0000 (10:05 +1000)]
mm, memcg: don't take task_lock in task_in_mem_cgroup

For processes that have detached their mm's, task_in_mem_cgroup()
unnecessarily takes task_lock() when rcu_read_lock() is all that is
necessary to call mem_cgroup_from_task().

While we're here, switch task_in_mem_cgroup() to return bool.

Signed-off-by: David Rientjes <rientjes@google.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agopagemap: prepare to reuse constant bits with page-shift
Pavel Emelyanov [Wed, 19 Jun 2013 00:05:48 +0000 (10:05 +1000)]
pagemap: prepare to reuse constant bits with page-shift

In order to reuse bits from pagemap entries gracefully, we leave the
entries as is but on pagemap open emit a warning in dmesg, that bits 55-60
are about to change in a couple of releases.  Next, if a user issues
soft-dirty clear command via the clear_refs file (it was disabled before
v3.9) we assume that he's aware of the new pagemap format, note that fact
and report the bits in pagemap in the new manner.

The "migration strategy" looks like this then:

1. existing users are not affected -- they don't touch soft-dirty feature, thus
   see old bits in pagemap, but are warned and have time to fix themselves
2. those who use soft-dirty know about new pagemap format
3. some time soon we get rid of any signs of page-shift in pagemap as well as
   this trick with clear-soft-dirty affecting pagemap format.

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Cc: Matt Mackall <mpm@selenic.com>
Cc: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Cc: Glauber Costa <glommer@parallels.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agosoft-dirty: call mmu notifiers when write-protecting ptes
Pavel Emelyanov [Wed, 19 Jun 2013 00:05:48 +0000 (10:05 +1000)]
soft-dirty: call mmu notifiers when write-protecting ptes

As noticed by Xiao, since soft-dirty clear command modifies page tables we
have to flush tlbs and call mmu notifiers.  While the former is done by
the clear_refs engine itself, the latter is to be done.

One thing to note about this -- in order not to call per-page invalidate
notifier (_all_ address space is about to be changed), the
_invalidate_range_start and _end are used.  But for those start and end
are not known exactly.  To address this, the same trick as in exit_mmap()
is used -- start is 0 and end is (unsigned long)-1.

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Cc: Matt Mackall <mpm@selenic.com>
Cc: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Cc: Glauber Costa <glommer@parallels.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agomm: soft-dirty bits for user memory changes tracking
Pavel Emelyanov [Wed, 19 Jun 2013 00:05:48 +0000 (10:05 +1000)]
mm: soft-dirty bits for user memory changes tracking

The soft-dirty is a bit on a PTE which helps to track which pages a task
writes to. In order to do this tracking one should

  1. Clear soft-dirty bits from PTEs ("echo 4 > /proc/PID/clear_refs)
  2. Wait some time.
  3. Read soft-dirty bits (55'th in /proc/PID/pagemap2 entries)

To do this tracking, the writable bit is cleared from PTEs when the
soft-dirty bit is. Thus, after this, when the task tries to modify a page
at some virtual address the #PF occurs and the kernel sets the soft-dirty
bit on the respective PTE.

Note, that although all the task's address space is marked as r/o after the
soft-dirty bits clear, the #PF-s that occur after that are processed fast.
This is so, since the pages are still mapped to physical memory, and thus
all the kernel does is finds this fact out and puts back writable, dirty
and soft-dirty bits on the PTE.

Another thing to note, is that when mremap moves PTEs they are marked with
soft-dirty as well, since from the user perspective mremap modifies the
virtual memory at mremap's new address.

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Cc: Matt Mackall <mpm@selenic.com>
Cc: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Cc: Glauber Costa <glommer@parallels.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agopagemap-introduce-pagemap_entry_t-without-pmshift-bits-v4
Pavel Emelyanov [Wed, 19 Jun 2013 00:05:47 +0000 (10:05 +1000)]
pagemap-introduce-pagemap_entry_t-without-pmshift-bits-v4

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Cc: Matt Mackall <mpm@selenic.com>
Cc: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Cc: Glauber Costa <glommer@parallels.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agopagemap: introduce pagemap_entry_t without pmshift bits
Pavel Emelyanov [Wed, 19 Jun 2013 00:05:47 +0000 (10:05 +1000)]
pagemap: introduce pagemap_entry_t without pmshift bits

These bits are always constant (== PAGE_SHIFT) and just occupy space in
the entry.  Moreover, in next patch we will need to report one more bit in
the pagemap, but all bits are already busy on it.

That said, describe the pagemap entry that has 6 more free zero bits.

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Cc: Matt Mackall <mpm@selenic.com>
Cc: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Cc: Glauber Costa <glommer@parallels.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agoclear_refs: introduce private struct for mm_walk
Pavel Emelyanov [Wed, 19 Jun 2013 00:05:47 +0000 (10:05 +1000)]
clear_refs: introduce private struct for mm_walk

In the next patch the clear-refs-type will be required in
clear_refs_pte_range funciton, so prepare the walk->private to carry this
info.

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Cc: Matt Mackall <mpm@selenic.com>
Cc: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Cc: Glauber Costa <glommer@parallels.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agoclear_refs: sanitize accepted commands declaration
Pavel Emelyanov [Wed, 19 Jun 2013 00:05:46 +0000 (10:05 +1000)]
clear_refs: sanitize accepted commands declaration

This is the implementation of the soft-dirty bit concept that should help
keep track of changes in user memory, which in turn is very-very required
by the checkpoint-restore project (http://criu.org).

To create a dump of an application(s) we save all the information about it
to files, and the biggest part of such dump is the contents of tasks' memory.
However, there are usage scenarios where it's not required to get _all_ the
task memory while creating a dump. For example, when doing periodical dumps,
it's only required to take full memory dump only at the first step and then
take incremental changes of memory. Another example is live migration. We
copy all the memory to the destination node without stopping all tasks, then
stop them, check for what pages has changed, dump it and the rest of the state,
then copy it to the destination node. This decreases freeze time significantly.

That said, some help from kernel to watch how processes modify the contents
of their memory is required.

The proposal is to track changes with the help of new soft-dirty bit this way:

1. First do "echo 4 > /proc/$pid/clear_refs".
   At that point kernel clears the soft dirty _and_ the writable bits from all
   ptes of process $pid. From now on every write to any page will result in #pf
   and the subsequent call to pte_mkdirty/pmd_mkdirty, which in turn will set
   the soft dirty flag.

2. Then read the /proc/$pid/pagemap2 and check the soft-dirty bit reported there
   (the 55'th one). If set, the respective pte was written to since last call
   to clear refs.

The soft-dirty bit is the _PAGE_BIT_HIDDEN one. Although it's used by kmemcheck,
the latter one marks kernel pages with it, while the former bit is put on user
pages so they do not conflict to each other.

This patch:

A new clear-refs type will be added in the next patch, so prepare
code for that.

[akpm@linux-foundation.org: don't assume that sizeof(enum clear_refs_types) == sizeof(int)]
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Cc: Matt Mackall <mpm@selenic.com>
Cc: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Cc: Glauber Costa <glommer@parallels.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>