]> git.karo-electronics.de Git - karo-tx-linux.git/log
karo-tx-linux.git
11 years agodrivers/misc/sgi-gru/grufault.c: fix a sanity test in gru_set_context_option()
Dimitri Sivanich [Thu, 27 Jun 2013 23:52:45 +0000 (09:52 +1000)]
drivers/misc/sgi-gru/grufault.c: fix a sanity test in gru_set_context_option()

"req.val1 == -1" is valid but it doesn't make sense to check gru_base[-1].
 gru_base[] is a global array.

Signed-off-by: Dimitri Sivanich <sivanich@sgi.com>
Cc: Dan Carpenter <dan.carpenter@oracle.com>
Cc: Robin Holt <holt@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agodrivers/dma: remove unused support for MEMSET operations
Bartlomiej Zolnierkiewicz [Thu, 27 Jun 2013 23:52:44 +0000 (09:52 +1000)]
drivers/dma: remove unused support for MEMSET operations

There have never been any real users of MEMSET operations since they have
been introduced in January 2007 by 7405f74badf4 ("dmaengine: refactor
dmaengine around dma_async_tx_descriptor").  Therefore remove support for
them for now, it can be always brought back when needed.

Signed-off-by: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
Signed-off-by: Kyungmin Park <kyungmin.park@samsung.com>
Cc: Vinod Koul <vinod.koul@intel.com>
Cc: Dan Williams <djbw@fb.com>
Cc: Tomasz Figa <t.figa@samsung.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agosmp: Give WARN()ing when calling smp_call_function_many()/single() in serving irq
Chuansheng Liu [Thu, 27 Jun 2013 23:52:44 +0000 (09:52 +1000)]
smp: Give WARN()ing when calling smp_call_function_many()/single() in serving irq

Currently the functions smp_call_function_many()/single() will give a
WARN()ing only in the case of irqs_disabled(), but that check is not
enough to guarantee execution of the SMP cross-calls.

In many other cases such as softirq handling/interrupt handling, the two
APIs still can not be called, just as the smp_call_function_many()
comments say:

  * You must not call this function with disabled interrupts or from a
  * hardware interrupt handler or from a bottom half handler. Preemption
  * must be disabled when calling this function.

There is a real case for softirq DEADLOCK case:

CPUA                            CPUB
                                spin_lock(&spinlock)
                                Any irq coming, call the irq handler
                                irq_exit()
spin_lock_irq(&spinlock)
<== Blocking here due to
CPUB hold it
                                  __do_softirq()
                                    run_timer_softirq()
                                      timer_cb()
                                        call smp_call_function_many()
                                          send IPI interrupt to CPUA
                                            wait_csd()

Then both CPUA and CPUB will be deadlocked here.

So we should give a warning in the nmi, hardirq or softirq context as well.

Moreover, adding one new macro in_serving_irq() which indicates we are
processing nmi, hardirq or sofirq.

Signed-off-by: liu chuansheng <chuansheng.liu@intel.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Tested-by: Fengguang Wu <fengguang.wu@intel.com>
Cc: Lai Jiangshan <eag0628@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agodrm/i915: quirk away phantom LVDS on Intel's D525MW mainboard
Jani Nikula [Thu, 27 Jun 2013 23:52:44 +0000 (09:52 +1000)]
drm/i915: quirk away phantom LVDS on Intel's D525MW mainboard

This replaceable mainboard only has a VGA-out, yet it claims to also have
a connected LVDS header.

Addresses https://bugs.freedesktop.org/show_bug.cgi?id=65256

Signed-off-by: Jani Nikula <jani.nikula@intel.com>
Reported-by: Cornel Panceac <cpanceac@gmail.com>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: <annndddrr@gmail.com>
Acked-by: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: Greg KH <greg@kroah.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agodrm/i915: quirk away phantom LVDS on Intel's D510MO mainboard
Chris Wilson [Thu, 27 Jun 2013 23:52:43 +0000 (09:52 +1000)]
drm/i915: quirk away phantom LVDS on Intel's D510MO mainboard

This replaceable mainboard only has a VGA-out, yet it claims to also have
a connected LVDS header.

Addresses https://bugs.freedesktop.org/show_bug.cgi?id=63860

[jani.nikula@intel.com: use DMI_EXACT_MATCH for board name.]
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Signed-off-by: Jani Nikula <jani.nikula@intel.com>
Reported-by: <annndddrr@gmail.com>
Cc: Cornel Panceac <cpanceac@gmail.com>
Acked-by: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: Greg KH <greg@kroah.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agodmi: add support for exact DMI matches in addition to substring matching
Jani Nikula [Thu, 27 Jun 2013 23:52:43 +0000 (09:52 +1000)]
dmi: add support for exact DMI matches in addition to substring matching

dmi_match() considers a substring match to be a successful match.  This is
not always sufficient to distinguish between DMI data for different
systems.  Add support for exact string matching using strcmp() in addition
to the substring matching using strstr().

The specific use case in the i915 driver is to allow us to use an exact
match for D510MO, without also incorrectly matching D510MOV:

{
.ident = "Intel D510MO",
.matches = {
DMI_MATCH(DMI_BOARD_VENDOR, "Intel"),
DMI_EXACT_MATCH(DMI_BOARD_NAME, "D510MO"),
},
}

Signed-off-by: Jani Nikula <jani.nikula@intel.com>
Cc: <annndddrr@gmail.com>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Cornel Panceac <cpanceac@gmail.com>
Acked-by: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: Greg KH <greg@kroah.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agokernel/sys.c:do_sysinfo(): use get_monotonic_boottime()
Oleg Nesterov [Thu, 27 Jun 2013 23:52:43 +0000 (09:52 +1000)]
kernel/sys.c:do_sysinfo(): use get_monotonic_boottime()

Change do_sysinfo() to use get_monotonic_boottime() instead of
do_posix_clock_monotonic_gettime() + monotonic_to_bootbased().

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: John Stultz <johnstul@us.ibm.com>
Cc: Tomas Janousek <tjanouse@redhat.com>
Cc: Tomas Smetana <tsmetana@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agokernel/sys.c: sys_reboot(): fix malformed panic message
liguang [Thu, 27 Jun 2013 23:52:42 +0000 (09:52 +1000)]
kernel/sys.c: sys_reboot(): fix malformed panic message

If LINUX_REBOOT_CMD_HALT for reboot failed, the message "cannot halt" will
stay on the same line with the next message, so append a '\n'.

Signed-off-by: liguang <lig.fnst@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agopanic-add-cpu-pid-to-warn_slowpath_common-in-warning-printks-fix
Andrew Morton [Thu, 27 Jun 2013 23:52:42 +0000 (09:52 +1000)]
panic-add-cpu-pid-to-warn_slowpath_common-in-warning-printks-fix

remove stray quote

Cc: Alex Thorlton <athorlton@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agopanic: add cpu/pid to warn_slowpath_common in WARNING printk()s
Alex Thorlton [Thu, 27 Jun 2013 23:52:42 +0000 (09:52 +1000)]
panic: add cpu/pid to warn_slowpath_common in WARNING printk()s

Add the cpu/pid that called WARN() so that the stack traces can be matched
up with the WARNING messages.

Signed-off-by: Alex Thorlton <athorlton@sgi.com>
Reviewed-by: Robin Holt <holt@sgi.com>
Cc: Stephen Boyd <sboyd@codeaurora.org>
Cc: Vikram Mulukutla <markivx@codeaurora.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agodump_stack-serialize-the-output-from-dump_stack-fix
Andrew Morton [Thu, 27 Jun 2013 23:52:42 +0000 (09:52 +1000)]
dump_stack-serialize-the-output-from-dump_stack-fix

- fix comment indenting
- avoid inclusion of <asm/> files - use <linux/> where possible
- fix uniprocessor build (__dump_stack undefined)
- remove unneeded ifdef around smp.h inclusion

Cc: Alex Thorlton <athorlton@sgi.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Jesper Nilsson <jesper.nilsson@axis.com>
Cc: Richard Kuo <rkuo@codeaurora.org>
Cc: Robin Holt <holt@sgi.com>
Cc: Russ Anderson <rja@sgi.com>
Cc: Vineet Gupta <vgupta@synopsys.com>
Cc: athorlton@sgi.com
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agodump_stack: serialize the output from dump_stack()
Alex Thorlton [Thu, 27 Jun 2013 23:52:41 +0000 (09:52 +1000)]
dump_stack: serialize the output from dump_stack()

tAdd adds functionality to serialize the output from dump_stack() to avoid
mangling of the output when dump_stack is called simultaneously from
multiple cpus.

Signed-off-by: Alex Thorlton <athorlton@sgi.com>
Reported-by: Russ Anderson <rja@sgi.com>
Reviewed-by: Robin Holt <holt@sgi.com>
Cc: Vineet Gupta <vgupta@synopsys.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Richard Kuo <rkuo@codeaurora.org>
Cc: Jesper Nilsson <jesper.nilsson@axis.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agodrivers: avoid parsing names as kthread_run() format strings
Kees Cook [Thu, 27 Jun 2013 23:52:41 +0000 (09:52 +1000)]
drivers: avoid parsing names as kthread_run() format strings

Calling kthread_run with a single name parameter causes it to be handled
as a format string. Many callers are passing potentially dynamic string
content, so use "%s" in those cases to avoid any potential accidents.

Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agodrivers: avoid format strings in names passed to alloc_workqueue()
Kees Cook [Thu, 27 Jun 2013 23:52:41 +0000 (09:52 +1000)]
drivers: avoid format strings in names passed to alloc_workqueue()

For the workqueue creation interfaces that do not expect format strings,
make sure they cannot accidently be parsed that way.  Additionally, clean
up calls made with a single parameter that would be handled as a format
string.  Many callers are passing potentially dynamic string content, so
use "%s" in those cases to avoid any potential accidents.

Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agodrivers: avoid format string in dev_set_name
Kees Cook [Thu, 27 Jun 2013 23:52:40 +0000 (09:52 +1000)]
drivers: avoid format string in dev_set_name

Calling dev_set_name with a single paramter causes it to be handled as a
format string.  Many callers are passing potentially dynamic string
content, so use "%s" in those cases to avoid any potential accidents,
including wrappers like device_create*() and bdi_register().

Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agoclean-up-scary-strncpydst-src-strlensrc-uses-fix
Andrew Morton [Thu, 27 Jun 2013 23:52:40 +0000 (09:52 +1000)]
clean-up-scary-strncpydst-src-strlensrc-uses-fix

revert getdelays.c part due to missing bsd/string.h

Cc: Frank Blaschka <blaschka@linux.vnet.ibm.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> [staging]
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com> [acpi]
Cc: Richard Weinberger <richard@nod.at>
Acked-by: Ursula Braun <ursula.braun@de.ibm.com> [s390]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agoclean up scary strncpy(dst, src, strlen(src)) uses
Kees Cook [Thu, 27 Jun 2013 23:52:40 +0000 (09:52 +1000)]
clean up scary strncpy(dst, src, strlen(src)) uses

Fix various weird constructions of strncpy(dst, src, strlen(src)).  Length
limits should be about the space available in the destination, not
repurposed as a method to either always include or always exclude a
trailing NULL byte.  Either the NULL should always be copied (using
strlcpy), or it should not be copied (using something like memcpy).
Readable code should not depend on the weird behavior of strncpy when it
hits the length limit.  Better to avoid the anti-pattern entirely.

Signed-off-by: Kees Cook <keescook@chromium.org>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> [staging]
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> [acpi]
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Ursula Braun <ursula.braun@de.ibm.com>
Cc: Frank Blaschka <blaschka@linux.vnet.ibm.com>
Cc: Richard Weinberger <richard@nod.at>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agoerr.h: IS_ERR() can accept __user pointers
Dan Carpenter [Thu, 27 Jun 2013 23:52:39 +0000 (09:52 +1000)]
err.h: IS_ERR() can accept __user pointers

Sparse generates a false positive when you pass a __user or __iomem
pointer to the IS_ERR() functions.

drivers/rtc/rtc-ds1286.c:344:36: sparse: incorrect type in argument 1 (different address spaces)
drivers/rtc/rtc-ds1286.c:344:36:    expected void const *ptr
drivers/rtc/rtc-ds1286.c:344:36:    got unsigned int [noderef] [usertype] <asn:2>*rtcregs

We can silence these by adding a __force here and upgrading to Sparse
v0.4.5-rc1 or later.

This change has no effect when using current Sparse releases.

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Acked-by: Christopher Li <sparse@chrisli.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agoarch/frv/kernel/setup.c: use strncmp() instead of memcmp()
Chen Gang [Thu, 27 Jun 2013 23:52:39 +0000 (09:52 +1000)]
arch/frv/kernel/setup.c: use strncmp() instead of memcmp()

'cmdline' is a NUL terminated string, when its length < 4, memcmp()
will cause memory access out of boundary.

So use strncmp() instead of memcmp().

Signed-off-by: Chen Gang <gang.chen@asianux.com>
Cc: David Howells <dhowells@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agoarch/frv/kernel/traps.c: using vsnprintf() instead of vsprintf()
Chen Gang [Thu, 27 Jun 2013 23:52:39 +0000 (09:52 +1000)]
arch/frv/kernel/traps.c: using vsnprintf() instead of vsprintf()

Since die_if_kernel() is an extern common used function, better always
check the buffer length to avoid memory overflow by a long 'str'.

Signed-off-by: Chen Gang <gang.chen@asianux.com>
Cc: David Howells <dhowells@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agomm: add vm event counters for balloon pages compaction
Rafael Aquini [Thu, 27 Jun 2013 23:52:38 +0000 (09:52 +1000)]
mm: add vm event counters for balloon pages compaction

Introduce a new set of vm event counters to keep track of ballooned pages
compaction activity.

Signed-off-by: Rafael Aquini <aquini@redhat.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agomm/dmapool.c: fix null dev in dma_pool_create()
Xi Wang [Thu, 27 Jun 2013 23:52:38 +0000 (09:52 +1000)]
mm/dmapool.c: fix null dev in dma_pool_create()

A few drivers invoke dma_pool_create() with a null dev.  Note that dev is
dereferenced in dev_to_node(dev), causing a null pointer dereference.

A long term solution is to disallow null dev.  Once the drivers are fixed,
we can simplify the core code here.  For now we add WARN_ON(!dev) to
notify the driver maintainers and avoid the null pointer dereference.

Signed-off-by: Xi Wang <xi.wang@gmail.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agodrivers/usb/gadget/amd5536udc.c: avoid calling dma_pool_create() with NULL dev
Xi Wang [Thu, 27 Jun 2013 23:52:38 +0000 (09:52 +1000)]
drivers/usb/gadget/amd5536udc.c: avoid calling dma_pool_create() with NULL dev

Calling dma_pool_create() with dev==NULL will oops on a NUMA machine.
Rather than changing dma_pool_create() we wish to disallow passing
dev==NULL.  This requires fixing up the small number of drivers which are
passing in dev==NULL.

Use &dev->pdev->dev instead of NULL.

Signed-off-by: Xi Wang <xi.wang@gmail.com>
Cc: Felipe Balbi <balbi@ti.com>
Cc: Greg KH <greg@kroah.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agodrop_caches: add some documentation and info message
Michal Hocko [Thu, 27 Jun 2013 23:52:37 +0000 (09:52 +1000)]
drop_caches: add some documentation and info message

I would like to resurrect Dave's patch.  The last time it was posted was
here https://lkml.org/lkml/2010/9/16/250 and there didn't seem to be any
strong opposition.

Kosaki was worried about possible excessive logging when somebody drops
caches too often (but then he claimed he didn't have a strong opinion on
that) but I would say opposite.  If somebody does that then I would really
like to know that from the log when supporting a system because it almost
for sure means that there is something fishy going on.  It is also worth
mentioning that only root can write drop caches so this is not an flooding
attack vector.

I am bringing that up again because this can be really helpful when
chasing strange performance issues which (surprise surprise) turn out to
be related to artificially dropped caches done because the admin thinks
this would help...

I have just refreshed the original patch on top of the current mm tree
but I could live with KERN_INFO as well if people think that KERN_NOTICE
is too hysterical.

: From: Dave Hansen <dave@linux.vnet.ibm.com>
: Date: Fri, 12 Oct 2012 14:30:54 +0200
:
: There is plenty of anecdotal evidence and a load of blog posts
: suggesting that using "drop_caches" periodically keeps your system
: running in "tip top shape".  Perhaps adding some kernel
: documentation will increase the amount of accurate data on its use.
:
: If we are not shrinking caches effectively, then we have real bugs.
: Using drop_caches will simply mask the bugs and make them harder
: to find, but certainly does not fix them, nor is it an appropriate
: "workaround" to limit the size of the caches.
:
: It's a great debugging tool, and is really handy for doing things
: like repeatable benchmark runs.  So, add a bit more documentation
: about it, and add a little KERN_NOTICE.  It should help developers
: who are chasing down reclaim-related bugs.

[mhocko@suse.cz: refreshed to current -mm tree]
[akpm@linux-foundation.org: checkpatch fixes]
Signed-off-by: Dave Hansen <dave@linux.vnet.ibm.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Michal Hocko <mhocko@suse.cz>
Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agomm: memmap_init_zone() performance improvement
Mike Yoknis [Thu, 27 Jun 2013 23:52:37 +0000 (09:52 +1000)]
mm: memmap_init_zone() performance improvement

We have what we call an "architectural simulator".  It is a computer
program that pretends that it is a computer system.  We use it to test the
firmware before real hardware is available.  We have booted Linux on our
simulator.  As you would expect it takes longer to boot on the simulator
than it does on real hardware.

With my patch - boot time 41 minutes
Without patch - boot time 94 minutes

These numbers do not scale linearly to real hardware.  But indicate to me
a place where Linux can be improved.

memmap_init_zone() loops through every Page Frame Number (pfn), including
pfn values that are within the gaps between existing memory sections.  The
unneeded looping will become a boot performance issue when machines
configure larger memory ranges that will contain larger and more numerous
gaps.

The code will skip across invalid pfn values to reduce the number of loops
executed.

Signed-off-by: Mike Yoknis <mike.yoknis@hp.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agoinclude/linux/mmzone.h: cleanups
Andrew Morton [Thu, 27 Jun 2013 23:52:37 +0000 (09:52 +1000)]
include/linux/mmzone.h: cleanups

- implement zone_idx() in C to fix its references-args-twice macro bug

- use zone_idx() in is_highmem() to remove large amounts of silly fluff.

Cc: Lin Feng <linfeng@cn.fujitsu.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agomm: vmscan: do not scale writeback pages when deciding whether to set ZONE_WRITEBACK
Mel Gorman [Thu, 27 Jun 2013 23:52:36 +0000 (09:52 +1000)]
mm: vmscan: do not scale writeback pages when deciding whether to set ZONE_WRITEBACK

After the patch "mm: vmscan: Flatten kswapd priority loop" was merged the
scanning priority of kswapd changed.  The priority now raises until it is
scanning enough pages to meet the high watermark.  shrink_inactive_list
sets ZONE_WRITEBACK if a number of pages were encountered under writeback
but this value is scaled based on the priority.  As kswapd frequently
scans with a higher priority now it is relatively easy to set
ZONE_WRITEBACK.  This patch removes the scaling and treates writeback
pages similar to how it treats unqueued dirty pages and congested pages.
The user-visible effect should be that kswapd will writeback fewer pages
from reclaim context.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agomm: vmscan: avoid direct reclaim scanning at maximum priority
Mel Gorman [Thu, 27 Jun 2013 23:52:36 +0000 (09:52 +1000)]
mm: vmscan: avoid direct reclaim scanning at maximum priority

Further testing revealed that swapping was still higher than expected for
the parallel IO tests.  There was also a performance regression reported
building kernels but there appears to be multiple sources of that problem.
 This follow-up series primarily addresses the first swapping issue.

The tests were based on three kernels

vanilla:        kernel 3.10-rc4 as that is what the current mmotm uses as a baseline
mmotm-20130606  is mmotm as of that date.
lessdisrupt-v1 is this follow-up series on top of the mmotm kernel

The first test used memcached+memcachetest while some background IO was in
progress as implemented by the parallel IO tests implement in MM Tests.
memcachetest benchmarks how many operations/second memcached can service.
It starts with no background IO on a freshly created ext4 filesystem and
then re-runs the test with larger amounts of IO in the background to
roughly simulate a large copy in progress.  The expectation is that the IO
should have little or no impact on memcachetest which is running entirely
in memory.

parallelio
                                        3.10.0-rc4                  3.10.0-rc4                  3.10.0-rc4
                                           vanilla          mm1-mmotm-20130606        mm1-lessdisrupt-v1
Ops memcachetest-0M             23018.00 (  0.00%)          22412.00 ( -2.63%)          22556.00 ( -2.01%)
Ops memcachetest-715M           23383.00 (  0.00%)          22810.00 ( -2.45%)          22431.00 ( -4.07%)
Ops memcachetest-2385M          10989.00 (  0.00%)          23564.00 (114.43%)          23054.00 (109.79%)
Ops memcachetest-4055M           3798.00 (  0.00%)          24004.00 (532.02%)          24050.00 (533.23%)
Ops io-duration-0M                  0.00 (  0.00%)              0.00 (  0.00%)              0.00 (  0.00%)
Ops io-duration-715M               12.00 (  0.00%)              7.00 ( 41.67%)              7.00 ( 41.67%)
Ops io-duration-2385M             133.00 (  0.00%)             21.00 ( 84.21%)             22.00 ( 83.46%)
Ops io-duration-4055M             159.00 (  0.00%)             36.00 ( 77.36%)             36.00 ( 77.36%)
Ops swaptotal-0M                    0.00 (  0.00%)              0.00 (  0.00%)              0.00 (  0.00%)
Ops swaptotal-715M             139693.00 (  0.00%)             19.00 ( 99.99%)              0.00 (  0.00%)
Ops swaptotal-2385M            268541.00 (  0.00%)              0.00 (  0.00%)             19.00 ( 99.99%)
Ops swaptotal-4055M            414269.00 (  0.00%)          22059.00 ( 94.68%)              2.00 (100.00%)
Ops swapin-0M                       0.00 (  0.00%)              0.00 (  0.00%)              0.00 (  0.00%)
Ops swapin-715M                     0.00 (  0.00%)              0.00 (  0.00%)              0.00 (  0.00%)
Ops swapin-2385M                73189.00 (  0.00%)              0.00 (  0.00%)              0.00 (  0.00%)
Ops swapin-4055M               126292.00 (  0.00%)              0.00 (  0.00%)              0.00 (  0.00%)
Ops minorfaults-0M            1536018.00 (  0.00%)        1533536.00 (  0.16%)        1607381.00 ( -4.65%)
Ops minorfaults-715M          1789978.00 (  0.00%)        1616152.00 (  9.71%)        1533462.00 ( 14.33%)
Ops minorfaults-2385M         1910448.00 (  0.00%)        1614060.00 ( 15.51%)        1661727.00 ( 13.02%)
Ops minorfaults-4055M         1760518.00 (  0.00%)        1613980.00 (  8.32%)        1615116.00 (  8.26%)
Ops majorfaults-0M                  0.00 (  0.00%)              0.00 (  0.00%)              5.00 (-99.00%)
Ops majorfaults-715M              169.00 (  0.00%)            234.00 (-38.46%)             48.00 ( 71.60%)
Ops majorfaults-2385M           14899.00 (  0.00%)            100.00 ( 99.33%)            222.00 ( 98.51%)
Ops majorfaults-4055M           21853.00 (  0.00%)            150.00 ( 99.31%)            128.00 ( 99.41%)

memcachetest is the transactions/second reported by memcachetest. In
        the vanilla kernel note that performance drops from around
        23K/sec to just over 4K/second when there is 2385M of IO going
        on in the background. With current mmotm and the follow-on
series performance is good.

swaptotal is the total amount of swap traffic. With mmotm the total amount
of swapping is much reduced. Note that with 4G of background IO that
this follow-up series almost completely eliminated swap IO.

                            3.10.0-rc4  3.10.0-rc4  3.10.0-rc4
                               vanillamm1-mmotm-20130606mm1-lessdisrupt-v1
Minor Faults                  11230171    10689656    10650607
Major Faults                     37255         786         705
Swap Ins                        199724           0           0
Swap Outs                       623022       22078          21
Direct pages scanned                 0        5378       51660
Kswapd pages scanned          15892718     1610408     1653629
Kswapd pages reclaimed         1097093     1083339     1107652
Direct pages reclaimed               0        5024       47241
Kswapd efficiency                   6%         67%         66%
Kswapd velocity              13633.275    1385.369    1420.058
Direct efficiency                 100%         93%         91%
Direct velocity                  0.000       4.626      44.363
Percentage direct scans             0%          0%          3%
Zone normal velocity         13474.405     671.123     697.927
Zone dma32 velocity            158.870     718.872     766.494
Zone dma velocity                0.000       0.000       0.000
Page writes by reclaim     3065275.000   27259.000    6316.000
Page writes file               2442253        5181        6295
Page writes anon                623022       22078          21
Page reclaim immediate            8019         429         318
Sector Reads                    963320       99096      151864
Sector Writes                 13057396    10887480    10878500
Page rescued immediate               0           0           0
Slabs scanned                    64896       23168       34176
Direct inode steals                  0           0           0
Kswapd inode steals               8668           0           0
Kswapd skipped wait                  0           0           0

Few observations

1. Swap outs were almost completely eliminated and there were no swap-ins.

2. Direct reclaim is active due to reduced activity from kswapd and the fact
   that it is no longer reclaiming at priority 0

3. Zone scanning is still relatively balanced.

4. Page writes from reclaim context is still reasonable low.

                  3.10.0-rc4  3.10.0-rc4  3.10.0-rc4
                     vanillamm1-mmotm-20130606mm1-lessdisrupt-v1
Mean sda-avgqz        168.05       34.64       35.60
Mean sda-await        831.76      216.31      207.05
Mean sda-r_await        7.88        9.68        7.25
Mean sda-w_await     3088.32      223.90      218.28
Max  sda-avgqz       1162.17      766.85      795.69
Max  sda-await       6788.75     4130.01     3728.43
Max  sda-r_await      106.93      242.00       65.97
Max  sda-w_await    30565.93     4145.75     3959.87

Wait times are marginally reduced by the follow-up and still a massive
improve against the mainline kernel.

I tested parallel kernel builds when booted with 1G of RAM. 12 kernels
were built with 2 being compiled at any given time.

multibuild
                          3.10.0-rc4            3.10.0-rc4            3.10.0-rc4
                             vanilla    mm1-mmotm-20130606  mm1-lessdisrupt-v1
User    min         584.99 (  0.00%)      553.31 (  5.42%)      569.08 (  2.72%)
User    mean        598.35 (  0.00%)      574.48 (  3.99%)      581.65 (  2.79%)
User    stddev       10.01 (  0.00%)       17.90 (-78.78%)       10.03 ( -0.14%)
User    max         614.64 (  0.00%)      598.94 (  2.55%)      597.97 (  2.71%)
User    range        29.65 (  0.00%)       45.63 (-53.90%)       28.89 (  2.56%)
System  min          35.78 (  0.00%)       35.05 (  2.04%)       35.54 (  0.67%)
System  mean         36.12 (  0.00%)       35.69 (  1.20%)       35.88 (  0.69%)
System  stddev        0.26 (  0.00%)        0.55 (-113.69%)        0.21 ( 17.51%)
System  max          36.53 (  0.00%)       36.44 (  0.25%)       36.13 (  1.09%)
System  range         0.75 (  0.00%)        1.39 (-85.33%)        0.59 ( 21.33%)
Elapsed min         190.54 (  0.00%)      190.56 ( -0.01%)      192.99 ( -1.29%)
Elapsed mean        197.58 (  0.00%)      203.30 ( -2.89%)      200.53 ( -1.49%)
Elapsed stddev        4.65 (  0.00%)        5.26 (-13.16%)        5.66 (-21.79%)
Elapsed max         203.72 (  0.00%)      210.23 ( -3.20%)      210.46 ( -3.31%)
Elapsed range        13.18 (  0.00%)       19.67 (-49.24%)       17.47 (-32.55%)
CPU     min         308.00 (  0.00%)      282.00 (  8.44%)      294.00 (  4.55%)
CPU     mean        320.80 (  0.00%)      299.78 (  6.55%)      307.67 (  4.09%)
CPU     stddev       10.44 (  0.00%)       13.83 (-32.50%)        9.71 (  7.01%)
CPU     max         340.00 (  0.00%)      333.00 (  2.06%)      328.00 (  3.53%)
CPU     range        32.00 (  0.00%)       51.00 (-59.38%)       34.00 ( -6.25%)

Average kernel build times are still impacted but the follow-up series
helps marginally (it's too noisy to be sure).  A preliminary bisection
indicated that there were multiple sources of the regression.  The two
other points are the patches that cause mark_page_accessed to be obeyed
and the slab shrinker series.  As there a number of patches in flight to
mmotm at the moment in different areas it would be best to confirm this
after this follow-up is merged.

This patch (of 2):

Page reclaim at priority 0 will scan the entire LRU as priority 0 is
considered to be a near OOM condition.  Direct reclaim can reach this
priority while still making reclaim progress.  This patch avoids
reclaiming at priority 0 unless no reclaim progress was made and the page
allocator would consider firing the OOM killer.  The user-visible impact
is that direct reclaim will not easily reach priority 0 and start swapping
prematurely.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agomm/memory_hotplug.c: fix a comment typo in register_page_bootmem_info_node()
Tang Chen [Thu, 27 Jun 2013 23:52:36 +0000 (09:52 +1000)]
mm/memory_hotplug.c: fix a comment typo in register_page_bootmem_info_node()

Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agomm/memblock.c: fix wrong comment in __next_free_mem_range()
Tang Chen [Thu, 27 Jun 2013 23:52:35 +0000 (09:52 +1000)]
mm/memblock.c: fix wrong comment in __next_free_mem_range()

Remove one redundant "nid" in the comment.

Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
11 years agopage migration: fix wrong comment in address_space_operations.migratepage()
Tang Chen [Thu, 27 Jun 2013 23:52:35 +0000 (09:52 +1000)]
page migration: fix wrong comment in address_space_operations.migratepage()

There is no parameter "sync" in address_space_operations->migratepage().
It should be mograte_mode.  And the comment is for MIGRATE_ASYNC.

Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agomm/vmalloc.c: fix an overflow bug in alloc_vmap_area()
Zhang Yanfei [Thu, 27 Jun 2013 23:52:35 +0000 (09:52 +1000)]
mm/vmalloc.c: fix an overflow bug in alloc_vmap_area()

When searching a vmap area in the vmalloc space, we use (addr + size - 1)
to check if the value is less than addr, which is an overflow.  But we
assign (addr + size) to vmap_area->va_end.

So if we come across the  below case:

(addr + size - 1) : not overflow
(addr + size)     : overflow

we will assign an overflow value (e.g 0) to vmap_area->va_end, And this
will trigger BUG in __insert_vmap_area, causing system panic.

So using (addr + size) to check the overflow should be the correct
behaviour, not (addr + size - 1).

Signed-off-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Reported-by: Ghennadi Procopciuc <unix140@gmail.com>
Tested-by: Daniel Baluta <dbaluta@ixiacom.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agomm: remove unused VM_<READfoo> macros and expand other in-place
Joe Perches [Thu, 27 Jun 2013 23:52:34 +0000 (09:52 +1000)]
mm: remove unused VM_<READfoo> macros and expand other in-place

These VM_<READfoo> macros aren't used very often and three of them aren't
used at all.

Expand the ones that are used in-place, and remove all the now unused
#define VM_<foo> macros.

VM_READHINTMASK, VM_NormalReadHint and VM_ClearReadHint were added just
before 2.4 and appears have never been used.

Signed-off-by: Joe Perches <joe@perches.com>
Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agomm/pgtable: don't accumulate addr during pgd prepopulate pmd
Wanpeng Li [Thu, 27 Jun 2013 23:52:34 +0000 (09:52 +1000)]
mm/pgtable: don't accumulate addr during pgd prepopulate pmd

The old codes accumulate addr to get right pmd, however, currently pmds
are preallocated and transfered as a parameter, there is unnecessary to
accumulate addr variable any more, this patch remove it.

Signed-off-by: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Reviewed-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agomm/thp: fix doc for transparent huge zero page
Wanpeng Li [Thu, 27 Jun 2013 23:52:34 +0000 (09:52 +1000)]
mm/thp: fix doc for transparent huge zero page

Transparent huge zero page is used during the page fault instead of in
khugepaged.

# ls /sys/kernel/mm/transparent_hugepage/
defrag  enabled  khugepaged  use_zero_page
# ls /sys/kernel/mm/transparent_hugepage/khugepaged/
alloc_sleep_millisecs  defrag  full_scans  max_ptes_none  pages_collapsed  pages_to_scan  scan_sleep_millisecs

This patch corrects the documentation just like the codes done.

Signed-off-by: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agomm/page_alloc: fix doc for numa_zonelist_order
Wanpeng Li [Thu, 27 Jun 2013 23:52:34 +0000 (09:52 +1000)]
mm/page_alloc: fix doc for numa_zonelist_order

The default zonelist order selecter will select "node" order if any node's
DMA zone comprises greater than 70% of its local memory instead of 60%,
according to default_zonelist_order::low_kmem_size > total * 70/100.

Signed-off-by: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agomm/writeback: commit reason of WB_REASON_FORKER_THREAD mismatch name
Wanpeng Li [Thu, 27 Jun 2013 23:52:33 +0000 (09:52 +1000)]
mm/writeback: commit reason of WB_REASON_FORKER_THREAD mismatch name

After 839a8e86 ("writeback: replace custom worker pool implementation with
unbound workqueue"), there is no bdi forker thread any more.  However,
WB_REASON_FORKER_THREAD is still used due to it is TPs userland visible
and we won't be exposing exactly the same information with just a
different name.

Signed-off-by: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Reviewed-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agomm/writeback: don't check force_wait to handle bdi->work_list
Wanpeng Li [Thu, 27 Jun 2013 23:52:33 +0000 (09:52 +1000)]
mm/writeback: don't check force_wait to handle bdi->work_list

After 839a8e86 ("writeback: replace custom worker pool implementation with
unbound workqueue"), bdi_writeback_workfn runs off bdi_writeback->dwork,
on each execution, it processes bdi->work_list and reschedules if there
are more things to do instead of flush any work that race with us
existing.  It is unecessary to check force_wait in wb_do_writeback since
it is always 0 after the mentioned commit.  This patch remove the
force_wait in wb_do_writeback.

Signed-off-by: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Reviewed-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agomm/writeback: remove wb_reason_name
Wanpeng Li [Thu, 27 Jun 2013 23:52:33 +0000 (09:52 +1000)]
mm/writeback: remove wb_reason_name

wb_reason_name is not used any more - remove it.

Signed-off-by: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Reviewed-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agofs/fs-writeback.c: : make wb_do_writeback() as static
Haicheng Li [Thu, 27 Jun 2013 23:52:32 +0000 (09:52 +1000)]
fs/fs-writeback.c: : make wb_do_writeback() as static

It's not used globally and could be static.

Signed-off-by: Haicheng Li <haicheng.li@linux.intel.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agomm/sparse.c: put clear_hwpoisoned_pages within CONFIG_MEMORY_HOTREMOVE
Zhang Yanfei [Thu, 27 Jun 2013 23:52:32 +0000 (09:52 +1000)]
mm/sparse.c: put clear_hwpoisoned_pages within CONFIG_MEMORY_HOTREMOVE

With CONFIG_MEMORY_HOTREMOVE unset, there is a compile warning:

mm/sparse.c:755: warning: `clear_hwpoisoned_pages' defined but not used

And Bisecting it ended up pointing to 4edd7ceff ("mm, hotplug: avoid
compiling memory hotremove functions when disabled").

This is because the commit above put sparse_remove_one_section() within
the protection of CONFIG_MEMORY_HOTREMOVE but the only user of
clear_hwpoisoned_pages() is sparse_remove_one_section(), and it is not
within the protection of CONFIG_MEMORY_HOTREMOVE.

So put clear_hwpoisoned_pages within CONFIG_MEMORY_HOTREMOVE should fix
the warning.

Signed-off-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Acked-by: Toshi Kani <toshi.kani@hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agomm: remove unused __put_page()
Zhang Yanfei [Thu, 27 Jun 2013 23:52:32 +0000 (09:52 +1000)]
mm: remove unused __put_page()

This function is nowhere used, and it has a confusing name with put_page
in mm/swap.c.  So better to remove it.

Signed-off-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agovfree: don't schedule free_work() if llist_add() returns false
Oleg Nesterov [Thu, 27 Jun 2013 23:52:31 +0000 (09:52 +1000)]
vfree: don't schedule free_work() if llist_add() returns false

vfree() only needs schedule_work(&p->wq) if p->list was empty, otherwise
vfree_deferred->wq is already pending or it is running and didn't do
llist_del_all() yet.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agomm/page_alloc.c: remove unlikely() from the current_order test
Zhang Yanfei [Thu, 27 Jun 2013 23:52:31 +0000 (09:52 +1000)]
mm/page_alloc.c: remove unlikely() from the current_order test

In __rmqueue_fallback(), current_order loops down from MAX_ORDER - 1 to
the order passed.  MAX_ORDER is typically 11 and pageblock_order is
typically 9 on x86.  Integer division truncates, so pageblock_order / 2 is
4.  For the first eight iterations, it's guaranteed that current_order >=
pageblock_order / 2 if it even gets that far!

So just remove the unlikely(), it's completely bogus.

Signed-off-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Suggested-by: David Rientjes <rientjes@google.com>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agomm: remove unused functions is_{normal_idx, normal, dma32, dma}
Zhang Yanfei [Thu, 27 Jun 2013 23:52:31 +0000 (09:52 +1000)]
mm: remove unused functions is_{normal_idx, normal, dma32, dma}

These functions are nowhere used, so remove them.

Signed-off-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agomm/page_alloc.c: remove zone_type argument of build_zonelists_node
Zhang Yanfei [Thu, 27 Jun 2013 23:52:30 +0000 (09:52 +1000)]
mm/page_alloc.c: remove zone_type argument of build_zonelists_node

The callers of build_zonelists_node always pass MAX_NR_ZONES -1 as the
zone_type argument, so we can directly use the value in
build_zonelists_node and remove zone_type argument.

Signed-off-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agoMAINTAINERS: add zswap and zbud maintainer
Seth Jennings [Thu, 27 Jun 2013 23:52:30 +0000 (09:52 +1000)]
MAINTAINERS: add zswap and zbud maintainer

Add maintainer information for zswap and zbud into the MAINTAINERS file.

Signed-off-by: Seth Jennings <sjenning@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agozswap: add documentation
Seth Jennings [Thu, 27 Jun 2013 23:52:30 +0000 (09:52 +1000)]
zswap: add documentation

Add the documentation file for the zswap functionality

Signed-off-by: Seth Jennings <sjenning@linux.vnet.ibm.com>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Nitin Gupta <ngupta@vflare.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Dan Magenheimer <dan.magenheimer@oracle.com>
Cc: Robert Jennings <rcj@linux.vnet.ibm.com>
Cc: Jenifer Hopper <jhopper@us.ibm.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Johannes Weiner <jweiner@redhat.com>
Cc: Larry Woodman <lwoodman@redhat.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Dave Hansen <dave@sr71.net>
Cc: Joe Perches <joe@perches.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Cody P Schafer <cody@linux.vnet.ibm.com>
Cc: Hugh Dickens <hughd@google.com>
Cc: Paul Mackerras <paulus@samba.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agozswap: fix Kconfig to depend on CRYPTO=y
Seth Jennings [Thu, 27 Jun 2013 23:52:29 +0000 (09:52 +1000)]
zswap: fix Kconfig to depend on CRYPTO=y

The Kconfig entry for zswap currently allows CRYPTO=m to satisfy the
zswap dependency.  However zswap is boolean (i.e. built-in) and has
symbol dependencies on CRYPTO.  Additionally, because the CRYPTO dependency
is satisfied with =m, the additional selects zswap does can also
be satisfied with =m, which leads to additional linking errors.

>From the report:
=====
tree:   git://git.kernel.org/pub/scm/linux/kernel/git/mhocko/mm.git since-3.9
head:   57cefddb141d6c9d0ab4f5a8589dd017f796a3f7
commit: 563f51c95b552e0d663df5bdf6cfc8e8a72d3ec6 [494/499] zswap: add to mm/
config: x86_64-randconfig-x004-0619 (attached as .config)

All error/warnings:

   mm/built-in.o: In function `zswap_frontswap_invalidate_area':
>> zswap.c:(.text+0x3a705): undefined reference to `zbud_free'
   mm/built-in.o: In function `zswap_free_entry':
>> zswap.c:(.text+0x3a76b): undefined reference to `zbud_free'
>> zswap.c:(.text+0x3a789): undefined reference to `zbud_get_pool_size'
on and on...
=====

This patch makes CRYPTO a built-in dependency of ZSWAP.  This has the
side effect of also making the selects built-in.

Signed-off-by: Seth Jennings <sjenning@linux.vnet.ibm.com>
Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agozswap: add to mm/
Seth Jennings [Thu, 27 Jun 2013 23:52:29 +0000 (09:52 +1000)]
zswap: add to mm/

zswap is a thin backend for frontswap that takes pages that are in the
process of being swapped out and attempts to compress them and store them
in a RAM-based memory pool.  This can result in a significant I/O
reduction on the swap device and, in the case where decompressing from RAM
is faster than reading from the swap device, can also improve workload
performance.

It also has support for evicting swap pages that are currently compressed
in zswap to the swap device on an LRU(ish) basis.  This functionality
makes zswap a true cache in that, once the cache is full, the oldest pages
can be moved out of zswap to the swap device so newer pages can be
compressed and stored in zswap.

This patch adds the zswap driver to mm/

Signed-off-by: Seth Jennings <sjenning@linux.vnet.ibm.com>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Nitin Gupta <ngupta@vflare.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Dan Magenheimer <dan.magenheimer@oracle.com>
Cc: Robert Jennings <rcj@linux.vnet.ibm.com>
Cc: Jenifer Hopper <jhopper@us.ibm.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Johannes Weiner <jweiner@redhat.com>
Cc: Larry Woodman <lwoodman@redhat.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Dave Hansen <dave@sr71.net>
Cc: Joe Perches <joe@perches.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Cody P Schafer <cody@linux.vnet.ibm.com>
Cc: Hugh Dickens <hughd@google.com>
Cc: Paul Mackerras <paulus@samba.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agozswap: init under_reclaim
Seth Jennings [Thu, 27 Jun 2013 23:52:29 +0000 (09:52 +1000)]
zswap: init under_reclaim

Bob Liu reported a memory leak in zswap.  This was due to the
under_reclaim field in the zbud header not being initialized
to 0, which resulted in zbud_free() not freeing the page
under the false assumption that the page was undergoing
zbud reclaim.

This patch properly initializes the under_reclaim field.

Signed-off-by: Seth Jennings <sjenning@linux.vnet.ibm.com>
Reported-by: Bob Liu <bob.liu@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agozbud: add to mm/
Seth Jennings [Thu, 27 Jun 2013 23:52:29 +0000 (09:52 +1000)]
zbud: add to mm/

zbud is an special purpose allocator for storing compressed pages.  It is
designed to store up to two compressed pages per physical page.  While
this design limits storage density, it has simple and deterministic
reclaim properties that make it preferable to a higher density approach
when reclaim will be used.

zbud works by storing compressed pages, or "zpages", together in pairs in
a single memory page called a "zbud page".  The first buddy is "left
justifed" at the beginning of the zbud page, and the last buddy is "right
justified" at the end of the zbud page.  The benefit is that if either
buddy is freed, the freed buddy space, coalesced with whatever slack space
that existed between the buddies, results in the largest possible free
region within the zbud page.

zbud also provides an attractive lower bound on density.  The ratio of
zpages to zbud pages can not be less than 1.  This ensures that zbud can
never "do harm" by using more pages to store zpages than the uncompressed
zpages would have used on their own.

This implementation is a rewrite of the zbud allocator internally used by
zcache in the driver/staging tree.  The rewrite was necessary to remove
some of the zcache specific elements that were ingrained throughout and
provide a generic allocation interface that can later be used by zsmalloc
and others.

This patch adds zbud to mm/ for later use by zswap.

Signed-off-by: Seth Jennings <sjenning@linux.vnet.ibm.com>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Nitin Gupta <ngupta@vflare.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Dan Magenheimer <dan.magenheimer@oracle.com>
Cc: Robert Jennings <rcj@linux.vnet.ibm.com>
Cc: Jenifer Hopper <jhopper@us.ibm.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Johannes Weiner <jweiner@redhat.com>
Cc: Larry Woodman <lwoodman@redhat.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Dave Hansen <dave@sr71.net>
Cc: Joe Perches <joe@perches.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Cody P Schafer <cody@linux.vnet.ibm.com>
Cc: Hugh Dickens <hughd@google.com>
Cc: Paul Mackerras <paulus@samba.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agoinclude/linux/gfp.h: fix the comment for GFP_ZONE_TABLE
Zhang Yanfei [Thu, 27 Jun 2013 23:52:28 +0000 (09:52 +1000)]
include/linux/gfp.h: fix the comment for GFP_ZONE_TABLE

0xc just means MOVABLE + DMA32, which results in zone DMA32.

Signed-off-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agomemcg: do not account memory used for cache creation
Glauber Costa [Thu, 27 Jun 2013 23:52:28 +0000 (09:52 +1000)]
memcg: do not account memory used for cache creation

The memory we used to hold the memcg arrays is currently accounted to the
current memcg.  But that creates a problem, because that memory can only
be freed after the last user is gone.  Our only way to know which is the
last user, is to hook up to freeing time, but the fact that we still have
some in flight kmallocs will prevent freeing to happen.  I believe
therefore to be just easier to account this memory as global overhead.

Signed-off-by: Glauber Costa <glommer@openvz.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agomemcg: also test for skip accounting at the page allocation level
Glauber Costa [Thu, 27 Jun 2013 23:52:28 +0000 (09:52 +1000)]
memcg: also test for skip accounting at the page allocation level

The memory we used to hold the memcg arrays is currently accounted to the
current memcg.  But that creates a problem, because that memory can only
be freed after the last user is gone.  Our only way to know which is the
last user, is to hook up to freeing time, but the fact that we still have
some in flight kmallocs will prevent freeing to happen.  I believe
therefore to be just easier to account this memory as global overhead.

This patch (of 2):

Disabling accounting is only relevant for some specific memcg internal
allocations.  Therefore we would initially not have such check at
memcg_kmem_newpage_charge, since direct calls to the page allocator that
are marked with GFP_KMEMCG only happen outside memcg core.  We are mostly
concerned with cache allocations and by having this test at
memcg_kmem_get_cache we are already able to relay the allocation to the
root cache and bypass the memcg caches altogether.

There is one exception, though: the SLUB allocator does not create large
order caches, but rather service large kmallocs directly from the page
allocator.  Therefore, the following sequence, when backed by the SLUB
allocator:

memcg_stop_kmem_account();
kmalloc(<large_number>)
memcg_resume_kmem_account();

would effectively ignore the fact that we should skip accounting, since it
will drive us directly to this function without passing through the cache
selector memcg_kmem_get_cache.  Such large allocations are extremely rare
but can happen, for instance, for the cache arrays.

This was never a problem in practice, because we weren't skipping
accounting for the cache arrays.  All the allocations we were skipping
were fairly small.  However, the fact that we were not skipping those
allocations are a problem and can prevent the memcgs from going away.  As
we fix that, we need to make sure that the fix will also work with the
SLUB allocator.

Signed-off-by: Glauber Costa <glommer@openvz.org>
Reported-by: Michal Hocko <mhocko@suze.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agomm/vmalloc.c: check VM_UNINITIALIZED flag in s_show instead of show_numa_info
Zhang Yanfei [Thu, 27 Jun 2013 23:52:27 +0000 (09:52 +1000)]
mm/vmalloc.c: check VM_UNINITIALIZED flag in s_show instead of show_numa_info

We should check the VM_UNITIALIZED flag in s_show().  If this flag is set,
that said, the vm_struct is not fully initialized.  So it is unnecessary
to try to show the information contained in vm_struct.

We checked this flag in show_numa_info(), but I think it's better
to check it earlier.

Signed-off-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agomm/vmalloc.c: rename VM_UNLIST to VM_UNINITIALIZED
Zhang Yanfei [Thu, 27 Jun 2013 23:52:27 +0000 (09:52 +1000)]
mm/vmalloc.c: rename VM_UNLIST to VM_UNINITIALIZED

VM_UNLIST was used to indicate that the vm_struct is not listed in vmlist.
 But after commit 4341fa4 ("mm, vmalloc: remove list management of vmlist
after initializing vmalloc"), the meaning of this flag changed.  It now
means the vm_struct is not fully initialized.  So renaming it to
VM_UNINITIALIZED seems more reasonable.

Also change clear_vm_unlist to clear_vm_uninitialized_flag.

Signed-off-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agomm/vmalloc.c: emit the failure message before return
Zhang Yanfei [Thu, 27 Jun 2013 23:52:27 +0000 (09:52 +1000)]
mm/vmalloc.c: emit the failure message before return

Use goto to jump to the fail label to give a failure message before
returning NULL.  This makes the failure handling in this function
consistent.

Signed-off-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agomm/vmalloc.c: remove alloc_map from vmap_block
Zhang Yanfei [Thu, 27 Jun 2013 23:52:26 +0000 (09:52 +1000)]
mm/vmalloc.c: remove alloc_map from vmap_block

As we have removed the dead code in the vb_alloc, it seems there is no
place to use the alloc_map.  So there is no reason to maintain the
alloc_map in vmap_block.

Signed-off-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agomm/vmalloc.c: remove unused purge_fragmented_blocks_thiscpu
Zhang Yanfei [Thu, 27 Jun 2013 23:52:26 +0000 (09:52 +1000)]
mm/vmalloc.c: remove unused purge_fragmented_blocks_thiscpu

This function is nowhere used now, so remove it.

Signed-off-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agomm/vmalloc.c: remove dead code in vb_alloc
Zhang Yanfei [Thu, 27 Jun 2013 23:52:26 +0000 (09:52 +1000)]
mm/vmalloc.c: remove dead code in vb_alloc

Space in a vmap block that was once allocated is considered dirty and
not made available for allocation again before the whole block is
recycled. The result is that free space within a vmap block is always
contiguous.

So if a vmap block has enough free space for allocation, the allocation
is impossible to fail. Thus, the fragmented block purging was never invoked
from vb_alloc(). So remove this dead code.

Signed-off-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agomm/vmalloc.c: unbreak __vunmap()
Dan Carpenter [Thu, 27 Jun 2013 23:52:25 +0000 (09:52 +1000)]
mm/vmalloc.c: unbreak __vunmap()

There is an extra semi-colon so the function always returns.

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Acked-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agomm-remove-duplicated-call-of-get_pfn_range_for_nid-v2-fix
Andrew Morton [Thu, 27 Jun 2013 23:52:25 +0000 (09:52 +1000)]
mm-remove-duplicated-call-of-get_pfn_range_for_nid-v2-fix

make definitions more readable

Cc: Michal Hocko <mhocko@suse.cz>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agomm-remove-duplicated-call-of-get_pfn_range_for_nid-v2
Zhang Yanfei [Thu, 27 Jun 2013 23:52:25 +0000 (09:52 +1000)]
mm-remove-duplicated-call-of-get_pfn_range_for_nid-v2

It fixes the uninitialized warnings in Arches with no
CONFIG_HAVE_MEMBLOCK_NODE_MAP.

Signed-off-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agomm: remove duplicated call of get_pfn_range_for_nid
Zhang Yanfei [Thu, 27 Jun 2013 23:52:24 +0000 (09:52 +1000)]
mm: remove duplicated call of get_pfn_range_for_nid

When calculating pages in a node, for each zone in that node, we will have

  zone_spanned_pages_in_node
    --> get_pfn_range_for_nid
  zone_absent_pages_in_node
    --> get_pfn_range_for_nid

That is to say, we call the get_pfn_range_for_nid to get start_pfn and
end_pfn of the node for MAX_NR_ZONES * 2 times.  And this is totally
unnecessary if we call the get_pfn_range_for_nid before
zone_*_pages_in_node add two extra arguments node_start_pfn and
node_end_pfn for zone_*_pages_in_node, then we can remove the
get_pfn_range_in_node in zone_*_pages_in_node.

Signed-off-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agomm: invoke oom-killer from remaining unconverted page fault handlers
Johannes Weiner [Thu, 27 Jun 2013 23:52:24 +0000 (09:52 +1000)]
mm: invoke oom-killer from remaining unconverted page fault handlers

A few remaining architectures directly kill the page faulting task in an
out of memory situation.  This is usually not a good idea since that task
might not even use a significant amount of memory and so may not be the
optimal victim to resolve the situation.

Since 2.6.29's 1c0fe6e ("mm: invoke oom-killer from page fault") there is
a hook that architecture page fault handlers are supposed to call to
invoke the OOM killer and let it pick the right task to kill.  Convert the
remaining architectures over to this hook.

To have the previous behavior of simply taking out the faulting task the
vm.oom_kill_allocating_task sysctl can be set to 1.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: David Rientjes <rientjes@google.com>
Acked-by: Vineet Gupta <vgupta@synopsys.com> [arch/arc bits]
Cc: James Hogan <james.hogan@imgtec.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Jonas Bonn <jonas@southpole.se>
Cc: Chen Liqin <liqin.chen@sunplusct.com>
Cc: Lennox Wu <lennox.wu@gmail.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agomemcg: clean up memcg->nodeinfo
Johannes Weiner [Thu, 27 Jun 2013 23:52:24 +0000 (09:52 +1000)]
memcg: clean up memcg->nodeinfo

Remove struct mem_cgroup_lru_info and fold its single member, the variably
sized nodeinfo[0], directly into struct mem_cgroup.  This should make it
more obvious why it has to be the last member there.

Also move the comment that's above that special last member below it, so
it is more visible to somebody that considers appending to the struct
mem_cgroup.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: David Rientjes <rientjes@google.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: Glauber Costa <glommer@openvz.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agomm: mremap: validate input before taking lock
Rasmus Villemoes [Thu, 27 Jun 2013 23:52:24 +0000 (09:52 +1000)]
mm: mremap: validate input before taking lock

This patch is very similar to 84d96d897671 ("mm: madvise: complete input
validation before taking lock"): perform some basic validation of the
input to mremap() before taking the &current->mm->mmap_sem lock.  This
also makes the MREMAP_FIXED => MREMAP_MAYMOVE dependency slightly more
explicit.

Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agosuper: fix for destroy lrus
Glauber Costa [Thu, 27 Jun 2013 23:52:23 +0000 (09:52 +1000)]
super: fix for destroy lrus

This patch adds the missing call to list_lru_destroy (spotted by Li Zhong)
and moves the deletion to after the shrinker is unregistered, as correctly
spotted by Dave

Signed-off-by: Glauber Costa <glommer@openvz.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agolist_lru: dynamically adjust node arrays
Glauber Costa [Thu, 27 Jun 2013 23:52:23 +0000 (09:52 +1000)]
list_lru: dynamically adjust node arrays

We currently use a compile-time constant to size the node array for the
list_lru structure.  Due to this, we don't need to allocate any memory at
initialization time.  But as a consequence, the structures that contain
embedded list_lru lists can become way too big (the superblock for
instance contains two of them).

This patch aims at ameliorating this situation by dynamically allocating
the node arrays with the firmware provided nr_node_ids.

Signed-off-by: Glauber Costa <glommer@openvz.org>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Cc: Arve Hjønnevåg <arve@android.com>
Cc: Carlos Maiolino <cmaiolino@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Chuck Lever <chuck.lever@oracle.com>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: David Rientjes <rientjes@google.com>
Cc: Gleb Natapov <gleb@redhat.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: J. Bruce Fields <bfields@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Kent Overstreet <koverstreet@google.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Thomas Hellstrom <thellstrom@vmware.com>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agoshrinker-kill-old-shrink-api-fix
Andrew Morton [Thu, 27 Jun 2013 23:52:23 +0000 (09:52 +1000)]
shrinker-kill-old-shrink-api-fix

fix whitespace

Cc: Glauber Costa <glommer@openvz.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agoshrinker: Kill old ->shrink API.
Dave Chinner [Thu, 27 Jun 2013 23:52:22 +0000 (09:52 +1000)]
shrinker: Kill old ->shrink API.

There are no more users of this API, so kill it dead, dead, dead and
quietly bury the corpse in a shallow, unmarked grave in a dark forest deep
in the hills...

[glommer@openvz.org: added flowers to the grave]
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Glauber Costa <glommer@openvz.org>
Reviewed-by: Greg Thelen <gthelen@google.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Cc: Arve Hjønnevåg <arve@android.com>
Cc: Carlos Maiolino <cmaiolino@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Chuck Lever <chuck.lever@oracle.com>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: David Rientjes <rientjes@google.com>
Cc: Gleb Natapov <gleb@redhat.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: J. Bruce Fields <bfields@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Kent Overstreet <koverstreet@google.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Thomas Hellstrom <thellstrom@vmware.com>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agohugepage-convert-huge-zero-page-shrinker-to-new-shrinker-api-fix
Andrew Morton [Thu, 27 Jun 2013 23:52:22 +0000 (09:52 +1000)]
hugepage-convert-huge-zero-page-shrinker-to-new-shrinker-api-fix

fix warnings

Cc: Dave Chinner <dchinner@redhat.com>
Cc: Glauber Costa <glommer@openvz.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agohugepage: convert huge zero page shrinker to new shrinker API
Glauber Costa [Thu, 27 Jun 2013 23:52:22 +0000 (09:52 +1000)]
hugepage: convert huge zero page shrinker to new shrinker API

It consists of:

* returning long instead of int
* separating count from scan
* returning the number of freed entities in scan

Signed-off-by: Glauber Costa <glommer@openvz.org>
Reviewed-by: Greg Thelen <gthelen@google.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Cc: Arve Hjønnevåg <arve@android.com>
Cc: Carlos Maiolino <cmaiolino@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Chuck Lever <chuck.lever@oracle.com>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: David Rientjes <rientjes@google.com>
Cc: Gleb Natapov <gleb@redhat.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: J. Bruce Fields <bfields@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Kent Overstreet <koverstreet@google.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Thomas Hellstrom <thellstrom@vmware.com>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agoshrinker-convert-remaining-shrinkers-to-count-scan-api-fix
Andrew Morton [Thu, 27 Jun 2013 23:52:21 +0000 (09:52 +1000)]
shrinker-convert-remaining-shrinkers-to-count-scan-api-fix

fix warnings

Cc: Dave Chinner <dchinner@redhat.com>
Cc: Glauber Costa <glommer@openvz.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agoshrinker: convert remaining shrinkers to count/scan API
Dave Chinner [Thu, 27 Jun 2013 23:52:21 +0000 (09:52 +1000)]
shrinker: convert remaining shrinkers to count/scan API

Convert the remaining couple of random shrinkers in the tree to the new
API.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Glauber Costa <glommer@openvz.org>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Gleb Natapov <gleb@redhat.com>
Cc: Chuck Lever <chuck.lever@oracle.com>
Cc: J. Bruce Fields <bfields@redhat.com>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Cc: Arve Hjønnevåg <arve@android.com>
Cc: Carlos Maiolino <cmaiolino@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Chuck Lever <chuck.lever@oracle.com>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: David Rientjes <rientjes@google.com>
Cc: Gleb Natapov <gleb@redhat.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: J. Bruce Fields <bfields@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Kent Overstreet <koverstreet@google.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Thomas Hellstrom <thellstrom@vmware.com>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agoi915: bail out earlier when shrinker cannot acquire mutex
Glauber Costa [Thu, 27 Jun 2013 23:52:21 +0000 (09:52 +1000)]
i915: bail out earlier when shrinker cannot acquire mutex

The main shrinker driver will keep trying for a while to free objects if
the returned value from the shrink scan procedure is 0.  That means "no
objects now", but a retry could very well succeed.

But what we should say here is a different thing: that it is impossible to
shrink, and we would better bail out soon.  We find this behavior more
appropriate for the case where the lock cannot be taken.  Specially given
the hammer behavior of the i915: if another thread is already shrinking,
we are likely not to be able to shrink anything anyway when we finally
acquire the mutex.

Signed-off-by: Glauber Costa <glommer@openvz.org>
Acked-by: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Kent Overstreet <koverstreet@google.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Cc: Arve Hjønnevåg <arve@android.com>
Cc: Carlos Maiolino <cmaiolino@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Chuck Lever <chuck.lever@oracle.com>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: David Rientjes <rientjes@google.com>
Cc: Gleb Natapov <gleb@redhat.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: J. Bruce Fields <bfields@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Kent Overstreet <koverstreet@google.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Thomas Hellstrom <thellstrom@vmware.com>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agodrivers-convert-shrinkers-to-new-count-scan-api-fix-2
Andrew Morton [Thu, 27 Jun 2013 23:52:20 +0000 (09:52 +1000)]
drivers-convert-shrinkers-to-new-count-scan-api-fix-2

fix warnings

Cc: Dave Chinner <dchinner@redhat.com>
Cc: Glauber Costa <glommer@openvz.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agodrivers-convert-shrinkers-to-new-count-scan-api-fix
Andrew Morton [Thu, 27 Jun 2013 23:52:20 +0000 (09:52 +1000)]
drivers-convert-shrinkers-to-new-count-scan-api-fix

fix warnings

Cc: Dave Chinner <dchinner@redhat.com>
Cc: Glauber Costa <glommer@openvz.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agodrivers: convert shrinkers to new count/scan API
Dave Chinner [Thu, 27 Jun 2013 23:52:20 +0000 (09:52 +1000)]
drivers: convert shrinkers to new count/scan API

Convert the driver shrinkers to the new API.  Most changes are compile
tested only because I either don't have the hardware or it's staging
stuff.

FWIW, the md and android code is pretty good, but the rest of it makes me
want to claw my eyes out.  The amount of broken code I just encountered is
mind boggling.  I've added comments explaining what is broken, but I fear
that some of the code would be best dealt with by being dragged behind the
bike shed, burying in mud up to it's neck and then run over repeatedly
with a blunt lawn mower.

Special mention goes to the zcache/zcache2 drivers.  They can't co-exist
in the build at the same time, they are under different menu options in
menuconfig, they only show up when you've got the right set of mm
subsystem options configured and so even compile testing is an exercise in
pulling teeth.  And that doesn't even take into account the horrible,
broken code...

[glommer@openvz.org: fixes for i915, android lowmem, zcache, bcache]
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Glauber Costa <glommer@openvz.org>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: Kent Overstreet <koverstreet@google.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Thomas Hellstrom <thellstrom@vmware.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Cc: Arve Hjønnevåg <arve@android.com>
Cc: Carlos Maiolino <cmaiolino@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Chuck Lever <chuck.lever@oracle.com>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: David Rientjes <rientjes@google.com>
Cc: Gleb Natapov <gleb@redhat.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: J. Bruce Fields <bfields@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Kent Overstreet <koverstreet@google.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Thomas Hellstrom <thellstrom@vmware.com>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agoUBIFS: signedness bug in ubifs_shrink_count()
Dan Carpenter [Thu, 27 Jun 2013 23:52:19 +0000 (09:52 +1000)]
UBIFS: signedness bug in ubifs_shrink_count()

We test "clean_zn_cnt" for negative later in the function.

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Glauber Costa <glommer@openvz.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agofs-convert-fs-shrinkers-to-new-scan-count-api-fix
Andrew Morton [Thu, 27 Jun 2013 23:52:19 +0000 (09:52 +1000)]
fs-convert-fs-shrinkers-to-new-scan-count-api-fix

fix warnings

Cc: Dave Chinner <dchinner@redhat.com>
Cc: Glauber Costa <glommer@openvz.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agofs: convert fs shrinkers to new scan/count API
Dave Chinner [Thu, 27 Jun 2013 23:52:19 +0000 (09:52 +1000)]
fs: convert fs shrinkers to new scan/count API

Convert the filesystem shrinkers to use the new API, and standardise some
of the behaviours of the shrinkers at the same time.  For example,
nr_to_scan means the number of objects to scan, not the number of objects
to free.

I refactored the CIFS idmap shrinker a little - it really needs to be
broken up into a shrinker per tree and keep an item count with the tree
root so that we don't need to walk the tree every time the shrinker needs
to count the number of objects in the tree (i.e.  all the time under
memory pressure).

[glommer@openvz.org: fixes for ext4, ubifs, nfs, cifs and glock. Fixes are needed mainly due to new code merged in the tree]
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Glauber Costa <glommer@openvz.org>
Acked-by: Mel Gorman <mgorman@suse.de>
Acked-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Acked-by: Jan Kara <jack@suse.cz>
Acked-by: Steven Whitehouse <swhiteho@redhat.com>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Cc: Arve Hjønnevåg <arve@android.com>
Cc: Carlos Maiolino <cmaiolino@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Chuck Lever <chuck.lever@oracle.com>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: David Rientjes <rientjes@google.com>
Cc: Gleb Natapov <gleb@redhat.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: J. Bruce Fields <bfields@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Kent Overstreet <koverstreet@google.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Thomas Hellstrom <thellstrom@vmware.com>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agoxfs-convert-dquot-cache-lru-to-list_lru-fix
Andrew Morton [Thu, 27 Jun 2013 23:52:19 +0000 (09:52 +1000)]
xfs-convert-dquot-cache-lru-to-list_lru-fix

fix warnings

Cc: Dave Chinner <dchinner@redhat.com>
Cc: Glauber Costa <glommer@openvz.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agoxfs: convert dquot cache lru to list_lru
Dave Chinner [Thu, 27 Jun 2013 23:52:18 +0000 (09:52 +1000)]
xfs: convert dquot cache lru to list_lru

Convert the XFS dquot lru to use the list_lru construct and convert the
shrinker to being node aware.

[glommer@openvz.org: edited for conflicts + warning fixes]
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Glauber Costa <glommer@openvz.org>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Cc: Arve Hjønnevåg <arve@android.com>
Cc: Carlos Maiolino <cmaiolino@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Chuck Lever <chuck.lever@oracle.com>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: David Rientjes <rientjes@google.com>
Cc: Gleb Natapov <gleb@redhat.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: J. Bruce Fields <bfields@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Kent Overstreet <koverstreet@google.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Thomas Hellstrom <thellstrom@vmware.com>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agoxfs: rework buffer dispose list tracking
Dave Chinner [Thu, 27 Jun 2013 23:52:18 +0000 (09:52 +1000)]
xfs: rework buffer dispose list tracking

In converting the buffer lru lists to use the generic code, the locking
for marking the buffers as on the dispose list was lost.  This results in
confusion in LRU buffer tracking and acocunting, resulting in reference
counts being mucked up and filesystem beig unmountable.

To fix this, introduce an internal buffer spinlock to protect the state
field that holds the dispose list information.  Because there is now
locking needed around xfs_buf_lru_add/del, and they are used in exactly
one place each two lines apart, get rid of the wrappers and code the logic
directly in place.

Further, the LRU emptying code used on unmount is less than optimal.
Convert it to use a dispose list as per a normal shrinker walk, and repeat
the walk that fills the dispose list until the LRU is empty.  Thi avoids
needing to drop and regain the LRU lock for every item being freed, and
allows the same logic as the shrinker isolate call to be used.  Simpler,
easier to understand.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Glauber Costa <glommer@openvz.org>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Cc: Arve Hjønnevåg <arve@android.com>
Cc: Carlos Maiolino <cmaiolino@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Chuck Lever <chuck.lever@oracle.com>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: David Rientjes <rientjes@google.com>
Cc: Gleb Natapov <gleb@redhat.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: J. Bruce Fields <bfields@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Kent Overstreet <koverstreet@google.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Thomas Hellstrom <thellstrom@vmware.com>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agoxfs-convert-buftarg-lru-to-generic-code-fix
Andrew Morton [Thu, 27 Jun 2013 23:52:18 +0000 (09:52 +1000)]
xfs-convert-buftarg-lru-to-generic-code-fix

fix warnings

Cc: Dave Chinner <dchinner@redhat.com>
Cc: Glauber Costa <glommer@openvz.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agoxfs: convert buftarg LRU to generic code
Dave Chinner [Thu, 27 Jun 2013 23:52:17 +0000 (09:52 +1000)]
xfs: convert buftarg LRU to generic code

Convert the buftarg LRU to use the new generic LRU list and take advantage
of the functionality it supplies to make the buffer cache shrinker node
aware.

Signed-off-by: Glauber Costa <glommer@openvz.org>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Cc: Arve Hjønnevåg <arve@android.com>
Cc: Carlos Maiolino <cmaiolino@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Chuck Lever <chuck.lever@oracle.com>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: David Rientjes <rientjes@google.com>
Cc: Gleb Natapov <gleb@redhat.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: J. Bruce Fields <bfields@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Kent Overstreet <koverstreet@google.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Thomas Hellstrom <thellstrom@vmware.com>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agofs: convert inode and dentry shrinking to be node aware
Dave Chinner [Thu, 27 Jun 2013 23:52:17 +0000 (09:52 +1000)]
fs: convert inode and dentry shrinking to be node aware

Now that the shrinker is passing a node in the scan control structure, we
can pass this to the the generic LRU list code to isolate reclaim to the
lists on matching nodes.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Glauber Costa <glommer@parallels.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Cc: Arve Hjønnevåg <arve@android.com>
Cc: Carlos Maiolino <cmaiolino@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Chuck Lever <chuck.lever@oracle.com>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: David Rientjes <rientjes@google.com>
Cc: Gleb Natapov <gleb@redhat.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: J. Bruce Fields <bfields@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Kent Overstreet <koverstreet@google.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Thomas Hellstrom <thellstrom@vmware.com>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agovmscan: per-node deferred work
Glauber Costa [Thu, 27 Jun 2013 23:52:17 +0000 (09:52 +1000)]
vmscan: per-node deferred work

The list_lru infrastructure already keeps per-node LRU lists in its
node-specific list_lru_node arrays and provide us with a per-node API, and
the shrinkers are properly equiped with node information.  This means that
we can now focus our shrinking effort in a single node, but the work that
is deferred from one run to another is kept global at nr_in_batch.  Work
can be deferred, for instance, during direct reclaim under a GFP_NOFS
allocation, where situation, all the filesystem shrinkers will be
prevented from running and accumulate in nr_in_batch the amount of work
they should have done, but could not.

This creates an impedance problem, where upon node pressure, work deferred
will accumulate and end up being flushed in other nodes.  The problem we
describe is particularly harmful in big machines, where many nodes can
accumulate at the same time, all adding to the global counter nr_in_batch.
 As we accumulate more and more, we start to ask for the caches to flush
even bigger numbers.  The result is that the caches are depleted and do
not stabilize.  To achieve stable steady state behavior, we need to tackle
it differently.

In this patch we keep the deferred count per-node, in the new array
nr_deferred[] (the name is also a bit more descriptive) and will never
accumulate that to other nodes.

Signed-off-by: Glauber Costa <glommer@openvz.org>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Cc: Arve Hjønnevåg <arve@android.com>
Cc: Carlos Maiolino <cmaiolino@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Chuck Lever <chuck.lever@oracle.com>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: David Rientjes <rientjes@google.com>
Cc: Gleb Natapov <gleb@redhat.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: J. Bruce Fields <bfields@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Kent Overstreet <koverstreet@google.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Thomas Hellstrom <thellstrom@vmware.com>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agoshrinker: add node awareness
Dave Chinner [Thu, 27 Jun 2013 23:52:16 +0000 (09:52 +1000)]
shrinker: add node awareness

Pass the node of the current zone being reclaimed to shrink_slab(),
allowing the shrinker control nodemask to be set appropriately for node
aware shrinkers.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Glauber Costa <glommer@openvz.org>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Cc: Arve Hjønnevåg <arve@android.com>
Cc: Carlos Maiolino <cmaiolino@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Chuck Lever <chuck.lever@oracle.com>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: David Rientjes <rientjes@google.com>
Cc: Gleb Natapov <gleb@redhat.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: J. Bruce Fields <bfields@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Kent Overstreet <koverstreet@google.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Thomas Hellstrom <thellstrom@vmware.com>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agolist_lru: remove special case function list_lru_dispose_all.
Glauber Costa [Thu, 27 Jun 2013 23:52:16 +0000 (09:52 +1000)]
list_lru: remove special case function list_lru_dispose_all.

The list_lru implementation has one function, list_lru_dispose_all, with
only one user (the dentry code).  At first, such function appears to make
sense because we are really not interested in the result of isolating each
dentry separately - all of them are going away anyway.  However, it's
implementation is buggy in the following way:

When we call list_lru_dispose_all in fs/dcache.c, we scan all dentries
marking them with DCACHE_SHRINK_LIST.  However, this is done without the
nlru->lock taken.  The imediate result of that is that someone else may
add or remove the dentry from the LRU at the same time.  When list_lru_del
happens in that scenario we will see an element that is not yet marked
with DCACHE_SHRINK_LIST (even though it will be in the future) and
obviously remove it from an lru where the element no longer is.  Since
list_lru_dispose_all will in effect count down nlru's nr_items and
list_lru_del will do the same, this will lead to an imbalance.

The solution for this would not be so simple: we can obviously just keep
the lru_lock taken, but then we have no guarantees that we will be able to
acquire the dentry lock (dentry->d_lock).  To properly solve this, we need
a communication mechanism between the lru and dentry code, so they can
coordinate this with each other.

Such mechanism already exists in the form of the list_lru_walk_cb
callback.  So it is possible to construct a dcache-side prune function
that does the right thing only by calling list_lru_walk in a loop until no
more dentries are available.

With only one user, plus the fact that a sane solution for the problem
would involve boucing between dcache and list_lru anyway, I see little
justification to keep the special case list_lru_dispose_all in tree.

Signed-off-by: Glauber Costa <glommer@openvz.org>
Cc: Michal Hocko <mhocko@suse.cz>
Acked-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agolist_lru: per-node API
Glauber Costa [Thu, 27 Jun 2013 23:52:16 +0000 (09:52 +1000)]
list_lru: per-node API

This patch adapts the list_lru API to accept an optional node argument, to
be used by NUMA aware shrinking functions.  Code that does not care about
the NUMA placement of objects can still call into the very same functions
as before.  They will simply iterate over all nodes.

Signed-off-by: Glauber Costa <glommer@openvz.org>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Cc: Arve Hjønnevåg <arve@android.com>
Cc: Carlos Maiolino <cmaiolino@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Chuck Lever <chuck.lever@oracle.com>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: David Rientjes <rientjes@google.com>
Cc: Gleb Natapov <gleb@redhat.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: J. Bruce Fields <bfields@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Kent Overstreet <koverstreet@google.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Thomas Hellstrom <thellstrom@vmware.com>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agolist_lru: per-node list infrastructure fix
Glauber Costa [Thu, 27 Jun 2013 23:52:15 +0000 (09:52 +1000)]
list_lru: per-node list infrastructure fix

After a while investigating, it seems to us that the imbalance we are
seeing are due to a multi-node race already in tree (our guess).  Although
the WARN is useful to show us the race, BUG_ON is too much, since it seems
the kernel should be fine going on after that.

Signed-off-by: Glauber Costa <glommer@openvz.org>
Cc: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agolist_lru: per-node list infrastructure
Dave Chinner [Thu, 27 Jun 2013 23:52:15 +0000 (09:52 +1000)]
list_lru: per-node list infrastructure

Now that we have an LRU list API, we can start to enhance the
implementation.  This splits the single LRU list into per-node lists and
locks to enhance scalability.  Items are placed on lists according to the
node the memory belongs to.  To make scanning the lists efficient, also
track whether the per-node lists have entries in them in a active
nodemask.

Note: We use a fixed-size array for the node LRU, this struct can be very
big if MAX_NUMNODES is big.  If this becomes a problem this is fixable by
turning this into a pointer and dynamically allocating this to
nr_node_ids.  This quantity is firwmare-provided, and still would provide
room for all nodes at the cost of a pointer lookup and an extra
allocation.  Because that allocation will most likely come from a may very
well fail.

[glommer@openvz.org: fix warnings, added note about node lru]
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Glauber Costa <glommer@openvz.org>
Reviewed-by: Greg Thelen <gthelen@google.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Cc: Arve Hjønnevåg <arve@android.com>
Cc: Carlos Maiolino <cmaiolino@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Chuck Lever <chuck.lever@oracle.com>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: David Rientjes <rientjes@google.com>
Cc: Gleb Natapov <gleb@redhat.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: J. Bruce Fields <bfields@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Kent Overstreet <koverstreet@google.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Thomas Hellstrom <thellstrom@vmware.com>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agodcache: convert to use new lru list infrastructure
Dave Chinner [Thu, 27 Jun 2013 23:52:15 +0000 (09:52 +1000)]
dcache: convert to use new lru list infrastructure

[glommer@openvz.org: don't reintroduce double decrement of nr_unused_dentries, adapted for new LRU return codes]
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Glauber Costa <glommer@openvz.org>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Cc: Arve Hjønnevåg <arve@android.com>
Cc: Carlos Maiolino <cmaiolino@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Chuck Lever <chuck.lever@oracle.com>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: David Rientjes <rientjes@google.com>
Cc: Gleb Natapov <gleb@redhat.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: J. Bruce Fields <bfields@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Kent Overstreet <koverstreet@google.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Thomas Hellstrom <thellstrom@vmware.com>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agoinode: move inode to a different list inside lock
Glauber Costa [Thu, 27 Jun 2013 23:52:14 +0000 (09:52 +1000)]
inode: move inode to a different list inside lock

When removing an element from the lru, this will be done today after the lock
is released. This is a clear mistake, although we are not sure if the bugs we
are seeing are related to this. All list manipulations are done inside the
lock, and so should this one.

Signed-off-by: Glauber Costa <glommer@openvz.org>
Tested-by: Michal Hocko <mhocko@suse.cz>
Cc: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agoinode: convert inode lru list to generic lru list code.
Dave Chinner [Thu, 27 Jun 2013 23:52:14 +0000 (09:52 +1000)]
inode: convert inode lru list to generic lru list code.

[glommer@openvz.org: adapted for new LRU return codes]
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Glauber Costa <glommer@openvz.org>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Cc: Arve Hjønnevåg <arve@android.com>
Cc: Carlos Maiolino <cmaiolino@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Chuck Lever <chuck.lever@oracle.com>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: David Rientjes <rientjes@google.com>
Cc: Gleb Natapov <gleb@redhat.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: J. Bruce Fields <bfields@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Kent Overstreet <koverstreet@google.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Thomas Hellstrom <thellstrom@vmware.com>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agolist: add a new LRU list type
Dave Chinner [Thu, 27 Jun 2013 23:52:14 +0000 (09:52 +1000)]
list: add a new LRU list type

Several subsystems use the same construct for LRU lists - a list head, a
spin lock and and item count.  They also use exactly the same code for
adding and removing items from the LRU.  Create a generic type for these
LRU lists.

This is the beginning of generic, node aware LRUs for shrinkers to work
with.

[glommer@openvz.org: enum defined constants for lru. Suggested by gthelen, don't relock over retry]
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Glauber Costa <glommer@openvz.org>
Reviewed-by: Greg Thelen <gthelen@google.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Cc: Arve Hjønnevåg <arve@android.com>
Cc: Carlos Maiolino <cmaiolino@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Chuck Lever <chuck.lever@oracle.com>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: David Rientjes <rientjes@google.com>
Cc: Gleb Natapov <gleb@redhat.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: J. Bruce Fields <bfields@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Kent Overstreet <koverstreet@google.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Thomas Hellstrom <thellstrom@vmware.com>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agoshrinker-convert-superblock-shrinkers-to-new-api-fix
Andrew Morton [Thu, 27 Jun 2013 23:52:13 +0000 (09:52 +1000)]
shrinker-convert-superblock-shrinkers-to-new-api-fix

fix warnings

fs/super.c: In function 'alloc_super':
fs/super.c:240: warning: assignment from incompatible pointer type
fs/super.c:241: warning: assignment from incompatible pointer type

Cc: Dave Chinner <dchinner@redhat.com>
Cc: Glauber Costa <glommer@openvz.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>