Joe Perches [Fri, 2 Dec 2011 03:12:41 +0000 (14:12 +1100)]
checkpatch: update signature "might be better as" warning
email header lines can look like signature tags. It's valid to have
multiple email recipients on a single line but not valid to have multiple
signatures on a single line.
Validate signatures only when not in the email headers.
Clear the $in_commit_log flag when the patch filename appears.
Add '-' to the valid chars in a message header for headers
like "Message-Id:" and "In-Reply-To:".
Signed-off-by: Joe Perches <joe@perches.com> Reported-by: Julia Lawall <julia.lawall@lip6.fr> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Axel Lin [Fri, 2 Dec 2011 03:12:41 +0000 (14:12 +1100)]
drivers/leds/leds-netxbig.c: use gpio_request_one()
Use gpio_request_one() instead of multiple gpiolib calls. This also
simplifies error handling a bit.
Signed-off-by: Axel Lin <axel.lin@gmail.com> Cc: Simon Guinot <sguinot@lacie.com> Cc: Richard Purdie <rpurdie@rpsys.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Axel Lin [Fri, 2 Dec 2011 03:12:40 +0000 (14:12 +1100)]
drivers/leds/leds-bd2802.c: use gpio_request_one()
Use gpio_request_one() instead of multiple gpiolib calls.
Signed-off-by: Axel Lin <axel.lin@gmail.com> Cc: Kim Kyuwon <q1.kim@samsung.com> Cc: Richard Purdie <rpurdie@rpsys.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Axel Lin <axel.lin@gmail.com> Cc: Samu Onkalo <samu.p.onkalo@nokia.com> Cc: Richard Purdie <rpurdie@rpsys.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Axel Lin [Fri, 2 Dec 2011 03:12:40 +0000 (14:12 +1100)]
leds: convert leds-dac124s085 to module_spi_driver
Factor out some boilerplate code for spi driver registration into
module_spi_driver.
Signed-off-by: Axel Lin <axel.lin@gmail.com> Cc: Haojian Zhuang <hzhuang1@marvell.com> Cc: Mark Brown <broonie@opensource.wolfsonmicro.com> Cc: Richard Purdie <rpurdie@rpsys.net> Cc: Michael Hennerich <hennerich@blackfin.uclinux.org> Cc: Mike Rapoport <mike@compulab.co.il> Acked-by: Guennadi Liakhovetski <g.liakhovetski@gmx.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Axel Lin [Fri, 2 Dec 2011 03:12:39 +0000 (14:12 +1100)]
leds: convert led i2c drivers to module_i2c_driver
Factor out some boilerplate code for i2c driver registration
into module_i2c_driver.
Signed-off-by: Axel Lin <axel.lin@gmail.com> Cc: Haojian Zhuang <hzhuang1@marvell.com> Cc: Mark Brown <broonie@opensource.wolfsonmicro.com> Cc: Richard Purdie <rpurdie@rpsys.net> Cc: Michael Hennerich <hennerich@blackfin.uclinux.org> Cc: Mike Rapoport <mike@compulab.co.il> Cc: Guennadi Liakhovetski <g.liakhovetski@gmx.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Axel Lin [Fri, 2 Dec 2011 03:12:39 +0000 (14:12 +1100)]
leds: convert led platform drivers to module_platform_driver
Factor out some boilerplate code for platform driver registration into
module_platform_driver.
Signed-off-by: Axel Lin <axel.lin@gmail.com> Acked-by: Haojian Zhuang <hzhuang1@marvell.com> [led-88pm860x.c] Acked-by: Mark Brown <broonie@opensource.wolfsonmicro.com> Cc: Richard Purdie <rpurdie@rpsys.net> Cc: Michael Hennerich <hennerich@blackfin.uclinux.org> Cc: Mike Rapoport <mike@compulab.co.il> Cc: Guennadi Liakhovetski <g.liakhovetski@gmx.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Axel Lin [Fri, 2 Dec 2011 03:12:37 +0000 (14:12 +1100)]
backlight: convert drivers/video/backlight/* to use module_platform_driver()
Convert the drivers in drivers/video/backlight/* to use the
module_platform_driver() macro which makes the code smaller and a bit
simpler.
Signed-off-by: Axel Lin <axel.lin@gmail.com> Acked-by: Haojian Zhuang <haojian.zhuang@gmail.com> Acked-by: H Hartley Sweeten <hsweeten@visionengravers.com> [ep93xx_bl.c] Cc: Mike Rapoport <mike@compulab.co.il> Cc: Richard Purdie <rpurdie@rpsys.net> Acked-by: Michael Hennerich <michael.hennerich@analog.com> Acked-by: Mark Brown <broonie@opensource.wolfsonmicro.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Paul Bolle [Fri, 2 Dec 2011 03:12:13 +0000 (14:12 +1100)]
backlight: remove ADX backlight device support
Support for the Avionic Design Xanthos backlight device got added in
commit 3b96ea9ef8 ("backlight: Add support for the Avionic Design Xanthos
backlight device."). That support depends on ARCH_PXA_ADX. The code that
should have provided that Kconfig symbol never got submitted. It has
never been possible to even build this driver. Remove it.
Signed-off-by: Paul Bolle <pebolle@tiscali.nl> Acked-by: Thierry Reding <thierry.reding@avionic-design.de> Cc: Richard Purdie <rpurdie@rpsys.net> Cc: Wim Van Sebroeck <wim@iguana.be> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Ian Campbell [Fri, 2 Dec 2011 03:12:13 +0000 (14:12 +1100)]
get_maintainers.pl: follow renames when looking up commit signers
I happen to have had a commit to various network drivers since the big
renaming/reorg which happened to drivers/net recently. This means that I
now appear to be in the top few commit signers (by %age) for many of them
so am getting sent all sorts of stuff and people who are involved with the
driver are not. e.g. (to pick one at random):
$ ./scripts/get_maintainer.pl -f drivers/net/ethernet/nvidia/forcedeth.c
"David S. Miller" <davem@davemloft.net> (commit_signer:5/7=71%)
Ian Campbell <ian.campbell@citrix.com> (commit_signer:2/7=29%)
Eric Dumazet <eric.dumazet@gmail.com> (commit_signer:1/7=14%)
Jeff Kirsher <jeffrey.t.kirsher@intel.com> (commit_signer:1/7=14%)
Jiri Pirko <jpirko@redhat.com> (commit_signer:1/7=14%)
netdev@vger.kernel.org (open list:NETWORKING DRIVERS)
linux-kernel@vger.kernel.org (open list)
With the following patch the renames are followed and the result appears
much more sensible:
1 is a power of two, therefore rounddown_pow_of_two(1) should return 1.
It does in case the argument is a variable but in case it's a constant it
behaves wrong and returns 0. Probably nobody ever did it so this was
never noticed, however net/drivers/vmxnet3 with latest GCC does and breaks
on unicpu systems.
This is similar to Rolf's patch to roundup_pow_of_two(1).
Cc: Rolf Eike Beer <eike-kernel@sf-tec.de> Reviewed-by: Jesper Juhl <jj@chaosbits.net> Signed-off-by: Andrei Warkentin <andreiw@vmware.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Andi Kleen [Fri, 2 Dec 2011 03:12:12 +0000 (14:12 +1100)]
brlocks/lglocks: clean up code
lglocks and brlocks are currently generated with some complicated macros
in lglock.h. But there's no reason I can see to not just use common
utility functions that get pointers to the lglock.
Since there are at least two users it makes sense to share this code in a
library.
This will also make it later possible to dynamically allocate lglocks.
In general the users now look more like normal function calls with
pointers, not magic macros.
The patch is rather large because I move over all users in one go to keep
it bisectable. This impacts the VFS somewhat in terms of lines changed.
But no actual behaviour change.
Signed-off-by: Andi Kleen <ak@linux.intel.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Nick Piggin <npiggin@kernel.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Jesper Juhl [Fri, 2 Dec 2011 03:12:11 +0000 (14:12 +1100)]
audit: always follow va_copy() with va_end()
A call to va_copy() should always be followed by a call to va_end() in the
same function. In kernel/autit.c::audit_log_vformat() this is not always
done. This patch makes sure va_end() is always called.
Signed-off-by: Jesper Juhl <jj@chaosbits.net> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Eric Paris <eparis@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Heiko Carstens [Fri, 2 Dec 2011 03:12:10 +0000 (14:12 +1100)]
mm,slub,x86: decouple size of struct page from CONFIG_CMPXCHG_LOCAL
While implementing cmpxchg_double() on s390 I realized that we don't set
CONFIG_CMPXCHG_LOCAL besides the fact that we have support for it.
However setting that option will increase the size of struct page by eight
bytes on 64 bit, which we certainly do not want. Also, it doesn't make
sense that a present cpu feature should increase the size of struct page.
Besides that it looks like the dependency to CMPXCHG_LOCAL is wrong and
that it should depend on CMPXCHG_DOUBLE instead.
This patch:
If an architecture supports CMPXCHG_LOCAL this shouldn't result
automatically in larger struct pages if the SLUB allocator is used.
Instead introduce a new config option "HAVE_ALIGNED_STRUCT_PAGE" which can
be selected if a double word aligned struct page is required. Also update
x86 Kconfig so that it should work as before.
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com> Acked-by: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
WARNING: please, no spaces at the start of a line
#57: FILE: arch/m68k/amiga/config.c:515:
+ __noreturn;$
total: 0 errors, 1 warnings, 106 lines checked
./patches/treewide-convert-uses-of-attrib_noreturn-to-__noreturn.patch has style problems, please review.
If any of these errors are false positives, please report
them to the maintainer, see CHECKPATCH in MAINTAINERS.
Please run checkpatch prior to sending patches
Cc: Joe Perches <joe@perches.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Shaohua Li [Fri, 2 Dec 2011 03:12:07 +0000 (14:12 +1100)]
intel_idle: fix API misuse
smp_call_function() only lets all other CPUs execute a specific function,
while we expect all CPUs do in intel_idle. Without the fix, we could have
one cpu which has auto_demotion enabled or has no boradcast timer setup.
Usually we don't see impact because auto demotion just harms power and the
intel_idle init is called in CPU 0, where boradcast timer delivers
interrupt, but this still could be a problem.
Signed-off-by: Shaohua Li <shaohua.li@intel.com> Cc: Len Brown <lenb@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Magnus Lynch [Fri, 2 Dec 2011 03:12:07 +0000 (14:12 +1100)]
hpet: factor timer allocate from open
The current implementation of the /dev/hpet driver couples opening the
device with allocating one of the (scarce) timers (aka comparators). This
is a limitation in that the main counter may be valuable to applications
seeking a high-resolution timer who have no use for the interrupt
generating functionality of the comparators.
This patch alters the open semantics so that when the device is opened, no
timer is allocated. Operations that depend on a timer being in context
implicitly attempt allocating a timer, to maintain backward compatibility.
There is also an IOCTL (HPET_ALLOC_TIMER _IO) added so that the
allocation may be done explicitly. (I prefer the explicit open then
allocate pattern but don't know how practical it would be to require all
existing code to be changed.)
/dev/hpet is accessed via mmap(). This is the only interface of /dev/hpet
that is actually used in practice.
[akpm@linux-foundation.org: coding-style tweaks]
[arnd@arndb.de: fix build] Signed-off-by: Magnus Lynch <maglyx@gmail.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: john stultz <johnstul@us.ibm.com> Acked-by: Clemens Ladisch <clemens@ladisch.de> Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
When (no)bootmem finishes its operation, it passes pages to the buddy
allocator. Since debug_pagealloc_enabled is not set, we will not protect
these pages, which is not what we want with CONFIG_DEBUG_PAGEALLOC=y.
To fix this, remove debug_pagealloc_enabled. That variable was introduced
by commit 12d6f21e ("x86: do not PSE on CONFIG_DEBUG_PAGEALLOC=y") to get
more CPA (change page attribude) code testing. But currently we have
CONFIG_CPA_DEBUG, which tests CPA.
Signed-off-by: Stanislaw Gruszka <sgruszka@redhat.com> Acked-by: Mel Gorman <mgorman@suse.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Hillf Danton [Fri, 2 Dec 2011 03:12:06 +0000 (14:12 +1100)]
mm: compaction: push isolate search base of compact control one pfn ahead
After isolated the current pfn will no longer be scanned and isolated if
the next round is necessary, so push the isolate_migratepages search base
of the given compact_control one step ahead.
Signed-off-by: Hillf Danton <dhillf@gmail.com> Reviewed-by: Andrea Arcangeli <aarcange@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Johannes Weiner [Fri, 2 Dec 2011 03:12:06 +0000 (14:12 +1100)]
Btrfs: pass __GFP_WRITE for buffered write page allocations
Tell the page allocator that pages allocated for a buffered write are
expected to become dirty soon.
Signed-off-by: Johannes Weiner <jweiner@redhat.com> Reviewed-by: Rik van Riel <riel@redhat.com> Acked-by: Mel Gorman <mgorman@suse.de> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Jan Kara <jack@suse.cz> Cc: Shaohua Li <shaohua.li@intel.com> Cc: Chris Mason <chris.mason@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Johannes Weiner [Fri, 2 Dec 2011 03:12:06 +0000 (14:12 +1100)]
mm: filemap: pass __GFP_WRITE from grab_cache_page_write_begin()
Tell the page allocator that pages allocated through
grab_cache_page_write_begin() are expected to become dirty soon.
Signed-off-by: Johannes Weiner <jweiner@redhat.com> Reviewed-by: Rik van Riel <riel@redhat.com> Acked-by: Mel Gorman <mgorman@suse.de> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Reviewed-by: Michal Hocko <mhocko@suse.cz> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Jan Kara <jack@suse.cz> Cc: Shaohua Li <shaohua.li@intel.com> Cc: Chris Mason <chris.mason@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Johannes Weiner [Fri, 2 Dec 2011 03:12:05 +0000 (14:12 +1100)]
mm: try to distribute dirty pages fairly across zones
The maximum number of dirty pages that exist in the system at any time is
determined by a number of pages considered dirtyable and a user-configured
percentage of those, or an absolute number in bytes.
This number of dirtyable pages is the sum of memory provided by all the
zones in the system minus their lowmem reserves and high watermarks, so
that the system can retain a healthy number of free pages without having
to reclaim dirty pages.
But there is a flaw in that we have a zoned page allocator which does not
care about the global state but rather the state of individual memory
zones. And right now there is nothing that prevents one zone from filling
up with dirty pages while other zones are spared, which frequently leads
to situations where kswapd, in order to restore the watermark of free
pages, does indeed have to write pages from that zone's LRU list. This
can interfere so badly with IO from the flusher threads that major
filesystems (btrfs, xfs, ext4) mostly ignore write requests from reclaim
already, taking away the VM's only possibility to keep such a zone
balanced, aside from hoping the flushers will soon clean pages from that
zone.
Enter per-zone dirty limits. They are to a zone's dirtyable memory what
the global limit is to the global amount of dirtyable memory, and try to
make sure that no single zone receives more than its fair share of the
globally allowed dirty pages in the first place. As the number of pages
considered dirtyable excludes the zones' lowmem reserves and high
watermarks, the maximum number of dirty pages in a zone is such that the
zone can always be balanced without requiring page cleaning.
As this is a placement decision in the page allocator and pages are
dirtied only after the allocation, this patch allows allocators to pass
__GFP_WRITE when they know in advance that the page will be written to and
become dirty soon. The page allocator will then attempt to allocate from
the first zone of the zonelist - which on NUMA is determined by the task's
NUMA memory policy - that has not exceeded its dirty limit.
At first glance, it would appear that the diversion to lower zones can
increase pressure on them, but this is not the case. With a full high
zone, allocations will be diverted to lower zones eventually, so it is
more of a shift in timing of the lower zone allocations. Workloads that
previously could fit their dirty pages completely in the higher zone may
be forced to allocate from lower zones, but the amount of pages that
"spill over" are limited themselves by the lower zones' dirty constraints,
and thus unlikely to become a problem.
For now, the problem of unfair dirty page distribution remains for NUMA
configurations where the zones allowed for allocation are in sum not big
enough to trigger the global dirty limits, wake up the flusher threads and
remedy the situation. Because of this, an allocation that could not
succeed on any of the considered zones is allowed to ignore the dirty
limits before going into direct reclaim or even failing the allocation,
until a future patch changes the global dirty throttling and flusher
thread activation so that they take individual zone states into account.
Test results
15M DMA + 3246M DMA32 + 504 Normal = 3765M memory
40% dirty ratio
16G USB thumb drive
10 runs of dd if=/dev/zero of=disk/zeroes bs=32k count=$((10 << 15))
Signed-off-by: Johannes Weiner <jweiner@redhat.com> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Acked-by: Mel Gorman <mgorman@suse.de> Reviewed-by: Michal Hocko <mhocko@suse.cz> Tested-by: Wu Fengguang <fengguang.wu@intel.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Jan Kara <jack@suse.cz> Cc: Shaohua Li <shaohua.li@intel.com> Cc: Rik van Riel <riel@redhat.com> Cc: Chris Mason <chris.mason@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Johannes Weiner [Fri, 2 Dec 2011 03:12:05 +0000 (14:12 +1100)]
mm: writeback: cleanups in preparation for per-zone dirty limits
The next patch will introduce per-zone dirty limiting functions in
addition to the traditional global dirty limiting.
Rename determine_dirtyable_memory() to global_dirtyable_memory() before
adding the zone-specific version, and fix up its documentation.
Also, move the functions to determine the dirtyable memory and the
function to calculate the dirty limit based on that together so that their
relationship is more apparent and that they can be commented on as a
group.
Signed-off-by: Johannes Weiner <jweiner@redhat.com> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Acked-by: Mel Gorman <mel@suse.de> Reviewed-by: Michal Hocko <mhocko@suse.cz> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Jan Kara <jack@suse.cz> Cc: Shaohua Li <shaohua.li@intel.com> Cc: Rik van Riel <riel@redhat.com> Cc: Chris Mason <chris.mason@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Chris Mason <chris.mason@oracle.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Jan Kara <jack@suse.cz> Cc: Johannes Weiner <jweiner@redhat.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Michal Hocko <mhocko@suse.cz> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: Rik van Riel <riel@redhat.com> Cc: Shaohua Li <shaohua.li@intel.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Johannes Weiner [Fri, 2 Dec 2011 03:12:04 +0000 (14:12 +1100)]
mm: exclude reserved pages from dirtyable memory
Per-zone dirty limits try to distribute page cache pages allocated for
writing across zones in proportion to the individual zone sizes, to reduce
the likelihood of reclaim having to write back individual pages from the
LRU lists in order to make progress.
This patch:
The amount of dirtyable pages should not include the full number of free
pages: there is a number of reserved pages that the page allocator and
kswapd always try to keep free.
The closer (reclaimable pages - dirty pages) is to the number of reserved
pages, the more likely it becomes for reclaim to run into dirty pages:
This patch introduces a per-zone dirty reserve that takes both the lowmem
reserve as well as the high watermark of the zone into account, and a
global sum of those per-zone values that is subtracted from the global
amount of dirtyable pages. The lowmem reserve is unavailable to page
cache allocations and kswapd tries to keep the high watermark free. We
don't want to end up in a situation where reclaim has to clean pages in
order to balance zones.
Not treating reserved pages as dirtyable on a global level is only a
conceptual fix. In reality, dirty pages are not distributed equally
across zones and reclaim runs into dirty pages on a regular basis.
But it is important to get this right before tackling the problem on a
per-zone level, where the distance between reclaim and the dirty pages is
mostly much smaller in absolute numbers.
Signed-off-by: Johannes Weiner <jweiner@redhat.com> Reviewed-by: Rik van Riel <riel@redhat.com> Reviewed-by: Michal Hocko <mhocko@suse.cz> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Acked-by: Mel Gorman <mgorman@suse.de> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Jan Kara <jack@suse.cz> Cc: Shaohua Li <shaohua.li@intel.com> Cc: Chris Mason <chris.mason@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
David Rientjes [Fri, 2 Dec 2011 03:12:04 +0000 (14:12 +1100)]
mm, debug: test for online nid when allocating on single node
Calling alloc_pages_exact_node() means the allocation only passes the
zonelist of a single node into the page allocator. If that node isn't
online, it's zonelist may never have been initialized causing a strange
oops that may not immediately be clear.
I recently debugged an issue where node 0 wasn't online and an allocator
was passing 0 to alloc_pages_exact_node() and it resulted in a NULL
pointer on zonelist->_zoneref. If CONFIG_DEBUG_VM is enabled, though, it
would be nice to catch this a bit earlier.
Signed-off-by: David Rientjes <rientjes@google.com> Acked-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Shawn Bohrer [Fri, 2 Dec 2011 03:12:03 +0000 (14:12 +1100)]
fadvise: only initiate writeback for specified range with FADV_DONTNEED
Previously POSIX_FADV_DONTNEED would start writeback for the entire file
when the bdi was not write congested. This negatively impacts performance
if the file contians dirty pages outside of the requested range. This
change uses __filemap_fdatawrite_range() to only initiate writeback for
the requested range.
Signed-off-by: Shawn Bohrer <sbohrer@rgmadvisors.com> Acked-by: Johannes Weiner <jweiner@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Disable slub debug facilities and allocate slabs at minimal order when
debug_guardpage_minorder > 0 to increase probability to catch random
memory corruption by cpu exception.
Signed-off-by: Stanislaw Gruszka <sgruszka@redhat.com> Cc: "Rafael J. Wysocki" <rjw@sisk.pl> Cc: Andrea Arcangeli <aarcange@redhat.com> Acked-by: Christoph Lameter <cl@linux.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Stanislaw Gruszka <sgruszka@redhat.com> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
When debugging with CONFIG_DEBUG_PAGEALLOC and debug_guardpage_minorder >
0, we have lot of free pages that are not marked so. Snapshot code
account them as savable, what cause hibernate memory preallocation
failure.
It is pretty hard to make hibernate allocation succeed with
debug_guardpage_minorder=1. This change at least make it possible when
system has relatively big amount of RAM.
Signed-off-by: Stanislaw Gruszka <sgruszka@redhat.com> Acked-by: Rafael J. Wysocki <rjw@sisk.pl> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
With CONFIG_DEBUG_PAGEALLOC configured, the CPU will generate an exception
on access (read,write) to an unallocated page, which permits us to catch
code which corrupts memory. However the kernel is trying to maximise
memory usage, hence there are usually few free pages in the system and
buggy code usually corrupts some crucial data.
This patch changes the buddy allocator to keep more free/protected pages
and to interlace free/protected and allocated pages to increase the
probability of catching corruption.
When the kernel is compiled with CONFIG_DEBUG_PAGEALLOC,
debug_guardpage_minorder defines the minimum order used by the page
allocator to grant a request. The requested size will be returned with
the remaining pages used as guard pages.
The default value of debug_guardpage_minorder is zero: no change from
current behaviour.
Signed-off-by: Stanislaw Gruszka <sgruszka@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: "Rafael J. Wysocki" <rjw@sisk.pl> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
David Daney [Fri, 2 Dec 2011 03:12:01 +0000 (14:12 +1100)]
hugetlb: replace BUG() with BUILD_BUG() for dummy definitions
The file linux/hugetlb.h has many places where dummy symbols were defined
so that the main source code would contain fewer:
#ifdef CONFIG_HUGETLBFS
or
#ifdef CONFIG_TRANSPARENT_HUGEPAGE
If there were any misuse of these symbols, the only symptom would be an
OOPS at runtime. Change the BUG() to BUILD_BUG() to catch any such abuse
at compile time instead.
Signed-off-by: David Daney <david.daney@cavium.com> Cc: David Rientjes <rientjes@google.com> Cc: DM <dm.n9107@gmail.com> Cc: Ralf Baechle <ralf@linux-mips.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
David Daney [Fri, 2 Dec 2011 03:12:01 +0000 (14:12 +1100)]
kernel.h: Add BUILD_BUG() macro.
We can place this in definitions that we expect the compiler to remove
by dead code elimination. If this assertion fails, we get a nice
error message at build time.
The GCC function attribute error("message") was added in version 4.3,
so we define a new macro __linktime_error(message) to expand to this
for GCC-4.3 and later. This will give us an error diagnostic from the
compiler on the line that fails. For other compilers
__linktime_error(message) expands to nothing, and we have to be
content with a link time error, but at least we will still get a build
error.
BUILD_BUG() expands to the undefined function __build_bug_failed() and
will fail at link time if the compiler ever emits code for it. On
GCC-4.3 and later, attribute((error())) is used so that the failure
will be noted at compile time instead.
Acked-by: David Howells <dhowells@redhat.com> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: David Daney <david.daney@cavium.com> Cc: David Rientjes <rientjes@google.com> Cc: DM <dm.n9107@gmail.com> Cc: Ralf Baechle <ralf@linux-mips.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
David Daney [Fri, 2 Dec 2011 03:12:01 +0000 (14:12 +1100)]
kernel.h: add BUILD_BUG() macro
We can place this in definitions that we expect the compiler to remove by
dead code elimination. If this assertion fails, we get a nice error
message at build time.
The GCC function attribute error("message") was added in version 4.3, so
we define a new macro __linktime_error(message) to expand to this for
GCC-4.3 and later. This will give us an error diagnostic from the
compiler on the line that fails. For other compilers
__linktime_error(message) expands to nothing, and we have to be content
with a link time error, but at least we will still get a build error.
BUILD_BUG() expands to the undefined function __build_bug_failed() and
will fail at link time if the compiler ever emits code for it. On GCC-4.3
and later, attribute((error())) is used so that the failure will be noted
at compile time instead.
Signed-off-by: David Daney <david.daney@cavium.com> Acked-by: David Rientjes <rientjes@google.com> Cc: DM <dm.n9107@gmail.com> Cc: Ralf Baechle <ralf@linux-mips.org> Acked-by: David Howells <dhowells@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
mm/hugetlb.c: fix virtual address handling in hugetlb fault
handle_mm_fault() passes 'faulted' address to hugetlb_fault(). This
address is not aligned to a hugepage boundary.
Most of the functions for hugetlb pages are aware of that and calculate an
alignment themselves. However some functions such as
copy_user_huge_page() and clear_huge_page() don't handle alignment by
themselves.
This patch make hugeltb_fault() fix the alignment and pass an aligned
addresss (to address of a faulted hugepage) to functions.
Let's make it clear that we cannot race with other fault handlers due to
hugetlb (global) mutex. Also make it clear that we want to keep pte_same
checks anayway to have a transition from the global mutex easier.
Signed-off-by: Michal Hocko <mhocko@suse.cz> Cc: Hillf Danton <dhillf@gmail.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Johannes Weiner <jweiner@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Hillf Danton [Fri, 2 Dec 2011 03:11:59 +0000 (14:11 +1100)]
hugetlb: detect race upon page allocation failure during COW
Currently we are not rechecking pte_same in hugetlb_cow after we take ptl
lock again in the page allocation failure code path and simply retry
again. This is not an issue at the moment because hugetlb fault path is
protected by hugetlb_instantiation_mutex so we cannot race.
The original page is locked and so we cannot race even with the page
migration.
Let's add the pte_same check anyway as we want to be consistent with the
other check later in this function and be safe if we ever remove the
mutex.
[mhocko@suse.cz: reworded the changelog] Signed-off-by: Hillf Danton <dhillf@gmail.com> Signed-off-by: Michal Hocko <mhocko@suse.cz> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Johannes Weiner <jweiner@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
mm: account reaped page cache on inode cache pruning
Inode cache pruning indirectly reclaims page-cache by invalidating mapping
pages. Let's account them into reclaim-state to notice this progress in
memory reclaimer.
Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org> Cc: Dave Chinner <david@fromorbit.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Mel Gorman [Fri, 2 Dec 2011 03:11:41 +0000 (14:11 +1100)]
mm: avoid livelock on !__GFP_FS allocations
Colin Cross reported;
Under the following conditions, __alloc_pages_slowpath can loop forever:
gfp_mask & __GFP_WAIT is true
gfp_mask & __GFP_FS is false
reclaim and compaction make no progress
order <= PAGE_ALLOC_COSTLY_ORDER
These conditions happen very often during suspend and resume,
when pm_restrict_gfp_mask() effectively converts all GFP_KERNEL
allocations into __GFP_WAIT.
The oom killer is not run because gfp_mask & __GFP_FS is false,
but should_alloc_retry will always return true when order is less
than PAGE_ALLOC_COSTLY_ORDER.
In his fix, he avoided retrying the allocation if reclaim made no progress
and __GFP_FS was not set. The problem is that this would result in
GFP_NOIO allocations failing that previously succeeded which would be very
unfortunate.
The big difference between GFP_NOIO and suspend converting GFP_KERNEL to
behave like GFP_NOIO is that normally flushers will be cleaning pages and
kswapd reclaims pages allowing GFP_NOIO to succeed after a short delay.
The same does not necessarily apply during suspend as the storage device
may be suspended.
This patch special cases the suspend case to fail the page allocation if
reclaim cannot make progress and adds some documentation on how
gfp_allowed_mask is currently used. Failing allocations like this may
cause suspend to abort but that is better than a livelock.
[mgorman@suse.de: Rework fix to be suspend specific]
[rientjes@google.com: Move suspended device check to should_alloc_retry] Reported-by: Colin Cross <ccross@android.com> Signed-off-by: Mel Gorman <mgorman@suse.de> Acked-by: David Rientjes <rientjes@google.com> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
WARNING: line over 80 characters
#42: FILE: mm/page_alloc.c:3464:
+ /* Blocks with reserved pages will never free, skip them. */
WARNING: line over 80 characters
#61: FILE: mm/page_alloc.c:3477:
+ set_pageblock_migratetype(page, MIGRATE_RESERVE);
WARNING: line over 80 characters
#62: FILE: mm/page_alloc.c:3478:
+ move_freepages_block(zone, page, MIGRATE_RESERVE);
total: 0 errors, 3 warnings, 44 lines checked
./patches/mm-reduce-the-amount-of-work-done-when-updating-min_free_kbytes.patch has style problems, please review.
If any of these errors are false positives, please report
them to the maintainer, see CHECKPATCH in MAINTAINERS.
Please run checkpatch prior to sending patches
Cc: Mel Gorman <mgorman@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Mel Gorman [Fri, 2 Dec 2011 03:11:40 +0000 (14:11 +1100)]
mm: reduce the amount of work done when updating min_free_kbytes
When min_free_kbytes is updated, some pageblocks are marked
MIGRATE_RESERVE. Ordinarily, this work is unnoticable as it happens early
in boot but on large machines with 1TB of memory, this has been reported
to delay boot times, probably due to the NUMA distances involved.
The bulk of the work is due to calling calling pageblock_is_reserved() an
unnecessary amount of times and accessing far more struct page metadata
than is necessary. This patch significantly reduces the amount of work
done by setup_zone_migrate_reserve() improving boot times on 1TB machines.
Signed-off-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Isaacson <adi@hexapodia.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Alan Cox <alan@lxorguk.ukuu.org.uk> Cc: Minchan Kim <minchan.kim@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Mel Gorman [Fri, 2 Dec 2011 03:11:39 +0000 (14:11 +1100)]
mm: do not stall in synchronous compaction for THP allocations
Occasionally during large file copies to slow storage, there are still
reports of user-visible stalls when THP is enabled. Reports on this have
been intermittent and not reliable to reproduce locally but;
Andy Isaacson reported a problem copying to VFAT on SD Card
https://lkml.org/lkml/2011/11/7/2
In this case, it was stuck in munmap for betwen 20 and 60
seconds in compaction. It is also possible that khugepaged
was holding mmap_sem on this process if CONFIG_NUMA was set.
Johannes Weiner reported stalls on USB
https://lkml.org/lkml/2011/7/25/378
In this case, there is no stack trace but it looks like the
same problem. The USB stick may have been using NTFS as a
filesystem based on other work done related to writing back
to USB around the same time.
Internally in SUSE, I received a bug report related to stalls in firefox
when using Java and Flash heavily while copying from NFS
to VFAT on USB. It has not been confirmed to be the same problem
but if it looks like a duck and quacks like a duck.....
In the past, commit [11bc82d6: mm: compaction: Use async migration for
__GFP_NO_KSWAPD and enforce no writeback] forced that sync compaction
would never be used for THP allocations. This was reverted in commit
[c6a140bf: mm/compaction: reverse the change that forbade sync migraton
with __GFP_NO_KSWAPD] on the grounds that it was uncertain it was
beneficial.
While user-visible stalls do not happen for me when writing to USB, I
setup a test running postmark while short-lived processes created
anonymous mapping. The objective was to exercise the paths that allocate
transparent huge pages. I then logged when processes were stalled for
more than 1 second, recorded a stack strace and did some analysis to
aggregate unique "stall events" which revealed
Time stalled in this event: 47369 ms
Event count: 20
usemem sleep_on_page 3690 ms
usemem sleep_on_page 2148 ms
usemem sleep_on_page 1534 ms
usemem sleep_on_page 1518 ms
usemem sleep_on_page 1225 ms
usemem sleep_on_page 2205 ms
usemem sleep_on_page 2399 ms
usemem sleep_on_page 2398 ms
usemem sleep_on_page 3760 ms
usemem sleep_on_page 1861 ms
usemem sleep_on_page 2948 ms
usemem sleep_on_page 1515 ms
usemem sleep_on_page 1386 ms
usemem sleep_on_page 1882 ms
usemem sleep_on_page 1850 ms
usemem sleep_on_page 3715 ms
usemem sleep_on_page 3716 ms
usemem sleep_on_page 4846 ms
usemem sleep_on_page 1306 ms
usemem sleep_on_page 1467 ms
[<ffffffff810ef30c>] wait_on_page_bit+0x6c/0x80
[<ffffffff8113de9f>] unmap_and_move+0x1bf/0x360
[<ffffffff8113e0e2>] migrate_pages+0xa2/0x1b0
[<ffffffff81134273>] compact_zone+0x1f3/0x2f0
[<ffffffff811345d8>] compact_zone_order+0xa8/0xf0
[<ffffffff811346ff>] try_to_compact_pages+0xdf/0x110
[<ffffffff810f773a>] __alloc_pages_direct_compact+0xda/0x1a0
[<ffffffff810f7d5d>] __alloc_pages_slowpath+0x55d/0x7a0
[<ffffffff810f8151>] __alloc_pages_nodemask+0x1b1/0x1c0
[<ffffffff811331db>] alloc_pages_vma+0x9b/0x160
[<ffffffff81142bb0>] do_huge_pmd_anonymous_page+0x160/0x270
[<ffffffff814410a7>] do_page_fault+0x207/0x4c0
[<ffffffff8143dde5>] page_fault+0x25/0x30
The stall times are approximate at best but the estimates represent 25% of
the worst stalls and even if the estimates are off by a factor of 10, it's
severe.
This patch once again prevents sync migration for transparent hugepage
allocations as it is preferable to fail a THP allocation than stall.
It was suggested that __GFP_NORETRY be used instead of __GFP_NO_KSWAPD to
look less like a special case. This would prevent THP allocation using
sync compaction but it would have other side-effects. There are existing
users of __GFP_NORETRY that are doing high-order allocations and while
they can handle allocation failure, it seems reasonable that they continue
to use sync compaction unless there is a deliberate reason to change that.
To help clarify this for the future, this patch updates the comment for
__GFP_NO_KSWAPD.
If accepted, this is a -stable candidate.
Reported-by: Andy Isaacson <adi@hexapodia.org> Reported-by: Johannes Weiner <hannes@cmpxchg.org> Tested-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Mel Gorman <mgorman@suse.de> Cc: Alan Cox <alan@lxorguk.ukuu.org.uk> Acked-by: Minchan Kim <minchan.kim@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Jacobo Giralt [Fri, 2 Dec 2011 03:11:39 +0000 (14:11 +1100)]
mm: migrate: one less atomic operation
migrate_page_move_mapping() drops a reference from the old page after
unfreezing its counter. Both operations can be merged into a single
atomic operation by directly unfreezing to one less reference.
The same applies to migrate_huge_page_move_mapping().
Signed-off-by: Jacobo Giralt <jacobo.giralt@gmail.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: Johannes Weiner <jweiner@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Rik van Riel [Fri, 2 Dec 2011 03:11:38 +0000 (14:11 +1100)]
mm-add-extra-free-kbytes-tunable-update
All the fixes suggested by Andrew Morton. Not much of a changelog
since the patch should probably be folded into
mm-add-extra-free-kbytes-tunable.patch
Thank you for pointing these out, Andrew.
Signed-off-by: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Rik van Riel [Fri, 2 Dec 2011 03:11:38 +0000 (14:11 +1100)]
mm: add extra free kbytes tunable
Add a userspace visible knob to tell the VM to keep an extra amount of
memory free, by increasing the gap between each zone's min and low
watermarks.
This is useful for realtime applications that call system calls and have a
bound on the number of allocations that happen in any short time period.
In this application, extra_free_kbytes would be left at an amount equal to
or larger than than the maximum number of allocations that happen in any
burst.
It may also be useful to reduce the memory use of virtual machines
(temporarily?), in a way that does not cause memory fragmentation like
ballooning does.
Testing results from Satoru Moriya:
: I ran some sample workloads and measure memory allocation latency
: (latency of __alloc_page_nodemask()).
: The test is like following:
:
: - CPU: 1 socket, 4 core
: - Memory: 4GB
:
: - Background load:
: $ dd if=3D/dev/zero of=3D/tmp/tmp1
: $ dd if=3D/dev/zero of=3D/tmp/tmp2
: $ dd if=3D/dev/zero of=3D/tmp/tmp3
:
: - Main load:
: $ mapped-file-stream 1 $((1024 * 1024 * 640)) --(*)
:
: (*) This is made by Johannes Weiner
: https://lkml.org/lkml/2010/8/30/226
:
: It allocates/access 640MByte memory at a burst.
:
: The result is follwoing:
:
: | | extra |
: | default | kbytes |
: --------------------------------------------------------------
: min_free_kbytes | 8113 | 8113 |
: extra_free_kbytes | 0 | 640*1024 | (KB)
: --------------------------------------------------------------
: worst latency | 517.762 | 20.775 | (usec)
: --------------------------------------------------------------
: vmstat result | | |
: nr_vmscan_write | 0 | 0 |
: pgsteal_dma | 0 | 0 |
: pgsteal_dma32 | 143667 | 144882 |
: pgsteal_normal | 31486 | 27001 |
: pgsteal_movable | 0 | 0 |
: pgscan_kswapd_dma | 0 | 0 |
: pgscan_kswapd_dma32 | 138617 | 156351 |
: pgscan_kswapd_normal | 30593 | 27955 |
: pgscan_kswapd_movable | 0 | 0 |
: pgscan_direct_dma | 0 | 0 |
: pgscan_direct_dma32 | 5050 | 0 |
: pgscan_direct_normal | 896 | 0 |
: pgscan_direct_movable | 0 | 0 |
: kswapd_steal | 169207 | 171883 |
: kswapd_inodesteal | 0 | 0 |
: kswapd_low_wmark_hit_quickly | 43 | 45 |
: kswapd_high_wmark_hit_quickly | 1 | 0 |
: allocstall | 32 | 0 |
:
:
: As you can see, in the default case there were 32 direct reclaim
: (allocstal= l) and its worst latency was 517.762 usecs. This value may be
: larger if a process would sleep or issue I/O in the direct reclaim path.
: OTOH, ii the other case where I add extra free bytes, there were no direct
: reclaim and its worst latency was 20.775 usecs.
:
: In this test case, we can avoid direct reclaim and keep a latency low.
Signed-off-by: Rik van Riel<riel@redhat.com> Acked-by: Johannes Weiner <jweiner@redhat.com> Tested-by: Satoru Moriya <satoru.moriya@hds.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
After commit v2.6.36-5896-gd065bd8 "mm: retry page fault when blocking on
disk transfer" we usually wait in page-faults without mmap_sem held, so
all swap-token logic was broken, because it based on using
rwsem_is_locked(&mm->mmap_sem) as sign of in progress page-faults.
Add an atomic counter of in progress page-faults for mm to the mm_struct
with swap-token.
Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Rename mm_page_free_direct into mm_page_free and mm_pagevec_free into
mm_page_free_batched
Since v2.6.33-5426-gc475dab the kernel triggers mm_page_free_direct for
all freed pages, not only for directly freed. So, let's name it properly.
For pages freed via page-list we also trigger mm_page_free_batched event.
Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org> Cc: Mel Gorman <mel@csn.ul.ie> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
This patch adds helper free_hot_cold_page_list() to free list of 0-order
pages. It frees pages directly from list without temporary page-vector.
It also calls trace_mm_pagevec_free() to simulate pagevec_free()
behaviour.
vmscan: activate executable pages after first usage
Logic added in commit 8cab4754d24a0 ("vmscan: make mapped executable pages
the first class citizen") was noticeably weakened in commit 645747462435d84 ("vmscan: detect mapped file pages used only once").
Currently these pages can become "first class citizens" only after second
usage. After this patch page_check_references() will activate they after
first usage, and executable code gets yet better chance to stay in memory.
Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org> Cc: Pekka Enberg <penberg@kernel.org> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Nick Piggin <npiggin@kernel.dk> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Shaohua Li <shaohua.li@intel.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Commit 645747462435 ("vmscan: detect mapped file pages used only once")
greatly decreases lifetime of single-used mapped file pages.
Unfortunately it also decreases life time of all shared mapped file pages.
Because after commit bf3f3bc5e7347 ("mm: don't mark_page_accessed
in fault path") page-fault handler does not mark page active or even
referenced.
Thus page_check_references() activates file page only if it was used twice
while it stays in inactive list, meanwhile it activates anon pages after
first access. Inactive list can be small enough, this way reclaimer can
accidentally throw away any widely used page if it wasn't used twice in
short period.
After this patch page_check_references() also activate file mapped page at
first inactive list scan if this page is already used multiple times via
several ptes.
I found this while trying to fix degragation in rhel6 (~2.6.32) from rhel5
(~2.6.18). There a complete mess with >100 web/mail/spam/ftp containers,
they share all their files but there a lot of anonymous pages: ~500mb
shared file mapped memory and 15-20Gb non-shared anonymous memory. In
this situation major-pagefaults are very costly, because all containers
share the same page. In my load kernel created a disproportionate
pressure on the file memory, compared with the anonymous, they equaled
only if I raise swappiness up to 150 =)
These patches actually wasn't helped a lot in my problem, but I saw
noticable (10-20 times) reduce in count and average time of
major-pagefault in file-mapped areas.
Actually both patches are fixes for commit v2.6.33-5448-g6457474, because
it was aimed at one scenario (singly used pages), but it breaks the logic
in other scenarios (shared and/or executable pages)
Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org> Acked-by: Pekka Enberg <penberg@kernel.org> Acked-by: Minchan Kim <minchan.kim@gmail.com> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Nick Piggin <npiggin@kernel.dk> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Shaohua Li <shaohua.li@intel.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Axel Lin [Fri, 2 Dec 2011 03:09:35 +0000 (14:09 +1100)]
sbus: convert drivers/sbus/char/* to use module_platform_driver()
Convert the drivers in drivers/sbus/char/* to use the
module_platform_driver() macro which makes the code smaller and a bit
simpler.
Signed-off-by: Axel Lin <axel.lin@gmail.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Grant Likely <grant.likely@secretlab.ca> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Stephen Boyd [Fri, 2 Dec 2011 03:09:17 +0000 (14:09 +1100)]
drivers/scsi/sg.c: convert to kstrtoul_from_user()
Instead of open coding this function use kstrtoul_from_user() directly.
Signed-off-by: Stephen Boyd <sboyd@codeaurora.org> Cc: Doug Gilbert <dgilbert@interlog.com> Cc: Douglas Gilbert <dougg@torque.net> Cc: James Bottomley <James.Bottomley@HansenPartnership.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Jesper Juhl [Fri, 2 Dec 2011 03:09:17 +0000 (14:09 +1100)]
drivers/scsi/aacraid/commctrl.c: fix mem leak in aac_send_raw_srb()
We leak in drivers/scsi/aacraid/commctrl.c::aac_send_raw_srb() :
We allocate memory:
...
struct user_sgmap* usg;
usg = kmalloc(actual_fibsize - sizeof(struct aac_srb)
+ sizeof(struct sgmap), GFP_KERNEL);
and then neglect to free it:
...
for (i = 0; i < usg->count; i++) {
u64 addr;
void* p;
if (usg->sg[i].count >
((dev->adapter_info.options &
AAC_OPT_NEW_COMM) ?
(dev->scsi_host_ptr->max_sectors << 9) :
65536)) {
rcode = -EINVAL;
goto cleanup;
... this 'goto' makes 'usg' go out of scope and leak the memory we
allocated.
Other exits properly kfree(usg), it's just here it is neglected.
Signed-off-by: Jesper Juhl <jj@chaosbits.net> Cc: James Bottomley <James.Bottomley@HansenPartnership.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Randy Dunlap [Fri, 2 Dec 2011 03:09:17 +0000 (14:09 +1100)]
drivers/scsi/megaraid.c: fix sparse warnings
Fix sparse warnings of right shift bigger than source value size:
drivers/scsi/megaraid.c:311:65: warning: right shift by bigger than source value
drivers/scsi/megaraid.c:313:65: warning: right shift by bigger than source value
drivers/scsi/megaraid.c:317:67: warning: right shift by bigger than source value
drivers/scsi/megaraid.c:319:67: warning: right shift by bigger than source value
Patch suggestion from email by Al Viro:
"Since both are claimed to be strings, I really suspect that this >> 8 is
misspelled >> 4 and they have a character followed by pair of two-digit
packed decimals in there..."
Signed-off-by: Randy Dunlap <rdunlap@xenotime.net> Cc: Al Viro <viro@ZenIV.linux.org.uk> Cc: Neela Syam Kolli <megaraidlinux@lsi.com> Cc: James Bottomley <James.Bottomley@HansenPartnership.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
For headers that get exported to userland and make use of u32 style
type names, it is advised to include linux/types.h.
This fixes a headers_check warning.
Signed-off-by: Alexander Shishkin <virtuoso@slind.org> Cc: James Bottomley <James.Bottomley@HansenPartnership.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Akinobu Mita [Fri, 2 Dec 2011 03:09:15 +0000 (14:09 +1100)]
ext4: use proper little-endian bitops
ext4_{set,clear}_bit() is defined as __test_and_{set,clear}_bit_le() for
ext4. Only two ext4_{set,clear}_bit() calls check the return value. The
rest of calls ignore the return value and they can be replaced with
__{set,clear}_bit_le().
This changes ext4_{set,clear}_bit() from __test_and_{set,clear}_bit_le()
to __{set,clear}_bit_le() and introduces ext4_test_and_{set,clear}_bit()
for the two places where old bit needs to be returned.
This ext4_{set,clear}_bit() change is considered safe, because if someone
uses these macros without noticing the change, new ext4_{set,clear}_bit
don't have return value and causes compiler errors where the return value
is used.
This also removes unused ext4_find_first_zero_bit().
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Cc: "Theodore Ts'o" <tytso@mit.edu> Cc: Andreas Dilger <adilger.kernel@dilger.ca> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Christine Chan [Fri, 2 Dec 2011 03:09:15 +0000 (14:09 +1100)]
kernel/timer.c: use debugobjects to catch deletion of uninitialized timers
del_timer_sync() calls debug_object_assert_init() to assert that a timer
has been initialized before calling lock_timer_base(). lock_timer_base()
would spin forever on a NULL(uninit-ed) base. The check is added to
del_timer() to prevent silent failure, even though it would not get stuck
in an infinite loop.
[sboyd@codeaurora.org: remove WARN, intialize timer function] Signed-off-by: Christine Chan <cschan@codeaurora.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: John Stultz <john.stultz@linaro.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Boyd <sboyd@codeaurora.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Christine Chan [Fri, 2 Dec 2011 03:09:15 +0000 (14:09 +1100)]
debugobjects: extend to assert that an object is initialized
Calling del_timer_sync() on an uninitialized timer leads to a never ending
loop in lock_timer_base() that spins checking for a non-NULL timer base.
Add an assertion to debugobjects to catch usage of uninitialized objects
so that we can initialize timers in the del_timer_sync() path before it
calls lock_timer_base().
[sboyd@codeaurora.org: clarify commit message] Signed-off-by: Christine Chan <cschan@codeaurora.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: John Stultz <john.stultz@linaro.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Stephen Boyd <sboyd@codeaurora.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Stephen Boyd [Fri, 2 Dec 2011 03:09:14 +0000 (14:09 +1100)]
debugobjects: be smarter about static objects
Remove the WARN_ON() in timer_fixup_activate() and actually use the return
code from fixup to tell the debugobjects code to print a warning. This
provides better diagnostic information via a nice debugobjects warning
instead of a simple WARN_ON(1) in the timer code with no information as to
what is wrong. We also assign a dummy timer callback so that if the timer
is actually set to fire we don't oops.
Signed-off-by: Stephen Boyd <sboyd@codeaurora.org> Cc: Christine Chan <cschan@codeaurora.org> Cc: John Stultz <john.stultz@linaro.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Amerigo Wang <amwang@redhat.com>
ERROR: Macros with complex values should be enclosed in parenthesis
#87: FILE: include/linux/ipc_namespace.h:126:
+#define DFLT_MSGSIZEMAX 1024*1024
ERROR: Macros with complex values should be enclosed in parenthesis
#88: FILE: include/linux/ipc_namespace.h:127:
+#define HARD_MSGSIZEMAX 16*1024*1024
total: 2 errors, 0 warnings, 75 lines checked
./patches/ipc-mqueue-update-maximums-for-the-mqueue-subsystem.patch has style problems, please review.
If any of these errors are false positives, please report
them to the maintainer, see CHECKPATCH in MAINTAINERS.
Please run checkpatch prior to sending patches
Cc: Doug Ledford <dledford@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
ipc/mqueue.c: In function 'mqueue_get_inode':
ipc/mqueue.c:154:4: error: implicit declaration of function 'vmalloc'
ipc/mqueue.c:154:19: warning: assignment makes pointer from integer without=
a cast
ipc/mqueue.c: In function 'mqueue_evict_inode':
ipc/mqueue.c:278:3: error: implicit declaration of function 'vfree'
Caused by commit 8a53f9442429 ("ipc/mqueue: update maximums for the
mqueue subsystem"). See Rule 1 in Documentation/SubmitChecklist.
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Doug Ledford <dledford@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Doug Ledford [Fri, 2 Dec 2011 03:09:13 +0000 (14:09 +1100)]
ipc/mqueue: update maximums for the mqueue subsystem
Commit b231cca4381ee ("message queues: increase range limits") changed the
maximum size of a message in a message queue from INT_MAX to 8192*128.
Unfortunately, we had customers that relied on a size much larger than
8192*128 on their production systems. After reviewing POSIX, we found
that it is silent on the maximum message size. We did find a couple other
areas in which it was not silent. Fix up the mqueue maximums so that the
customer's system can continue to work, and document both the POSIX and
real world requirements in ipc_namespace.h so that we don't have this
issue crop back up.
Also, commit 9cf18e1dd74c ("ipc: HARD_MSGMAX should be higher not lower on
64bit") fiddled with HARD_MSGMAX without realizing that the number was
intentionally in place to limit the msg queue depth to one that was small
enough to kmalloc an array of pointers (hence why we divided 128k by
sizeof(long)). If we wish to meet POSIX requirements, we have no choice
but to change our allocation to a vmalloc instead (at least for the large
queue size case). With that, it's possible to increase our allowed
maximum to the POSIX requirements (or more if we choose).
Signed-off-by: Doug Ledford <dledford@redhat.com> Cc: Amerigo Wang <amwang@redhat.com> Cc: Serge E. Hallyn <serue@us.ibm.com> Cc: Joe Korty <joe.korty@ccur.com> Cc: Manfred Spraul <manfred@colorfullife.com> Cc: <stable@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Doug Ledford [Fri, 2 Dec 2011 03:09:13 +0000 (14:09 +1100)]
ipc/mqueue: enforce hard limits
In two places we don't enforce the hard limits for CAP_SYS_RESOURCE apps.
In preparation for making more reasonable hard limits, start enforcing
them even on CAP_SYS_RESOURCE.
Signed-off-by: Doug Ledford <dledford@redhat.com> Cc: Amerigo Wang <amwang@redhat.com> Cc: Serge E. Hallyn <serue@us.ibm.com> Cc: Joe Korty <joe.korty@ccur.com> Cc: Manfred Spraul <manfred@colorfullife.com> Cc: <stable@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Doug Ledford [Fri, 2 Dec 2011 03:09:13 +0000 (14:09 +1100)]
ipc/mqueue: switch back to using non-max values on create
Commit b231cca4381ee15e ("message queues: increase range limits") changed
how we create a queue that does not include an attr struct passed to open
so that it creates the queue with whatever the maximum values are.
However, if the admin has set the maximums to allow flexibility in
creating a queue (aka, both a large size and large queue are allowed, but
combined they create a queue too large for the RLIMIT_MSGQUEUE of the
user), then attempts to create a queue without an attr struct will fail.
Switch back to using acceptable defaults regardless of what the maximums
are.
Signed-off-by: Doug Ledford <dledford@redhat.com> Cc: Amerigo Wang <amwang@redhat.com> Cc: Serge E. Hallyn <serue@us.ibm.com> Cc: Joe Korty <joe.korty@ccur.com> Cc: Manfred Spraul <manfred@colorfullife.com> Cc: <stable@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Doug Ledford [Fri, 2 Dec 2011 03:09:12 +0000 (14:09 +1100)]
ipc/mqueue: cleanup definition names and locations
We had a customer come up with a problem while trying to upgrade from our
2.6.18 kernel to our 2.6.32 kernel. In diagnosing their problem, it was
determined that when commit b231cca4 ("message queues: increase range
limits") changed the msg size max from INT_MAX to 8192*128, that's what
broke their setup.
While fixing this problem, testing showed that if you increase the max
values of a msg queue, then attempt to create one without an attr struct
passed in to the open call, it could fail because it sets the queue size
to the max of both the msg size and queue size. If these are large
enough, they over run the default RLIMIT_MSGQUEUE. This change was also
introduced in the b231cca4 ("message queues: increase range limits")
commit.
We then found that the msg queue limits were not all being enforced on
CAP_SYS_RESOURCE apps.
Finally, we found that commit 9cf18e1d ("ipc: HARD_MSGMAX should be higher
not lower on 64bit") fiddled with HARD_MSGMAX without realizing that the
reason it was set to what it was, was to avoid trying to kmalloc a chunk
larger than 128K.
So this series of patches cleans up the various defines, takes us back to
having a larger HARD_MSGSIZEMAX, goes back to using a separate define for
the case where a user doesn't pass in an attr struct in case the maxes
have been raised too large for RLIMIT_MSGQUEUE, enforces the maximums on
CAP_SYS_RESOURCE apps, uses vmalloc instead of kmalloc when the msg
pointer array is too large, and documents all of this so it shouldn't
happen again.
This patch:
The various defines for minimums and maximums of the sysctl controllable
mqueue values are scattered amongst different files and named
inconsistently. Move them all into ipc_namespace.h and make them have
consistent names. Additionally, make the number of queues per namespace
also have a minimum and maximum and use the same sysctl function as the
other two settable variables.
Signed-off-by: Doug Ledford <dledford@redhat.com> Cc: Amerigo Wang <amwang@redhat.com> Cc: Serge E. Hallyn <serue@us.ibm.com> Cc: Joe Korty <joe.korty@ccur.com> Cc: Manfred Spraul <manfred@colorfullife.com> Cc: <stable@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
John Stultz [Fri, 2 Dec 2011 03:09:12 +0000 (14:09 +1100)]
merge_config.sh: fix bug in final check
Arnaud Lacombe pointed out the final checking that the requested configs
were included in the final .config was broken.
The example was that if you had a fragment that disabled
CONFIG_DECOMPRESS_GZIP applied to a normal defconfig, there would be no
final warning that CONFIG_DECOMPRESS_GZIP was acutally set in the final
.config.
This bug was introduced by me in v3 of the original patch, and the
following patch reverts the invalid change.
Signed-off-by: John Stultz <john.stultz@linaro.org> Reported-by: Arnaud Lacombe <lacombar@gmail.com> Cc: Darren Hart <dvhart@linux.intel.com> Cc: Michal Marek <mmarek@suse.cz> Cc: Arnaud Lacombe <lacombar@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Darren Hart [Fri, 2 Dec 2011 03:09:11 +0000 (14:09 +1100)]
merge_config.sh: whitespace cleanup
Fix whitespace usage in the clean_up routine.
Signed-off-by: Darren Hart <dvhart@linux.intel.com> Acked-by: John Stultz <john.stultz@linaro.org> Cc: Michal Marek <mmarek@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Darren Hart [Fri, 2 Dec 2011 03:09:11 +0000 (14:09 +1100)]
merge_config.sh: use signal names compatible with dash and bash
The SIGHUP SIGINT and SIGTERM names caused failures when running
merge_config.sh with the dash shell. Dropping the "SIG" component makes
the script work in both bash and dash.
Signed-off-by: Darren Hart <dvhart@linux.intel.com> Acked-by: John Stultz <john.stultz@linaro.org> Cc: Michal Marek <mmarek@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
john stultz [Fri, 2 Dec 2011 03:09:08 +0000 (14:09 +1100)]
kconfig: add merge_config.sh script
After noticing almost every distro has their own method of managing config
fragments, I went looking at some best practices, and wanted to try to
consolidate some of the different approaches so this fairly simple
infrastructure can be shared (and new distros/build systems don't have to
implement yet another config fragment merge script).
This script is most influenced by the Windriver tools used in the Yocto
Project, reusing some portions found there.
This script merges multiple config fragments, warning on any overridden
values. It then sets any unspecified values to their default, then
finally checks to make sure no specified value was dropped due to
unsatisfied dependencies.
I'm sure this implementation won't work for everyone, and I expect it will
need to evolve to adapt for various use cases. But I think its a
reasonable starting point.
Cc: Sam Ravnborg <sam@ravnborg.org> Cc: Greg Thelen <gthelen@google.com> Cc: <tartler@cs.fau.de> Cc: Dmitry Fink <Dmitry.Fink@palm.com> Cc: Darren Hart <dvhart@linux.intel.com> Cc: Eric B Munson <ebmunson@us.ibm.com> Cc: Bruce Ashfield <Bruce.Ashfield@windriver.com> Cc: Michal Marek <mmarek@suse.cz> Signed-off-by: John Stultz <john.stultz@linaro.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
tick-sched: add specific do_timer_cpu value for nohz off mode
Show and modify the tick_do_timer_cpu via sysfs. This determines the cpu
on which global time (jiffies) updates occur. Modification can only be
done on systems with nohz mode turned off.
While not necessarily harmful, doing jiffies updates on an application cpu
does cause some extra overhead that HPC benchmarking people notice. They
prefer to have OS activity isolated to certain cpus. They like
reproducibility of results, and having jiffies updates bouncing around
introduces variability.
Signed-off-by: Dimitri Sivanich <sivanich@sgi.com> Cc: John Stultz <john.stultz@linaro.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Matthew Garrett [Fri, 2 Dec 2011 03:08:46 +0000 (14:08 +1100)]
hrtimers: Special-case zero length sleeps
sleep(0) is a common construct used by applications that want to trigger
the scheduler. sched_yield() might make more sense, but only appeared in
POSIX.1-2001 and so plenty of example code still uses the sleep(0) form.
This wouldn't normally be a problem, but it means that event-driven
applications that are merely trying to avoid starving other processes may
actually end up sleeping due to having large timer_slack values. Special-
casing this seems reasonable.
Signed-off-by: Matthew Garrett <mjg@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Arjan van de Ven <arjan@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Witold Baryluk [Fri, 2 Dec 2011 03:08:46 +0000 (14:08 +1100)]
intel-iommu: Fix __init section missmatch of dmar_parse_rmrr_atsr_dev
dmar_parse_rmrr_atsr_dev() (drivers/iommu/dmar.c) is called from
dmar_dev_scope_init() (drivers/iommu/intel-iommu.c), but
dmar_dev_scope_init() is annotated with __init, when
dmar_parse_rmrr_atsr_dev() is not, causing full section missmatch
analsysis to abort compilation.
Fix problem by adding __init annotation to dmar_parse_rmrr_atsr_dev.
Signed-off-by: Witold Baryluk <baryluk@smp.if.uj.edu.pl> Cc: David Woodhouse <dwmw2@infradead.org> Cc: Suresh Siddha <suresh.b.siddha@intel.com> Cc: Allen Kay <allen.m.kay@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Mathias Krause [Fri, 2 Dec 2011 03:08:46 +0000 (14:08 +1100)]
arm, exec: remove redundant set_fs(USER_DS)
The address limit is already set in flush_old_exec() so this
set_fs(USER_DS) is redundant.
Signed-off-by: Mathias Krause <minipli@googlemail.com> Cc: Russell King <rmk@arm.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>