Most mobile phones have Ambient Light Sensors and it changes brightness
according to the lux. It means it changes backlight brightness frequently
by just writing sysfs node, so it generates uevent.
Usually there's no user to use this backlight changes. But it forks udev
worker threads and it takes about 5ms. The main problem is that it hurts
other process activities. so remove it.
Kay said
"Uevents are for the major, low-frequent, global device state-changes,
not for carrying-out any sort of measurement data. Subsystems which
need that should use other facilities like poll()-able sysfs file or
any other subscription-based, client-tracking interface which does not
cause overhead if it isn't used. Uevents are not the right thing to
use here, and upstream udev should not paper-over broken kernel
subsystems."
Signed-off-by: Kyungmin Park <kyungmin.park@samsung.com> Cc: Henrique de Moraes Holschuh <hmh@hmh.eng.br> Cc: Matthew Garrett <mjg59@srcf.ucam.org> Cc: Kay Sievers <kay.sievers@vrfy.org> Acked-by: Jingoo Han <jg1.han@samsung.com> Cc: Henrique de Moraes Holschuh <ibm-acpi@hmh.eng.br> Cc: Richard Purdie <rpurdie@rpsys.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Joe Perches [Mon, 16 Dec 2013 23:45:25 +0000 (10:45 +1100)]
get_maintainer: add commit author information to --rolestats
get_maintainer currently uses "Signed-off-by" style lines to find
interested parties to send patches to when the MAINTAINERS file does not
have a specific section entry with a matching file pattern.
Add statistics for commit authors and lines added and deleted to the
information provided by --rolestats.
These statistics are also emitted whenever --rolestats and --git are
selected even when there is a specified maintainer.
This can have the effect of expanding the number of people that are shown
as possible "maintainers" of a particular file because "authors",
"added_lines", and "removed_lines" are also used as criterion for the
--max-maintainers option separate from the "commit_signers".
The first "--git-max-maintainers" values of each criterion
are emitted. Any "ties" are not shown.
For example: (forcedeth does not have a named maintainer)
Old output:
$ ./scripts/get_maintainer.pl -f drivers/net/ethernet/nvidia/forcedeth.c
"David S. Miller" <davem@davemloft.net> (commit_signer:8/10=80%)
Jiri Pirko <jiri@resnulli.us> (commit_signer:2/10=20%)
Patrick McHardy <kaber@trash.net> (commit_signer:2/10=20%)
Larry Finger <Larry.Finger@lwfinger.net> (commit_signer:1/10=10%)
Peter Zijlstra <peterz@infradead.org> (commit_signer:1/10=10%)
netdev@vger.kernel.org (open list:NETWORKING DRIVERS)
linux-kernel@vger.kernel.org (open list)
New output:
$ ./scripts/get_maintainer.pl -f drivers/net/ethernet/nvidia/forcedeth.c
"David S. Miller" <davem@davemloft.net> (commit_signer:8/10=80%)
Jiri Pirko <jiri@resnulli.us> (commit_signer:2/10=20%,authored:2/10=20%,removed_lines:3/33=9%)
Patrick McHardy <kaber@trash.net> (commit_signer:2/10=20%,authored:2/10=20%,added_lines:12/95=13%,removed_lines:10/33=30%)
Larry Finger <Larry.Finger@lwfinger.net> (commit_signer:1/10=10%,authored:1/10=10%,added_lines:35/95=37%)
Peter Zijlstra <peterz@infradead.org> (commit_signer:1/10=10%)
"Peter Hüwe" <PeterHuewe@gmx.de> (authored:1/10=10%,removed_lines:15/33=45%)
Joe Perches <joe@perches.com> (authored:1/10=10%)
Neil Horman <nhorman@tuxdriver.com> (added_lines:40/95=42%)
Bill Pemberton <wfp5p@virginia.edu> (removed_lines:3/33=9%)
netdev@vger.kernel.org (open list:NETWORKING DRIVERS)
linux-kernel@vger.kernel.org (open list)
Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Joe Perches [Mon, 16 Dec 2013 23:45:24 +0000 (10:45 +1100)]
printk/cache: Mark printk_once test variable __read_mostly
Add #include <linux/cache.h> to define __read_mostly.
Convert cache.h to use uapi/linux/kernel.h instead
of linux/kernel.h to avoid recursive #includes.
Convert the ALIGN macro to __ALIGN_KERNEL.
printk_once only sets the bool variable tested
once so mark it __read_mostly.
Neaten the alignment so it matches the rest of the
pr_<level>_once #defines too.
Signed-off-by: Joe Perches <joe@perches.com> Reviewed-by: James Hogan <james.hogan@imgtec.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Du, Changbin [Mon, 16 Dec 2013 23:45:24 +0000 (10:45 +1100)]
dynamic-debug-howto.txt: update since new wildcard support
Add the usage of using new feature wildcard support.
Signed-off-by: Du, Changbin <changbin.du@gmail.com> Cc: Jason Baron <jbaron@akamai.com> Cc: Joe Perches <joe@perches.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Du, Changbin [Mon, 16 Dec 2013 23:45:24 +0000 (10:45 +1100)]
dynamic_debug: add wildcard support to filter files/functions/modules
Add wildcard '*'(matches zero or more characters) and '?' (matches one
character) support when qurying debug flags.
Now we can open debug messages using keywords. eg:
1. open debug logs in all usb drivers
echo "file drivers/usb/* +p" > <debugfs>/dynamic_debug/control
2. open debug logs for usb xhci code
echo "file *xhci* +p" > <debugfs>/dynamic_debug/control
Signed-off-by: Du, Changbin <changbin.du@gmail.com> Cc: Jason Baron <jbaron@akamai.com> Cc: Joe Perches <joe@perches.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Du, Changbin [Mon, 16 Dec 2013 23:45:23 +0000 (10:45 +1100)]
lib/parser.c: add match_wildcard function
match_wildcard function is a simple implementation of wildcard
matching algorithm. It only supports two usual wildcardes:
'*' - matches zero or more characters
'?' - matches one character
This algorithm is safe since it is non-recursive.
Signed-off-by: Du, Changbin <changbin.du@gmail.com> Cc: Jason Baron <jbaron@akamai.com> Cc: Joe Perches <joe@perches.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Gustavo Padovan [Mon, 16 Dec 2013 23:45:23 +0000 (10:45 +1100)]
drivers/misc/ti-st/st_core.c: fix NULL dereference on protocol type check
If the type we receive is greater than ST_MAX_CHANNELS we can't rely on
type as vector index since we would be accessing unknown memory when we use the type
as index.
Mark Salter [Mon, 16 Dec 2013 23:45:23 +0000 (10:45 +1100)]
um: use generic fixmap.h
Signed-off-by: Mark Salter <msalter@redhat.com> Acked-by: Richard Weinberger <richard@nod.at> Cc: Jeff Dike <jdike@addtoit.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Mark Salter [Mon, 16 Dec 2013 23:45:22 +0000 (10:45 +1100)]
powerpc: use generic fixmap.h
Signed-off-by: Mark Salter <msalter@redhat.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Mark Salter [Mon, 16 Dec 2013 23:45:22 +0000 (10:45 +1100)]
metag: use generic fixmap.h
Signed-off-by: Mark Salter <msalter@redhat.com> Acked-by: James Hogan <james.hogan@imgtec.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Mark Salter [Mon, 16 Dec 2013 23:45:21 +0000 (10:45 +1100)]
arm: use generic fixmap.h
ARM is different from other architectures in that fixmap pages are indexed
with a positive offset from FIXADDR_START. Other architectures index with
a negative offset from FIXADDR_TOP. In order to use the generic fixmap.h
definitions, this patch redefines FIXADDR_TOP to be inclusive of the
useable range. That is, FIXADDR_TOP is the virtual address of the topmost
fixed page. The newly defined FIXADDR_END is the first virtual address
past the fixed mappings.
Signed-off-by: Mark Salter <msalter@redhat.com> Cc: Russell King <linux@arm.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Mark Salter [Mon, 16 Dec 2013 23:45:21 +0000 (10:45 +1100)]
add generic fixmap.h
Many architectures provide an asm/fixmap.h which defines support for
compile-time 'special' virtual mappings which need to be made before
paging_init() has run. This support is also used for early ioremap on
x86. Much of this support is identical across the architectures. This
patch consolidates all of the common bits into asm-generic/fixmap.h which
is intended to be included from arch/*/include/asm/fixmap.h.
Signed-off-by: Mark Salter <msalter@redhat.com> Acked-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Ralf Baechle <ralf@linux-mips.org> Cc: Russell King <linux@arm.linux.org.uk> Cc: Richard Kuo <rkuo@codeaurora.org> Cc: James Hogan <james.hogan@imgtec.com> Cc: Michal Simek <monstr@monstr.eu> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Chris Metcalf <cmetcalf@tilera.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jeff Dike <jdike@addtoit.com> Cc: Paul Mundt <lethal@linux-sh.org> Cc: Richard Weinberger <richard@nod.at> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Jonas Bonn <jonas.bonn@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Fabian Frederick [Mon, 16 Dec 2013 23:45:20 +0000 (10:45 +1100)]
drivers/block/Kconfig: update RAM block device module name
RAM block device support module name changed to brd.ko some years ago with
an "rd" alias to match previous module implementation. This patch updates
its Kconfig definition.
Signed-off-by: Fabian Frederick <fabf@skynet.be> Acked-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Now all 64-bit architectures have been converted to int-ll64.h, we can
remove int-l64.h in kernelspace.
For backwards compatibility, alpha, ia64, mips64, and powerpc64 still use
int-l64.h in userspace.
This is the (reworked for UAPI) non-documentation part of more than two
year old "asm/types.h: All architectures use int-ll64.h in kernelspace"
(https://lkml.org/lkml/2011/8/13/104)
Since <asm/types.h> (from include/uapi/asm-generic/types.h) is used for
both kernel and user space, include/asm-generic/int-ll64.h cannot just
become include/asm-generic/types.h, as Arnd suggested.
Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org> Acked-by: Arnd Bergmann <arnd@arndb.de> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Randy Dunlap <rdunlap@xenotime.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
kernel: use lockless list for smp_call_function_single
Make smp_call_function_single and friends more efficient by using
a lockless list.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Shaohua Li [Mon, 16 Dec 2013 23:45:19 +0000 (10:45 +1100)]
swap: add a simple detector for inappropriate swapin readahead
This is a patch to improve swap readahead algorithm. It's from Hugh and I
slightly changed it.
Hugh's original changelog:
swapin readahead does a blind readahead, whether or not the swapin
is sequential. This may be ok on harddisk, because large reads have
relatively small costs, and if the readahead pages are unneeded they
can be reclaimed easily - though, what if their allocation forced
reclaim of useful pages? But on SSD devices large reads are more
expensive than small ones: if the readahead pages are unneeded,
reading them in caused significant overhead.
This patch adds very simplistic random read detection. Stealing
the PageReadahead technique from Konstantin Khlebnikov's patch,
avoiding the vma/anon_vma sophistications of Shaohua Li's patch,
swapin_nr_pages() simply looks at readahead's current success
rate, and narrows or widens its readahead window accordingly.
There is little science to its heuristic: it's about as stupid
as can be whilst remaining effective.
The table below shows elapsed times (in centiseconds) when running
a single repetitive swapping load across a 1000MB mapping in 900MB
ram with 1GB swap (the harddisk tests had taken painfully too long
when I used mem=500M, but SSD shows similar results for that).
Vanilla is the 3.6-rc7 kernel on which I started; Shaohua denotes
his Sep 3 patch in mmotm and linux-next; HughOld denotes my Oct 1
patch which Shaohua showed to be defective; HughNew this Nov 14
patch, with page_cluster as usual at default of 3 (8-page reads);
HughPC4 this same patch with page_cluster 4 (16-page reads);
HughPC0 with page_cluster 0 (1-page reads: no readahead).
HDD for swapping to harddisk, SSD for swapping to VertexII SSD.
Seq for sequential access to the mapping, cycling five times around;
Rand for the same number of random touches. Anon for a MAP_PRIVATE
anon mapping; Shmem for a MAP_SHARED anon mapping, equivalent to tmpfs.
One weakness of Shaohua's vma/anon_vma approach was that it did
not optimize Shmem: seen below. Konstantin's approach was perhaps
mistuned, 50% slower on Seq: did not compete and is not shown below.
These tests are, of course, two extremes of a very simple case:
under heavier mixed loads I've not yet observed any consistent
improvement or degradation, and wider testing would be welcome.
Shaohua Li:
Test shows Vanilla is slightly better in sequential workload than Hugh's patch.
I observed with Hugh's patch sometimes the readahead size is shrinked too fast
(from 8 to 1 immediately) in sequential workload if there is no hit. And in
such case, continuing doing readahead is good actually.
I don't prepare a sophisticated algorithm for the sequential workload because
so far we can't guarantee sequential accessed pages are swap out sequentially.
So I slightly change Hugh's heuristic - don't shrink readahead size too fast.
Here is my test result (unit second, 3 runs average):
Vanilla Hugh New
Seq 356 370 360
Random 4525 2447 2444
Attached graph is the swapin/swapout throughput I collected with 'vmstat 2'.
The first part is running a random workload (till around 1200 of the x-axis)
and the second part is running a sequential workload. swapin and swapout
throughput are almost identical in steady state in both workloads. These are
expected behavior. while in Vanilla, swapin is much bigger than swapout
especially in random workload (because wrong readahead).
Original patches by: Shaohua Li and Konstantin Khlebnikov.
Signed-off-by: Hugh Dickins <hughd@google.com> Signed-off-by: Shaohua Li <shli@fusionio.com> Cc: Rik van Riel <riel@redhat.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Konstantin Khlebnikov <khlebnikov@openvz.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
swap: fix setting PAGE_SIZE blocksize during swapoff/swapon race
Fix race between swapoff and swapon resulting in setting blocksize of
PAGE_SIZE for block devices during swapoff.
The swapon modifies swap_info->old_block_size before acquiring
swapon_mutex. It reads block_size of bdev, stores it under
swap_info->old_block_size and sets new block_size to PAGE_SIZE.
On the other hand the swapoff sets the device's block_size to
old_block_size after releasing swapon_mutex.
This patch locks the swapon_mutex much earlier during swapon. It also
releases the swapon_mutex later during swapoff.
The effect of race can be triggered by following scenario:
- One block swap device with block size of 512
- thread 1: Swapon is called, swap is activated,
p->old_block_size = block_size(p->bdev); /512/
block_size(p->bdev) = PAGE_SIZE;
Thread ends.
- thread 2: Swapoff is called and it goes just after releasing the
swapon_mutex. The swap is now fully disabled except of setting the
block size to old value. The p->bdev->block_size is still equal to
PAGE_SIZE.
- thread 3: New swapon is called. This swap is disabled so without
acquiring the swapon_mutex:
- p->old_block_size = block_size(p->bdev); /PAGE_SIZE (!!!)/
- block_size(p->bdev) = PAGE_SIZE;
Swap is activated and thread ends.
- thread 2: resumes work and sets blocksize to old value:
- set_blocksize(bdev, p->old_block_size)
But now the p->old_block_size is equal to PAGE_SIZE.
The patch swap-fix-set_blocksize-race-during-swapon-swapoff does not fix
this particular issue. It reduces the possibility of races as the swapon
must overwrite p->old_block_size before acquiring swapon_mutex in swapoff.
Signed-off-by: Krzysztof Kozlowski <k.kozlowski@samsung.com> Cc: Weijie Yang <weijie.yang.kh@gmail.com> Cc: Bob Liu <bob.liu@oracle.com> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Shaohua Li <shli@fusionio.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Dan Streetman [Mon, 16 Dec 2013 23:45:19 +0000 (10:45 +1100)]
mm/zswap.c: change params from hidden to ro
The "compressor" and "enabled" params are currently hidden, this changes
them to read-only, so userspace can tell if zswap is enabled or not and
see what compressor is in use.
Signed-off-by: Dan Streetman <ddstreet@ieee.org> Cc: Vladimir Murzin <murzin.v@gmail.com> Cc: Bob Liu <bob.liu@oracle.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Weijie Yang <weijie.yang@samsung.com> Acked-by: Seth Jennings <sjennings@variantweb.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Documentation/vm/locking is a blast from the past. In the entire git
history, it has had precisely Three modifications. Two of those look to
be pure renames, and the third was from 2005.
The doc contains such gems as:
> The page_table_lock is grabbed while holding the
> kernel_lock spinning monitor.
> Page stealers hold kernel_lock to protect against a bunch of
> races.
Or this which talks about mmap_sem:
> 4. The exception to this rule is expand_stack, which just
> takes the read lock and the page_table_lock, this is ok
> because it doesn't really modify fields anybody relies on.
expand_stack() doesn't take any locks any more directly, and the
mmap_sem acquisition was long ago moved up in to the page fault
code itself.
It could be argued that we need to rewrite this, but it is
dangerous to leave it as-is. It will confuse more people than it
helps.
Signed-off-by: Dave Hansen <dave.hansen@intel.com> Cc: Hugh Dickins <hughd@google.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Joonsoo Kim [Mon, 16 Dec 2013 23:45:18 +0000 (10:45 +1100)]
mm/migrate: remove putback_lru_pages, fix comment on putback_movable_pages
Some part of putback_lru_pages() and putback_movable_pages() is
duplicated, so it could confuse us what we should use. We can remove
putback_lru_pages() since it is not really needed now. This makes us
undestand and maintain the code more easily.
And comment on putback_movable_pages() is stale now, so fix it.
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Reviewed-by: Wanpeng Li <liwanp@linux.vnet.ibm.com> Cc: Christoph Lameter <cl@linux.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Rafael Aquini <aquini@redhat.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Rik van Riel <riel@redhat.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Joonsoo Kim [Mon, 16 Dec 2013 23:45:18 +0000 (10:45 +1100)]
mm/migrate: correct failure handling if !hugepage_migration_support()
We should remove the page from the list if we fail with ENOSYS, since
migrate_pages() consider error cases except -ENOMEM and -EAGAIN as
permanent failure and it assumes that the page would be removed from the
list. Without this patch, we could overcount number of failure.
In addition, we should put back the new hugepage if
!hugepage_migration_support(). If not, we would leak hugepage memory.
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Acked-by: Christoph Lameter <cl@linux.com> Reviewed-by: Wanpeng Li <liwanp@linux.vnet.ibm.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Rafael Aquini <aquini@redhat.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Rik van Riel <riel@redhat.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
David Rientjes [Mon, 16 Dec 2013 23:45:17 +0000 (10:45 +1100)]
mm, page_alloc: warn for non-blockable __GFP_NOFAIL allocation failure
__GFP_NOFAIL may return NULL when coupled with GFP_NOWAIT or GFP_ATOMIC.
Luckily, nothing currently does such craziness. So instead of causing
such allocations to loop (potentially forever), we maintain the current
behavior and also warn about the new users of the deprecated flag.
Suggested-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: David Rientjes <rientjes@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Michal Hocko <mhocko@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Vlastimil Babka [Mon, 16 Dec 2013 23:45:17 +0000 (10:45 +1100)]
mm: compaction: reset scanner positions immediately when they meet
Compaction used to start its migrate and free page scaners at the zone's
lowest and highest pfn, respectively. Later, caching was introduced to
remember the scanners' progress across compaction attempts so that
pageblocks are not re-scanned uselessly. Additionally, pageblocks where
isolation failed are marked to be quickly skipped when encountered again
in future compactions.
Currently, both the reset of cached pfn's and clearing of the pageblock
skip information for a zone is done in __reset_isolation_suitable(). This
function gets called when:
- compaction is restarting after being deferred
- compact_blockskip_flush flag is set in compact_finished() when the scanners
meet (and not again cleared when direct compaction succeeds in allocation)
and kswapd acts upon this flag before going to sleep
This behavior is suboptimal for several reasons:
- when direct sync compaction is called after async compaction fails (in the
allocation slowpath), it will effectively do nothing, unless kswapd
happens to process the compact_blockskip_flush flag meanwhile. This is racy
and goes against the purpose of sync compaction to more thoroughly retry
the compaction of a zone where async compaction has failed.
The restart-after-deferring path cannot help here as deferring happens only
after the sync compaction fails. It is also done only for the preferred
zone, while the compaction might be done for a fallback zone.
- the mechanism of marking pageblock to be skipped has little value since the
cached pfn's are reset only together with the pageblock skip flags. This
effectively limits pageblock skip usage to parallel compactions.
This patch changes compact_finished() so that cached pfn's are reset
immediately when the scanners meet. Clearing pageblock skip flags is
unchanged, as well as the other situations where cached pfn's are reset.
This allows the sync-after-async compaction to retry pageblocks not marked
as skipped, such as blocks !MIGRATE_MOVABLE blocks that async compactions
now skips without marking them.
Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Cc: Rik van Riel <riel@redhat.com> Acked-by: Mel Gorman <mgorman@suse.de> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Vlastimil Babka [Mon, 16 Dec 2013 23:45:17 +0000 (10:45 +1100)]
mm: compaction: do not mark unmovable pageblocks as skipped in async compaction
Compaction temporarily marks pageblocks where it fails to isolate pages as
to-be-skipped in further compactions, in order to improve efficiency. One
of the reasons to fail isolating pages is that isolation is not attempted
in pageblocks that are not of MIGRATE_MOVABLE (or CMA) type.
The problem is that blocks skipped due to not being MIGRATE_MOVABLE in
async compaction become skipped due to the temporary mark also in future
sync compaction. Moreover, this may follow quite soon during
__alloc_page_slowpath, without much time for kswapd to clear the pageblock
skip marks. This goes against the idea that sync compaction should try to
scan these blocks more thoroughly than the async compaction.
The fix is to ensure in async compaction that these !MIGRATE_MOVABLE
blocks are not marked to be skipped. Note this should not affect
performance or locking impact of further async compactions, as skipping a
block due to being !MIGRATE_MOVABLE is done soon after skipping a block
marked to be skipped, both without locking.
Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Cc: Rik van Riel <riel@redhat.com> Acked-by: Mel Gorman <mgorman@suse.de> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Vlastimil Babka [Mon, 16 Dec 2013 23:45:17 +0000 (10:45 +1100)]
mm: compaction: detect when scanners meet in isolate_freepages
Compaction of a zone is finished when the migrate scanner (which begins at
the zone's lowest pfn) meets the free page scanner (which begins at the
zone's highest pfn). This is detected in compact_zone() and in the case
of direct compaction, the compact_blockskip_flush flag is set so that
kswapd later resets the cached scanner pfn's, and a new compaction may
again start at the zone's borders.
The meeting of the scanners can happen during either scanner's activity.
However, it may currently fail to be detected when it occurs in the free
page scanner, due to two problems. First, isolate_freepages() keeps
free_pfn at the highest block where it isolated pages from, for the
purposes of not missing the pages that are returned back to allocator when
migration fails. Second, failing to isolate enough free pages due to
scanners meeting results in -ENOMEM being returned by migrate_pages(),
which makes compact_zone() bail out immediately without calling
compact_finished() that would detect scanners meeting.
This failure to detect scanners meeting might result in repeated attempts
at compaction of a zone that keep starting from the cached pfn's close to
the meeting point, and quickly failing through the -ENOMEM path, without
the cached pfns being reset, over and over. This has been observed
(through additional tracepoints) in the third phase of the mmtests
stress-highalloc benchmark, where the allocator runs on an otherwise idle
system. The problem was observed in the DMA32 zone, which was used as a
fallback to the preferred Normal zone, but on the 4GB system it was
actually the largest zone. The problem is even amplified for such
fallback zone - the deferred compaction logic, which could (after being
fixed by a previous patch) reset the cached scanner pfn's, is only applied
to the preferred zone and not for the fallbacks.
The problem in the third phase of the benchmark was further amplified by
commit 81c0a2bb ("mm: page_alloc: fair zone allocator policy") which
resulted in a non-deterministic regression of the allocation success rate
from ~85% to ~65%. This occurs in about half of benchmark runs, making
bisection problematic. It is unlikely that the commit itself is buggy,
but it should put more pressure on the DMA32 zone during phases 1 and 2,
which may leave it more fragmented in phase 3 and expose the bugs that
this patch fixes.
The fix is to make scanners meeting in isolate_freepage() stay that way,
and to check in compact_zone() for scanners meeting when migrate_pages()
returns -ENOMEM. The result is that compact_finished() also detects
scanners meeting and sets the compact_blockskip_flush flag to make kswapd
reset the scanner pfn's.
The results in stress-highalloc benchmark show that the "regression" by
commit 81c0a2bb in phase 3 no longer occurs, and phase 1 and 2 allocation
success rates are also significantly improved.
Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Cc: Mel Gorman <mgorman@suse.de> Cc: Rik van Riel <riel@redhat.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Vlastimil Babka [Mon, 16 Dec 2013 23:45:16 +0000 (10:45 +1100)]
mm: compaction: reset cached scanner pfn's before reading them
Compaction caches pfn's for its migrate and free scanners to avoid
scanning the whole zone each time. In compact_zone(), the cached values
are read to set up initial values for the scanners. There are several
situations when these cached pfn's are reset to the first and last pfn of
the zone, respectively. One of these situations is when a compaction has
been deferred for a zone and is now being restarted during a direct
compaction, which is also done in compact_zone().
However, compact_zone() currently reads the cached pfn's *before*
resetting them. This means the reset doesn't affect the compaction that
performs it, and with good chance also subsequent compactions, as
update_pageblock_skip() is likely to be called and update the cached pfn's
to those being processed. Another chance for a successful reset is when a
direct compaction detects that migration and free scanners meet (which has
its own problems addressed by another patch) and sets
update_pageblock_skip flag which kswapd uses to do the reset because it
goes to sleep.
This is clearly a bug that results in non-deterministic behavior, so this
patch moves the cached pfn reset to be performed *before* the values are
read.
Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Mel Gorman <mgorman@suse.de> Acked-by: Rik van Riel <riel@redhat.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Vlastimil Babka [Mon, 16 Dec 2013 23:45:16 +0000 (10:45 +1100)]
mm: compaction: encapsulate defer reset logic
Currently there are several functions to manipulate the deferred
compaction state variables. The remaining case where the variables are
touched directly is when a successful allocation occurs in direct
compaction, or is expected to be successful in the future by kswapd.
Here, the lowest order that is expected to fail is updated, and in the
case of successful allocation, the deferred status and counter is reset
completely.
Create a new function compaction_defer_reset() to encapsulate this
functionality and make it easier to understand the code. No functional
change.
Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Mel Gorman <mgorman@suse.de> Reviewed-by: Rik van Riel <riel@redhat.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Mel Gorman [Mon, 16 Dec 2013 23:45:16 +0000 (10:45 +1100)]
mm: compaction: trace compaction begin and end
The broad goal of the series is to improve allocation success rates for
huge pages through memory compaction, while trying not to increase the
compaction overhead. The original objective was to reintroduce capturing
of high-order pages freed by the compaction, before they are split by
concurrent activity. However, several bugs and opportunities for simple
improvements were found in the current implementation, mostly through
extra tracepoints (which are however too ugly for now to be considered for
sending).
The patches mostly deal with two mechanisms that reduce compaction
overhead, which is caching the progress of migrate and free scanners, and
marking pageblocks where isolation failed to be skipped during further
scans.
Patch 1 (from mgorman) adds tracepoints that allow calculate time spent in
compaction and potentially debug scanner pfn values.
Patch 2 encapsulates the some functionality for handling deferred compactions
for better maintainability, without a functional change
type is not determined without being actually needed.
Patch 3 fixes a bug where cached scanner pfn's are sometimes reset only after
they have been read to initialize a compaction run.
Patch 4 fixes a bug where scanners meeting is sometimes not properly detected
and can lead to multiple compaction attempts quitting early without
doing any work.
Patch 5 improves the chances of sync compaction to process pageblocks that
async compaction has skipped due to being !MIGRATE_MOVABLE.
Patch 6 improves the chances of sync direct compaction to actually do anything
when called after async compaction fails during allocation slowpath.
The impact of patches were validated using mmtests's stress-highalloc
benchmark with mmtests's stress-highalloc benchmark on a x86_64 machine
with 4GB memory.
Due to instability of the results (mostly related to the bugs fixed by
patches 2 and 3), 10 iterations were performed, taking min,mean,max values
for success rates and mean values for time and vmstat-based metrics.
First, the default GFP_HIGHUSER_MOVABLE allocations were tested with the
patches stacked on top of v3.13-rc2. Patch 2 is OK to serve as baseline
due to no functional changes in 1 and 2. Comments below.
- The "Success 3" line is allocation success rate with system idle
(phases 1 and 2 are with background interference). I used to get stable
values around 85% with vanilla 3.11. The lower min and mean values came
with 3.12. This was bisected to commit 81c0a2bb ("mm: page_alloc: fair
zone allocator policy") As explained in comment for patch 3, I don't
think the commit is wrong, but that it makes the effect of compaction
bugs worse. From patch 3 onwards, the results are OK and match the 3.11
results.
- Patch 4 also clearly helps phases 1 and 2, and exceeds any results
I've seen with 3.11 (I didn't measure it that thoroughly then, but it
was never above 40%).
- Compaction cost and number of scanned pages is higher, especially due
to patch 4. However, keep in mind that patches 3 and 4 fix existing
bugs in the current design of compaction overhead mitigation, they do
not change it. If overhead is found unacceptable, then it should be
decreased differently (and consistently, not due to random conditions)
than the current implementation does. In contrast, patches 5 and 6
(which are not strictly bug fixes) do not increase the overhead (but
also not success rates). This might be a limitation of the
stress-highalloc benchmark as it's quite uniform.
Another set of results is when configuring stress-highalloc t allocate
with similar flags as THP uses:
(GFP_HIGHUSER_MOVABLE|__GFP_NOMEMALLOC|__GFP_NORETRY|__GFP_NO_KSWAPD)
There are some differences from the previous results for THP-like allocations:
- Here, the bad result for unpatched kernel in phase 3 is much more
consistent to be between 65-70% and not related to the "regression" in
3.12. Still there is the improvement from patch 4 onwards, which brings
it on par with simple GFP_HIGHUSER_MOVABLE allocations.
- Compaction costs have increased, but nowhere near as much as the
non-THP case. Again, the patches should be worth the gained
determininsm.
- Patches 5 and 6 somewhat increase the number of migrate-scanned pages.
This is most likely due to __GFP_NO_KSWAPD flag, which means the cached
pfn's and pageblock skip bits are not reset by kswapd that often (at
least in phase 3 where no concurrent activity would wake up kswapd) and
the patches thus help the sync-after-async compaction. It doesn't
however show that the sync compaction would help so much with success
rates, which can be again seen as a limitation of the benchmark
scenario.
This patch (of 6):
Add two tracepoints for compaction begin and end of a zone. Using this it
is possible to calculate how much time a workload is spending within
compaction and potentially debug problems related to cached pfns for
scanning. In combination with the direct reclaim and slab trace points it
should be possible to estimate most allocation-related overhead for a
workload.
Signed-off-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Cc: Rik van Riel <riel@redhat.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Michal Hocko [Mon, 16 Dec 2013 23:45:16 +0000 (10:45 +1100)]
memcg, oom: lock mem_cgroup_print_oom_info
mem_cgroup_print_oom_info uses a static buffer (memcg_name) to store the
name of the cgroup. This is not safe as pointed out by David Rientjes
because memcg oom is locked only for its hierarchy and nothing prevents
another parallel hierarchy to trigger oom as well and overwrite the
already in-use buffer.
This patch introduces oom_info_lock hidden inside
mem_cgroup_print_oom_info which is held throughout the function. It makes
access to memcg_name safe and as a bonus it also prevents parallel memcg
ooms to interleave their statistics which would make the printed data hard
to analyze otherwise.
Signed-off-by: Michal Hocko <mhocko@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Mel Gorman [Mon, 16 Dec 2013 23:45:15 +0000 (10:45 +1100)]
sched: add tracepoints related to NUMA task migration
This patch adds three tracepoints
o trace_sched_move_numa when a task is moved to a node
o trace_sched_swap_numa when a task is swapped with another task
o trace_sched_stick_numa when a numa-related migration fails
The tracepoints allow the NUMA scheduler activity to be monitored and the
following high-level metrics can be calculated
o NUMA migrated stuck nr trace_sched_stick_numa
o NUMA migrated idle nr trace_sched_move_numa
o NUMA migrated swapped nr trace_sched_swap_numa
o NUMA local swapped trace_sched_swap_numa src_nid == dst_nid (should never happen)
o NUMA remote swapped trace_sched_swap_numa src_nid != dst_nid (should == NUMA migrated swapped)
o NUMA group swapped trace_sched_swap_numa src_ngid == dst_ngid
Maybe a small number of these are acceptable
but a high number would be a major surprise.
It would be even worse if bounces are frequent.
o NUMA avg task migs. Average number of migrations for tasks
o NUMA stddev task mig Self-explanatory
o NUMA max task migs. Maximum number of migrations for a single task
In general the intent of the tracepoints is to help diagnose problems
where automatic NUMA balancing appears to be doing an excessive amount of
useless work.
Signed-off-by: Mel Gorman <mgorman@suse.de> Reviewed-by: Rik van Riel <riel@redhat.com> Cc: Alex Thorlton <athorlton@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Mel Gorman [Mon, 16 Dec 2013 23:45:15 +0000 (10:45 +1100)]
mm: numa: do not automatically migrate KSM pages
KSM pages can be shared between tasks that are not necessarily related to
each other from a NUMA perspective. This patch causes those pages to be
ignored by automatic NUMA balancing so they do not migrate and do not
cause unrelated tasks to be grouped together.
Signed-off-by: Mel Gorman <mgorman@suse.de> Reviewed-by: Rik van Riel <riel@redhat.com> Cc: Alex Thorlton <athorlton@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Mel Gorman [Mon, 16 Dec 2013 23:45:15 +0000 (10:45 +1100)]
mm: numa: trace tasks that fail migration due to rate limiting
A low local/remote numa hinting fault ratio is potentially explained by
failed migrations. This patch adds a tracepoint that fires when migration
fails due to migration rate limitation.
Signed-off-by: Mel Gorman <mgorman@suse.de> Reviewed-by: Rik van Riel <riel@redhat.com> Cc: Alex Thorlton <athorlton@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Mel Gorman [Mon, 16 Dec 2013 23:45:14 +0000 (10:45 +1100)]
mm: numa: limit scope of lock for NUMA migrate rate limiting
NUMA migrate rate limiting protects a migration counter and window using a
lock but in some cases this can be a contended lock. It is not critical
that the number of pages be perfect, lost updates are acceptable. Reduce
the importance of this lock.
Signed-off-by: Mel Gorman <mgorman@suse.de> Reviewed-by: Rik van Riel <riel@redhat.com> Cc: Alex Thorlton <athorlton@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
David Rientjes [Mon, 16 Dec 2013 23:45:14 +0000 (10:45 +1100)]
mm, page_alloc: allow __GFP_NOFAIL to allocate below watermarks after reclaim
If direct reclaim has failed to free memory, __GFP_NOFAIL allocations can
potentially loop forever in the page allocator. In this case, it's better
to give them the ability to access below watermarks so that they may
allocate similar to the same privilege given to GFP_ATOMIC allocations.
We're careful to ensure this is only done after direct reclaim has had the
chance to free memory, however.
Signed-off-by: David Rientjes <rientjes@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Michal Hocko <mhocko@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Oleg Nesterov [Mon, 16 Dec 2013 23:45:13 +0000 (10:45 +1100)]
oom_kill: add rcu_read_lock() into find_lock_task_mm()
find_lock_task_mm() expects it is called under rcu or tasklist lock, but
it seems that at least oom_unkillable_task()->task_in_mem_cgroup() and
mem_cgroup_out_of_memory()->oom_badness() can call it lockless.
Perhaps we could fix the callers, but this patch simply adds rcu lock into
find_lock_task_mm(). This also allows to simplify a bit one of its
callers, oom_kill_process().
At least out_of_memory() calls has_intersects_mems_allowed() without even
rcu_read_lock(), this is obviously buggy.
Add the necessary rcu_read_lock(). This means that we can not simply
return from the loop, we need "bool ret" and "break".
While at it, swap the names of task_struct's (the argument and the local).
This cleans up the code a little bit and avoids the unnecessary
initialization.
Oleg Nesterov [Mon, 16 Dec 2013 23:45:13 +0000 (10:45 +1100)]
introduce for_each_thread() to replace the buggy while_each_thread()
while_each_thread() and next_thread() should die, almost every lockless
usage is wrong.
1. Unless g == current, the lockless while_each_thread() is not safe.
while_each_thread(g, t) can loop forever if g exits, next_thread()
can't reach the unhashed thread in this case. Note that this can
happen even if g is the group leader, it can exec.
2. Even if while_each_thread() itself was correct, people often use
it wrongly.
It was never safe to just take rcu_read_lock() and loop unless
you verify that pid_alive(g) == T, even the first next_thread()
can point to the already freed/reused memory.
This patch adds signal_struct->thread_head and task->thread_node to create
the normal rcu-safe list with the stable head. The new for_each_thread(g,
t) helper is always safe under rcu_read_lock() as long as this task_struct
can't go away.
Note: of course it is ugly to have both task_struct->thread_node and the
old task_struct->thread_group, we will kill it later, after we change the
users of while_each_thread() to use for_each_thread().
Perhaps we can kill it even before we convert all users, we can
reimplement next_thread(t) using the new thread_head/thread_node. But we
can't do this right now because this will lead to subtle behavioural
changes. For example, do/while_each_thread() always sees at least one
task, while for_each_thread() can do nothing if the whole thread group has
died. Or thread_group_empty(), currently its semantics is not clear
unless thread_group_leader(p) and we need to audit the callers before we
can change it.
So this patch adds the new interface which has to coexist with the old one
for some time, hopefully the next changes will be more or less
straightforward and the old one will go away soon.
Joonsoo Kim [Mon, 16 Dec 2013 23:45:12 +0000 (10:45 +1100)]
mm/rmap: use rmap_walk() in page_referenced()
Now, we have an infrastructure in rmap_walk() to handle difference from
variants of rmap traversing functions.
So, just use it in page_referenced().
In this patch, I change following things.
1. remove some variants of rmap traversing functions.
cf> page_referenced_ksm, page_referenced_anon,
page_referenced_file
2. introduce new struct page_referenced_arg and pass it to
page_referenced_one(), main function of rmap_walk, in order to count
reference, to store vm_flags and to check finish condition.
3. mechanical change to use rmap_walk() in page_referenced().
Joonsoo Kim [Mon, 16 Dec 2013 23:45:12 +0000 (10:45 +1100)]
mm/rmap: use rmap_walk() in try_to_munlock()
Now, we have an infrastructure in rmap_walk() to handle difference from
variants of rmap traversing functions.
So, just use it in try_to_munlock().
In this patch, I change following things.
1. remove some variants of rmap traversing functions.
cf> try_to_unmap_ksm, try_to_unmap_anon, try_to_unmap_file
2. mechanical change to use rmap_walk() in try_to_munlock().
3. copy and paste comments.
Joonsoo Kim [Mon, 16 Dec 2013 23:45:11 +0000 (10:45 +1100)]
mm/rmap: extend rmap_walk_xxx() to cope with different cases
There are a lot of common parts in traversing functions, but there are
also a little of uncommon parts in it. By assigning proper function
pointer on each rmap_walker_control, we can handle these difference
correctly.
Following are differences we should handle.
1. difference of lock function in anon mapping case
2. nonlinear handling in file mapping case
3. prechecked condition:
checking memcg in page_referenced(),
checking VM_SHARE in page_mkclean()
checking temporary vma in try_to_unmap()
4. exit condition:
checking page_mapped() in try_to_unmap()
So, in this patch, I introduce 4 function pointers to handle above
differences.
Joonsoo Kim [Mon, 16 Dec 2013 23:45:11 +0000 (10:45 +1100)]
mm/rmap: make rmap_walk to get the rmap_walk_control argument
In each rmap traverse case, there is some difference so that we need
function pointers and arguments to them in order to handle these
For this purpose, struct rmap_walk_control is introduced in this patch,
and will be extended in following patch. Introducing and extending are
separate, because it clarify changes.
Joonsoo Kim [Mon, 16 Dec 2013 23:45:11 +0000 (10:45 +1100)]
mm/rmap: factor lock function out of rmap_walk_anon()
When we traverse anon_vma, we need to take a read-side anon_lock. But
there is subtle difference in the situation so that we can't use same
method to take a lock in each cases. Therefore, we need to make
rmap_walk_anon() taking difference lock function.
This patch is the first step, factoring lock function for anon_lock out of
rmap_walk_anon(). It will be used in case of removing migration entry and
in default of rmap_walk_anon().
Joonsoo Kim [Mon, 16 Dec 2013 23:45:11 +0000 (10:45 +1100)]
mm/rmap: factor nonlinear handling out of try_to_unmap_file()
To merge all kinds of rmap traverse functions, try_to_unmap(),
try_to_munlock(), page_referenced() and page_mkclean(), we need to extract
common parts and separate out non-common parts.
Nonlinear handling is handled just in try_to_unmap_file() and other rmap
traverse functions doesn't care of it. Therfore it is better to factor
nonlinear handling out of try_to_unmap_file() in order to merge all kinds
of rmap traverse functions easily.
Joonsoo Kim [Mon, 16 Dec 2013 23:45:11 +0000 (10:45 +1100)]
mm/rmap: recompute pgoff for huge page
Rmap traversing is used in five different cases, try_to_unmap(),
try_to_munlock(), page_referenced(), page_mkclean() and
remove_migration_ptes(). Each one implements its own traversing functions
for the cases, anon, file, ksm, respectively. These cause lots of
duplications and cause maintenance overhead. They also make codes being
hard to understand and error-prone. One example is hugepage handling.
There is a code to compute hugepage offset correctly in
try_to_unmap_file(), but, there isn't a code to compute hugepage offset in
rmap_walk_file(). These are used pairwise in migration context, but we
missed to modify pairwise.
To overcome these drawbacks, we should unify these through one unified
function. I decide rmap_walk() as main function since it has no
unnecessity. And to control behavior of rmap_walk(), I introduce struct
rmap_walk_control having some function pointers. These makes rmap_walk()
working for their specific needs.
This patchset remove a lot of duplicated code as you can see in below
short-stat and kernel text size also decrease slightly.
text data bss dec hex filename
10640 1 16 10657 29a1 mm/rmap.o
10047 1 16 10064 2750 mm/rmap.o
We have to recompute pgoff if the given page is huge, since result based
on HPAGE_SIZE is not approapriate for scanning the vma interval tree, as
shown by commit 36e4f20af833 ("hugetlb: do not use vma_hugecache_offset()
for vma_prio_tree_foreach") and commit 369a713e ("rmap: recompute pgoff
for unmapping huge page").
To handle both the cases, normal page for page cache and hugetlb page, by
same way, we can use compound_page(). It returns 0 on non-compound page
and it also returns proper value on compound page.
Vladimir Davydov [Mon, 16 Dec 2013 23:45:10 +0000 (10:45 +1100)]
memcg: fix kmem_account_flags check in memcg_can_account_kmem()
We should start kmem accounting for a memory cgroup only after both its
kmem limit is set (KMEM_ACCOUNTED_ACTIVE) and related call sites are
patched (KMEM_ACCOUNTED_ACTIVATED). Currently memcg_can_account_kmem()
allows kmem accounting even if only one of the conditions is true. Fix
it.
This means that a page might get charged by memcg_kmem_newpage_charge
which would see its static key patched already but
memcg_kmem_commit_charge would still see it unpatched and so the charge
won't be committed. The result would be charge inconsistency (page_cgroup
not marked as PageCgroupUsed) and the charge would leak because
__memcg_kmem_uncharge_pages would ignore it.
[mhocko@suse.cz: augment changelog] Signed-off-by: Vladimir Davydov <vdavydov@parallels.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: Balbir Singh <bsingharora@gmail.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Glauber Costa <glommer@parallels.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tang Chen [Mon, 16 Dec 2013 23:45:10 +0000 (10:45 +1100)]
x86, numa, acpi, memory-hotplug: make movable_node have higher priority
If users specify the original movablecore=nn@ss boot option, the kernel
will arrange [ss, ss+nn) as ZONE_MOVABLE. The kernelcore=nn@ss boot
option is similar except it specifies ZONE_NORMAL ranges.
Now, if users specify "movable_node" in kernel commandline, the kernel
will arrange hotpluggable memory in SRAT as ZONE_MOVABLE. And if users do
this, all the other movablecore=nn@ss and kernelcore=nn@ss options should
be ignored.
For those who don't want this, just specify nothing. The kernel will act
as before.
Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com> Signed-off-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Reviewed-by: Wanpeng Li <liwanp@linux.vnet.ibm.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: "Rafael J . Wysocki" <rjw@sisk.pl> Cc: Chen Tang <imtangchen@gmail.com> Cc: Gong Chen <gong.chen@linux.intel.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Jiang Liu <jiang.liu@huawei.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Lai Jiangshan <laijs@cn.fujitsu.com> Cc: Larry Woodman <lwoodman@redhat.com> Cc: Len Brown <lenb@kernel.org> Cc: Liu Jiang <jiang.liu@huawei.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Michal Nazarewicz <mina86@mina86.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Prarit Bhargava <prarit@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Taku Izumi <izumi.taku@jp.fujitsu.com> Cc: Tejun Heo <tj@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Thomas Renninger <trenn@suse.de> Cc: Toshi Kani <toshi.kani@hp.com> Cc: Vasilis Liaskovitis <vasilis.liaskovitis@profitbricks.com> Cc: Wen Congyang <wency@cn.fujitsu.com> Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Cc: Yinghai Lu <yinghai@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
WARNING: line over 80 characters
#83: FILE: include/linux/memblock.h:83:
+static inline bool memblock_is_hotpluggable(struct memblock_region *m){ return false; }
ERROR: space required before the open brace '{'
#83: FILE: include/linux/memblock.h:83:
+static inline bool memblock_is_hotpluggable(struct memblock_region *m){ return false; }
total: 1 errors, 1 warnings, 67 lines checked
./patches/memblock-mem_hotplug-make-memblock-skip-hotpluggable-regions-if-needed.patch has style problems, please review.
If any of these errors are false positives, please report
them to the maintainer, see CHECKPATCH in MAINTAINERS.
Tang Chen [Mon, 16 Dec 2013 23:45:09 +0000 (10:45 +1100)]
memblock, mem_hotplug: make memblock skip hotpluggable regions if needed
Linux kernel cannot migrate pages used by the kernel. As a result,
hotpluggable memory used by the kernel won't be able to be hot-removed.
To solve this problem, the basic idea is to prevent memblock from
allocating hotpluggable memory for the kernel at early time, and arrange
all hotpluggable memory in ACPI SRAT(System Resource Affinity Table) as
ZONE_MOVABLE when initializing zones.
In the previous patches, we have marked hotpluggable memory regions with
MEMBLOCK_HOTPLUG flag in memblock.memory.
In this patch, we make memblock skip these hotpluggable memory regions in
the default top-down allocation function if movable_node boot option is
specified.
Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com> Signed-off-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: "Rafael J . Wysocki" <rjw@sisk.pl> Cc: Chen Tang <imtangchen@gmail.com> Cc: Gong Chen <gong.chen@linux.intel.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Jiang Liu <jiang.liu@huawei.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Lai Jiangshan <laijs@cn.fujitsu.com> Cc: Larry Woodman <lwoodman@redhat.com> Cc: Len Brown <lenb@kernel.org> Cc: Liu Jiang <jiang.liu@huawei.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Michal Nazarewicz <mina86@mina86.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Prarit Bhargava <prarit@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Taku Izumi <izumi.taku@jp.fujitsu.com> Cc: Tejun Heo <tj@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Thomas Renninger <trenn@suse.de> Cc: Toshi Kani <toshi.kani@hp.com> Cc: Vasilis Liaskovitis <vasilis.liaskovitis@profitbricks.com> Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com> Cc: Wen Congyang <wency@cn.fujitsu.com> Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Cc: Yinghai Lu <yinghai@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tang Chen [Mon, 16 Dec 2013 23:45:09 +0000 (10:45 +1100)]
acpi, numa, mem_hotplug: mark all nodes the kernel resides un-hotpluggable
At very early time, the kernel have to use some memory such as loading the
kernel image. We cannot prevent this anyway. So any node the kernel
resides in should be un-hotpluggable.
Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com> Reviewed-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: "Rafael J . Wysocki" <rjw@sisk.pl> Cc: Chen Tang <imtangchen@gmail.com> Cc: Gong Chen <gong.chen@linux.intel.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Jiang Liu <jiang.liu@huawei.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Lai Jiangshan <laijs@cn.fujitsu.com> Cc: Larry Woodman <lwoodman@redhat.com> Cc: Len Brown <lenb@kernel.org> Cc: Liu Jiang <jiang.liu@huawei.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Michal Nazarewicz <mina86@mina86.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Prarit Bhargava <prarit@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Taku Izumi <izumi.taku@jp.fujitsu.com> Cc: Tejun Heo <tj@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Thomas Renninger <trenn@suse.de> Cc: Toshi Kani <toshi.kani@hp.com> Cc: Vasilis Liaskovitis <vasilis.liaskovitis@profitbricks.com> Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com> Cc: Wen Congyang <wency@cn.fujitsu.com> Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Cc: Yinghai Lu <yinghai@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tang Chen [Mon, 16 Dec 2013 23:45:09 +0000 (10:45 +1100)]
acpi, numa, mem_hotplug: mark hotpluggable memory in memblock
When parsing SRAT, we know that which memory area is hotpluggable. So we
invoke function memblock_mark_hotplug() introduced by previous patch to
mark hotpluggable memory in memblock.
Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com> Reviewed-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: "Rafael J . Wysocki" <rjw@sisk.pl> Cc: Chen Tang <imtangchen@gmail.com> Cc: Gong Chen <gong.chen@linux.intel.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Jiang Liu <jiang.liu@huawei.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Lai Jiangshan <laijs@cn.fujitsu.com> Cc: Larry Woodman <lwoodman@redhat.com> Cc: Len Brown <lenb@kernel.org> Cc: Liu Jiang <jiang.liu@huawei.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Michal Nazarewicz <mina86@mina86.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Prarit Bhargava <prarit@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Taku Izumi <izumi.taku@jp.fujitsu.com> Cc: Tejun Heo <tj@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Thomas Renninger <trenn@suse.de> Cc: Toshi Kani <toshi.kani@hp.com> Cc: Vasilis Liaskovitis <vasilis.liaskovitis@profitbricks.com> Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com> Cc: Wen Congyang <wency@cn.fujitsu.com> Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Cc: Yinghai Lu <yinghai@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tang Chen [Mon, 16 Dec 2013 23:45:08 +0000 (10:45 +1100)]
memblock, mem_hotplug: introduce MEMBLOCK_HOTPLUG flag to mark hotpluggable regions
In find_hotpluggable_memory, once we find out a memory region which is
hotpluggable, we want to mark them in memblock.memory. So that we could
control memblock allocator not to allocte hotpluggable memory for the
kernel later.
To achieve this goal, we introduce MEMBLOCK_HOTPLUG flag to indicate the
hotpluggable memory regions in memblock and a function
memblock_mark_hotplug() to mark hotpluggable memory if we find one.
Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com> Reviewed-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: "Rafael J . Wysocki" <rjw@sisk.pl> Cc: Chen Tang <imtangchen@gmail.com> Cc: Gong Chen <gong.chen@linux.intel.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Jiang Liu <jiang.liu@huawei.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Lai Jiangshan <laijs@cn.fujitsu.com> Cc: Larry Woodman <lwoodman@redhat.com> Cc: Len Brown <lenb@kernel.org> Cc: Liu Jiang <jiang.liu@huawei.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Michal Nazarewicz <mina86@mina86.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Prarit Bhargava <prarit@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Taku Izumi <izumi.taku@jp.fujitsu.com> Cc: Tejun Heo <tj@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Thomas Renninger <trenn@suse.de> Cc: Toshi Kani <toshi.kani@hp.com> Cc: Vasilis Liaskovitis <vasilis.liaskovitis@profitbricks.com> Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com> Cc: Wen Congyang <wency@cn.fujitsu.com> Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Cc: Yinghai Lu <yinghai@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tang Chen [Mon, 16 Dec 2013 23:45:08 +0000 (10:45 +1100)]
memblock, numa: introduce flags field into memblock
There is no flag in memblock to describe what type the memory is.
Sometimes, we may use memblock to reserve some memory for special usage.
And we want to know what kind of memory it is. So we need a way to
In hotplug environment, we want to reserve hotpluggable memory so the
kernel won't be able to use it. And when the system is up, we have to
free these hotpluggable memory to buddy. So we need to mark these memory
first.
In order to do so, we need to mark out these special memory in memblock.
In this patch, we introduce a new "flags" member into memblock_region:
struct memblock_region {
phys_addr_t base;
phys_addr_t size;
unsigned long flags; /* This is new. */
#ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
int nid;
#endif
};
This patch does the following things:
1) Add "flags" member to memblock_region.
2) Modify the following APIs' prototype:
memblock_add_region()
memblock_insert_region()
3) Add memblock_reserve_region() to support reserve memory with flags, and keep
memblock_reserve()'s prototype unmodified.
4) Modify other APIs to support flags, but keep their prototype unmodified.
The idea is from Wen Congyang <wency@cn.fujitsu.com> and Liu Jiang <jiang.liu@huawei.com>.
Suggested-by: Wen Congyang <wency@cn.fujitsu.com> Suggested-by: Liu Jiang <jiang.liu@huawei.com> Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com> Reviewed-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: "Rafael J . Wysocki" <rjw@sisk.pl> Cc: Chen Tang <imtangchen@gmail.com> Cc: Gong Chen <gong.chen@linux.intel.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Jiang Liu <jiang.liu@huawei.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Lai Jiangshan <laijs@cn.fujitsu.com> Cc: Larry Woodman <lwoodman@redhat.com> Cc: Len Brown <lenb@kernel.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Michal Nazarewicz <mina86@mina86.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Prarit Bhargava <prarit@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Taku Izumi <izumi.taku@jp.fujitsu.com> Cc: Tejun Heo <tj@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Thomas Renninger <trenn@suse.de> Cc: Toshi Kani <toshi.kani@hp.com> Cc: Vasilis Liaskovitis <vasilis.liaskovitis@profitbricks.com> Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com> Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Cc: Yinghai Lu <yinghai@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
If system can create movable node which all memory of the node is
allocated as ZONE_MOVABLE, setup_node_data() cannot allocate memory for
the node's pg_data_t. So, invoke memblock_alloc_nid(...MAX_NUMNODES)
again to retry when the first allocation fails. Otherwise, the system
could failed to boot. (We don't use memblock_alloc_try_nid() to retry
because in this function, if the allocation fails, it will panic the
system.)
The node_data could be on hotpluggable node. And so could pagetable and
vmemmap. But for now, doing so will break memory hot-remove path.
A node could have several memory devices. And the device who holds node
data should be hot-removed in the last place. But in NUMA level, we don't
know which memory_block (/sys/devices/system/node/nodeX/memoryXXX) belongs
to which memory device. We only have node. So we can only do node
hotplug.
But in virtualization, developers are now developing memory hotplug in
qemu, which support a single memory device hotplug. So a whole node
hotplug will not satisfy virtualization users.
So at last, we concluded that we'd better do memory hotplug and local node
things (local node node data, pagetable, vmemmap, ...) in two steps.
Please refer to https://lkml.org/lkml/2013/6/19/73
For now, we put node_data of movable node to another node, and then
improve it in the future.
Signed-off-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com> Signed-off-by: Jiang Liu <jiang.liu@huawei.com> Signed-off-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Reviewed-by: Wanpeng Li <liwanp@linux.vnet.ibm.com> Acked-by: Toshi Kani <toshi.kani@hp.com> Cc: Tejun Heo <tj@kernel.org> CC: "Rafael J . Wysocki" <rjw@sisk.pl> Cc: Len Brown <lenb@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@elte.hu> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Toshi Kani <toshi.kani@hp.com> Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com> Cc: Thomas Renninger <trenn@suse.de> Cc: Yinghai Lu <yinghai@kernel.org> Cc: Jiang Liu <jiang.liu@huawei.com> Cc: Wen Congyang <wency@cn.fujitsu.com> Cc: Lai Jiangshan <laijs@cn.fujitsu.com> Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Cc: Taku Izumi <izumi.taku@jp.fujitsu.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Minchan Kim <minchan@kernel.org> Cc: Michal Nazarewicz <mina86@mina86.com> Cc: Gong Chen <gong.chen@linux.intel.com> Cc: Vasilis Liaskovitis <vasilis.liaskovitis@profitbricks.com> Cc: Larry Woodman <lwoodman@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Prarit Bhargava <prarit@redhat.com> Cc: Chen Tang <imtangchen@gmail.com> Cc: Tang Chen <tangchen@cn.fujitsu.com> Cc: Zhang Yanfei <zhangyanfei.yes@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
mm/memblock: debug: correct displaying of upper memory boundary
Current memblock APIs don't work on 32 PAE or LPAE extension arches where
the physical memory start address beyond 4GB. The problem was discussed
here [3] where Tejun, Yinghai(thanks) proposed a way forward with memblock
interfaces. Based on the proposal, this series adds necessary memblock
interfaces and convert the core kernel code to use them. Architectures
already converted to NO_BOOTMEM use these new interfaces and other which
still uses bootmem, these new interfaces just fallback to exiting bootmem
APIs.
So no functional change in behavior. In long run, once all the
architectures moves to NO_BOOTMEM, we can get rid of bootmem layer
completely. This is one step to remove the core code dependency with
bootmem and also gives path for architectures to move away from bootmem.
Testing is done on ARM architecture with 32 bit ARM LAPE machines
with normal as well sparse(faked) memory model.
This patch (of 23):
When debugging is enabled (cmdline has "memblock=debug") the memblock will
display upper memory boundary per each allocated/freed memory range
wrongly. For example:
The 0x0000009e7ed000 is displayed instead of 0x0000009e7ecfff
Hence, correct this by changing formula used to calculate upper memory
boundary to (u64)base + size - 1 instead of (u64)base + size everywhere
in the debug messages.
Signed-off-by: Grygorii Strashko <grygorii.strashko@ti.com> Signed-off-by: Santosh Shilimkar <santosh.shilimkar@ti.com> Cc: Yinghai Lu <yinghai@kernel.org> Acked-by: Tejun Heo <tj@kernel.org> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Russell King <linux@arm.linux.org.uk> Cc: "Rafael J. Wysocki" <rjw@sisk.pl> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Paul Walmsley <paul@pwsan.com> Cc: Pavel Machek <pavel@ucw.cz> Cc: Tony Lindgren <tony@atomide.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Davidlohr Bueso [Mon, 16 Dec 2013 23:45:07 +0000 (10:45 +1100)]
mm/mlock: prepare params outside critical region
All mlock related syscalls prepare lock limits, lengths and start
parameters with the mmap_sem held. Move this logic outside of the
critical region. For the case of mlock, continue incrementing the amount
already locked by mm->locked_vm with the rwsem taken.
Signed-off-by: Davidlohr Bueso <davidlohr@hp.com> Cc: Rik van Riel <riel@redhat.com> Reviewed-by: Michel Lespinasse <walken@google.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Davidlohr Bueso [Mon, 16 Dec 2013 23:45:07 +0000 (10:45 +1100)]
mm/mmap.c: add mlock_future_check() helper
Both do_brk and do_mmap_pgoff verify that we are actually capable of
locking future pages if the corresponding VM_LOCKED flags are used.
Encapsulate this logic into a single mlock_future_check() helper function.
Signed-off-by: Davidlohr Bueso <davidlohr@hp.com> Cc: Rik van Riel <riel@redhat.com> Reviewed-by: Michel Lespinasse <walken@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Jerome Marchand [Mon, 16 Dec 2013 23:45:06 +0000 (10:45 +1100)]
mm: add overcommit_kbytes sysctl variable
Some applications that run on HPC clusters are designed around the
availability of RAM and the overcommit ratio is fine tuned to get the
maximum usage of memory without swapping. With growing memory, the
1%-of-all-RAM grain provided by overcommit_ratio has become too coarse for
these workload (on a 2TB machine it represents no less than 20GB).
This patch adds the new overcommit_kbytes sysctl variable that allow a
much finer grain.
Signed-off-by: Jerome Marchand <jmarchan@redhat.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Alan Cox <alan@lxorguk.ukuu.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Mel Gorman [Mon, 16 Dec 2013 23:45:06 +0000 (10:45 +1100)]
mm, show_mem: remove SHOW_MEM_FILTER_PAGE_COUNT
Commit 4b59e6c4 ("mm, show_mem: suppress page counts in non-blockable
contexts") introduced SHOW_MEM_FILTER_PAGE_COUNT to suppress PFN walks on
large memory machines. Commit c78e9363 (:mm: do not walk all of system
memory during show_mem") avoided a PFN walk in the generic show_mem helper
which removes the requirement for SHOW_MEM_FILTER_PAGE_COUNT in that case.
This patch removes PFN walkers from the arch-specific implementations that
report on a per-node or per-zone granularity. ARM and unicore32 still do
a PFN walk as they report memory usage on each bank which is a much finer
granularity where the debugging information may still be of use. As the
remaining arches doing PFN walks have relatively small amounts of memory,
this patch simply removes SHOW_MEM_FILTER_PAGE_COUNT.
Signed-off-by: Mel Gorman <mgorman@suse.de> Acked-by: David Rientjes <rientjes@google.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Russell King <linux@arm.linux.org.uk> Cc: James Bottomley <jejb@parisc-linux.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Jianyu Zhan [Mon, 16 Dec 2013 23:45:05 +0000 (10:45 +1100)]
mm/vmalloc: interchage the implementation of vmalloc_to_{pfn,page}
Currently we are implementing vmalloc_to_pfn() as a wrapper around
vmalloc_to_page(), which is implemented as follow:
1. walks the page talbes to generates the corresponding pfn,
2. then converts the pfn to struct page,
3. returns it.
And vmalloc_to_pfn() re-wraps vmalloc_to_page() to get the pfn.
This seems too circuitous, so this patch reverses the way: implement
vmalloc_to_page() as a wrapper around vmalloc_to_pfn(). This makes
vmalloc_to_pfn() and vmalloc_to_page() slightly more efficient.
No functional change.
Signed-off-by: Jianyu Zhan <nasa4836@gmail.com> Cc: Vladimir Murzin <murzin.v@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
David Rientjes [Mon, 16 Dec 2013 23:45:05 +0000 (10:45 +1100)]
mm, mempolicy: remove unneeded functions for UMA configs
Mempolicies only exist for CONFIG_NUMA configurations. Therefore, a
certain class of functions are unneeded in configurations where
CONFIG_NUMA is disabled such as functions that duplicate existing
mempolicies, lookup existing policies, set certain mempolicy traits, or
test mempolicies for certain attributes.
Remove the unneeded functions so that any future callers get a compile-
time error and protect their code with CONFIG_NUMA as required.
Signed-off-by: David Rientjes <rientjes@google.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>