Heiko Stuebner [Fri, 3 Jan 2014 03:10:23 +0000 (14:10 +1100)]
rtc: hym8563: include clkout code only if COMMON_CLK active
The contents of clk-provide.h, struct clk_hw etc, are only available if
CONFIG_COMMON_CLK is selected. Therefore IS_ENABLED(COMMON_CLK) is not
sufficient and real preprocessor conditions are necessary to keep the code
in question from being compiled on non-COMMON_CLK systems.
Signed-off-by: Heiko Stuebner <heiko@sntech.de> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Reported-by: Wu Fengguang <fengguang.wu@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Heiko Stuebner [Fri, 3 Jan 2014 03:10:23 +0000 (14:10 +1100)]
rtc: add hym8563 rtc-driver
The Haoyu Microelectronics HYM8563 provides rtc and alarm functions as
well as a clock output of up to 32kHz.
Signed-off-by: Heiko Stuebner <heiko@sntech.de> Cc: Rob Herring <rob.herring@calxeda.com> Cc: Pawel Moll <pawel.moll@arm.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Stephen Warren <swarren@wwwdotorg.org> Cc: Ian Campbell <ijc+devicetree@hellion.org.uk> Cc: Grant Likely <grant.likely@linaro.org> Cc: Mike Turquette <mturquette@linaro.org> Cc: Richard Weinberger <richard.weinberger@gmail.com> Cc: Mark Brown <broonie@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Heiko Stuebner [Fri, 3 Jan 2014 03:10:22 +0000 (14:10 +1100)]
dt-bindings: add hym8563 binding
Add binding documentation for the hym8563 rtc chip.
Signed-off-by: Heiko Stuebner <heiko@sntech.de> Cc: Rob Herring <rob.herring@calxeda.com> Cc: Pawel Moll <pawel.moll@arm.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Stephen Warren <swarren@wwwdotorg.org> Cc: Ian Campbell <ijc+devicetree@hellion.org.uk> Cc: Grant Likely <grant.likely@linaro.org> Cc: Mike Turquette <mturquette@linaro.org> Cc: Richard Weinberger <richard.weinberger@gmail.com> Cc: Mark Brown <broonie@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
This patch allows the driver to be enabled with devicetree.
Signed-off-by: Alexander Shiyan <shc_work@mail.ru> Acked-by: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The driver core clears the driver data to NULL after device_release or on
probe failure. Thus, it is not needed to manually clear the device driver
data to NULL.
Signed-off-by: Jingoo Han <jg1.han@samsung.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Ian Kent [Fri, 3 Jan 2014 03:10:20 +0000 (14:10 +1100)]
autofs: fix symlinks aren't checked for expiry
The autofs4 module doesn't consider symlinks for expire as it did in the
older autofs v3 module (so it's actually a long standing regression).
The user space daemon has focused on the use of bind mounts instead of
symlinks for a long time now and that's why this has not been noticed.
But with the future addition of amd map parsing to automount(8), not to
mention amd itself (of am-utils), symlink expiry will be needed.
The direct and offset mount types can't be symlinks and the tree mounts of
version 4 were always real mounts so only indirect mounts need expire
symlinks.
Since the current users of the autofs4 module haven't reported this as a
problem to date this patch probably isn't a candidate for backport to
stable.
Signed-off-by: Ian Kent <ikent@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Miklos Szeredi [Fri, 3 Jan 2014 03:10:20 +0000 (14:10 +1100)]
autofs4: translate pids to the right namespace for the daemon
The PID and the TGID of the process triggering the mount are sent to the
daemon. Currently the global pid values are sent (ones valid in the
initial pid namespace) but this is wrong if the autofs daemon itself is
not running in the initial pid namespace.
So send the pid values that are valid in the namespace of the autofs daemon.
The namespace to use is taken from the oz_pgrp pid pointer, which was set
at mount time to the mounting process' pid namespace.
If the pid translation fails (the triggering process is in an unrelated
pid namespace) then the automount fails with ENOENT.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Acked-by: Serge Hallyn <serge.hallyn@canonical.com> Cc: Eric Biederman <ebiederm@xmission.com> Acked-by: Ian Kent <raven@themaw.net> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
autofs4: allow autofs to work outside the initial PID namespace
Enable autofs4 to work in a "container". oz_pgrp is converted from pid_t
to struct pid and this is stored at mount time based on the "pgrp=" option
or if the option is missing then the current pgrp.
The "pgrp=" option is interpreted in the PID namespace of the current
process. This option is flawed in that it doesn't carry the namespace
information, so it should be deprecated. AFAICS the autofs daemon always
sends the current pgrp, which is the default anyway.
The oz_pgrp is also set from the AUTOFS_DEV_IOCTL_SETPIPEFD_CMD ioctl.
This ioctl sets oz_pgrp to the current pgrp. It is not allowed to change
the pid namespace.
oz_pgrp is used mainly to determine whether the process traversing the
autofs mount tree is the autofs daemon itself or not. This function now
compares the pid pointers instead of the pid_t values.
One other use of oz_pgrp is in autofs4_show_options. There is shows the
virtual pid number (i.e. the one that is valid inside the PID namespace
of the calling process)
For debugging printk convert oz_pgrp to the value in the initial pid
namespace.
Signed-off-by: Sukadev Bhattiprolu <sukadev@us.ibm.com> Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Acked-by: Serge Hallyn <serge.hallyn@canonical.com> Cc: Eric Biederman <ebiederm@xmission.com> Acked-by: Ian Kent <raven@themaw.net> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Axel Lin [Fri, 3 Jan 2014 03:10:19 +0000 (14:10 +1100)]
fs/ramfs/file-nommu.c: make ramfs_nommu_get_unmapped_area() and ramfs_nommu_mmap() static
Since commit 853ac43ab194f "shmem: unify regular and tiny shmem",
ramfs_nommu_get_unmapped_area() and ramfs_nommu_mmap() are not directly
referenced outside of file-nommu.c. Thus make them static.
Signed-off-by: Axel Lin <axel.lin@ingics.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
We observed this problem has been occurring since 2.6.30 with
fs/binfmt_elf.c: create_elf_tables()->get_random_bytes(), introduced by f06295b44c296c8f ("ELF: implement AT_RANDOM for glibc PRNG seeding").
/*
* Generate 16 random bytes for userspace PRNG seeding.
*/
get_random_bytes(k_rand_bytes, sizeof(k_rand_bytes));
The patch introduces a wrapper around get_random_int() which has lower
overhead than calling get_random_bytes() directly.
With this patch applied:
$ cat /proc/sys/kernel/random/entropy_avail
2731
$ cat /proc/sys/kernel/random/entropy_avail
2802
$ cat /proc/sys/kernel/random/entropy_avail
2878
Analyzed by John Sobecki.
This has been applied on a specific Oracle kernel and has been running on
the customer's production environment (the original bug reporter) for
several months; it has worked fine until now.
Signed-off-by: Jie Liu <jeff.liu@oracle.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Andreas Dilger <aedilger@gmail.com> Cc: Alan Cox <alan@linux.intel.com> Cc: Arnd Bergmann <arnn@arndb.de> Cc: John Sobecki <john.sobecki@oracle.com> Cc: James Morris <james.l.morris@oracle.com> Cc: Jakub Jelinek <jakub@redhat.com> Cc: Ted Ts'o <tytso@mit.edu> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Acked-by: Kees Cook <keescook@chromium.org> Cc: Ulrich Drepper <drepper@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Joe Perches [Fri, 3 Jan 2014 03:10:18 +0000 (14:10 +1100)]
checkpatch: improve space before tab --fix option
This test should remove all the spaces before a tab not just one space.
Substitute a tab for each 8 space block before a tab and remove less than
8 spaces before a tab.
This SPACE_BEFORE_TAB test is done after CODE_INDENT.
If there are spaces used at the beginning of a line that should be
converted to tabs, please make sure that the CODE_INDENT test and
conversion is done before this SPACE_BEFORE_TAB test and conversion.
Reported-by: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Joe Perches <joe@perches.com> Cc: Josh Triplett <josh@joshtriplett.org> Cc: Andy Whitcroft <apw@canonical.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
case blocks should end in a break/return/goto/continue.
If a fall-through is used, it should have a comment showing that it is
intentional. Ideally that comment should be something like:
"/* fall-through */"
Add a test to look for missing break statements.
This looks only at the context lines before an inserted case so it's
possible to have false positives when the context contains a close brace
and the break is before the brace and not part of the patch context.
Looking at recent patches, this is a pretty rare occurrence. The normal
kernel style uses a break as the last line of the previous block.
Signed-off-by: Joe Perches <joe@perche.com> Cc: Andy Whitcroft <apw@shadowen.org> Cc: Jiri Kosina <jkosina@suse.cz> Cc: Benjamin Tissoires <benjamin.tissoires@redhat.com> Cc: Dave Jones <davej@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Joe Perches [Fri, 3 Jan 2014 03:10:16 +0000 (14:10 +1100)]
checkpatch: more comprehensive split strings warning
The current checkpatch test for split strings does not find several cases
that should be found.
For instance:
/* Else poor success; go back to mode in "active" table */
} else {
IWL_DEBUG_RATE(mvm,
- "LQ: GOING BACK TO THE OLD TABLE suc=%d cur-tpt=%d old-tpt=%d\n",
+ "GOING BACK TO THE OLD TABLE: SR %d "
+ "cur-tpt %d old-tpt %d\n",
window->success_ratio,
window->average_tpt,
lq_sta->last_tpt);
does not currently emit a warning.
Improve the test to find these cases.
Add more exceptions to reduce false positives for assembly and octal/hex
string constants.
Signed-off-by: Joe Perches <joe@perches.com> Reviewed-by: Josh Triplett <josh@joshtriplett.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Ard Biesheuvel [Fri, 3 Jan 2014 03:10:16 +0000 (14:10 +1100)]
firmware/dmi_scan: generalize for use by other archs
This patch makes a couple of changes to the SMBIOS/DMI scanning
code so it can be used on other archs (such as ARM and arm64):
(a) wrap the calls to ioremap()/iounmap(), this allows the use of a
flavor of ioremap() more suitable for random unaligned access;
(b) allow the non-EFI fallback probe into hardcoded physical address
0xF0000 to be disabled.
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org> Acked-by: Grant Likely <grant.likely@linaro.org> Cc: Ingo Molnar <mingo@elte.hu>
Cc "Luck, Tony" <tony.luck@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Marian Chereji [Fri, 3 Jan 2014 03:10:16 +0000 (14:10 +1100)]
lib: Add CRC64 ECMA module
Add implementation of CRC64 ECMA checksum.
We have an IP Acceleration driver for Freescale network processors which
is using this CRC64. However, it still needs some work in order for it to
become upstreamable.
Signed-off-by: Marian Chereji <marian.chereji@freescale.com> Reviewed-by: Varvara Andrei-B21317 <andrei.varvara@freescale.com> Reviewed-by: Fleming Andrew-AFLEMING <AFLEMING@freescale.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Kees Cook [Fri, 3 Jan 2014 03:10:16 +0000 (14:10 +1100)]
test: fix sparse warnings in user_copy tests
Sparse fix for "test: check copy_to/from_user boundary validation":
To keep sparse happy with the horrible things being done with the user
memory pointers, declare both __user and non-__user cases ahead of time to
avoid needing to do the casts later.
Signed-off-by: Kees Cook <keescook@chromium.org> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Joe Perches <joe@perches.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Kees Cook [Fri, 3 Jan 2014 03:10:15 +0000 (14:10 +1100)]
test: check copy_to/from_user boundary validation
To help avoid an architecture failing to correctly check kernel/user
boundaries when handling copy_to_user, copy_from_user, put_user, or
get_user, perform some simple tests and fail to load if any of them behave
unexpectedly.
Specifically, this is to make sure there is a way to notice if things like
what was fixed in 8404663f81 ("ARM: 7527/1: uaccess: explicitly check
__user pointer when !CPU_USE_DOMAINS") ever regresses again, for any
architecture.
Additionally, adds new "user" selftest target, which loads this module.
Signed-off-by: Kees Cook <keescook@chromium.org> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Joe Perches <joe@perches.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Kees Cook [Fri, 3 Jan 2014 03:10:15 +0000 (14:10 +1100)]
test: add minimal module for verification testing
This is a pair of test modules I'd like to see in the tree. Instead of
putting these in lkdtm, where I've been adding various tests that trigger
crashes, these don't make sense there since they need to be either
distinctly separate, or their pass/fail state don't need to crash the
machine.
These live in lib/ for now, along with a few other in-kernel test modules,
and use the slightly more common "test_" naming convention, instead of
"test-". We should likely standardize on the former:
The first is entirely a no-op module, designed to allow simple testing of
the module loading and verification interface. It's useful to have a
module that has no other uses or dependencies so it can be reliably used
for just testing module loading and verification.
The second is a module that exercises the user memory access functions, in
an effort to make sure that we can quickly catch any regressions in
boundary checking (e.g. like what was recently fixed on ARM).
This patch (of 2):
When doing module loading verification tests (for example, with module
signing, or LSM hooks), it is very handy to have a module that can be
built on all systems under test, isn't auto-loaded at boot, and has no
device or similar dependencies. This creates the "test_module.ko" module
for that purpose, which only reports its load and unload to printk.
Signed-off-by: Kees Cook <keescook@chromium.org> Acked-by: Rusty Russell <rusty@rustcorp.com.au> Cc: Joe Perches <joe@perches.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Most mobile phones have Ambient Light Sensors and it changes brightness
according to the lux. It means it changes backlight brightness frequently
by just writing sysfs node, so it generates uevent.
Usually there's no user to use this backlight changes. But it forks udev
worker threads and it takes about 5ms. The main problem is that it hurts
other process activities. so remove it.
Kay said
"Uevents are for the major, low-frequent, global device state-changes,
not for carrying-out any sort of measurement data. Subsystems which
need that should use other facilities like poll()-able sysfs file or
any other subscription-based, client-tracking interface which does not
cause overhead if it isn't used. Uevents are not the right thing to
use here, and upstream udev should not paper-over broken kernel
subsystems."
Signed-off-by: Kyungmin Park <kyungmin.park@samsung.com> Cc: Henrique de Moraes Holschuh <hmh@hmh.eng.br> Cc: Matthew Garrett <mjg59@srcf.ucam.org> Cc: Kay Sievers <kay.sievers@vrfy.org> Acked-by: Jingoo Han <jg1.han@samsung.com> Cc: Henrique de Moraes Holschuh <ibm-acpi@hmh.eng.br> Cc: Richard Purdie <rpurdie@rpsys.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Joe Perches [Fri, 3 Jan 2014 03:10:12 +0000 (14:10 +1100)]
get_maintainer: add commit author information to --rolestats
get_maintainer currently uses "Signed-off-by" style lines to find
interested parties to send patches to when the MAINTAINERS file does not
have a specific section entry with a matching file pattern.
Add statistics for commit authors and lines added and deleted to the
information provided by --rolestats.
These statistics are also emitted whenever --rolestats and --git are
selected even when there is a specified maintainer.
This can have the effect of expanding the number of people that are shown
as possible "maintainers" of a particular file because "authors",
"added_lines", and "removed_lines" are also used as criterion for the
--max-maintainers option separate from the "commit_signers".
The first "--git-max-maintainers" values of each criterion
are emitted. Any "ties" are not shown.
For example: (forcedeth does not have a named maintainer)
Old output:
$ ./scripts/get_maintainer.pl -f drivers/net/ethernet/nvidia/forcedeth.c
"David S. Miller" <davem@davemloft.net> (commit_signer:8/10=80%)
Jiri Pirko <jiri@resnulli.us> (commit_signer:2/10=20%)
Patrick McHardy <kaber@trash.net> (commit_signer:2/10=20%)
Larry Finger <Larry.Finger@lwfinger.net> (commit_signer:1/10=10%)
Peter Zijlstra <peterz@infradead.org> (commit_signer:1/10=10%)
netdev@vger.kernel.org (open list:NETWORKING DRIVERS)
linux-kernel@vger.kernel.org (open list)
New output:
$ ./scripts/get_maintainer.pl -f drivers/net/ethernet/nvidia/forcedeth.c
"David S. Miller" <davem@davemloft.net> (commit_signer:8/10=80%)
Jiri Pirko <jiri@resnulli.us> (commit_signer:2/10=20%,authored:2/10=20%,removed_lines:3/33=9%)
Patrick McHardy <kaber@trash.net> (commit_signer:2/10=20%,authored:2/10=20%,added_lines:12/95=13%,removed_lines:10/33=30%)
Larry Finger <Larry.Finger@lwfinger.net> (commit_signer:1/10=10%,authored:1/10=10%,added_lines:35/95=37%)
Peter Zijlstra <peterz@infradead.org> (commit_signer:1/10=10%)
"Peter Hüwe" <PeterHuewe@gmx.de> (authored:1/10=10%,removed_lines:15/33=45%)
Joe Perches <joe@perches.com> (authored:1/10=10%)
Neil Horman <nhorman@tuxdriver.com> (added_lines:40/95=42%)
Bill Pemberton <wfp5p@virginia.edu> (removed_lines:3/33=9%)
netdev@vger.kernel.org (open list:NETWORKING DRIVERS)
linux-kernel@vger.kernel.org (open list)
Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Joe Perches [Fri, 3 Jan 2014 03:10:11 +0000 (14:10 +1100)]
printk/cache: Mark printk_once test variable __read_mostly
Add #include <linux/cache.h> to define __read_mostly.
Convert cache.h to use uapi/linux/kernel.h instead
of linux/kernel.h to avoid recursive #includes.
Convert the ALIGN macro to __ALIGN_KERNEL.
printk_once only sets the bool variable tested
once so mark it __read_mostly.
Neaten the alignment so it matches the rest of the
pr_<level>_once #defines too.
Signed-off-by: Joe Perches <joe@perches.com> Reviewed-by: James Hogan <james.hogan@imgtec.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Du, Changbin [Fri, 3 Jan 2014 03:10:11 +0000 (14:10 +1100)]
dynamic-debug-howto.txt: update since new wildcard support
Add the usage of using new feature wildcard support.
Signed-off-by: Du, Changbin <changbin.du@gmail.com> Cc: Jason Baron <jbaron@akamai.com> Cc: Joe Perches <joe@perches.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Du, Changbin [Fri, 3 Jan 2014 03:10:11 +0000 (14:10 +1100)]
dynamic_debug: add wildcard support to filter files/functions/modules
Add wildcard '*'(matches zero or more characters) and '?' (matches one
character) support when qurying debug flags.
Now we can open debug messages using keywords. eg:
1. open debug logs in all usb drivers
echo "file drivers/usb/* +p" > <debugfs>/dynamic_debug/control
2. open debug logs for usb xhci code
echo "file *xhci* +p" > <debugfs>/dynamic_debug/control
Signed-off-by: Du, Changbin <changbin.du@gmail.com> Cc: Jason Baron <jbaron@akamai.com> Cc: Joe Perches <joe@perches.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Du, Changbin [Fri, 3 Jan 2014 03:10:10 +0000 (14:10 +1100)]
lib/parser.c: add match_wildcard function
match_wildcard function is a simple implementation of wildcard
matching algorithm. It only supports two usual wildcardes:
'*' - matches zero or more characters
'?' - matches one character
This algorithm is safe since it is non-recursive.
Signed-off-by: Du, Changbin <changbin.du@gmail.com> Cc: Jason Baron <jbaron@akamai.com> Cc: Joe Perches <joe@perches.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Gustavo Padovan [Fri, 3 Jan 2014 03:10:10 +0000 (14:10 +1100)]
drivers/misc/ti-st/st_core.c: fix NULL dereference on protocol type check
If the type we receive is greater than ST_MAX_CHANNELS we can't rely on
type as vector index since we would be accessing unknown memory when we use the type
as index.
Kees Cook [Fri, 3 Jan 2014 03:10:10 +0000 (14:10 +1100)]
stack protector: provide -fstack-protector-strong build option
This changes the stack protector config option into a choice of "None",
"Regular", and "Strong". For "Strong", the kernel is built with
-fstack-protector-strong (gcc 4.9 and later). This options increases the
coverage of the stack protector without the heavy performance hit of
-fstack-protector-all.
For reference, the stack protector options available in gcc are:
-fstack-protector-all:
Adds the stack-canary saving prefix and stack-canary checking suffix to
_all_ function entry and exit. Results in substantial use of stack space
for saving the canary for deep stack users (e.g. historically xfs), and
measurable (though shockingly still low) performance hit due to all the
saving/checking. Really not suitable for sane systems, and was entirely
removed as an option from the kernel many years ago.
-fstack-protector:
Adds the canary save/check to functions that define an 8
(--param=ssp-buffer-size=N, N=8 by default) or more byte local char
array. Traditionally, stack overflows happened with string-based
manipulations, so this was a way to find those functions. Very few
total functions actually get the canary; no measurable performance or
size overhead.
-fstack-protector-strong
Adds the canary for a wider set of functions, since it's not just those
with strings that have ultimately been vulnerable to stack-busting. With
this superset, more functions end up with a canary, but it still remains
small compared to all functions with no measurable change in performance.
Based on the original design document, a function gets the canary when it
contains any of:
- local variable's address used as part of the RHS of an assignment or
function argument
- local variable is an array (or union containing an array), regardless
of array type or length
- uses register local variables
https://docs.google.com/a/google.com/document/d/1xXBH6rRZue4f296vGt9YQcuLVQHeE516stHwt8M9xyU
Comparison of "size" and "objdump" output when built with gcc-4.9 in
three configurations:
- defconfig 11430641 text size
36110 function bodies
- defconfig + CONFIG_CC_STACKPROTECTOR 11468490 text size (+0.33%)
1015 of 36110 functions stack-protected (2.81%)
- defconfig + CONFIG_CC_STACKPROTECTOR_STRONG via this patch 11692790 text size (+2.24%)
7401 of 36110 functions stack-protected (20.5%)
With -strong, ARM's compressed boot code now triggers stack protection, so
a static guard was added. Since this is only used during decompression
and was never used before, the exposure here is very small. Once it
switches to the full kernel, the stack guard is back to normal.
Chrome OS has been using -fstack-protector-strong for its kernel builds
for the last 8 months with no problems.
Signed-off-by: Kees Cook <keescook@chromium.org> Cc: Michal Marek <mmarek@suse.cz> Cc: Russell King <linux@arm.linux.org.uk> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Paul Mundt <lethal@linux-sh.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: James Hogan <james.hogan@imgtec.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Shawn Guo <shawn.guo@linaro.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Kees Cook [Fri, 3 Jan 2014 03:10:10 +0000 (14:10 +1100)]
stack protector: create HAVE_CC_STACKPROTECTOR for centralized use
Instead of duplicating the CC_STACKPROTECTOR Kconfig and Makefile logic in
each architecture, switch to using HAVE_CC_STACKPROTECTOR and keep
everything in one place. This retains the x86-specific bug verification
scripts.
Signed-off-by: Kees Cook <keescook@chromium.org> Cc: Michal Marek <mmarek@suse.cz> Cc: Russell King <linux@arm.linux.org.uk> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Paul Mundt <lethal@linux-sh.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: James Hogan <james.hogan@imgtec.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Shawn Guo <shawn.guo@linaro.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Roman Gushchin [Fri, 3 Jan 2014 03:10:10 +0000 (14:10 +1100)]
kernel/smp.c: remove cpumask_ipi
After 9a46ad6 ("smp: make smp_call_function_many() use logic similar to
smp_call_function_single()"), cfd->cpumask is accessed only in
smp_call_function_many(). So there is no more need to copy it into
cfd->cpumask_ipi before putting csd into the list. The cpumask_ipi field
is obsolete and can be removed.
Signed-off-by: Roman Gushchin <klamm@yandex-team.ru> Cc: Ingo Molnar <mingo@kernel.org> Cc: Christoph Hellwig <hch@lst.de> Cc: Wang YanQing <udknight@gmail.com> Cc: Xie XiuQi <xiexiuqi@huawei.com> Cc: Shaohua Li <shli@fusionio.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Alex Elder [Fri, 3 Jan 2014 03:10:09 +0000 (14:10 +1100)]
remove extra definitions of U32_MAX
Now that the definition is centralized in <linux/kernel.h>, the
definitions of U32_MAX (and related) elsewhere in the kernel can be
removed.
Signed-off-by: Alex Elder <elder@linaro.org> Acked-by: Sage Weil <sage@inktank.com> Acked-by: David S. Miller <davem@davemloft.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Alex Elder [Fri, 3 Jan 2014 03:10:09 +0000 (14:10 +1100)]
kernel.h: define u8, s8, u32, etc. limits
Create constants that define the maximum and minimum values
representable by the kernel types u8, s8, u16, s16, and so on.
Signed-off-by: Alex Elder <elder@linaro.org> Cc: Sage Weil <sage@inktank.com> Cc: David Miller <davem@davemloft.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Alex Elder [Fri, 3 Jan 2014 03:10:09 +0000 (14:10 +1100)]
conditionally define U32_MAX
The symbol U32_MAX is defined in several spots. Change these definitions
to be conditional. This is in preparation for the next patch, which
centralizes the definition in <linux/kernel.h>.
Signed-off-by: Alex Elder <elder@linaro.org> Cc: Sage Weil <sage@inktank.com> Cc: David Miller <davem@davemloft.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Mark Salter [Fri, 3 Jan 2014 03:10:09 +0000 (14:10 +1100)]
um: use generic fixmap.h
Signed-off-by: Mark Salter <msalter@redhat.com> Acked-by: Richard Weinberger <richard@nod.at> Cc: Jeff Dike <jdike@addtoit.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Mark Salter [Fri, 3 Jan 2014 03:10:08 +0000 (14:10 +1100)]
powerpc: use generic fixmap.h
Signed-off-by: Mark Salter <msalter@redhat.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Mark Salter [Fri, 3 Jan 2014 03:10:07 +0000 (14:10 +1100)]
metag: use generic fixmap.h
Signed-off-by: Mark Salter <msalter@redhat.com> Acked-by: James Hogan <james.hogan@imgtec.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Mark Salter [Fri, 3 Jan 2014 03:10:07 +0000 (14:10 +1100)]
arm: use generic fixmap.h
ARM is different from other architectures in that fixmap pages are indexed
with a positive offset from FIXADDR_START. Other architectures index with
a negative offset from FIXADDR_TOP. In order to use the generic fixmap.h
definitions, this patch redefines FIXADDR_TOP to be inclusive of the
useable range. That is, FIXADDR_TOP is the virtual address of the topmost
fixed page. The newly defined FIXADDR_END is the first virtual address
past the fixed mappings.
Signed-off-by: Mark Salter <msalter@redhat.com> Cc: Russell King <linux@arm.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Mark Salter [Fri, 3 Jan 2014 03:10:06 +0000 (14:10 +1100)]
add generic fixmap.h
Many architectures provide an asm/fixmap.h which defines support for
compile-time 'special' virtual mappings which need to be made before
paging_init() has run. This support is also used for early ioremap on
x86. Much of this support is identical across the architectures. This
patch consolidates all of the common bits into asm-generic/fixmap.h which
is intended to be included from arch/*/include/asm/fixmap.h.
Signed-off-by: Mark Salter <msalter@redhat.com> Acked-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Ralf Baechle <ralf@linux-mips.org> Cc: Russell King <linux@arm.linux.org.uk> Cc: Richard Kuo <rkuo@codeaurora.org> Cc: James Hogan <james.hogan@imgtec.com> Cc: Michal Simek <monstr@monstr.eu> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Chris Metcalf <cmetcalf@tilera.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jeff Dike <jdike@addtoit.com> Cc: Paul Mundt <lethal@linux-sh.org> Cc: Richard Weinberger <richard@nod.at> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Jonas Bonn <jonas.bonn@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
drivers/block/Kconfig: update RAM block device module name
RAM block device support module name changed to brd.ko some years ago with
an "rd" alias to match previous module implementation. This patch updates
its Kconfig definition.
Signed-off-by: Fabian Frederick <fabf@skynet.be> Acked-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Now all 64-bit architectures have been converted to int-ll64.h, we can
remove int-l64.h in kernelspace.
For backwards compatibility, alpha, ia64, mips64, and powerpc64 still use
int-l64.h in userspace.
This is the (reworked for UAPI) non-documentation part of more than two
year old "asm/types.h: All architectures use int-ll64.h in kernelspace"
(https://lkml.org/lkml/2011/8/13/104)
Since <asm/types.h> (from include/uapi/asm-generic/types.h) is used for
both kernel and user space, include/asm-generic/int-ll64.h cannot just
become include/asm-generic/types.h, as Arnd suggested.
Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org> Acked-by: Arnd Bergmann <arnd@arndb.de> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Randy Dunlap <rdunlap@xenotime.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
kernel: use lockless list for smp_call_function_single
Make smp_call_function_single and friends more efficient by using
a lockless list.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Shaohua Li [Fri, 3 Jan 2014 03:10:05 +0000 (14:10 +1100)]
swap: add a simple detector for inappropriate swapin readahead
This is a patch to improve swap readahead algorithm. It's from Hugh and I
slightly changed it.
Hugh's original changelog:
swapin readahead does a blind readahead, whether or not the swapin
is sequential. This may be ok on harddisk, because large reads have
relatively small costs, and if the readahead pages are unneeded they
can be reclaimed easily - though, what if their allocation forced
reclaim of useful pages? But on SSD devices large reads are more
expensive than small ones: if the readahead pages are unneeded,
reading them in caused significant overhead.
This patch adds very simplistic random read detection. Stealing
the PageReadahead technique from Konstantin Khlebnikov's patch,
avoiding the vma/anon_vma sophistications of Shaohua Li's patch,
swapin_nr_pages() simply looks at readahead's current success
rate, and narrows or widens its readahead window accordingly.
There is little science to its heuristic: it's about as stupid
as can be whilst remaining effective.
The table below shows elapsed times (in centiseconds) when running
a single repetitive swapping load across a 1000MB mapping in 900MB
ram with 1GB swap (the harddisk tests had taken painfully too long
when I used mem=500M, but SSD shows similar results for that).
Vanilla is the 3.6-rc7 kernel on which I started; Shaohua denotes
his Sep 3 patch in mmotm and linux-next; HughOld denotes my Oct 1
patch which Shaohua showed to be defective; HughNew this Nov 14
patch, with page_cluster as usual at default of 3 (8-page reads);
HughPC4 this same patch with page_cluster 4 (16-page reads);
HughPC0 with page_cluster 0 (1-page reads: no readahead).
HDD for swapping to harddisk, SSD for swapping to VertexII SSD.
Seq for sequential access to the mapping, cycling five times around;
Rand for the same number of random touches. Anon for a MAP_PRIVATE
anon mapping; Shmem for a MAP_SHARED anon mapping, equivalent to tmpfs.
One weakness of Shaohua's vma/anon_vma approach was that it did
not optimize Shmem: seen below. Konstantin's approach was perhaps
mistuned, 50% slower on Seq: did not compete and is not shown below.
These tests are, of course, two extremes of a very simple case:
under heavier mixed loads I've not yet observed any consistent
improvement or degradation, and wider testing would be welcome.
Shaohua Li:
Test shows Vanilla is slightly better in sequential workload than Hugh's patch.
I observed with Hugh's patch sometimes the readahead size is shrinked too fast
(from 8 to 1 immediately) in sequential workload if there is no hit. And in
such case, continuing doing readahead is good actually.
I don't prepare a sophisticated algorithm for the sequential workload because
so far we can't guarantee sequential accessed pages are swap out sequentially.
So I slightly change Hugh's heuristic - don't shrink readahead size too fast.
Here is my test result (unit second, 3 runs average):
Vanilla Hugh New
Seq 356 370 360
Random 4525 2447 2444
Attached graph is the swapin/swapout throughput I collected with 'vmstat 2'.
The first part is running a random workload (till around 1200 of the x-axis)
and the second part is running a sequential workload. swapin and swapout
throughput are almost identical in steady state in both workloads. These are
expected behavior. while in Vanilla, swapin is much bigger than swapout
especially in random workload (because wrong readahead).
Original patches by: Shaohua Li and Konstantin Khlebnikov.
Signed-off-by: Hugh Dickins <hughd@google.com> Signed-off-by: Shaohua Li <shli@fusionio.com> Cc: Rik van Riel <riel@redhat.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Konstantin Khlebnikov <khlebnikov@openvz.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
swap: fix setting PAGE_SIZE blocksize during swapoff/swapon race
Fix race between swapoff and swapon resulting in setting blocksize of
PAGE_SIZE for block devices during swapoff.
The swapon modifies swap_info->old_block_size before acquiring
swapon_mutex. It reads block_size of bdev, stores it under
swap_info->old_block_size and sets new block_size to PAGE_SIZE.
On the other hand the swapoff sets the device's block_size to
old_block_size after releasing swapon_mutex.
This patch locks the swapon_mutex much earlier during swapon. It also
releases the swapon_mutex later during swapoff.
The effect of race can be triggered by following scenario:
- One block swap device with block size of 512
- thread 1: Swapon is called, swap is activated,
p->old_block_size = block_size(p->bdev); /512/
block_size(p->bdev) = PAGE_SIZE;
Thread ends.
- thread 2: Swapoff is called and it goes just after releasing the
swapon_mutex. The swap is now fully disabled except of setting the
block size to old value. The p->bdev->block_size is still equal to
PAGE_SIZE.
- thread 3: New swapon is called. This swap is disabled so without
acquiring the swapon_mutex:
- p->old_block_size = block_size(p->bdev); /PAGE_SIZE (!!!)/
- block_size(p->bdev) = PAGE_SIZE;
Swap is activated and thread ends.
- thread 2: resumes work and sets blocksize to old value:
- set_blocksize(bdev, p->old_block_size)
But now the p->old_block_size is equal to PAGE_SIZE.
The patch swap-fix-set_blocksize-race-during-swapon-swapoff does not fix
this particular issue. It reduces the possibility of races as the swapon
must overwrite p->old_block_size before acquiring swapon_mutex in swapoff.
Signed-off-by: Krzysztof Kozlowski <k.kozlowski@samsung.com> Cc: Weijie Yang <weijie.yang.kh@gmail.com> Cc: Bob Liu <bob.liu@oracle.com> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Shaohua Li <shli@fusionio.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Sasha Levin [Fri, 3 Jan 2014 03:10:04 +0000 (14:10 +1100)]
mm: dump page when hitting a VM_BUG_ON using VM_BUG_ON_PAGE
Most of the VM_BUG_ON assertions are performed on a page. Usually, when
one of these assertions fails we'll get a BUG_ON with a call stack and the
registers.
I've recently noticed based on the requests to add a small piece of code
that dumps the page to various VM_BUG_ON sites that the page dump is quite
useful to people debugging issues in mm.
This patch adds a VM_BUG_ON_PAGE(cond, page) which beyond doing what
VM_BUG_ON() does, also dumps the page before executing the actual BUG_ON.
Signed-off-by: Sasha Levin <sasha.levin@oracle.com> Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Naoya Horiguchi [Fri, 3 Jan 2014 03:10:03 +0000 (14:10 +1100)]
fs/proc/page.c: add PageAnon check to surely detect thp
stable_page_flags() checks !PageHuge && PageTransCompound && PageLRU to
know that a specified page is thp or not. But sometimes it's not enough
and we fail to detect thp when the thp is on pagevec. This happens only
for a few seconds after LRU list operations, but it makes it difficult to
control our applications depending on this flag.
So this patch adds another check PageAnon to detect thps on pagevec. It
might not give the future extensibility for thp pagecache, but it's OK at
least for now.
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: David Rientjes <rientjes@google.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Bob Liu [Fri, 3 Jan 2014 03:10:03 +0000 (14:10 +1100)]
mm: remove BUG_ON() from mlock_vma_page()
objrmap doesn't work for nonlinear VMAs because the assumption that
offset-into-file correlates with offset-into-virtual-addresses does not
hold. Hence what try_to_unmap_cluster does is a mini "virtual scan" of
each nonlinear VMA which maps the file to which the target page belongs.
If vma locked, mlock the pages in the cluster, rather than unmapping them.
However, not all pages are guarantee page locked instead of the check
page, resulting in the below BUG_ON().
It's safe to mlock_vma_page() without PageLocked, so fix this issue by
removing that BUG_ON().
memcg: do not use vmalloc for mem_cgroup allocations
The vmalloc was introduced by 333279 ("memcgroup: use vmalloc for
mem_cgroup allocation"), because at that time MAX_NUMNODES was used for
defining the per-node array in the mem_cgroup structure so that the
structure could be huge even if the system had the only NUMA node.
The situation was significantly improved by patch 45cf7e ("memcg: reduce
the size of struct memcg 244-fold"), which made the size of the mem_cgroup
structure calculated dynamically depending on the real number of NUMA
nodes installed on the system (nr_node_ids), so now there is no point in
using vmalloc here: the structure is allocated rarely and on most systems
its size is about 1K.
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: Glauber Costa <glommer@openvz.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Balbir Singh <bsingharora@gmail.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Vlastimil Babka [Fri, 3 Jan 2014 03:10:03 +0000 (14:10 +1100)]
mm: munlock: fix potential race with THP page split
Since commit ff6a6da60 ("mm: accelerate munlock() treatment of THP pages")
munlock skips tail pages of a munlocked THP page. There is some attempt
to prevent bad consequences of racing with a THP page split, but code
inspection indicates that there are two problems that may lead to a
non-fatal, yet wrong outcome.
First, __split_huge_page_refcount() copies flags including PageMlocked
from the head page to the tail pages. Clearing PageMlocked by
munlock_vma_page() in the middle of this operation might result in part of
tail pages left with PageMlocked flag. As the head page still appears to
be a THP page until all tail pages are processed, munlock_vma_page() might
think it munlocked the whole THP page and skip all the former tail pages.
Before ff6a6da60, those pages would be cleared in further iterations of
munlock_vma_pages_range(), but NR_MLOCK would still become undercounted
(related the next point).
Second, NR_MLOCK accounting is based on call to hpage_nr_pages() after the
PageMlocked is cleared. The accounting might also become inconsistent due
to race with __split_huge_page_refcount()
- undercount when HUGE_PMD_NR is subtracted, but some tail pages are
left with PageMlocked set and counted again (only possible before ff6a6da60)
- overcount when hpage_nr_pages() sees a normal page (split has already
finished), but the parallel split has meanwhile cleared PageMlocked from
additional tail pages
This patch prevents both problems via extending the scope of lru_lock in
munlock_vma_page(). This is convenient because:
- __split_huge_page_refcount() takes lru_lock for its whole operation
- munlock_vma_page() typically takes lru_lock anyway for page isolation
As this becomes a second function where page isolation is done with
lru_lock already held, factor this out to a new
__munlock_isolate_lru_page() function and clean up the code around.
Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Cc: Sasha Levin <sasha.levin@oracle.com> Cc: Michel Lespinasse <walken@google.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Hugh Dickins <hughd@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Dave Hansen [Fri, 3 Jan 2014 03:10:02 +0000 (14:10 +1100)]
mm: print more details for bad_page()
bad_page() is cool in that it prints out a bunch of data about the page.
But, I can never remember which page flags are good and which are bad, or
whether ->index or ->mapping is required to be NULL.
This patch allows bad/dump_page() callers to specify a string about why
they are dumping the page and adds explanation strings to a number of
places. It also adds a 'bad_flags' argument to bad_page(), which it then
dumps out separately from the flags which are actually set.
This way, the messages will show specifically why the page was bad,
*specifically* which flags it is complaining about, if it was a page flag
combination which was the problem.
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Christoph Lameter <cl@linux.com> Cc: Andi Kleen <andi@firstfloor.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Dan Streetman [Fri, 3 Jan 2014 03:10:02 +0000 (14:10 +1100)]
mm/zswap.c: change params from hidden to ro
The "compressor" and "enabled" params are currently hidden, this changes
them to read-only, so userspace can tell if zswap is enabled or not and
see what compressor is in use.
Signed-off-by: Dan Streetman <ddstreet@ieee.org> Cc: Vladimir Murzin <murzin.v@gmail.com> Cc: Bob Liu <bob.liu@oracle.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Weijie Yang <weijie.yang@samsung.com> Acked-by: Seth Jennings <sjennings@variantweb.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Documentation/vm/locking is a blast from the past. In the entire git
history, it has had precisely Three modifications. Two of those look to
be pure renames, and the third was from 2005.
The doc contains such gems as:
> The page_table_lock is grabbed while holding the
> kernel_lock spinning monitor.
> Page stealers hold kernel_lock to protect against a bunch of
> races.
Or this which talks about mmap_sem:
> 4. The exception to this rule is expand_stack, which just
> takes the read lock and the page_table_lock, this is ok
> because it doesn't really modify fields anybody relies on.
expand_stack() doesn't take any locks any more directly, and the
mmap_sem acquisition was long ago moved up in to the page fault
code itself.
It could be argued that we need to rewrite this, but it is
dangerous to leave it as-is. It will confuse more people than it
helps.
Signed-off-by: Dave Hansen <dave.hansen@intel.com> Cc: Hugh Dickins <hughd@google.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Joonsoo Kim [Fri, 3 Jan 2014 03:10:01 +0000 (14:10 +1100)]
mm/migrate: remove putback_lru_pages, fix comment on putback_movable_pages
Some part of putback_lru_pages() and putback_movable_pages() is
duplicated, so it could confuse us what we should use. We can remove
putback_lru_pages() since it is not really needed now. This makes us
undestand and maintain the code more easily.
And comment on putback_movable_pages() is stale now, so fix it.
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Reviewed-by: Wanpeng Li <liwanp@linux.vnet.ibm.com> Cc: Christoph Lameter <cl@linux.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Rafael Aquini <aquini@redhat.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Rik van Riel <riel@redhat.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Joonsoo Kim [Fri, 3 Jan 2014 03:10:01 +0000 (14:10 +1100)]
mm/migrate: correct failure handling if !hugepage_migration_support()
We should remove the page from the list if we fail with ENOSYS, since
migrate_pages() consider error cases except -ENOMEM and -EAGAIN as
permanent failure and it assumes that the page would be removed from the
list. Without this patch, we could overcount number of failure.
In addition, we should put back the new hugepage if
!hugepage_migration_support(). If not, we would leak hugepage memory.
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Acked-by: Christoph Lameter <cl@linux.com> Reviewed-by: Wanpeng Li <liwanp@linux.vnet.ibm.com> Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Rafael Aquini <aquini@redhat.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Rik van Riel <riel@redhat.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
David Rientjes [Fri, 3 Jan 2014 03:10:01 +0000 (14:10 +1100)]
mm, page_alloc: warn for non-blockable __GFP_NOFAIL allocation failure
__GFP_NOFAIL may return NULL when coupled with GFP_NOWAIT or GFP_ATOMIC.
Luckily, nothing currently does such craziness. So instead of causing
such allocations to loop (potentially forever), we maintain the current
behavior and also warn about the new users of the deprecated flag.
Suggested-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: David Rientjes <rientjes@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Michal Hocko <mhocko@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Vlastimil Babka [Fri, 3 Jan 2014 03:10:00 +0000 (14:10 +1100)]
mm: compaction: reset scanner positions immediately when they meet
Compaction used to start its migrate and free page scaners at the zone's
lowest and highest pfn, respectively. Later, caching was introduced to
remember the scanners' progress across compaction attempts so that
pageblocks are not re-scanned uselessly. Additionally, pageblocks where
isolation failed are marked to be quickly skipped when encountered again
in future compactions.
Currently, both the reset of cached pfn's and clearing of the pageblock
skip information for a zone is done in __reset_isolation_suitable(). This
function gets called when:
- compaction is restarting after being deferred
- compact_blockskip_flush flag is set in compact_finished() when the scanners
meet (and not again cleared when direct compaction succeeds in allocation)
and kswapd acts upon this flag before going to sleep
This behavior is suboptimal for several reasons:
- when direct sync compaction is called after async compaction fails (in the
allocation slowpath), it will effectively do nothing, unless kswapd
happens to process the compact_blockskip_flush flag meanwhile. This is racy
and goes against the purpose of sync compaction to more thoroughly retry
the compaction of a zone where async compaction has failed.
The restart-after-deferring path cannot help here as deferring happens only
after the sync compaction fails. It is also done only for the preferred
zone, while the compaction might be done for a fallback zone.
- the mechanism of marking pageblock to be skipped has little value since the
cached pfn's are reset only together with the pageblock skip flags. This
effectively limits pageblock skip usage to parallel compactions.
This patch changes compact_finished() so that cached pfn's are reset
immediately when the scanners meet. Clearing pageblock skip flags is
unchanged, as well as the other situations where cached pfn's are reset.
This allows the sync-after-async compaction to retry pageblocks not marked
as skipped, such as blocks !MIGRATE_MOVABLE blocks that async compactions
now skips without marking them.
Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Cc: Rik van Riel <riel@redhat.com> Acked-by: Mel Gorman <mgorman@suse.de> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Vlastimil Babka [Fri, 3 Jan 2014 03:10:00 +0000 (14:10 +1100)]
mm: compaction: do not mark unmovable pageblocks as skipped in async compaction
Compaction temporarily marks pageblocks where it fails to isolate pages as
to-be-skipped in further compactions, in order to improve efficiency. One
of the reasons to fail isolating pages is that isolation is not attempted
in pageblocks that are not of MIGRATE_MOVABLE (or CMA) type.
The problem is that blocks skipped due to not being MIGRATE_MOVABLE in
async compaction become skipped due to the temporary mark also in future
sync compaction. Moreover, this may follow quite soon during
__alloc_page_slowpath, without much time for kswapd to clear the pageblock
skip marks. This goes against the idea that sync compaction should try to
scan these blocks more thoroughly than the async compaction.
The fix is to ensure in async compaction that these !MIGRATE_MOVABLE
blocks are not marked to be skipped. Note this should not affect
performance or locking impact of further async compactions, as skipping a
block due to being !MIGRATE_MOVABLE is done soon after skipping a block
marked to be skipped, both without locking.
Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Cc: Rik van Riel <riel@redhat.com> Acked-by: Mel Gorman <mgorman@suse.de> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>