]> git.karo-electronics.de Git - karo-tx-linux.git/log
karo-tx-linux.git
10 years agoautofs4: allow autofs to work outside the initial PID namespace
Sukadev Bhattiprolu [Fri, 3 Jan 2014 03:10:20 +0000 (14:10 +1100)]
autofs4: allow autofs to work outside the initial PID namespace

Enable autofs4 to work in a "container".  oz_pgrp is converted from pid_t
to struct pid and this is stored at mount time based on the "pgrp=" option
or if the option is missing then the current pgrp.

The "pgrp=" option is interpreted in the PID namespace of the current
process.  This option is flawed in that it doesn't carry the namespace
information, so it should be deprecated.  AFAICS the autofs daemon always
sends the current pgrp, which is the default anyway.

The oz_pgrp is also set from the AUTOFS_DEV_IOCTL_SETPIPEFD_CMD ioctl.
This ioctl sets oz_pgrp to the current pgrp.  It is not allowed to change
the pid namespace.

oz_pgrp is used mainly to determine whether the process traversing the
autofs mount tree is the autofs daemon itself or not.  This function now
compares the pid pointers instead of the pid_t values.

One other use of oz_pgrp is in autofs4_show_options.  There is shows the
virtual pid number (i.e.  the one that is valid inside the PID namespace
of the calling process)

For debugging printk convert oz_pgrp to the value in the initial pid
namespace.

Signed-off-by: Sukadev Bhattiprolu <sukadev@us.ibm.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Acked-by: Ian Kent <raven@themaw.net>
Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 years agoinit/main.c: remove unused declaration of tc_init()
Geert Uytterhoeven [Fri, 3 Jan 2014 03:10:19 +0000 (14:10 +1100)]
init/main.c: remove unused declaration of tc_init()

Its user was removed in v2.5.2.4.

Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 years agofs/ramfs: move ramfs_aops to inode.c
Axel Lin [Fri, 3 Jan 2014 03:10:19 +0000 (14:10 +1100)]
fs/ramfs: move ramfs_aops to inode.c

ramfs_aops is identical in file-mmu.c and file-nommu.c.  Thus move it to
fs/ramfs/inode.c and make it static.

Signed-off-by: Axel Lin <axel.lin@ingics.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 years agofs/ramfs/file-nommu.c: make ramfs_nommu_get_unmapped_area() and ramfs_nommu_mmap...
Axel Lin [Fri, 3 Jan 2014 03:10:19 +0000 (14:10 +1100)]
fs/ramfs/file-nommu.c: make ramfs_nommu_get_unmapped_area() and ramfs_nommu_mmap() static

Since commit 853ac43ab194f "shmem: unify regular and tiny shmem",
ramfs_nommu_get_unmapped_area() and ramfs_nommu_mmap() are not directly
referenced outside of file-nommu.c.  Thus make them static.

Signed-off-by: Axel Lin <axel.lin@ingics.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 years agobinfmt_elf.c: use get_random_int() to fix entropy depleting
Jeff Liu [Fri, 3 Jan 2014 03:10:19 +0000 (14:10 +1100)]
binfmt_elf.c: use get_random_int() to fix entropy depleting

Entropy is quickly depleted under normal operations like ls(1), cat(1),
etc...  between 2.6.30 to current mainline, for instance:

$ cat /proc/sys/kernel/random/entropy_avail
3428
$ cat /proc/sys/kernel/random/entropy_avail
2911
$cat /proc/sys/kernel/random/entropy_avail
2620

We observed this problem has been occurring since 2.6.30 with
fs/binfmt_elf.c: create_elf_tables()->get_random_bytes(), introduced by
f06295b44c296c8f ("ELF: implement AT_RANDOM for glibc PRNG seeding").

/*
 * Generate 16 random bytes for userspace PRNG seeding.
 */
get_random_bytes(k_rand_bytes, sizeof(k_rand_bytes));

The patch introduces a wrapper around get_random_int() which has lower
overhead than calling get_random_bytes() directly.

With this patch applied:
$ cat /proc/sys/kernel/random/entropy_avail
2731
$ cat /proc/sys/kernel/random/entropy_avail
2802
$ cat /proc/sys/kernel/random/entropy_avail
2878

Analyzed by John Sobecki.

This has been applied on a specific Oracle kernel and has been running on
the customer's production environment (the original bug reporter) for
several months; it has worked fine until now.

Signed-off-by: Jie Liu <jeff.liu@oracle.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andreas Dilger <aedilger@gmail.com>
Cc: Alan Cox <alan@linux.intel.com>
Cc: Arnd Bergmann <arnn@arndb.de>
Cc: John Sobecki <john.sobecki@oracle.com>
Cc: James Morris <james.l.morris@oracle.com>
Cc: Jakub Jelinek <jakub@redhat.com>
Cc: Ted Ts'o <tytso@mit.edu>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Acked-by: Kees Cook <keescook@chromium.org>
Cc: Ulrich Drepper <drepper@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 years agocheckpatch: update the FSF/GPL address check
Joe Perches [Fri, 3 Jan 2014 03:10:18 +0000 (14:10 +1100)]
checkpatch: update the FSF/GPL address check

The FSF address check is a bit too verbose looking for the GPL text.
Quiet it a bit by requiring --strict for the GPL bit.

Also make the address tests match a few uses of abbreviations for street
names and make it case insensitive.

Signed-off-by: Joe Perches <joe@perches.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Cc: Manfred Spraul <manfred@colorfullife.com>
Cc: Andy Whitcroft <apw@canonical.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 years agocheckpatch: check for if's with unnecessary parentheses
Joe Perches [Fri, 3 Jan 2014 03:10:18 +0000 (14:10 +1100)]
checkpatch: check for if's with unnecessary parentheses

If statements don't need multiple parentheses around tested comparisons
like "if ((foo == bar))".

An == comparison maybe a sign of an intended assignment, so emit a
slightly different message if so.

Signed-off-by: Joe Perches <joe@perches.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Cc: Manfred Spraul <manfred@colorfullife.com>
Cc: Andy Whitcroft <apw@canonical.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 years agocheckpatch: improve space before tab --fix option
Joe Perches [Fri, 3 Jan 2014 03:10:18 +0000 (14:10 +1100)]
checkpatch: improve space before tab --fix option

This test should remove all the spaces before a tab not just one space.

Substitute a tab for each 8 space block before a tab and remove less than
8 spaces before a tab.

This SPACE_BEFORE_TAB test is done after CODE_INDENT.

If there are spaces used at the beginning of a line that should be
converted to tabs, please make sure that the CODE_INDENT test and
conversion is done before this SPACE_BEFORE_TAB test and conversion.

Reported-by: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Joe Perches <joe@perches.com>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Andy Whitcroft <apw@canonical.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 years agocheckpatch: add a --fix-inplace option
Joe Perches [Fri, 3 Jan 2014 03:10:18 +0000 (14:10 +1100)]
checkpatch: add a --fix-inplace option

Add the ability to fix and overwrite existing files/patches instead of
creating a new file "<filename>.EXPERIMENTAL-checkpatch-fixes".

Suggested-by: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Joe Perches <joe@perches.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Cc: Andy Whitcroft <apw@canonical.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 years agocheckpatch-add-tests-for-function-pointer-style-misuses-fix
Josh Triplett [Fri, 3 Jan 2014 03:10:18 +0000 (14:10 +1100)]
checkpatch-add-tests-for-function-pointer-style-misuses-fix

Fix an ambiguity in one warning message and a copy/paste problem in
another.

Signed-off-by: Josh Triplett <josh@joshtriplett.org>
Acked-by: Joe Perches <joe@perches.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 years agocheckpatch: add tests for function pointer style misuses
Joe Perches [Fri, 3 Jan 2014 03:10:17 +0000 (14:10 +1100)]
checkpatch: add tests for function pointer style misuses

Kernel style uses function pointers in this form:
"type (*funcptr)(args...)"

Emit warnings when this function pointer form isn't used.

Signed-off-by: Joe Perches <joe@perches.com>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Manfred Spraul <manfred@colorfullife.com>
Cc: Andy Whitcroft <apw@canonical.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 years agocheckpatch: attempt to find missing switch/case break;
Joe Perches [Fri, 3 Jan 2014 03:10:17 +0000 (14:10 +1100)]
checkpatch: attempt to find missing switch/case break;

switch case statements missing a break statement are an unfortunately
common error.

e.g.:
commit 4a2c94c9b6c0 ("HID: kye: Add report fixup for Genius Manticore Keyboard")

case blocks should end in a break/return/goto/continue.

If a fall-through is used, it should have a comment showing that it is
intentional.  Ideally that comment should be something like:
"/* fall-through */"

Add a test to look for missing break statements.

This looks only at the context lines before an inserted case so it's
possible to have false positives when the context contains a close brace
and the break is before the brace and not part of the patch context.

Looking at recent patches, this is a pretty rare occurrence.  The normal
kernel style uses a break as the last line of the previous block.

Signed-off-by: Joe Perches <joe@perche.com>
Cc: Andy Whitcroft <apw@shadowen.org>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Benjamin Tissoires <benjamin.tissoires@redhat.com>
Cc: Dave Jones <davej@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 years agocheckpatch: add warning of future __GFP_NOFAIL use
David Rientjes [Fri, 3 Jan 2014 03:10:17 +0000 (14:10 +1100)]
checkpatch: add warning of future __GFP_NOFAIL use

gfp.h and page_alloc.c already specify that __GFP_NOFAIL is deprecated and
no new users should be added.

Add a warning to checkpatch to catch this.

Signed-off-by: David Rientjes <rientjes@google.com>
Cc: Joe Perches <joe@perches.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 years agocheckpatch: warn only on "space before semicolon" at end of line
Joe Perches [Fri, 3 Jan 2014 03:10:17 +0000 (14:10 +1100)]
checkpatch: warn only on "space before semicolon" at end of line

The "space before a non-naked semicolon" test has unwanted output when
used in "for ( ;; )" loops.

Make the test work only on end-of-line statement
termination semicolons.

Signed-off-by: Joe Perches <joe@perches.com>
Cc: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 years agocheckpatch: more comprehensive split strings warning
Joe Perches [Fri, 3 Jan 2014 03:10:16 +0000 (14:10 +1100)]
checkpatch: more comprehensive split strings warning

The current checkpatch test for split strings does not find several cases
that should be found.

For instance:

  /* Else poor success; go back to mode in "active" table */
  } else {
  IWL_DEBUG_RATE(mvm,
-        "LQ: GOING BACK TO THE OLD TABLE suc=%d cur-tpt=%d old-tpt=%d\n",
+        "GOING BACK TO THE OLD TABLE: SR %d "
+        "cur-tpt %d old-tpt %d\n",
         window->success_ratio,
         window->average_tpt,
        lq_sta->last_tpt);

does not currently emit a warning.

Improve the test to find these cases.

Add more exceptions to reduce false positives for assembly and octal/hex
string constants.

Signed-off-by: Joe Perches <joe@perches.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 years agofirmware/dmi_scan: generalize for use by other archs
Ard Biesheuvel [Fri, 3 Jan 2014 03:10:16 +0000 (14:10 +1100)]
firmware/dmi_scan: generalize for use by other archs

This patch makes a couple of changes to the SMBIOS/DMI scanning
code so it can be used on other archs (such as ARM and arm64):
(a) wrap the calls to ioremap()/iounmap(), this allows the use of a
    flavor of ioremap() more suitable for random unaligned access;
(b) allow the non-EFI fallback probe into hardcoded physical address
    0xF0000 to be disabled.

Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Acked-by: Grant Likely <grant.likely@linaro.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc "Luck, Tony" <tony.luck@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 years agolib: Add CRC64 ECMA module
Marian Chereji [Fri, 3 Jan 2014 03:10:16 +0000 (14:10 +1100)]
lib: Add CRC64 ECMA module

Add implementation of CRC64 ECMA checksum.

We have an IP Acceleration driver for Freescale network processors which
is using this CRC64.  However, it still needs some work in order for it to
become upstreamable.

Signed-off-by: Marian Chereji <marian.chereji@freescale.com>
Reviewed-by: Varvara Andrei-B21317 <andrei.varvara@freescale.com>
Reviewed-by: Fleming Andrew-AFLEMING <AFLEMING@freescale.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 years agotest: fix sparse warnings in user_copy tests
Kees Cook [Fri, 3 Jan 2014 03:10:16 +0000 (14:10 +1100)]
test: fix sparse warnings in user_copy tests

Sparse fix for "test: check copy_to/from_user boundary validation":

To keep sparse happy with the horrible things being done with the user
memory pointers, declare both __user and non-__user cases ahead of time to
avoid needing to do the casts later.

Signed-off-by: Kees Cook <keescook@chromium.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Joe Perches <joe@perches.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 years agotest: check copy_to/from_user boundary validation
Kees Cook [Fri, 3 Jan 2014 03:10:15 +0000 (14:10 +1100)]
test: check copy_to/from_user boundary validation

To help avoid an architecture failing to correctly check kernel/user
boundaries when handling copy_to_user, copy_from_user, put_user, or
get_user, perform some simple tests and fail to load if any of them behave
unexpectedly.

Specifically, this is to make sure there is a way to notice if things like
what was fixed in 8404663f81 ("ARM: 7527/1: uaccess: explicitly check
__user pointer when !CPU_USE_DOMAINS") ever regresses again, for any
architecture.

Additionally, adds new "user" selftest target, which loads this module.

Signed-off-by: Kees Cook <keescook@chromium.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Joe Perches <joe@perches.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 years agotest: add minimal module for verification testing
Kees Cook [Fri, 3 Jan 2014 03:10:15 +0000 (14:10 +1100)]
test: add minimal module for verification testing

This is a pair of test modules I'd like to see in the tree.  Instead of
putting these in lkdtm, where I've been adding various tests that trigger
crashes, these don't make sense there since they need to be either
distinctly separate, or their pass/fail state don't need to crash the
machine.

These live in lib/ for now, along with a few other in-kernel test modules,
and use the slightly more common "test_" naming convention, instead of
"test-".  We should likely standardize on the former:

$ find . -name 'test_*.c' | grep -v /tools/ | wc -l
4
$ find . -name 'test-*.c' | grep -v /tools/ | wc -l
2

The first is entirely a no-op module, designed to allow simple testing of
the module loading and verification interface.  It's useful to have a
module that has no other uses or dependencies so it can be reliably used
for just testing module loading and verification.

The second is a module that exercises the user memory access functions, in
an effort to make sure that we can quickly catch any regressions in
boundary checking (e.g.  like what was recently fixed on ARM).

This patch (of 2):

When doing module loading verification tests (for example, with module
signing, or LSM hooks), it is very handy to have a module that can be
built on all systems under test, isn't auto-loaded at boot, and has no
device or similar dependencies.  This creates the "test_module.ko" module
for that purpose, which only reports its load and unload to printk.

Signed-off-by: Kees Cook <keescook@chromium.org>
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Cc: Joe Perches <joe@perches.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 years agolib/cmdline.c: declare exported symbols immediately
Felipe Contreras [Fri, 3 Jan 2014 03:10:15 +0000 (14:10 +1100)]
lib/cmdline.c: declare exported symbols immediately

WARNING: EXPORT_SYMBOL(foo); should immediately follow its function/variable
+EXPORT_SYMBOL(memparse);

WARNING: EXPORT_SYMBOL(foo); should immediately follow its function/variable
+EXPORT_SYMBOL(get_option);

WARNING: EXPORT_SYMBOL(foo); should immediately follow its function/variable
+EXPORT_SYMBOL(get_options);

Signed-off-by: Felipe Contreras <felipe.contreras@gmail.com>
Cc: Levente Kurusa <levex@linux.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 years agolib/cmdline.c: fix style issues
Felipe Contreras [Fri, 3 Jan 2014 03:10:15 +0000 (14:10 +1100)]
lib/cmdline.c: fix style issues

WARNING: space prohibited between function name and open parenthesis '('
+int get_option (char **str, int *pint)

WARNING: space prohibited between function name and open parenthesis '('
+ *pint = simple_strtol (cur, str, 0);

ERROR: trailing whitespace
+ $

WARNING: please, no spaces at the start of a line
+ $

WARNING: space prohibited between function name and open parenthesis '('
+ res = get_option ((char **)&str, ints + i);

Signed-off-by: Felipe Contreras <felipe.contreras@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 years agolib/kstrtox.c: remove redundant cleanup
Felipe Contreras [Fri, 3 Jan 2014 03:10:15 +0000 (14:10 +1100)]
lib/kstrtox.c: remove redundant cleanup

We can't reach the cleanup code unless the flag KSTRTOX_OVERFLOW is not
set, so there's not no point in clearing a bit that we know is not set.

Signed-off-by: Felipe Contreras <felipe.contreras@gmail.com>
Acked-by: Levente Kurusa <levex@linux.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 years agodrivers/video/backlight/backlight.c: remove backlight sysfs uevent
Kyungmin Park [Fri, 3 Jan 2014 03:10:14 +0000 (14:10 +1100)]
drivers/video/backlight/backlight.c: remove backlight sysfs uevent

Most mobile phones have Ambient Light Sensors and it changes brightness
according to the lux.  It means it changes backlight brightness frequently
by just writing sysfs node, so it generates uevent.

Usually there's no user to use this backlight changes.  But it forks udev
worker threads and it takes about 5ms.  The main problem is that it hurts
other process activities.  so remove it.

Kay said
"Uevents are for the major, low-frequent, global device state-changes,
 not for carrying-out any sort of measurement data. Subsystems which
 need that should use other facilities like poll()-able sysfs file or
 any other subscription-based, client-tracking interface which does not
 cause overhead if it isn't used. Uevents are not the right thing to
 use here, and upstream udev should not paper-over broken kernel
 subsystems."

Signed-off-by: Kyungmin Park <kyungmin.park@samsung.com>
Cc: Henrique de Moraes Holschuh <hmh@hmh.eng.br>
Cc: Matthew Garrett <mjg59@srcf.ucam.org>
Cc: Kay Sievers <kay.sievers@vrfy.org>
Acked-by: Jingoo Han <jg1.han@samsung.com>
Cc: Henrique de Moraes Holschuh <ibm-acpi@hmh.eng.br>
Cc: Richard Purdie <rpurdie@rpsys.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 years agobacklight: tosa: use devm_lcd_device_register()
Jingoo Han [Fri, 3 Jan 2014 03:10:14 +0000 (14:10 +1100)]
backlight: tosa: use devm_lcd_device_register()

Use devm_lcd_device_register() to make cleanup paths simpler.

Signed-off-by: Jingoo Han <jg1.han@samsung.com>
Cc: Tomi Valkeinen <tomi.valkeinen@ti.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 years agobacklight: l4f00242t03: use devm_lcd_device_register()
Jingoo Han [Fri, 3 Jan 2014 03:10:14 +0000 (14:10 +1100)]
backlight: l4f00242t03: use devm_lcd_device_register()

Use devm_lcd_device_register() to make cleanup paths simpler.

Signed-off-by: Jingoo Han <jg1.han@samsung.com>
Cc: Tomi Valkeinen <tomi.valkeinen@ti.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 years agobacklight: jornada720: use devm_lcd_device_register()
Jingoo Han [Fri, 3 Jan 2014 03:10:14 +0000 (14:10 +1100)]
backlight: jornada720: use devm_lcd_device_register()

Use devm_lcd_device_register() to make cleanup paths simpler,
and remove unnecessary remove().

Signed-off-by: Jingoo Han <jg1.han@samsung.com>
Cc: Tomi Valkeinen <tomi.valkeinen@ti.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 years agobacklight: tosa: use devm_backlight_device_register()
Jingoo Han [Fri, 3 Jan 2014 03:10:13 +0000 (14:10 +1100)]
backlight: tosa: use devm_backlight_device_register()

Use devm_backlight_device_register() to make cleanup paths simpler.

Signed-off-by: Jingoo Han <jg1.han@samsung.com>
Cc: Tomi Valkeinen <tomi.valkeinen@ti.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 years agobacklight: ot200_bl: use devm_backlight_device_register()
Jingoo Han [Fri, 3 Jan 2014 03:10:13 +0000 (14:10 +1100)]
backlight: ot200_bl: use devm_backlight_device_register()

Use devm_backlight_device_register() to make cleanup paths simpler.

Signed-off-by: Jingoo Han <jg1.han@samsung.com>
Cc: Tomi Valkeinen <tomi.valkeinen@ti.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 years agobacklight: omap1: use devm_backlight_device_register()
Jingoo Han [Fri, 3 Jan 2014 03:10:13 +0000 (14:10 +1100)]
backlight: omap1: use devm_backlight_device_register()

Use devm_backlight_device_register() to make cleanup paths simpler,
and remove unnecessary remove().

Signed-off-by: Jingoo Han <jg1.han@samsung.com>
Cc: Tomi Valkeinen <tomi.valkeinen@ti.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 years agobacklight: hp680_bl: use devm_backlight_device_register()
Jingoo Han [Fri, 3 Jan 2014 03:10:13 +0000 (14:10 +1100)]
backlight: hp680_bl: use devm_backlight_device_register()

Use devm_backlight_device_register() to make cleanup paths simpler.

Signed-off-by: Jingoo Han <jg1.han@samsung.com>
Cc: Tomi Valkeinen <tomi.valkeinen@ti.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 years agobacklight: jornada720: use devm_backlight_device_register()
Jingoo Han [Fri, 3 Jan 2014 03:10:13 +0000 (14:10 +1100)]
backlight: jornada720: use devm_backlight_device_register()

Use devm_backlight_device_register() to make cleanup paths simpler,
and remove unnecessary remove().

Signed-off-by: Jingoo Han <jg1.han@samsung.com>
Cc: Tomi Valkeinen <tomi.valkeinen@ti.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 years agoMAINTAINERS: add an entry for the Macintosh HFSPlus Filesystem
Geert Uytterhoeven [Fri, 3 Jan 2014 03:10:12 +0000 (14:10 +1100)]
MAINTAINERS: add an entry for the Macintosh HFSPlus Filesystem

To make scripts/get_maintainer.pl output something sensible.

Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 years agoget_maintainer: add commit author information to --rolestats
Joe Perches [Fri, 3 Jan 2014 03:10:12 +0000 (14:10 +1100)]
get_maintainer: add commit author information to --rolestats

get_maintainer currently uses "Signed-off-by" style lines to find
interested parties to send patches to when the MAINTAINERS file does not
have a specific section entry with a matching file pattern.

Add statistics for commit authors and lines added and deleted to the
information provided by --rolestats.

These statistics are also emitted whenever --rolestats and --git are
selected even when there is a specified maintainer.

This can have the effect of expanding the number of people that are shown
as possible "maintainers" of a particular file because "authors",
"added_lines", and "removed_lines" are also used as criterion for the
--max-maintainers option separate from the "commit_signers".

The first "--git-max-maintainers" values of each criterion
are emitted.  Any "ties" are not shown.

For example: (forcedeth does not have a named maintainer)

Old output:

$ ./scripts/get_maintainer.pl -f drivers/net/ethernet/nvidia/forcedeth.c
"David S. Miller" <davem@davemloft.net> (commit_signer:8/10=80%)
Jiri Pirko <jiri@resnulli.us> (commit_signer:2/10=20%)
Patrick McHardy <kaber@trash.net> (commit_signer:2/10=20%)
Larry Finger <Larry.Finger@lwfinger.net> (commit_signer:1/10=10%)
Peter Zijlstra <peterz@infradead.org> (commit_signer:1/10=10%)
netdev@vger.kernel.org (open list:NETWORKING DRIVERS)
linux-kernel@vger.kernel.org (open list)

New output:

$ ./scripts/get_maintainer.pl -f drivers/net/ethernet/nvidia/forcedeth.c
"David S. Miller" <davem@davemloft.net> (commit_signer:8/10=80%)
Jiri Pirko <jiri@resnulli.us> (commit_signer:2/10=20%,authored:2/10=20%,removed_lines:3/33=9%)
Patrick McHardy <kaber@trash.net> (commit_signer:2/10=20%,authored:2/10=20%,added_lines:12/95=13%,removed_lines:10/33=30%)
Larry Finger <Larry.Finger@lwfinger.net> (commit_signer:1/10=10%,authored:1/10=10%,added_lines:35/95=37%)
Peter Zijlstra <peterz@infradead.org> (commit_signer:1/10=10%)
"Peter Hüwe" <PeterHuewe@gmx.de> (authored:1/10=10%,removed_lines:15/33=45%)
Joe Perches <joe@perches.com> (authored:1/10=10%)
Neil Horman <nhorman@tuxdriver.com> (added_lines:40/95=42%)
Bill Pemberton <wfp5p@virginia.edu> (removed_lines:3/33=9%)
netdev@vger.kernel.org (open list:NETWORKING DRIVERS)
linux-kernel@vger.kernel.org (open list)

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 years agovsprintf: add %pad extension for dma_addr_t use
Joe Perches [Fri, 3 Jan 2014 03:10:12 +0000 (14:10 +1100)]
vsprintf: add %pad extension for dma_addr_t use

dma_addr_t's can be either u32 or u64 depending on a CONFIG option.

There are a few hundred dma_addr_t's printed via either cast to unsigned
long long, unsigned long or no cast at all.

Add %pad to be able to emit them without the cast.

Update Documentation/printk-formats.txt too.

Signed-off-by: Joe Perches <joe@perches.com>
Cc: "Shevchenko, Andriy" <andriy.shevchenko@intel.com>
Cc: Rob Landley <rob@landley.net>
Cc: Laurent Pinchart <laurent.pinchart+renesas@ideasonboard.com>
Cc: Julia Lawall <julia.lawall@lip6.fr>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 years agoprintk-cache-mark-printk_once-test-variable-__read_mostly-fix
Joe Perches [Fri, 3 Jan 2014 03:10:12 +0000 (14:10 +1100)]
printk-cache-mark-printk_once-test-variable-__read_mostly-fix

fix ia64 build

Cc: James Hogan <james.hogan@imgtec.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 years agoprintk/cache: Mark printk_once test variable __read_mostly
Joe Perches [Fri, 3 Jan 2014 03:10:11 +0000 (14:10 +1100)]
printk/cache: Mark printk_once test variable __read_mostly

Add #include <linux/cache.h> to define __read_mostly.

Convert cache.h to use uapi/linux/kernel.h instead
of linux/kernel.h to avoid recursive #includes.

Convert the ALIGN macro to __ALIGN_KERNEL.

printk_once only sets the bool variable tested
once so mark it __read_mostly.

Neaten the alignment so it matches the rest of the
pr_<level>_once #defines too.

Signed-off-by: Joe Perches <joe@perches.com>
Reviewed-by: James Hogan <james.hogan@imgtec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 years agodynamic-debug-howto.txt: update since new wildcard support
Du, Changbin [Fri, 3 Jan 2014 03:10:11 +0000 (14:10 +1100)]
dynamic-debug-howto.txt: update since new wildcard support

Add the usage of using new feature wildcard support.

Signed-off-by: Du, Changbin <changbin.du@gmail.com>
Cc: Jason Baron <jbaron@akamai.com>
Cc: Joe Perches <joe@perches.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 years agodynamic_debug: add wildcard support to filter files/functions/modules
Du, Changbin [Fri, 3 Jan 2014 03:10:11 +0000 (14:10 +1100)]
dynamic_debug: add wildcard support to filter files/functions/modules

Add wildcard '*'(matches zero or more characters) and '?' (matches one
character) support when qurying debug flags.

Now we can open debug messages using keywords. eg:
1. open debug logs in all usb drivers
    echo "file drivers/usb/* +p" > <debugfs>/dynamic_debug/control
2.  open debug logs for usb xhci code
    echo "file *xhci* +p" > <debugfs>/dynamic_debug/control

Signed-off-by: Du, Changbin <changbin.du@gmail.com>
Cc: Jason Baron <jbaron@akamai.com>
Cc: Joe Perches <joe@perches.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 years agolib/parser.c: put EXPORT_SYMBOLs in the conventional place
Andrew Morton [Fri, 3 Jan 2014 03:10:11 +0000 (14:10 +1100)]
lib/parser.c: put EXPORT_SYMBOLs in the conventional place

Cc: Du, Changbin <changbin.du@gmail.com>
Cc: Joe Perches <joe@perches.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 years agolib/parser.c: add match_wildcard function
Du, Changbin [Fri, 3 Jan 2014 03:10:10 +0000 (14:10 +1100)]
lib/parser.c: add match_wildcard function

match_wildcard function is a simple implementation of wildcard
matching algorithm. It only supports two usual wildcardes:
    '*' - matches zero or more characters
    '?' - matches one character
This algorithm is safe since it is non-recursive.

Signed-off-by: Du, Changbin <changbin.du@gmail.com>
Cc: Jason Baron <jbaron@akamai.com>
Cc: Joe Perches <joe@perches.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 years agodrivers/misc/ti-st/st_core.c: fix NULL dereference on protocol type check
Gustavo Padovan [Fri, 3 Jan 2014 03:10:10 +0000 (14:10 +1100)]
drivers/misc/ti-st/st_core.c: fix NULL dereference on protocol type check

If the type we receive is greater than ST_MAX_CHANNELS we can't rely on
type as vector index since we would be accessing unknown memory when we use the type
as index.

 Unable to handle kernel NULL pointer dereference at virtual address 0000001b
 pgd = c0004000
 [0000001b] *pgd=00000000
 Internal error: Oops: 17 [#1] PREEMPT SMP ARM
 Modules linked in: btwilink wl12xx wlcore mac80211 cfg80211 rfcomm bnep bluo
 CPU: 0    Tainted: G        W     (3.4.0+ #15)
 PC is at st_int_recv+0x278/0x344
 LR is at get_parent_ip+0x14/0x30
 pc : [<c03b01a8>]    lr : [<c007273c>]    psr: 200f0193
 sp : dc631ed0  ip : e3e21c24  fp : dc631f04
 r10: 00000000  r9 : 600f0113  r8 : 0000003f
 r7 : e3e21b14  r6 : 00000067  r5 : e2e49c1c  r4 : e3e21a80
 r3 : 00000001  r2 : 00000001  r1 : 00000001  r0 : 600f0113
 Flags: nzCv  IRQs off  FIQs on  Mode SVC_32  ISA ARM  Segment kernel
 Control: 10c5387d  Table: 9c50004a  DAC: 00000015

Signed-off-by: Gustavo Padovan <gustavo.padovan@collabora.co.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 years agostack protector: provide -fstack-protector-strong build option
Kees Cook [Fri, 3 Jan 2014 03:10:10 +0000 (14:10 +1100)]
stack protector: provide -fstack-protector-strong build option

This changes the stack protector config option into a choice of "None",
"Regular", and "Strong".  For "Strong", the kernel is built with
-fstack-protector-strong (gcc 4.9 and later).  This options increases the
coverage of the stack protector without the heavy performance hit of
-fstack-protector-all.

For reference, the stack protector options available in gcc are:

-fstack-protector-all:
Adds the stack-canary saving prefix and stack-canary checking suffix to
_all_ function entry and exit. Results in substantial use of stack space
for saving the canary for deep stack users (e.g. historically xfs), and
measurable (though shockingly still low) performance hit due to all the
saving/checking. Really not suitable for sane systems, and was entirely
removed as an option from the kernel many years ago.

-fstack-protector:
Adds the canary save/check to functions that define an 8
(--param=ssp-buffer-size=N, N=8 by default) or more byte local char
array. Traditionally, stack overflows happened with string-based
manipulations, so this was a way to find those functions. Very few
total functions actually get the canary; no measurable performance or
size overhead.

-fstack-protector-strong
Adds the canary for a wider set of functions, since it's not just those
with strings that have ultimately been vulnerable to stack-busting.  With
this superset, more functions end up with a canary, but it still remains
small compared to all functions with no measurable change in performance.
Based on the original design document, a function gets the canary when it
contains any of:

- local variable's address used as part of the RHS of an assignment or
  function argument
- local variable is an array (or union containing an array), regardless
  of array type or length
- uses register local variables
https://docs.google.com/a/google.com/document/d/1xXBH6rRZue4f296vGt9YQcuLVQHeE516stHwt8M9xyU

Comparison of "size" and "objdump" output when built with gcc-4.9 in
three configurations:
- defconfig
11430641 text size
36110 function bodies
- defconfig + CONFIG_CC_STACKPROTECTOR
11468490 text size (+0.33%)
1015 of 36110 functions stack-protected (2.81%)
- defconfig + CONFIG_CC_STACKPROTECTOR_STRONG via this patch
11692790 text size (+2.24%)
7401 of 36110 functions stack-protected (20.5%)

With -strong, ARM's compressed boot code now triggers stack protection, so
a static guard was added.  Since this is only used during decompression
and was never used before, the exposure here is very small.  Once it
switches to the full kernel, the stack guard is back to normal.

Chrome OS has been using -fstack-protector-strong for its kernel builds
for the last 8 months with no problems.

Signed-off-by: Kees Cook <keescook@chromium.org>
Cc: Michal Marek <mmarek@suse.cz>
Cc: Russell King <linux@arm.linux.org.uk>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: James Hogan <james.hogan@imgtec.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Shawn Guo <shawn.guo@linaro.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 years agostack protector: create HAVE_CC_STACKPROTECTOR for centralized use
Kees Cook [Fri, 3 Jan 2014 03:10:10 +0000 (14:10 +1100)]
stack protector: create HAVE_CC_STACKPROTECTOR for centralized use

Instead of duplicating the CC_STACKPROTECTOR Kconfig and Makefile logic in
each architecture, switch to using HAVE_CC_STACKPROTECTOR and keep
everything in one place.  This retains the x86-specific bug verification
scripts.

Signed-off-by: Kees Cook <keescook@chromium.org>
Cc: Michal Marek <mmarek@suse.cz>
Cc: Russell King <linux@arm.linux.org.uk>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: James Hogan <james.hogan@imgtec.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Shawn Guo <shawn.guo@linaro.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 years agokernel/smp.c: remove cpumask_ipi
Roman Gushchin [Fri, 3 Jan 2014 03:10:10 +0000 (14:10 +1100)]
kernel/smp.c: remove cpumask_ipi

After 9a46ad6 ("smp: make smp_call_function_many() use logic similar to
smp_call_function_single()"), cfd->cpumask is accessed only in
smp_call_function_many().  So there is no more need to copy it into
cfd->cpumask_ipi before putting csd into the list.  The cpumask_ipi field
is obsolete and can be removed.

Signed-off-by: Roman Gushchin <klamm@yandex-team.ru>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Wang YanQing <udknight@gmail.com>
Cc: Xie XiuQi <xiexiuqi@huawei.com>
Cc: Shaohua Li <shli@fusionio.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 years agoremove extra definitions of U32_MAX
Alex Elder [Fri, 3 Jan 2014 03:10:09 +0000 (14:10 +1100)]
remove extra definitions of U32_MAX

Now that the definition is centralized in <linux/kernel.h>, the
definitions of U32_MAX (and related) elsewhere in the kernel can be
removed.

Signed-off-by: Alex Elder <elder@linaro.org>
Acked-by: Sage Weil <sage@inktank.com>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 years agokernel.h: define u8, s8, u32, etc. limits
Alex Elder [Fri, 3 Jan 2014 03:10:09 +0000 (14:10 +1100)]
kernel.h: define u8, s8, u32, etc. limits

Create constants that define the maximum and minimum values
representable by the kernel types u8, s8, u16, s16, and so on.

Signed-off-by: Alex Elder <elder@linaro.org>
Cc: Sage Weil <sage@inktank.com>
Cc: David Miller <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 years agoconditionally define U32_MAX
Alex Elder [Fri, 3 Jan 2014 03:10:09 +0000 (14:10 +1100)]
conditionally define U32_MAX

The symbol U32_MAX is defined in several spots.  Change these definitions
to be conditional.  This is in preparation for the next patch, which
centralizes the definition in <linux/kernel.h>.

Signed-off-by: Alex Elder <elder@linaro.org>
Cc: Sage Weil <sage@inktank.com>
Cc: David Miller <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 years agoum: use generic fixmap.h
Mark Salter [Fri, 3 Jan 2014 03:10:09 +0000 (14:10 +1100)]
um: use generic fixmap.h

Signed-off-by: Mark Salter <msalter@redhat.com>
Acked-by: Richard Weinberger <richard@nod.at>
Cc: Jeff Dike <jdike@addtoit.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 years agotile: use generic fixmap.h
Mark Salter [Fri, 3 Jan 2014 03:10:08 +0000 (14:10 +1100)]
tile: use generic fixmap.h

Signed-off-by: Mark Salter <msalter@redhat.com>
Acked-by: Chris Metcalf <cmetcalf@tilera.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 years agosh: use generic fixmap.h
Mark Salter [Fri, 3 Jan 2014 03:10:08 +0000 (14:10 +1100)]
sh: use generic fixmap.h

Signed-off-by: Mark Salter <msalter@redhat.com>
Cc: Paul Mundt <lethal@linux-sh.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 years agopowerpc: use generic fixmap.h
Mark Salter [Fri, 3 Jan 2014 03:10:08 +0000 (14:10 +1100)]
powerpc: use generic fixmap.h

Signed-off-by: Mark Salter <msalter@redhat.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 years agomips: use generic fixmap.h
Mark Salter [Fri, 3 Jan 2014 03:10:08 +0000 (14:10 +1100)]
mips: use generic fixmap.h

Signed-off-by: Mark Salter <msalter@redhat.com>
Acked-by: Ralf Baechle <ralf@linux-mips.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 years agomicroblaze: use generic fixmap.h
Mark Salter [Fri, 3 Jan 2014 03:10:07 +0000 (14:10 +1100)]
microblaze: use generic fixmap.h

Signed-off-by: Mark Salter <msalter@redhat.com>
Tested-by: Michal Simek <monstr@monstr.eu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 years agometag: use generic fixmap.h
Mark Salter [Fri, 3 Jan 2014 03:10:07 +0000 (14:10 +1100)]
metag: use generic fixmap.h

Signed-off-by: Mark Salter <msalter@redhat.com>
Acked-by: James Hogan <james.hogan@imgtec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 years agohexagon: use generic fixmap.h
Mark Salter [Fri, 3 Jan 2014 03:10:07 +0000 (14:10 +1100)]
hexagon: use generic fixmap.h

Signed-off-by: Mark Salter <msalter@redhat.com>
Acked-by: Richard Kuo <rkuo@codeaurora.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 years agoarm: use generic fixmap.h
Mark Salter [Fri, 3 Jan 2014 03:10:07 +0000 (14:10 +1100)]
arm: use generic fixmap.h

ARM is different from other architectures in that fixmap pages are indexed
with a positive offset from FIXADDR_START.  Other architectures index with
a negative offset from FIXADDR_TOP.  In order to use the generic fixmap.h
definitions, this patch redefines FIXADDR_TOP to be inclusive of the
useable range.  That is, FIXADDR_TOP is the virtual address of the topmost
fixed page.  The newly defined FIXADDR_END is the first virtual address
past the fixed mappings.

Signed-off-by: Mark Salter <msalter@redhat.com>
Cc: Russell King <linux@arm.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 years agox86: use generic fixmap.h
Mark Salter [Fri, 3 Jan 2014 03:10:07 +0000 (14:10 +1100)]
x86: use generic fixmap.h

Signed-off-by: Mark Salter <msalter@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 years agoadd generic fixmap.h
Mark Salter [Fri, 3 Jan 2014 03:10:06 +0000 (14:10 +1100)]
add generic fixmap.h

Many architectures provide an asm/fixmap.h which defines support for
compile-time 'special' virtual mappings which need to be made before
paging_init() has run.  This support is also used for early ioremap on
x86.  Much of this support is identical across the architectures.  This
patch consolidates all of the common bits into asm-generic/fixmap.h which
is intended to be included from arch/*/include/asm/fixmap.h.

Signed-off-by: Mark Salter <msalter@redhat.com>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Ralf Baechle <ralf@linux-mips.org>
Cc: Russell King <linux@arm.linux.org.uk>
Cc: Richard Kuo <rkuo@codeaurora.org>
Cc: James Hogan <james.hogan@imgtec.com>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jeff Dike <jdike@addtoit.com>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: Richard Weinberger <richard@nod.at>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Jonas Bonn <jonas.bonn@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 years agologfs: check for the return value after calling find_or_create_page()
Younger Liu [Fri, 3 Jan 2014 03:10:06 +0000 (14:10 +1100)]
logfs: check for the return value after calling find_or_create_page()

In get_mapping_page(), after calling find_or_create_page(), the return
value should be checked.

 This patch has been provided:
http://www.spinics.net/lists/linux-fsdevel/msg66948.html but not been
applied now.

Signed-off-by: Younger Liu <liuyiyang@hisense.com>
Cc: Younger Liu <younger.liucn@gmail.com>
Cc: Vyacheslav Dubeyko <slava@dubeyko.com>
Reviewed-by: Prasad Joshi <prasadjoshi.linux@gmail.com>
Cc: Jörn Engel <joern@logfs.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 years agodrivers/block/Kconfig: update RAM block device module name
Fabian Frederick [Fri, 3 Jan 2014 03:10:06 +0000 (14:10 +1100)]
drivers/block/Kconfig: update RAM block device module name

RAM block device support module name changed to brd.ko some years ago with
an "rd" alias to match previous module implementation.  This patch updates
its Kconfig definition.

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Acked-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 years agodrivers/mailbox/omap: make mbox->irq signed for error handling
Dan Carpenter [Fri, 3 Jan 2014 03:10:06 +0000 (14:10 +1100)]
drivers/mailbox/omap: make mbox->irq signed for error handling

There is a bug in omap2_mbox_probe() where we try do:

mbox->irq = platform_get_irq(pdev, info->irq_id);
if (mbox->irq < 0) {

The problem is that mbox->irq is unsigned so the error handling doesn't
work.  I've changed it to a signed integer.

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Cc: Suman Anna <s-anna@ti.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Omar Ramirez Luna <omar.ramirez@copitl.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 years agoasm/types.h: Remove include/asm-generic/int-l64.h
Geert Uytterhoeven [Fri, 3 Jan 2014 03:10:05 +0000 (14:10 +1100)]
asm/types.h: Remove include/asm-generic/int-l64.h

Now all 64-bit architectures have been converted to int-ll64.h, we can
remove int-l64.h in kernelspace.

For backwards compatibility, alpha, ia64, mips64, and powerpc64 still use
int-l64.h in userspace.

This is the (reworked for UAPI) non-documentation part of more than two
year old "asm/types.h: All architectures use int-ll64.h in kernelspace"
(https://lkml.org/lkml/2011/8/13/104)

Since <asm/types.h> (from include/uapi/asm-generic/types.h) is used for
both kernel and user space, include/asm-generic/int-ll64.h cannot just
become include/asm-generic/types.h, as Arnd suggested.

Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Randy Dunlap <rdunlap@xenotime.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 years agokernel: use lockless list for smp_call_function_single
Christoph Hellwig [Fri, 3 Jan 2014 03:10:05 +0000 (14:10 +1100)]
kernel: use lockless list for smp_call_function_single

Make smp_call_function_single and friends more efficient by using
a lockless list.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 years agoswap: swapin_nr_pages() can be static
Fengguang Wu [Fri, 3 Jan 2014 03:10:05 +0000 (14:10 +1100)]
swap: swapin_nr_pages() can be static

Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
Cc: Shaohua Li <shli@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 years agoswap: add a simple detector for inappropriate swapin readahead
Shaohua Li [Fri, 3 Jan 2014 03:10:05 +0000 (14:10 +1100)]
swap: add a simple detector for inappropriate swapin readahead

This is a patch to improve swap readahead algorithm. It's from Hugh and I
slightly changed it.

Hugh's original changelog:

swapin readahead does a blind readahead, whether or not the swapin
is sequential.  This may be ok on harddisk, because large reads have
relatively small costs, and if the readahead pages are unneeded they
can be reclaimed easily - though, what if their allocation forced
reclaim of useful pages?  But on SSD devices large reads are more
expensive than small ones: if the readahead pages are unneeded,
reading them in caused significant overhead.

This patch adds very simplistic random read detection.  Stealing
the PageReadahead technique from Konstantin Khlebnikov's patch,
avoiding the vma/anon_vma sophistications of Shaohua Li's patch,
swapin_nr_pages() simply looks at readahead's current success
rate, and narrows or widens its readahead window accordingly.
There is little science to its heuristic: it's about as stupid
as can be whilst remaining effective.

The table below shows elapsed times (in centiseconds) when running
a single repetitive swapping load across a 1000MB mapping in 900MB
ram with 1GB swap (the harddisk tests had taken painfully too long
when I used mem=500M, but SSD shows similar results for that).

Vanilla is the 3.6-rc7 kernel on which I started; Shaohua denotes
his Sep 3 patch in mmotm and linux-next; HughOld denotes my Oct 1
patch which Shaohua showed to be defective; HughNew this Nov 14
patch, with page_cluster as usual at default of 3 (8-page reads);
HughPC4 this same patch with page_cluster 4 (16-page reads);
HughPC0 with page_cluster 0 (1-page reads: no readahead).

HDD for swapping to harddisk, SSD for swapping to VertexII SSD.
Seq for sequential access to the mapping, cycling five times around;
Rand for the same number of random touches.  Anon for a MAP_PRIVATE
anon mapping; Shmem for a MAP_SHARED anon mapping, equivalent to tmpfs.

One weakness of Shaohua's vma/anon_vma approach was that it did
not optimize Shmem: seen below.  Konstantin's approach was perhaps
mistuned, 50% slower on Seq: did not compete and is not shown below.

HDD        Vanilla Shaohua HughOld HughNew HughPC4 HughPC0
Seq Anon     73921   76210   75611   76904   78191  121542
Seq Shmem    73601   73176   73855   72947   74543  118322
Rand Anon   895392  831243  871569  845197  846496  841680
Rand Shmem 1058375 1053486  827935  764955  764376  756489

SSD        Vanilla Shaohua HughOld HughNew HughPC4 HughPC0
Seq Anon     24634   24198   24673   25107   21614   70018
Seq Shmem    24959   24932   25052   25703   22030   69678
Rand Anon    43014   26146   28075   25989   26935   25901
Rand Shmem   45349   45215   28249   24268   24138   24332

These tests are, of course, two extremes of a very simple case:
under heavier mixed loads I've not yet observed any consistent
improvement or degradation, and wider testing would be welcome.

Shaohua Li:

Test shows Vanilla is slightly better in sequential workload than Hugh's patch.
I observed with Hugh's patch sometimes the readahead size is shrinked too fast
(from 8 to 1 immediately) in sequential workload if there is no hit. And in
such case, continuing doing readahead is good actually.

I don't prepare a sophisticated algorithm for the sequential workload because
so far we can't guarantee sequential accessed pages are swap out sequentially.
So I slightly change Hugh's heuristic - don't shrink readahead size too fast.

Here is my test result (unit second, 3 runs average):
Vanilla Hugh New
Seq 356 370 360
Random 4525 2447 2444

Attached graph is the swapin/swapout throughput I collected with 'vmstat 2'.
The first part is running a random workload (till around 1200 of the x-axis)
and the second part is running a sequential workload. swapin and swapout
throughput are almost identical in steady state in both workloads. These are
expected behavior. while in Vanilla, swapin is much bigger than swapout
especially in random workload (because wrong readahead).

Original patches by: Shaohua Li and Konstantin Khlebnikov.

Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Shaohua Li <shli@fusionio.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Konstantin Khlebnikov <khlebnikov@openvz.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 years agoswap: fix setting PAGE_SIZE blocksize during swapoff/swapon race
Krzysztof Kozlowski [Fri, 3 Jan 2014 03:10:05 +0000 (14:10 +1100)]
swap: fix setting PAGE_SIZE blocksize during swapoff/swapon race

Fix race between swapoff and swapon resulting in setting blocksize of
PAGE_SIZE for block devices during swapoff.

The swapon modifies swap_info->old_block_size before acquiring
swapon_mutex.  It reads block_size of bdev, stores it under
swap_info->old_block_size and sets new block_size to PAGE_SIZE.

On the other hand the swapoff sets the device's block_size to
old_block_size after releasing swapon_mutex.

This patch locks the swapon_mutex much earlier during swapon. It also
releases the swapon_mutex later during swapoff.

The effect of race can be triggered by following scenario:
 - One block swap device with block size of 512
 - thread 1: Swapon is called, swap is activated,
   p->old_block_size = block_size(p->bdev); /512/
   block_size(p->bdev) = PAGE_SIZE;
   Thread ends.

 - thread 2: Swapoff is called and it goes just after releasing the
   swapon_mutex. The swap is now fully disabled except of setting the
   block size to old value. The p->bdev->block_size is still equal to
   PAGE_SIZE.

 - thread 3: New swapon is called. This swap is disabled so without
   acquiring the swapon_mutex:
   - p->old_block_size = block_size(p->bdev); /PAGE_SIZE (!!!)/
   - block_size(p->bdev) = PAGE_SIZE;
   Swap is activated and thread ends.

 - thread 2: resumes work and sets blocksize to old value:
   - set_blocksize(bdev, p->old_block_size)
   But now the p->old_block_size is equal to PAGE_SIZE.

The patch swap-fix-set_blocksize-race-during-swapon-swapoff does not fix
this particular issue.  It reduces the possibility of races as the swapon
must overwrite p->old_block_size before acquiring swapon_mutex in swapoff.

Signed-off-by: Krzysztof Kozlowski <k.kozlowski@samsung.com>
Cc: Weijie Yang <weijie.yang.kh@gmail.com>
Cc: Bob Liu <bob.liu@oracle.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Shaohua Li <shli@fusionio.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 years agomm-dump-page-when-hitting-a-vm_bug_on-using-vm_bug_on_page-fix-fix
Andrew Morton [Fri, 3 Jan 2014 03:10:04 +0000 (14:10 +1100)]
mm-dump-page-when-hitting-a-vm_bug_on-using-vm_bug_on_page-fix-fix

Fix the patch for mm-print-more-details-for-bad_page.patch.

Also fix up an include mess - various files were using mmdebug.h
facilities but were not including that file.

Cc: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 years agomm: dump page when hitting a VM_BUG_ON using VM_BUG_ON_PAGE fix
Sasha Levin [Fri, 3 Jan 2014 03:10:04 +0000 (14:10 +1100)]
mm: dump page when hitting a VM_BUG_ON using VM_BUG_ON_PAGE fix

I messed up and forgot to commit this fix before sending out the original
patch.

It fixes build issues in various files using VM_BUG_ON_PAGE.

Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 years agomm: dump page when hitting a VM_BUG_ON using VM_BUG_ON_PAGE
Sasha Levin [Fri, 3 Jan 2014 03:10:04 +0000 (14:10 +1100)]
mm: dump page when hitting a VM_BUG_ON using VM_BUG_ON_PAGE

Most of the VM_BUG_ON assertions are performed on a page.  Usually, when
one of these assertions fails we'll get a BUG_ON with a call stack and the
registers.

I've recently noticed based on the requests to add a small piece of code
that dumps the page to various VM_BUG_ON sites that the page dump is quite
useful to people debugging issues in mm.

This patch adds a VM_BUG_ON_PAGE(cond, page) which beyond doing what
VM_BUG_ON() does, also dumps the page before executing the actual BUG_ON.

Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 years agofs/proc/page.c: add PageAnon check to surely detect thp
Naoya Horiguchi [Fri, 3 Jan 2014 03:10:03 +0000 (14:10 +1100)]
fs/proc/page.c: add PageAnon check to surely detect thp

stable_page_flags() checks !PageHuge && PageTransCompound && PageLRU to
know that a specified page is thp or not.  But sometimes it's not enough
and we fail to detect thp when the thp is on pagevec.  This happens only
for a few seconds after LRU list operations, but it makes it difficult to
control our applications depending on this flag.

So this patch adds another check PageAnon to detect thps on pagevec.  It
might not give the future extensibility for thp pagecache, but it's OK at
least for now.

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: David Rientjes <rientjes@google.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 years agomm: remove BUG_ON() from mlock_vma_page()
Bob Liu [Fri, 3 Jan 2014 03:10:03 +0000 (14:10 +1100)]
mm: remove BUG_ON() from mlock_vma_page()

objrmap doesn't work for nonlinear VMAs because the assumption that
offset-into-file correlates with offset-into-virtual-addresses does not
hold.  Hence what try_to_unmap_cluster does is a mini "virtual scan" of
each nonlinear VMA which maps the file to which the target page belongs.
If vma locked, mlock the pages in the cluster, rather than unmapping them.
However, not all pages are guarantee page locked instead of the check
page, resulting in the below BUG_ON().

It's safe to mlock_vma_page() without PageLocked, so fix this issue by
removing that BUG_ON().

[  253.869145] kernel BUG at mm/mlock.c:82!
[  253.869549] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
[  253.870098] Dumping ftrace buffer:
[  253.870098]    (ftrace buffer empty)
[  253.870098] Modules linked in:
[  253.870098] CPU: 10 PID: 9162 Comm: trinity-child75 Tainted: G        W 3.13.0-rc4-next-20131216-sasha-00011-g5f105ec-dirty #4137
[  253.873310] task: ffff8800c98cb000 ti: ffff8804d34e8000 task.ti: ffff8804d34e8000
[  253.873310] RIP: 0010:[<ffffffff81281f28>]  [<ffffffff81281f28>] mlock_vma_page+0x18/0xc0
[  253.873310] RSP: 0000:ffff8804d34e99e8  EFLAGS: 00010246
[  253.873310] RAX: 006fffff8038002c RBX: ffffea00474944c0 RCX: ffff880807636000
[  253.873310] RDX: ffffea0000000000 RSI: 00007f17a9bca000 RDI: ffffea00474944c0
[  253.873310] RBP: ffff8804d34e99f8 R08: ffff880807020000 R09: 0000000000000000
[  253.873310] R10: 0000000000000001 R11: 0000000000002000 R12: 00007f17a9bca000
[  253.873310] R13: ffffea00474944c0 R14: 00007f17a9be0000 R15: ffff880807020000
[  253.873310] FS:  00007f17aa31a700(0000) GS:ffff8801c9c00000(0000) knlGS:0000000000000000
[  253.873310] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[  253.873310] CR2: 00007f17a94fa000 CR3: 00000004d3b02000 CR4: 00000000000006e0
[  253.873310] DR0: 00007f17a74ca000 DR1: 0000000000000000 DR2: 0000000000000000
[  253.873310] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000600
[  253.873310] Stack:
[  253.873310]  0000000b3de28067 ffff880b3de28e50 ffff8804d34e9aa8 ffffffff8128bc31
[  253.873310]  0000000000000301 ffffea0011850220 ffff8809a4039000 ffffea0011850238
[  253.873310]  ffff8804d34e9aa8 ffff880807636060 0000000000000001 ffff880807636348
[  253.873310] Call Trace:
[  253.873310]  [<ffffffff8128bc31>] try_to_unmap_cluster+0x1c1/0x340
[  253.873310]  [<ffffffff8128c60a>] try_to_unmap_file+0x20a/0x2e0
[  253.873310]  [<ffffffff8128c7b3>] try_to_unmap+0x73/0x90
[  253.873310]  [<ffffffff812b526d>] __unmap_and_move+0x18d/0x250
[  253.873310]  [<ffffffff812b53e9>] unmap_and_move+0xb9/0x180
[  253.873310]  [<ffffffff812b559b>] migrate_pages+0xeb/0x2f0
[  253.873310]  [<ffffffff812a0660>] ? queue_pages_pte_range+0x1a0/0x1a0
[  253.873310]  [<ffffffff812a193c>] migrate_to_node+0x9c/0xc0
[  253.873310]  [<ffffffff812a30b8>] do_migrate_pages+0x1b8/0x240
[  253.873310]  [<ffffffff812a3456>] SYSC_migrate_pages+0x316/0x380
[  253.873310]  [<ffffffff812a31ec>] ? SYSC_migrate_pages+0xac/0x380
[  253.873310]  [<ffffffff811763c6>] ? vtime_account_user+0x96/0xb0
[  253.873310]  [<ffffffff812a34ce>] SyS_migrate_pages+0xe/0x10
[  253.873310]  [<ffffffff843c4990>] tracesys+0xdd/0xe2
[  253.873310] Code: 0f 1f 00 65 48 ff 04 25 10 25 1d 00 48 83 c4 08
5b c9 c3 55 48 89 e5 53 48 83 ec 08 66 66 66 66 90 48 8b 07 48 89 fb
a8 01 75 10 <0f> 0b 66 0f 1f 44 00 00 eb fe 66 0f 1f 44 00 00 f0 0f ba
2f 15
[  253.873310] RIP  [<ffffffff81281f28>] mlock_vma_page+0x18/0xc0
[  253.873310]  RSP <ffff8804d34e99e8>
[  253.904194] ---[ end trace be59c4a7f8edab3f ]---

Signed-off-by: Bob Liu <bob.liu@oracle.com>
Reported-by: Sasha Levin <sasha.levin@oracle.com>
Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 years agomemcg: do not use vmalloc for mem_cgroup allocations
Vladimir Davydov [Fri, 3 Jan 2014 03:10:03 +0000 (14:10 +1100)]
memcg: do not use vmalloc for mem_cgroup allocations

The vmalloc was introduced by 333279 ("memcgroup: use vmalloc for
mem_cgroup allocation"), because at that time MAX_NUMNODES was used for
defining the per-node array in the mem_cgroup structure so that the
structure could be huge even if the system had the only NUMA node.

The situation was significantly improved by patch 45cf7e ("memcg: reduce
the size of struct memcg 244-fold"), which made the size of the mem_cgroup
structure calculated dynamically depending on the real number of NUMA
nodes installed on the system (nr_node_ids), so now there is no point in
using vmalloc here: the structure is allocated rarely and on most systems
its size is about 1K.

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: Glauber Costa <glommer@openvz.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 years agomm-munlock-fix-potential-race-with-thp-page-split-fix
Andrew Morton [Fri, 3 Jan 2014 03:10:03 +0000 (14:10 +1100)]
mm-munlock-fix-potential-race-with-thp-page-split-fix

avoid a coding-style ugly

Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 years agomm: munlock: fix potential race with THP page split
Vlastimil Babka [Fri, 3 Jan 2014 03:10:03 +0000 (14:10 +1100)]
mm: munlock: fix potential race with THP page split

Since commit ff6a6da60 ("mm: accelerate munlock() treatment of THP pages")
munlock skips tail pages of a munlocked THP page.  There is some attempt
to prevent bad consequences of racing with a THP page split, but code
inspection indicates that there are two problems that may lead to a
non-fatal, yet wrong outcome.

First, __split_huge_page_refcount() copies flags including PageMlocked
from the head page to the tail pages.  Clearing PageMlocked by
munlock_vma_page() in the middle of this operation might result in part of
tail pages left with PageMlocked flag.  As the head page still appears to
be a THP page until all tail pages are processed, munlock_vma_page() might
think it munlocked the whole THP page and skip all the former tail pages.
Before ff6a6da60, those pages would be cleared in further iterations of
munlock_vma_pages_range(), but NR_MLOCK would still become undercounted
(related the next point).

Second, NR_MLOCK accounting is based on call to hpage_nr_pages() after the
PageMlocked is cleared.  The accounting might also become inconsistent due
to race with __split_huge_page_refcount()

- undercount when HUGE_PMD_NR is subtracted, but some tail pages are
  left with PageMlocked set and counted again (only possible before
  ff6a6da60)

- overcount when hpage_nr_pages() sees a normal page (split has already
  finished), but the parallel split has meanwhile cleared PageMlocked from
  additional tail pages

This patch prevents both problems via extending the scope of lru_lock in
munlock_vma_page().  This is convenient because:

- __split_huge_page_refcount() takes lru_lock for its whole operation

- munlock_vma_page() typically takes lru_lock anyway for page isolation

As this becomes a second function where page isolation is done with
lru_lock already held, factor this out to a new
__munlock_isolate_lru_page() function and clean up the code around.

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Sasha Levin <sasha.levin@oracle.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Hugh Dickins <hughd@google.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 years agomm-print-more-details-for-bad_page-fix
Andrew Morton [Fri, 3 Jan 2014 03:10:02 +0000 (14:10 +1100)]
mm-print-more-details-for-bad_page-fix

switch to pr_alert.

Cc: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 years agomm: print more details for bad_page()
Dave Hansen [Fri, 3 Jan 2014 03:10:02 +0000 (14:10 +1100)]
mm: print more details for bad_page()

bad_page() is cool in that it prints out a bunch of data about the page.
But, I can never remember which page flags are good and which are bad, or
whether ->index or ->mapping is required to be NULL.

This patch allows bad/dump_page() callers to specify a string about why
they are dumping the page and adds explanation strings to a number of
places.  It also adds a 'bad_flags' argument to bad_page(), which it then
dumps out separately from the flags which are actually set.

This way, the messages will show specifically why the page was bad,
*specifically* which flags it is complaining about, if it was a page flag
combination which was the problem.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Christoph Lameter <cl@linux.com>
Cc: Andi Kleen <andi@firstfloor.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 years agomm/zswap.c: change params from hidden to ro
Dan Streetman [Fri, 3 Jan 2014 03:10:02 +0000 (14:10 +1100)]
mm/zswap.c: change params from hidden to ro

The "compressor" and "enabled" params are currently hidden, this changes
them to read-only, so userspace can tell if zswap is enabled or not and
see what compressor is in use.

Signed-off-by: Dan Streetman <ddstreet@ieee.org>
Cc: Vladimir Murzin <murzin.v@gmail.com>
Cc: Bob Liu <bob.liu@oracle.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Weijie Yang <weijie.yang@samsung.com>
Acked-by: Seth Jennings <sjennings@variantweb.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 years agomm: documentation: remove hopelessly out-of-date locking doc
Dave Hansen [Fri, 3 Jan 2014 03:10:02 +0000 (14:10 +1100)]
mm: documentation: remove hopelessly out-of-date locking doc

Documentation/vm/locking is a blast from the past.  In the entire git
history, it has had precisely Three modifications.  Two of those look to
be pure renames, and the third was from 2005.

The doc contains such gems as:

> The page_table_lock is grabbed while holding the
> kernel_lock spinning monitor.

> Page stealers hold kernel_lock to protect against a bunch of
> races.

Or this which talks about mmap_sem:

> 4. The exception to this rule is expand_stack, which just
>    takes the read lock and the page_table_lock, this is ok
>    because it doesn't really modify fields anybody relies on.

expand_stack() doesn't take any locks any more directly, and the
mmap_sem acquisition was long ago moved up in to the page fault
code itself.

It could be argued that we need to rewrite this, but it is
dangerous to leave it as-is.  It will confuse more people than it
helps.

Signed-off-by: Dave Hansen <dave.hansen@intel.com>
Cc: Hugh Dickins <hughd@google.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 years agomm/migrate: remove unused function, fail_migrate_page()
Joonsoo Kim [Fri, 3 Jan 2014 03:10:01 +0000 (14:10 +1100)]
mm/migrate: remove unused function, fail_migrate_page()

fail_migrate_page() isn't used anywhere, so remove it.

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Acked-by: Christoph Lameter <cl@linux.com>
Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Reviewed-by: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Cc: Rafael Aquini <aquini@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 years agomm/migrate: remove putback_lru_pages, fix comment on putback_movable_pages
Joonsoo Kim [Fri, 3 Jan 2014 03:10:01 +0000 (14:10 +1100)]
mm/migrate: remove putback_lru_pages, fix comment on putback_movable_pages

Some part of putback_lru_pages() and putback_movable_pages() is
duplicated, so it could confuse us what we should use.  We can remove
putback_lru_pages() since it is not really needed now.  This makes us
undestand and maintain the code more easily.

And comment on putback_movable_pages() is stale now, so fix it.

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Reviewed-by: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Rafael Aquini <aquini@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 years agomm/migrate: correct failure handling if !hugepage_migration_support()
Joonsoo Kim [Fri, 3 Jan 2014 03:10:01 +0000 (14:10 +1100)]
mm/migrate: correct failure handling if !hugepage_migration_support()

We should remove the page from the list if we fail with ENOSYS, since
migrate_pages() consider error cases except -ENOMEM and -EAGAIN as
permanent failure and it assumes that the page would be removed from the
list.  Without this patch, we could overcount number of failure.

In addition, we should put back the new hugepage if
!hugepage_migration_support().  If not, we would leak hugepage memory.

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Acked-by: Christoph Lameter <cl@linux.com>
Reviewed-by: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Rafael Aquini <aquini@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 years agomm/migrate: add comment about permanent failure path
Naoya Horiguchi [Fri, 3 Jan 2014 03:10:01 +0000 (14:10 +1100)]
mm/migrate: add comment about permanent failure path

Let's add a comment about where the failed page goes to, which makes code
more readable.

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Acked-by: Christoph Lameter <cl@linux.com>
Reviewed-by: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Acked-by: Rafael Aquini <aquini@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 years agomm, page_alloc: warn for non-blockable __GFP_NOFAIL allocation failure
David Rientjes [Fri, 3 Jan 2014 03:10:01 +0000 (14:10 +1100)]
mm, page_alloc: warn for non-blockable __GFP_NOFAIL allocation failure

__GFP_NOFAIL may return NULL when coupled with GFP_NOWAIT or GFP_ATOMIC.

Luckily, nothing currently does such craziness.  So instead of causing
such allocations to loop (potentially forever), we maintain the current
behavior and also warn about the new users of the deprecated flag.

Suggested-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: David Rientjes <rientjes@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 years agomm: compaction: reset scanner positions immediately when they meet
Vlastimil Babka [Fri, 3 Jan 2014 03:10:00 +0000 (14:10 +1100)]
mm: compaction: reset scanner positions immediately when they meet

Compaction used to start its migrate and free page scaners at the zone's
lowest and highest pfn, respectively.  Later, caching was introduced to
remember the scanners' progress across compaction attempts so that
pageblocks are not re-scanned uselessly.  Additionally, pageblocks where
isolation failed are marked to be quickly skipped when encountered again
in future compactions.

Currently, both the reset of cached pfn's and clearing of the pageblock
skip information for a zone is done in __reset_isolation_suitable().  This
function gets called when:

 - compaction is restarting after being deferred
 - compact_blockskip_flush flag is set in compact_finished() when the scanners
   meet (and not again cleared when direct compaction succeeds in allocation)
   and kswapd acts upon this flag before going to sleep

This behavior is suboptimal for several reasons:

 - when direct sync compaction is called after async compaction fails (in the
   allocation slowpath), it will effectively do nothing, unless kswapd
   happens to process the compact_blockskip_flush flag meanwhile. This is racy
   and goes against the purpose of sync compaction to more thoroughly retry
   the compaction of a zone where async compaction has failed.
   The restart-after-deferring path cannot help here as deferring happens only
   after the sync compaction fails. It is also done only for the preferred
   zone, while the compaction might be done for a fallback zone.

 - the mechanism of marking pageblock to be skipped has little value since the
   cached pfn's are reset only together with the pageblock skip flags. This
   effectively limits pageblock skip usage to parallel compactions.

This patch changes compact_finished() so that cached pfn's are reset
immediately when the scanners meet.  Clearing pageblock skip flags is
unchanged, as well as the other situations where cached pfn's are reset.
This allows the sync-after-async compaction to retry pageblocks not marked
as skipped, such as blocks !MIGRATE_MOVABLE blocks that async compactions
now skips without marking them.

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Rik van Riel <riel@redhat.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 years agomm: compaction: do not mark unmovable pageblocks as skipped in async compaction
Vlastimil Babka [Fri, 3 Jan 2014 03:10:00 +0000 (14:10 +1100)]
mm: compaction: do not mark unmovable pageblocks as skipped in async compaction

Compaction temporarily marks pageblocks where it fails to isolate pages as
to-be-skipped in further compactions, in order to improve efficiency.  One
of the reasons to fail isolating pages is that isolation is not attempted
in pageblocks that are not of MIGRATE_MOVABLE (or CMA) type.

The problem is that blocks skipped due to not being MIGRATE_MOVABLE in
async compaction become skipped due to the temporary mark also in future
sync compaction.  Moreover, this may follow quite soon during
__alloc_page_slowpath, without much time for kswapd to clear the pageblock
skip marks.  This goes against the idea that sync compaction should try to
scan these blocks more thoroughly than the async compaction.

The fix is to ensure in async compaction that these !MIGRATE_MOVABLE
blocks are not marked to be skipped.  Note this should not affect
performance or locking impact of further async compactions, as skipping a
block due to being !MIGRATE_MOVABLE is done soon after skipping a block
marked to be skipped, both without locking.

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Rik van Riel <riel@redhat.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 years agomm: compaction: detect when scanners meet in isolate_freepages
Vlastimil Babka [Fri, 3 Jan 2014 03:10:00 +0000 (14:10 +1100)]
mm: compaction: detect when scanners meet in isolate_freepages

Compaction of a zone is finished when the migrate scanner (which begins at
the zone's lowest pfn) meets the free page scanner (which begins at the
zone's highest pfn).  This is detected in compact_zone() and in the case
of direct compaction, the compact_blockskip_flush flag is set so that
kswapd later resets the cached scanner pfn's, and a new compaction may
again start at the zone's borders.

The meeting of the scanners can happen during either scanner's activity.
However, it may currently fail to be detected when it occurs in the free
page scanner, due to two problems.  First, isolate_freepages() keeps
free_pfn at the highest block where it isolated pages from, for the
purposes of not missing the pages that are returned back to allocator when
migration fails.  Second, failing to isolate enough free pages due to
scanners meeting results in -ENOMEM being returned by migrate_pages(),
which makes compact_zone() bail out immediately without calling
compact_finished() that would detect scanners meeting.

This failure to detect scanners meeting might result in repeated attempts
at compaction of a zone that keep starting from the cached pfn's close to
the meeting point, and quickly failing through the -ENOMEM path, without
the cached pfns being reset, over and over.  This has been observed
(through additional tracepoints) in the third phase of the mmtests
stress-highalloc benchmark, where the allocator runs on an otherwise idle
system.  The problem was observed in the DMA32 zone, which was used as a
fallback to the preferred Normal zone, but on the 4GB system it was
actually the largest zone.  The problem is even amplified for such
fallback zone - the deferred compaction logic, which could (after being
fixed by a previous patch) reset the cached scanner pfn's, is only applied
to the preferred zone and not for the fallbacks.

The problem in the third phase of the benchmark was further amplified by
commit 81c0a2bb ("mm: page_alloc: fair zone allocator policy") which
resulted in a non-deterministic regression of the allocation success rate
from ~85% to ~65%.  This occurs in about half of benchmark runs, making
bisection problematic.  It is unlikely that the commit itself is buggy,
but it should put more pressure on the DMA32 zone during phases 1 and 2,
which may leave it more fragmented in phase 3 and expose the bugs that
this patch fixes.

The fix is to make scanners meeting in isolate_freepage() stay that way,
and to check in compact_zone() for scanners meeting when migrate_pages()
returns -ENOMEM.  The result is that compact_finished() also detects
scanners meeting and sets the compact_blockskip_flush flag to make kswapd
reset the scanner pfn's.

The results in stress-highalloc benchmark show that the "regression" by
commit 81c0a2bb in phase 3 no longer occurs, and phase 1 and 2 allocation
success rates are also significantly improved.

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 years agomm: compaction: reset cached scanner pfn's before reading them
Vlastimil Babka [Fri, 3 Jan 2014 03:10:00 +0000 (14:10 +1100)]
mm: compaction: reset cached scanner pfn's before reading them

Compaction caches pfn's for its migrate and free scanners to avoid
scanning the whole zone each time.  In compact_zone(), the cached values
are read to set up initial values for the scanners.  There are several
situations when these cached pfn's are reset to the first and last pfn of
the zone, respectively.  One of these situations is when a compaction has
been deferred for a zone and is now being restarted during a direct
compaction, which is also done in compact_zone().

However, compact_zone() currently reads the cached pfn's *before*
resetting them.  This means the reset doesn't affect the compaction that
performs it, and with good chance also subsequent compactions, as
update_pageblock_skip() is likely to be called and update the cached pfn's
to those being processed.  Another chance for a successful reset is when a
direct compaction detects that migration and free scanners meet (which has
its own problems addressed by another patch) and sets
update_pageblock_skip flag which kswapd uses to do the reset because it
goes to sleep.

This is clearly a bug that results in non-deterministic behavior, so this
patch moves the cached pfn reset to be performed *before* the values are
read.

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Mel Gorman <mgorman@suse.de>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 years agomm: compaction: encapsulate defer reset logic
Vlastimil Babka [Fri, 3 Jan 2014 03:09:59 +0000 (14:09 +1100)]
mm: compaction: encapsulate defer reset logic

Currently there are several functions to manipulate the deferred
compaction state variables.  The remaining case where the variables are
touched directly is when a successful allocation occurs in direct
compaction, or is expected to be successful in the future by kswapd.
Here, the lowest order that is expected to fail is updated, and in the
case of successful allocation, the deferred status and counter is reset
completely.

Create a new function compaction_defer_reset() to encapsulate this
functionality and make it easier to understand the code.  No functional
change.

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 years agomm: compaction: trace compaction begin and end
Mel Gorman [Fri, 3 Jan 2014 03:09:59 +0000 (14:09 +1100)]
mm: compaction: trace compaction begin and end

The broad goal of the series is to improve allocation success rates for
huge pages through memory compaction, while trying not to increase the
compaction overhead.  The original objective was to reintroduce capturing
of high-order pages freed by the compaction, before they are split by
concurrent activity.  However, several bugs and opportunities for simple
improvements were found in the current implementation, mostly through
extra tracepoints (which are however too ugly for now to be considered for
sending).

The patches mostly deal with two mechanisms that reduce compaction
overhead, which is caching the progress of migrate and free scanners, and
marking pageblocks where isolation failed to be skipped during further
scans.

Patch 1 (from mgorman) adds tracepoints that allow calculate time spent in
        compaction and potentially debug scanner pfn values.

Patch 2 encapsulates the some functionality for handling deferred compactions
        for better maintainability, without a functional change
        type is not determined without being actually needed.

Patch 3 fixes a bug where cached scanner pfn's are sometimes reset only after
        they have been read to initialize a compaction run.

Patch 4 fixes a bug where scanners meeting is sometimes not properly detected
        and can lead to multiple compaction attempts quitting early without
        doing any work.

Patch 5 improves the chances of sync compaction to process pageblocks that
        async compaction has skipped due to being !MIGRATE_MOVABLE.

Patch 6 improves the chances of sync direct compaction to actually do anything
        when called after async compaction fails during allocation slowpath.

The impact of patches were validated using mmtests's stress-highalloc
benchmark with mmtests's stress-highalloc benchmark on a x86_64 machine
with 4GB memory.

Due to instability of the results (mostly related to the bugs fixed by
patches 2 and 3), 10 iterations were performed, taking min,mean,max values
for success rates and mean values for time and vmstat-based metrics.

First, the default GFP_HIGHUSER_MOVABLE allocations were tested with the
patches stacked on top of v3.13-rc2.  Patch 2 is OK to serve as baseline
due to no functional changes in 1 and 2.  Comments below.

stress-highalloc
                             3.13-rc2              3.13-rc2              3.13-rc2              3.13-rc2              3.13-rc2
                              2-nothp               3-nothp               4-nothp               5-nothp               6-nothp
Success 1 Min          9.00 (  0.00%)       10.00 (-11.11%)       43.00 (-377.78%)       43.00 (-377.78%)       33.00 (-266.67%)
Success 1 Mean        27.50 (  0.00%)       25.30 (  8.00%)       45.50 (-65.45%)       45.90 (-66.91%)       46.30 (-68.36%)
Success 1 Max         36.00 (  0.00%)       36.00 (  0.00%)       47.00 (-30.56%)       48.00 (-33.33%)       52.00 (-44.44%)
Success 2 Min         10.00 (  0.00%)        8.00 ( 20.00%)       46.00 (-360.00%)       45.00 (-350.00%)       35.00 (-250.00%)
Success 2 Mean        26.40 (  0.00%)       23.50 ( 10.98%)       47.30 (-79.17%)       47.60 (-80.30%)       48.10 (-82.20%)
Success 2 Max         34.00 (  0.00%)       33.00 (  2.94%)       48.00 (-41.18%)       50.00 (-47.06%)       54.00 (-58.82%)
Success 3 Min         65.00 (  0.00%)       63.00 (  3.08%)       85.00 (-30.77%)       84.00 (-29.23%)       85.00 (-30.77%)
Success 3 Mean        76.70 (  0.00%)       70.50 (  8.08%)       86.20 (-12.39%)       85.50 (-11.47%)       86.00 (-12.13%)
Success 3 Max         87.00 (  0.00%)       86.00 (  1.15%)       88.00 ( -1.15%)       87.00 (  0.00%)       87.00 (  0.00%)

            3.13-rc2    3.13-rc2    3.13-rc2    3.13-rc2    3.13-rc2
             2-nothp     3-nothp     4-nothp     5-nothp     6-nothp
User         6437.72     6459.76     5960.32     5974.55     6019.67
System       1049.65     1049.09     1029.32     1031.47     1032.31
Elapsed      1856.77     1874.48     1949.97     1994.22     1983.15

                              3.13-rc2    3.13-rc2    3.13-rc2    3.13-rc2    3.13-rc2
                               2-nothp     3-nothp     4-nothp     5-nothp     6-nothp
Minor Faults                 253952267   254581900   250030122   250507333   250157829
Major Faults                       420         407         506         530         530
Swap Ins                             4           9           9           6           6
Swap Outs                          398         375         345         346         333
Direct pages scanned            197538      189017      298574      287019      299063
Kswapd pages scanned           1809843     1801308     1846674     1873184     1861089
Kswapd pages reclaimed         1806972     1798684     1844219     1870509     1858622
Direct pages reclaimed          197227      188829      298380      286822      298835
Kswapd efficiency                  99%         99%         99%         99%         99%
Kswapd velocity                953.382     970.449     952.243     934.569     922.286
Direct efficiency                  99%         99%         99%         99%         99%
Direct velocity                104.058     101.832     153.961     143.200     148.205
Percentage direct scans             9%          9%         13%         13%         13%
Zone normal velocity           347.289     359.676     348.063     339.933     332.983
Zone dma32 velocity            710.151     712.605     758.140     737.835     737.507
Zone dma velocity                0.000       0.000       0.000       0.000       0.000
Page writes by reclaim         557.600     429.000     353.600     426.400     381.800
Page writes file                   159          53           7          79          48
Page writes anon                   398         375         345         346         333
Page reclaim immediate             825         644         411         575         420
Sector Reads                   2781750     2769780     2878547     2939128     2910483
Sector Writes                 12080843    12083351    12012892    12002132    12010745
Page rescued immediate               0           0           0           0           0
Slabs scanned                  1575654     1545344     1778406     1786700     1794073
Direct inode steals               9657       10037       15795       14104       14645
Kswapd inode steals              46857       46335       50543       50716       51796
Kswapd skipped wait                  0           0           0           0           0
THP fault alloc                     97          91          81          71          77
THP collapse alloc                 456         506         546         544         565
THP splits                           6           5           5           4           4
THP fault fallback                   0           1           0           0           0
THP collapse fail                   14          14          12          13          12
Compaction stalls                 1006         980        1537        1536        1548
Compaction success                 303         284         562         559         578
Compaction failures                702         696         974         976         969
Page migrate success           1177325     1070077     3927538     3781870     3877057
Page migrate failure                 0           0           0           0           0
Compaction pages isolated      2547248     2306457     8301218     8008500     8200674
Compaction migrate scanned    42290478    38832618   153961130   154143900   159141197
Compaction free scanned       89199429    79189151   356529027   351943166   356326727
Compaction cost                   1566        1426        5312        5156        5294
NUMA PTE updates                     0           0           0           0           0
NUMA hint faults                     0           0           0           0           0
NUMA hint local faults               0           0           0           0           0
NUMA hint local percent            100         100         100         100         100
NUMA pages migrated                  0           0           0           0           0
AutoNUMA cost                        0           0           0           0           0

Observations:

- The "Success 3" line is allocation success rate with system idle
  (phases 1 and 2 are with background interference).  I used to get stable
  values around 85% with vanilla 3.11.  The lower min and mean values came
  with 3.12.  This was bisected to commit 81c0a2bb ("mm: page_alloc: fair
  zone allocator policy") As explained in comment for patch 3, I don't
  think the commit is wrong, but that it makes the effect of compaction
  bugs worse.  From patch 3 onwards, the results are OK and match the 3.11
  results.

- Patch 4 also clearly helps phases 1 and 2, and exceeds any results
  I've seen with 3.11 (I didn't measure it that thoroughly then, but it
  was never above 40%).

- Compaction cost and number of scanned pages is higher, especially due
  to patch 4.  However, keep in mind that patches 3 and 4 fix existing
  bugs in the current design of compaction overhead mitigation, they do
  not change it.  If overhead is found unacceptable, then it should be
  decreased differently (and consistently, not due to random conditions)
  than the current implementation does.  In contrast, patches 5 and 6
  (which are not strictly bug fixes) do not increase the overhead (but
  also not success rates).  This might be a limitation of the
  stress-highalloc benchmark as it's quite uniform.

Another set of results is when configuring stress-highalloc t allocate
with similar flags as THP uses:
 (GFP_HIGHUSER_MOVABLE|__GFP_NOMEMALLOC|__GFP_NORETRY|__GFP_NO_KSWAPD)

stress-highalloc
                             3.13-rc2              3.13-rc2              3.13-rc2              3.13-rc2              3.13-rc2
                                2-thp                 3-thp                 4-thp                 5-thp                 6-thp
Success 1 Min          2.00 (  0.00%)        7.00 (-250.00%)       18.00 (-800.00%)       19.00 (-850.00%)       26.00 (-1200.00%)
Success 1 Mean        19.20 (  0.00%)       17.80 (  7.29%)       29.20 (-52.08%)       29.90 (-55.73%)       32.80 (-70.83%)
Success 1 Max         27.00 (  0.00%)       29.00 ( -7.41%)       35.00 (-29.63%)       36.00 (-33.33%)       37.00 (-37.04%)
Success 2 Min          3.00 (  0.00%)        8.00 (-166.67%)       21.00 (-600.00%)       21.00 (-600.00%)       32.00 (-966.67%)
Success 2 Mean        19.30 (  0.00%)       17.90 (  7.25%)       32.20 (-66.84%)       32.60 (-68.91%)       35.70 (-84.97%)
Success 2 Max         27.00 (  0.00%)       30.00 (-11.11%)       36.00 (-33.33%)       37.00 (-37.04%)       39.00 (-44.44%)
Success 3 Min         62.00 (  0.00%)       62.00 (  0.00%)       85.00 (-37.10%)       75.00 (-20.97%)       64.00 ( -3.23%)
Success 3 Mean        66.30 (  0.00%)       65.50 (  1.21%)       85.60 (-29.11%)       83.40 (-25.79%)       83.50 (-25.94%)
Success 3 Max         70.00 (  0.00%)       69.00 (  1.43%)       87.00 (-24.29%)       86.00 (-22.86%)       87.00 (-24.29%)

            3.13-rc2    3.13-rc2    3.13-rc2    3.13-rc2    3.13-rc2
               2-thp       3-thp       4-thp       5-thp       6-thp
User         6547.93     6475.85     6265.54     6289.46     6189.96
System       1053.42     1047.28     1043.23     1042.73     1038.73
Elapsed      1835.43     1821.96     1908.67     1912.74     1956.38

                              3.13-rc2    3.13-rc2    3.13-rc2    3.13-rc2    3.13-rc2
                                 2-thp       3-thp       4-thp       5-thp       6-thp
Minor Faults                 256805673   253106328   253222299   249830289   251184418
Major Faults                       395         375         423         434         448
Swap Ins                            12          10          10          12           9
Swap Outs                          530         537         487         455         415
Direct pages scanned             71859       86046      153244      152764      190713
Kswapd pages scanned           1900994     1870240     1898012     1892864     1880520
Kswapd pages reclaimed         1897814     1867428     1894939     1890125     1877924
Direct pages reclaimed           71766       85908      153167      152643      190600
Kswapd efficiency                  99%         99%         99%         99%         99%
Kswapd velocity               1029.000    1067.782    1000.091     991.049     951.218
Direct efficiency                  99%         99%         99%         99%         99%
Direct velocity                 38.897      49.127      80.747      79.983      96.468
Percentage direct scans             3%          4%          7%          7%          9%
Zone normal velocity           351.377     372.494     348.910     341.689     335.310
Zone dma32 velocity            716.520     744.414     731.928     729.343     712.377
Zone dma velocity                0.000       0.000       0.000       0.000       0.000
Page writes by reclaim         669.300     604.000     545.700     538.900     429.900
Page writes file                   138          66          58          83          14
Page writes anon                   530         537         487         455         415
Page reclaim immediate             806         655         772         548         517
Sector Reads                   2711956     2703239     2811602     2818248     2839459
Sector Writes                 12163238    12018662    12038248    11954736    11994892
Page rescued immediate               0           0           0           0           0
Slabs scanned                  1385088     1388364     1507968     1513292     1558656
Direct inode steals               1739        2564        4622        5496        6007
Kswapd inode steals              47461       46406       47804       48013       48466
Kswapd skipped wait                  0           0           0           0           0
THP fault alloc                    110          82          84          69          70
THP collapse alloc                 445         482         467         462         539
THP splits                           6           5           4           5           3
THP fault fallback                   3           0           0           0           0
THP collapse fail                   15          14          14          14          13
Compaction stalls                  659         685        1033        1073        1111
Compaction success                 222         225         410         427         456
Compaction failures                436         460         622         646         655
Page migrate success            446594      439978     1085640     1095062     1131716
Page migrate failure                 0           0           0           0           0
Compaction pages isolated      1029475     1013490     2453074     2482698     2565400
Compaction migrate scanned     9955461    11344259    24375202    27978356    30494204
Compaction free scanned       27715272    28544654    80150615    82898631    85756132
Compaction cost                    552         555        1344        1379        1436
NUMA PTE updates                     0           0           0           0           0
NUMA hint faults                     0           0           0           0           0
NUMA hint local faults               0           0           0           0           0
NUMA hint local percent            100         100         100         100         100
NUMA pages migrated                  0           0           0           0           0
AutoNUMA cost                        0           0           0           0           0

There are some differences from the previous results for THP-like allocations:

- Here, the bad result for unpatched kernel in phase 3 is much more
  consistent to be between 65-70% and not related to the "regression" in
  3.12.  Still there is the improvement from patch 4 onwards, which brings
  it on par with simple GFP_HIGHUSER_MOVABLE allocations.

- Compaction costs have increased, but nowhere near as much as the
  non-THP case.  Again, the patches should be worth the gained
  determininsm.

- Patches 5 and 6 somewhat increase the number of migrate-scanned pages.
   This is most likely due to __GFP_NO_KSWAPD flag, which means the cached
  pfn's and pageblock skip bits are not reset by kswapd that often (at
  least in phase 3 where no concurrent activity would wake up kswapd) and
  the patches thus help the sync-after-async compaction.  It doesn't
  however show that the sync compaction would help so much with success
  rates, which can be again seen as a limitation of the benchmark
  scenario.

This patch (of 6):

Add two tracepoints for compaction begin and end of a zone.  Using this it
is possible to calculate how much time a workload is spending within
compaction and potentially debug problems related to cached pfns for
scanning.  In combination with the direct reclaim and slab trace points it
should be possible to estimate most allocation-related overhead for a
workload.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Rik van Riel <riel@redhat.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 years agomemcg, oom: lock mem_cgroup_print_oom_info
Michal Hocko [Fri, 3 Jan 2014 03:09:59 +0000 (14:09 +1100)]
memcg, oom: lock mem_cgroup_print_oom_info

mem_cgroup_print_oom_info uses a static buffer (memcg_name) to store the
name of the cgroup.  This is not safe as pointed out by David Rientjes
because memcg oom is locked only for its hierarchy and nothing prevents
another parallel hierarchy to trigger oom as well and overwrite the
already in-use buffer.

This patch introduces oom_info_lock hidden inside
mem_cgroup_print_oom_info which is held throughout the function.  It makes
access to memcg_name safe and as a bonus it also prevents parallel memcg
ooms to interleave their statistics which would make the printed data hard
to analyze otherwise.

Signed-off-by: Michal Hocko <mhocko@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 years agosched-add-tracepoints-related-to-numa-task-migration-fix
Andrew Morton [Fri, 3 Jan 2014 03:09:59 +0000 (14:09 +1100)]
sched-add-tracepoints-related-to-numa-task-migration-fix

remove semicolon-after-if, repair coding-style

Cc: Alex Thorlton <athorlton@sgi.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 years agosched: add tracepoints related to NUMA task migration
Mel Gorman [Fri, 3 Jan 2014 03:09:58 +0000 (14:09 +1100)]
sched: add tracepoints related to NUMA task migration

This patch adds three tracepoints
 o trace_sched_move_numa when a task is moved to a node
 o trace_sched_swap_numa when a task is swapped with another task
 o trace_sched_stick_numa when a numa-related migration fails

The tracepoints allow the NUMA scheduler activity to be monitored and the
following high-level metrics can be calculated

 o NUMA migrated stuck  nr trace_sched_stick_numa
 o NUMA migrated idle  nr trace_sched_move_numa
 o NUMA migrated swapped nr trace_sched_swap_numa
 o NUMA local swapped  trace_sched_swap_numa src_nid == dst_nid (should never happen)
 o NUMA remote swapped  trace_sched_swap_numa src_nid != dst_nid (should == NUMA migrated swapped)
 o NUMA group swapped  trace_sched_swap_numa src_ngid == dst_ngid
 Maybe a small number of these are acceptable
 but a high number would be a major surprise.
 It would be even worse if bounces are frequent.
 o NUMA avg task migs.  Average number of migrations for tasks
 o NUMA stddev task mig  Self-explanatory
 o NUMA max task migs.  Maximum number of migrations for a single task

In general the intent of the tracepoints is to help diagnose problems
where automatic NUMA balancing appears to be doing an excessive amount of
useless work.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Alex Thorlton <athorlton@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 years agomm: numa: do not automatically migrate KSM pages
Mel Gorman [Fri, 3 Jan 2014 03:09:58 +0000 (14:09 +1100)]
mm: numa: do not automatically migrate KSM pages

KSM pages can be shared between tasks that are not necessarily related to
each other from a NUMA perspective.  This patch causes those pages to be
ignored by automatic NUMA balancing so they do not migrate and do not
cause unrelated tasks to be grouped together.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Alex Thorlton <athorlton@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 years agomm: numa: trace tasks that fail migration due to rate limiting
Mel Gorman [Fri, 3 Jan 2014 03:09:58 +0000 (14:09 +1100)]
mm: numa: trace tasks that fail migration due to rate limiting

A low local/remote numa hinting fault ratio is potentially explained by
failed migrations.  This patch adds a tracepoint that fires when migration
fails due to migration rate limitation.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Alex Thorlton <athorlton@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 years agomm: numa: limit scope of lock for NUMA migrate rate limiting
Mel Gorman [Fri, 3 Jan 2014 03:09:58 +0000 (14:09 +1100)]
mm: numa: limit scope of lock for NUMA migrate rate limiting

NUMA migrate rate limiting protects a migration counter and window using a
lock but in some cases this can be a contended lock.  It is not critical
that the number of pages be perfect, lost updates are acceptable.  Reduce
the importance of this lock.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Alex Thorlton <athorlton@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 years agomm: numa: make NUMA-migrate related functions static
Mel Gorman [Fri, 3 Jan 2014 03:09:58 +0000 (14:09 +1100)]
mm: numa: make NUMA-migrate related functions static

numamigrate_update_ratelimit and numamigrate_isolate_page only have
callers in mm/migrate.c.  This patch makes them static.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Alex Thorlton <athorlton@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 years agolib/show_mem.c: show num_poisoned_pages when oom
Xishi Qiu [Fri, 3 Jan 2014 03:09:57 +0000 (14:09 +1100)]
lib/show_mem.c: show num_poisoned_pages when oom

Show num_poisoned_pages when oom, it is a little helpful to find the
reason.  Also it will be emitted anytime show_mem() is called.

Signed-off-by: Xishi Qiu <qiuxishi@huawei.com>
Suggested-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Acked-by: David Rientjes <rientjes@google.com>
Reviewed-by: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 years agomm/hwpoison: add '#' to hwpoison_inject
Wanpeng Li [Fri, 3 Jan 2014 03:09:57 +0000 (14:09 +1100)]
mm/hwpoison: add '#' to hwpoison_inject

Add '#' to hwpoison_inject just as done in madvise_hwpoison.

Signed-off-by: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Reviewed-by: Vladimir Murzin <murzin.v@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
10 years agomm/ARM: fix ARMs __ffs() to conform to avoid warning with NO_BOOTMEM
Santosh Shilimkar [Fri, 3 Jan 2014 03:09:57 +0000 (14:09 +1100)]
mm/ARM: fix ARMs __ffs() to conform to avoid warning with NO_BOOTMEM

Building ARM with NO_BOOTMEM generates below warning.

mm/nobootmem.c: In function _____free_pages_memory___:
mm/nobootmem.c:88:11: warning: comparison of distinct pointer types lacks a cast

order = min(MAX_ORDER - 1UL, __ffs(start));

ARM's __ffs() differs from other architectures in that it ends up being
an int, whereas almost everyone else is unsigned long.

So fix ARMs __ffs() to conform to other architectures.  Suggested by
Russell King.

Some more details in below thread -
https://lkml.org/lkml/2013/12/9/807

Signed-off-by: Santosh Shilimkar <santosh.shilimkar@ti.com>
Acked-by: Russell King <rmk+kernel@arm.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>