]> git.karo-electronics.de Git - karo-tx-linux.git/log
karo-tx-linux.git
9 years agolib-find__bit-reimplementation-fix
Andrew Morton [Tue, 7 Apr 2015 23:45:01 +0000 (09:45 +1000)]
lib-find__bit-reimplementation-fix

include linux/bitmap.h, per Rasmus

Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Cc: Yury Norov <yury.norov@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 years agolib: find_*_bit reimplementation
Yury Norov [Tue, 7 Apr 2015 23:45:01 +0000 (09:45 +1000)]
lib: find_*_bit reimplementation

This patchset does rework to find_bit function family to achieve better
performance, and decrease size of text.  All rework is done in patch 1.
Patches 2 and 3 are about code moving and renaming.

It was boot-tested on x86_64 and MIPS (big-endian) machines.
Performance tests were ran on userspace with code like this:

/* addr[] is filled from /dev/urandom */
start = clock();
while (ret < nbits)
ret = find_next_bit(addr, nbits, ret + 1);

end = clock();
printf("%ld\t", (unsigned long) end - start);

On Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz measurements are: (for
find_next_bit, nbits is 8M, for find_first_bit - 80K)

find_next_bit: find_first_bit:
new current new current
26932 43151 14777 14925
26947 43182 14521 15423
26507 43824 15053 14705
27329 43759 14473 14777
26895 43367 14847 15023
26990 43693 15103 15163
26775 43299 15067 15232
27282 42752 14544 15121
27504 43088 14644 14858
26761 43856 14699 15193
26692 43075 14781 14681
27137 42969 14451 15061
... ...

find_next_bit performance gain is 35-40%;
find_first_bit - no measurable difference.

On ARM machine, there is arch-specific implementation for find_bit.

Thanks a lot to George Spelvin and Rasmus Villemoes for hints and
helpful discussions.

This patch (of 3):

New implementations takes less space in source file (see diffstat) and in
object.  For me it's 710 vs 453 bytes of text.  It also shows better
performance.

find_last_bit description fixed due to obvious typo.

Signed-off-by: Yury Norov <yury.norov@gmail.com>
Reviewed-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Reviewed-by: George Spelvin <linux@horizon.com>
Cc: Alexey Klimov <klimov.linux@gmail.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Daniel Borkmann <dborkman@redhat.com>
Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Mark Salter <msalter@redhat.com>
Cc: AKASHI Takahiro <takahiro.akashi@linaro.org>
Cc: Thomas Graf <tgraf@suug.ch>
Cc: Valentin Rothberg <valentinrothberg@gmail.com>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 years agoparisc: remove use of seq_printf return value
Joe Perches [Tue, 7 Apr 2015 23:45:01 +0000 (09:45 +1000)]
parisc: remove use of seq_printf return value

The seq_printf return value, because it's frequently misused,
will eventually be converted to void.

See: commit 1f33c41c03da ("seq_file: Rename seq_overflow() to
     seq_has_overflowed() and make public")

Signed-off-by: Joe Perches <joe@perches.com>
Cc: "James E.J. Bottomley" <jejb@parisc-linux.org>
Cc: Helge Deller <deller@gmx.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 years agolru_cache: remove use of seq_printf return value
Joe Perches [Tue, 7 Apr 2015 23:45:00 +0000 (09:45 +1000)]
lru_cache: remove use of seq_printf return value

The seq_printf return value, because it's frequently misused,
will eventually be converted to void.

See: commit 1f33c41c03da ("seq_file: Rename seq_overflow() to
     seq_has_overflowed() and make public")

Signed-off-by: Joe Perches <joe@perches.com>
Cc: Lars Ellenberg <drbd-dev@lists.linbit.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 years agotracing: remove use of seq_printf return value
Joe Perches [Tue, 7 Apr 2015 23:45:00 +0000 (09:45 +1000)]
tracing: remove use of seq_printf return value

The seq_printf return value, because it's frequently misused,
will eventually be converted to void.

See: commit 1f33c41c03da ("seq_file: Rename seq_overflow() to
     seq_has_overflowed() and make public")

Miscellanea:

o Remove unused return value from trace_lookup_stack

Signed-off-by: Joe Perches <joe@perches.com>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 years agocgroup: remove use of seq_printf return value
Joe Perches [Tue, 7 Apr 2015 23:45:00 +0000 (09:45 +1000)]
cgroup: remove use of seq_printf return value

The seq_printf return value, because it's frequently misused,
will eventually be converted to void.

See: commit 1f33c41c03da ("seq_file: Rename seq_overflow() to
     seq_has_overflowed() and make public")

Signed-off-by: Joe Perches <joe@perches.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 years agoproc: remove use of seq_printf return value
Joe Perches [Tue, 7 Apr 2015 23:45:00 +0000 (09:45 +1000)]
proc: remove use of seq_printf return value

The seq_printf return value, because it's frequently misused,
will eventually be converted to void.

See: commit 1f33c41c03da ("seq_file: Rename seq_overflow() to
     seq_has_overflowed() and make public")

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 years agos390: remove use of seq_printf return value
Joe Perches [Tue, 7 Apr 2015 23:45:00 +0000 (09:45 +1000)]
s390: remove use of seq_printf return value

The seq_printf return value, because it's frequently misused,
will eventually be converted to void.

See: commit 1f33c41c03da ("seq_file: Rename seq_overflow() to
     seq_has_overflowed() and make public")

Signed-off-by: Joe Perches <joe@perches.com>
Acked-by: Sebastian Ott <sebott@linux.vnet.ibm.com>
Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com>
Cc: Peter Oberparleiter <oberpar@linux.vnet.ibm.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 years agocris fasttimer: remove use of seq_printf return value
Joe Perches [Tue, 7 Apr 2015 23:44:59 +0000 (09:44 +1000)]
cris fasttimer: remove use of seq_printf return value

The seq_printf return value, because it's frequently misused,
will eventually be converted to void.

See: commit 1f33c41c03da ("seq_file: Rename seq_overflow() to
     seq_has_overflowed() and make public")

Miscellanea:

o Coalesce formats, realign arguments

Signed-off-by: Joe Perches <joe@perches.com>
Cc: Mikael Starvik <starvik@axis.com>
Cc: Jesper Nilsson <jesper.nilsson@axis.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 years agocris: remove use of seq_printf return value
Joe Perches [Tue, 7 Apr 2015 23:44:59 +0000 (09:44 +1000)]
cris: remove use of seq_printf return value

The seq_printf return value, because it's frequently misused,
will eventually be converted to void.

See: commit 1f33c41c03da ("seq_file: Rename seq_overflow() to
     seq_has_overflowed() and make public")

Signed-off-by: Joe Perches <joe@perches.com>
Acked-by: Jesper Nilsson <jesper.nilsson@axis.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 years agoopenrisc: remove use of seq_printf return value
Joe Perches [Tue, 7 Apr 2015 23:44:59 +0000 (09:44 +1000)]
openrisc: remove use of seq_printf return value

The seq_printf return value, because it's frequently misused,
will eventually be converted to void.

See: commit 1f33c41c03da ("seq_file: Rename seq_overflow() to
     seq_has_overflowed() and make public")

Signed-off-by: Joe Perches <joe@perches.com>
Cc: Jonas Bonn <jonas@southpole.se>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 years agoARM: plat-pxa: remove use of seq_printf return value
Joe Perches [Tue, 7 Apr 2015 23:44:59 +0000 (09:44 +1000)]
ARM: plat-pxa: remove use of seq_printf return value

The seq_printf return value, because it's frequently misused,
(as it is here, it doesn't return # of chars emitted) will
eventually be converted to void.

See: commit 1f33c41c03da ("seq_file: Rename seq_overflow() to
     seq_has_overflowed() and make public")

Signed-off-by: Joe Perches <joe@perches.com>
Cc: Russell King <linux@arm.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 years agonios2: cpuinfo: remove use of seq_printf return value
Joe Perches [Tue, 7 Apr 2015 23:44:59 +0000 (09:44 +1000)]
nios2: cpuinfo: remove use of seq_printf return value

The seq_printf return value, because it's frequently misused,
will eventually be converted to void.

See: commit 1f33c41c03da ("seq_file: Rename seq_overflow() to
     seq_has_overflowed() and make public")

Signed-off-by: Joe Perches <joe@perches.com>
Cc: Ley Foon Tan <lftan@altera.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 years agomicroblaze: mb: remove use of seq_printf return value
Joe Perches [Tue, 7 Apr 2015 23:44:58 +0000 (09:44 +1000)]
microblaze: mb: remove use of seq_printf return value

Remove errant x char after tab found by kbuild test robot

Signed-off-by: Joe Perches <joe@perches.com>
Cc: Michal Simek <monstr@monstr.eu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 years agomicroblaze: mb: remove use of seq_printf return value
Joe Perches [Tue, 7 Apr 2015 23:44:58 +0000 (09:44 +1000)]
microblaze: mb: remove use of seq_printf return value

The seq_printf return value, because it's frequently misused,
will eventually be converted to void.

See: commit 1f33c41c03da ("seq_file: Rename seq_overflow() to
     seq_has_overflowed() and make public")

Signed-off-by: Joe Perches <joe@perches.com>
Acked-by: Michal Simek <monstr@monstr.eu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 years agoipc: remove use of seq_printf return value
Joe Perches [Tue, 7 Apr 2015 23:44:58 +0000 (09:44 +1000)]
ipc: remove use of seq_printf return value

The seq_printf return value, because it's frequently misused,
will eventually be converted to void.

See: commit 1f33c41c03da ("seq_file: Rename seq_overflow() to
     seq_has_overflowed() and make public")

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 years agortc: remove use of seq_printf return value
Joe Perches [Tue, 7 Apr 2015 23:44:58 +0000 (09:44 +1000)]
rtc: remove use of seq_printf return value

The seq_printf return value, because it's frequently misused,
will eventually be converted to void.

See: commit 1f33c41c03da ("seq_file: Rename seq_overflow() to
     seq_has_overflowed() and make public")

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 years agopower: wakeup: remove use of seq_printf return value
Joe Perches [Tue, 7 Apr 2015 23:44:58 +0000 (09:44 +1000)]
power: wakeup: remove use of seq_printf return value

The seq_printf return value, because it's frequently misused,
will eventually be converted to void.

See: commit 1f33c41c03da ("seq_file: Rename seq_overflow() to
     seq_has_overflowed() and make public")

Signed-off-by: Joe Perches <joe@perches.com>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Cc: Pavel Machek <pavel@ucw.cz>
Cc: Len Brown <len.brown@intel.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 years agox86: mtrr: if: remove use of seq_printf return value
Joe Perches [Tue, 7 Apr 2015 23:44:58 +0000 (09:44 +1000)]
x86: mtrr: if: remove use of seq_printf return value

The seq_printf return value, because it's frequently misused,
will eventually be converted to void.

See: commit 1f33c41c03da ("seq_file: Rename seq_overflow() to
     seq_has_overflowed() and make public")

Signed-off-by: Joe Perches <joe@perches.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 years agolinux/bitmap.h: improve BITMAP_{LAST,FIRST}_WORD_MASK
Rasmus Villemoes [Tue, 7 Apr 2015 23:44:57 +0000 (09:44 +1000)]
linux/bitmap.h: improve BITMAP_{LAST,FIRST}_WORD_MASK

The macro BITMAP_LAST_WORD_MASK can be implemented without a conditional,
which will generally lead to slightly better generated code (221 bytes
saved for allmodconfig-GCOV_KERNEL, ~2k with GCOV_KERNEL).  As a small
bonus, this also ensures that the nbits parameter is expanded exactly
once.

In BITMAP_FIRST_WORD_MASK, if start is signed gcc is technically allowed
to assume it is positive (or divisible by BITS_PER_LONG), and hence just
do the simple mask.  It doesn't seem to use this, and even on an
architecture like x86 where the shift only depends on the lower 5 or 6
bits, and these bits are not affected by the signedness of the expression,
gcc still generates code to compute the C99 mandated value of start %
BITS_PER_LONG.  So just use a mask explicitly, also for consistency with
BITMAP_LAST_WORD_MASK.

Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Cc: Tejun Heo <tj@kernel.org>
Reviewed-by: George Spelvin <linux@horizon.com>
Cc: Yury Norov <yury.norov@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 years agoMAINTAINERS: CREDITS: remove Stefano Brivio from B43
Joe Perches [Tue, 7 Apr 2015 23:44:57 +0000 (09:44 +1000)]
MAINTAINERS: CREDITS: remove Stefano Brivio from B43

This email address isn't working anymore

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 years ago.mailmap: add Ricardo Ribalda
Ricardo Ribalda Delgado [Tue, 7 Apr 2015 23:44:57 +0000 (09:44 +1000)]
.mailmap: add Ricardo Ribalda

Work and Home computer had different settings in the mail client.  Some
contributions appear as Ricardo Ribalda, others as Ricardo Ribalda Delgado
(and one as just Ricardo).

Signed-off-by: Ricardo Ribalda Delgado <ricardo.ribalda@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 years agoCREDITS: add Ricardo Ribalda Delgado
Ricardo Ribalda Delgado [Tue, 7 Apr 2015 23:44:57 +0000 (09:44 +1000)]
CREDITS: add Ricardo Ribalda Delgado

Add personal details to CREDITS file.

Signed-off-by: Ricardo Ribalda Delgado <ricardo.ribalda@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 years agoMAINTAINERS: Use tabs consistently
Joe Perches [Tue, 7 Apr 2015 23:44:57 +0000 (09:44 +1000)]
MAINTAINERS: Use tabs consistently

Consistently use a single tab after the "specifier:" type.

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 years agolib/vsprintf: add %pT format specifier
Tetsuo Handa [Tue, 7 Apr 2015 23:44:56 +0000 (09:44 +1000)]
lib/vsprintf: add %pT format specifier

Since task_struct->comm can be modified by other threads while the current
thread is reading it, it is recommended to use get_task_comm() for reading
it.

However, since get_task_comm() holds task_struct->alloc_lock spinlock,
some users cannot use get_task_comm().  Also, a lot of users are directly
reading from task_struct->comm even if they can use get_task_comm().  Such
users might obtain inconsistent result.

This patch introduces %pT format specifier for printing task_struct->comm.
Currently %pT does not provide consistency.  I'm planning to change to
use RCU in the future.  By using RCU, the comm name read from
task_struct->comm will be guaranteed to be consistent.  But before
modifying set_task_comm() to use RCU, we need to kill direct ->comm users
who do not use get_task_comm().

An example for converting direct ->comm users is shown below.  Since many
debug printings use p == current, you can pass NULL instead of p if p ==
current.

  pr_info("comm=%s\n", p->comm);       => pr_info("comm=%pT\n", p);
  pr_info("comm=%s\n", current->comm); => pr_info("comm=%pT\n", NULL);

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Reviewed-by: Pavel Machek <pavel@ucw.cz>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 years agolib-string_helpersc-change-semantics-of-string_escape_mem-fix-fix
Andy Shevchenko [Tue, 7 Apr 2015 23:44:56 +0000 (09:44 +1000)]
lib-string_helpersc-change-semantics-of-string_escape_mem-fix-fix

- add missed curly braces (grr... I have them in initial comment)

Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 years agosunrpc/cache: simplify qword_add
Andy Shevchenko [Tue, 7 Apr 2015 23:44:56 +0000 (09:44 +1000)]
sunrpc/cache: simplify qword_add

The commit 7572d3b29896 (lib/string_helpers.c: change semantics of
string_escape_mem) updates qword_add() to follow the changes in
lib/string_helpers.c. This patch simplifies the approach.

Andrew, I think this one can be folded in the mentioned commit by Rasmus.

Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 years agolib/string_helpers.c: change semantics of string_escape_mem
Rasmus Villemoes [Tue, 7 Apr 2015 23:44:56 +0000 (09:44 +1000)]
lib/string_helpers.c: change semantics of string_escape_mem

The current semantics of string_escape_mem are inadequate for one of its
current users, vsnprintf().  If that is to honour its contract, it must
know how much space would be needed for the entire escaped buffer, and
string_escape_mem provides no way of obtaining that (short of allocating a
large enough buffer (~4 times input string) to let it play with, and
that's definitely a big no-no inside vsnprintf).

So change the semantics for string_escape_mem to be more snprintf-like:
Return the size of the output that would be generated if the destination
buffer was big enough, but of course still only write to the part of dst
it is allowed to, and (contrary to snprintf) don't do '\0'-termination.
It is then up to the caller to detect whether output was truncated and to
append a '\0' if desired.  Also, we must output partial escape sequences,
otherwise a call such as snprintf(buf, 3, "%1pE", "\123") would cause
printf to write a \0 to buf[2] but leaving buf[0] and buf[1] with whatever
they previously contained.

This also fixes a bug in the escaped_string() helper function, which used
to unconditionally pass a length of "end-buf" to string_escape_mem();
since the latter doesn't check osz for being insanely large, it would
happily write to dst.  For example, kasprintf(GFP_KERNEL, "something and
then %pE", ...); is an easy way to trigger an oops.

In test-string_helpers.c, the -ENOMEM test is replaced with testing for
getting the expected return value even if the buffer is too small.  We
also ensure that nothing is written (by relying on a NULL pointer deref)
if the output size is 0 by passing NULL - this has to work for
kasprintf("%pE") to work.

In net/sunrpc/cache.c, I think qword_add still has the same semantics.
Someone should definitely double-check this.

In fs/proc/array.c, I made the minimum possible change, but longer-term it
should stop poking around in seq_file internals.

Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Acked-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 years agolib/string_helpers.c: refactor string_escape_mem
Rasmus Villemoes [Tue, 7 Apr 2015 23:44:56 +0000 (09:44 +1000)]
lib/string_helpers.c: refactor string_escape_mem

When printf is given the format specifier %pE, it needs a way of obtaining
the total output size that would be generated if the buffer was large
enough, and string_escape_mem doesn't easily provide that.  This is a
refactorization of string_escape_mem in preparation of changing its
external API to provide that information.

The somewhat ugly early returns and subsequent seemingly redundant
conditionals are to make the following patch touch as little as possible
in string_helpers.c while still preserving the current behaviour of never
outputting partial escape sequences.  That behaviour must also change for
%pE to work as one expects from every other printf specifier.

Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Acked-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 years agolib/vsprintf.c: fix potential NULL deref in hex_string
Rasmus Villemoes [Tue, 7 Apr 2015 23:44:55 +0000 (09:44 +1000)]
lib/vsprintf.c: fix potential NULL deref in hex_string

The helper hex_string() is broken in two ways.  First, it doesn't
increment buf regardless of whether there is room to print, so callers
such as kasprintf() that try to probe the correct storage to allocate will
get a too small return value.  But even worse, kasprintf() (and likely
anyone else trying to find the size of the result) pass NULL for buf and 0
for size, so we also have end == NULL.  But this means that the end-1 in
hex_string() is (char*)-1, so buf < end-1 is true and we get a NULL
pointer deref.  I double-checked this with a trivial kernel module that
just did a kasprintf(GFP_KERNEL, "%14ph", "CrashBoomBang").

Nobody seems to be using %ph with kasprintf, but we might as well fix it
before it hits someone.

Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Acked-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 years agolib-vsprintf-add-%pcnr-format-specifiers-for-clocks-fix
Andrew Morton [Tue, 7 Apr 2015 23:44:55 +0000 (09:44 +1000)]
lib-vsprintf-add-%pcnr-format-specifiers-for-clocks-fix

omit code if !CONFIG_HAVE_CLK

Cc: Geert Uytterhoeven <geert+renesas@glider.be>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Mike Turquette <mturquette@linaro.org>
Cc: Stephen Boyd <sboyd@codeaurora.org>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 years agolib/vsprintf: add %pC{,n,r} format specifiers for clocks
Geert Uytterhoeven [Tue, 7 Apr 2015 23:44:55 +0000 (09:44 +1000)]
lib/vsprintf: add %pC{,n,r} format specifiers for clocks

Add format specifiers for printing struct clk:
  - '%pC' or '%pCn': name (Common Clock Framework) or address (legacy
    clock framework) of the clock,
  - '%pCr': rate of the clock.

Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Mike Turquette <mturquette@linaro.org>
Cc: Stephen Boyd <sboyd@codeaurora.org>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 years agolib/vsprintf: Move integer format types to the top
Geert Uytterhoeven [Tue, 7 Apr 2015 23:44:55 +0000 (09:44 +1000)]
lib/vsprintf: Move integer format types to the top

Move the format types for 64-bit integers and configurable size integers
to the top, so they're next to the other integer format types.  While at
it, add the missing format types for s32 and u32.

Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Mike Turquette <mturquette@linaro.org>
Cc: Stephen Boyd <sboyd@codeaurora.org>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 years agolib/vsprintf: document %p parameters passed by reference
Geert Uytterhoeven [Tue, 7 Apr 2015 23:44:55 +0000 (09:44 +1000)]
lib/vsprintf: document %p parameters passed by reference

This patch series improves the documentation for printk() formats, and
adds support for printing clocks.  The latter has always been a hassle if
you wanted to support both the common and legacy clock frameworks.

  - '%pC' and '%pCn' print the name (Common Clock Framework) or address
    (legacy clock framework) of a clock,
  - '%pCr' prints the current clock rate.

This patch (of 3):

Make sure all %p extensions that take parameters by references are
documented to do so.

Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Mike Turquette <mturquette@linaro.org>
Cc: Stephen Boyd <sboyd@codeaurora.org>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 years agolib/vsprintf.c: another small hack
Rasmus Villemoes [Tue, 7 Apr 2015 23:44:54 +0000 (09:44 +1000)]
lib/vsprintf.c: another small hack

Making ZEROPAD == '0'-' ', we can eliminate a few more instructions.

Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 years agolib/vsprintf.c: eliminate duplicate hex string array
Rasmus Villemoes [Tue, 7 Apr 2015 23:44:54 +0000 (09:44 +1000)]
lib/vsprintf.c: eliminate duplicate hex string array

gcc doesn't merge or overlap const char[] objects with identical contents
(probably language lawyers would also insist that these things have
different addresses), but there's no reason to have the string
"0123456789ABCDEF" occur in multiple places.  hex_asc_upper is declared in
kernel.h and defined in lib/hexdump.c, which is unconditionally compiled
in.

Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 years agolib/vsprintf.c: reduce stack use in number()
Rasmus Villemoes [Tue, 7 Apr 2015 23:44:54 +0000 (09:44 +1000)]
lib/vsprintf.c: reduce stack use in number()

At least since the initial git commit, when base was passed as a separate
parameter, number() has only been called with bases 8, 10 and 16.  I'm
guessing that 66 was to accommodate 64 0/1, a sign and a '\0', but the
buffer is only used for the actual digits.  Octal digits carry 3 bits of
information, so 24 is enough.  Spell that 3*sizeof(num) so one less place
needs to be changed should long long ever be 128 bits.  Also remove the
commented-out code that would handle an arbitrary base.

Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 years agolib/vsprintf.c: eliminate some branches
Rasmus Villemoes [Tue, 7 Apr 2015 23:44:54 +0000 (09:44 +1000)]
lib/vsprintf.c: eliminate some branches

Since FORMAT_TYPE_INT is simply 1 more than FORMAT_TYPE_UINT, and
similarly for BYTE/UBYTE, SHORT/USHORT, LONG/ULONG, we can eliminate a few
instructions by making SIGN have the value 1 instead of 2, and then use
arithmetic instead of branches for computing the right spec->type.  It's a
little hacky, but certainly in the same spirit as SMALL needing to have
the value 0x20.  For example for the spec->qualifier == 'l' case, gcc now
generates

     75e:       0f b6 53 01             movzbl 0x1(%rbx),%edx
     762:       83 e2 01                and    $0x1,%edx
     765:       83 c2 09                add    $0x9,%edx
     768:       88 13                   mov    %dl,(%rbx)

instead of

     763:       0f b6 53 01             movzbl 0x1(%rbx),%edx
     767:       83 e2 02                and    $0x2,%edx
     76a:       80 fa 01                cmp    $0x1,%dl
     76d:       19 d2                   sbb    %edx,%edx
     76f:       83 c2 0a                add    $0xa,%edx
     772:       88 13                   mov    %dl,(%rbx)

Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 years agoprintk: comment pr_cont() stating it is only to continue a line
Steven Rostedt [Tue, 7 Apr 2015 23:44:54 +0000 (09:44 +1000)]
printk: comment pr_cont() stating it is only to continue a line

KERN_CONT is nicely commented in kern_levels.h, but pr_cont() is now used
more often, and it lacks the comment stating what it is used for.  It can
be confused as continuing the log level, but that is not its purpose.  Its
purpose is to continue a line that had no newline enclosed.  This should
be documented by pr_cont() as well.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Acked-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 years agopowerpc/powernv: reboot when requested by firmware
Joel Stanley [Tue, 7 Apr 2015 23:44:53 +0000 (09:44 +1000)]
powerpc/powernv: reboot when requested by firmware

Use orderly_reboot so userspace will to shut itself down via the reboot
path.  This is required for graceful reboot initiated by the BMC, such as
when a user uses ipmitool to issue a 'chassis power cycle' command.

Signed-off-by: Joel Stanley <joel@jms.id.au>
Acked-by: Michael Ellerman <mpe@ellerman.id.au>
Cc: Fabian Frederick <fabf@skynet.be>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Jeremy Kerr <jk@ozlabs.org>
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 years agokernel/reboot.c: add orderly_reboot for graceful reboot
Joel Stanley [Tue, 7 Apr 2015 23:44:53 +0000 (09:44 +1000)]
kernel/reboot.c: add orderly_reboot for graceful reboot

The kernel has orderly_poweroff which allows the kernel to initiate a
graceful shutdown of userspace, by running /sbin/poweroff.  This adds
orderly_reboot that will cause userspace to shut itself down by calling
/sbin/reboot.

This will be used for shutdown initiated by a system controller on
platforms that do not use ACPI.

orderly_reboot() should be used when the system wants to allow userspace
to gracefully shut itself down.  For cases where the system may imminently
catch on fire, the existing emergency_restart() provides an immediate
reboot without involving userspace.

Signed-off-by: Joel Stanley <joel@jms.id.au>
Cc: Fabian Frederick <fabf@skynet.be>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Jeremy Kerr <jk@ozlabs.org>
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 years agodrivers/sbus/char/envctrl.c: ignore orderly_poweroff return value
Joel Stanley [Tue, 7 Apr 2015 23:44:53 +0000 (09:44 +1000)]
drivers/sbus/char/envctrl.c: ignore orderly_poweroff return value

orderly_poweroff() unconditionally returns 0, so remove the dead code that
checks the return value.

A future patch will change the return type to void.

Signed-off-by: Joel Stanley <joel@jms.id.au>
Acked-by: David S. Miller <davem@davemloft.net>
Cc: Fabian Frederick <fabf@skynet.be>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Jeremy Kerr <jk@ozlabs.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 years agokernel/hung_task.c: change hung_task.c to use for_each_process_thread()
Aaron Tomlin [Tue, 7 Apr 2015 23:44:53 +0000 (09:44 +1000)]
kernel/hung_task.c: change hung_task.c to use for_each_process_thread()

In check_hung_uninterruptible_tasks() avoid the use of deprecated
while_each_thread().

The "max_count" logic will prevent a livelock - see commit 0c740d0a
("introduce for_each_thread() to replace the buggy while_each_thread()").
Having said this let's use for_each_process_thread().

Signed-off-by: Aaron Tomlin <atomlin@redhat.com>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Dave Wysochanski <dwysocha@redhat.com>
Cc: Aaron Tomlin <atomlin@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 years agokernel/resource.c: remove deprecated __check_region() and friends
Jakub Sitnicki [Tue, 7 Apr 2015 23:44:53 +0000 (09:44 +1000)]
kernel/resource.c: remove deprecated __check_region() and friends

All users of __check_region(), check_region(), and check_mem_region() are
gone.  We got rid of the last user in v4.0-rc1.  Remove them.

bloat-o-meter on x86_64 shows:

add/remove: 0/3 grow/shrink: 0/0 up/down: 0/-102 (-102)
function                                     old     new   delta
__kstrtab___check_region                      15       -     -15
__ksymtab___check_region                      16       -     -16
__check_region                                71       -     -71

Signed-off-by: Jakub Sitnicki <jsitnicki@gmail.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 years agokernel-conditionally-support-non-root-users-groups-and-capabilities-checkpatch-fixes
Andrew Morton [Tue, 7 Apr 2015 23:44:52 +0000 (09:44 +1000)]
kernel-conditionally-support-non-root-users-groups-and-capabilities-checkpatch-fixes

ERROR: code indent should use tabs where possible
#120: FILE: include/linux/capability.h:220:
+        return true;$

WARNING: please, no spaces at the start of a line
#120: FILE: include/linux/capability.h:220:
+        return true;$

ERROR: code indent should use tabs where possible
#125: FILE: include/linux/capability.h:225:
+        return true;$

WARNING: please, no spaces at the start of a line
#125: FILE: include/linux/capability.h:225:
+        return true;$

ERROR: code indent should use tabs where possible
#129: FILE: include/linux/capability.h:229:
+        return true;$

WARNING: please, no spaces at the start of a line
#129: FILE: include/linux/capability.h:229:
+        return true;$

ERROR: code indent should use tabs where possible
#134: FILE: include/linux/capability.h:234:
+        return true;$

WARNING: please, no spaces at the start of a line
#134: FILE: include/linux/capability.h:234:
+        return true;$

ERROR: code indent should use tabs where possible
#170: FILE: include/linux/cred.h:79:
+        return 1;$

WARNING: please, no spaces at the start of a line
#170: FILE: include/linux/cred.h:79:
+        return 1;$

ERROR: code indent should use tabs where possible
#174: FILE: include/linux/cred.h:83:
+        return 1;$

WARNING: please, no spaces at the start of a line
#174: FILE: include/linux/cred.h:83:
+        return 1;$

total: 6 errors, 6 warnings, 310 lines checked

NOTE: whitespace errors detected, you may wish to use scripts/cleanpatch or
      scripts/cleanfile

./patches/kernel-conditionally-support-non-root-users-groups-and-capabilities.patch has style problems, please review.

If any of these errors are false positives, please report
them to the maintainer, see CHECKPATCH in MAINTAINERS.

Please run checkpatch prior to sending patches

Cc: Iulia Manda <iulia.manda21@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 years agokernel: conditionally support non-root users, groups and capabilities
Iulia Manda [Tue, 7 Apr 2015 23:44:52 +0000 (09:44 +1000)]
kernel: conditionally support non-root users, groups and capabilities

There are a lot of embedded systems that run most or all of their
functionality in init, running as root:root.  For these systems,
supporting multiple users is not necessary.

This patch adds a new symbol, CONFIG_MULTIUSER, that makes support for
non-root users, non-root groups, and capabilities optional.  It is enabled
under CONFIG_EXPERT menu.

When this symbol is not defined, UID and GID are zero in any possible case
and processes always have all capabilities.

The following syscalls are compiled out: setuid, setregid, setgid,
setreuid, setresuid, getresuid, setresgid, getresgid, setgroups,
getgroups, setfsuid, setfsgid, capget, capset.

Also, groups.c is compiled out completely.

In kernel/capability.c, capable function was moved in order to avoid
adding two ifdef blocks.

This change saves about 25 KB on a defconfig build.  The most minimal
kernels have total text sizes in the high hundreds of kB rather than
low MB.  (The 25k goes down a bit with allnoconfig, but not that much.

The kernel was booted in Qemu.  All the common functionalities work.
Adding users/groups is not possible, failing with -ENOSYS.

Bloat-o-meter output:
add/remove: 7/87 grow/shrink: 19/397 up/down: 1675/-26325 (-24650)

Signed-off-by: Iulia Manda <iulia.manda21@gmail.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>
Tested-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 years agoparide: fix the "verbose" module param
Dan Carpenter [Tue, 7 Apr 2015 23:44:52 +0000 (09:44 +1000)]
paride: fix the "verbose" module param

The verbose module parameter can be set to 2 for extremely verbose
messages so the type should be int instead of bool.

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Cc: Tim Waugh <tim@cyberelk.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 years agoinclude/linux: remove empty conditionals
Rasmus Villemoes [Tue, 7 Apr 2015 23:44:52 +0000 (09:44 +1000)]
include/linux: remove empty conditionals

Commit 607ca46e97a1 ("UAPI: (Scripted) Disintegrate include/linux") left
behind some empty conditional blocks.  Since they are useless and may
cause a reader to wonder whether something is missing, remove them.

Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: David Howells <dhowells@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 years agoDocumentation/filesystems/proc.txt: add guest_nice column to example output of `cat...
Tobias Klauser [Tue, 7 Apr 2015 23:44:51 +0000 (09:44 +1000)]
Documentation/filesystems/proc.txt: add guest_nice column to example output of `cat /proc/stat'

Commit ce0e7b28fb75cb00 ("sched, cpuacct: Fix niced guest time
accounting") added the guest_nice column to /proc/stat, but the example
output of `cat /proc/stat' in Documentation/filesystems/proc.txt wasn't
updated accordingly.  Do so now.

Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
Cc: Ryota Ozaki <ozaki.ryota@gmail.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 years agoproc-show-locks-in-proc-pid-fdinfo-x-v2
Andrey Vagin [Tue, 7 Apr 2015 23:44:51 +0000 (09:44 +1000)]
proc-show-locks-in-proc-pid-fdinfo-x-v2

use seq_has_overflowed() properly

Signed-off-by: Andrey Vagin <avagin@openvz.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Jeff Layton <jlayton@poochiereds.net>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Joe Perches <joe@perches.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 years agoproc: show locks in /proc/pid/fdinfo/X
Andrey Vagin [Tue, 7 Apr 2015 23:44:51 +0000 (09:44 +1000)]
proc: show locks in /proc/pid/fdinfo/X

Let's show locks which are associated with a file descriptor in
its fdinfo file.

Currently we don't have a reliable way to determine who holds a lock.  We
can find some information in /proc/locks, but PID which is reported there
can be wrong.  For example, a process takes a lock, then forks a child and
dies.  In this case /proc/locks contains the parent pid, which can be
reused by another process.

$ cat /proc/locks
...
6: FLOCK  ADVISORY  WRITE 324 00:13:13431 0 EOF
...

$ ps -C rpcbind
  PID TTY          TIME CMD
  332 ?        00:00:00 rpcbind

$ cat /proc/332/fdinfo/4
pos: 0
flags: 0100000
mnt_id: 22
lock: 1: FLOCK  ADVISORY  WRITE 324 00:13:13431 0 EOF

$ ls -l /proc/332/fd/4
lr-x------ 1 root root 64 Mar  5 14:43 /proc/332/fd/4 -> /run/rpcbind.lock

$ ls -l /proc/324/fd/
total 0
lrwx------ 1 root root 64 Feb 27 14:50 0 -> /dev/pts/0
lrwx------ 1 root root 64 Feb 27 14:50 1 -> /dev/pts/0
lrwx------ 1 root root 64 Feb 27 14:49 2 -> /dev/pts/0

You can see that the process with the 324 pid doesn't hold the lock.

This information is required for proper dumping and restoring file
locks.

Signed-off-by: Andrey Vagin <avagin@openvz.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Acked-by: Jeff Layton <jlayton@poochiereds.net>
Acked-by: "J. Bruce Fields" <bfields@fieldses.org>
Acked-by: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Joe Perches <joe@perches.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 years agodocs: add missing and new /proc/PID/status file entries, fix typos
Nathan Scott [Tue, 7 Apr 2015 23:44:51 +0000 (09:44 +1000)]
docs: add missing and new /proc/PID/status file entries, fix typos

docs: add missing and new /proc/PID/status file entries, fix typos

Signed-off-by: Nathan Scott <nathans@redhat.com>
Signed-off-by: Chen Hanxiao <chenhanxiao@cn.fujitsu.com>
Cc: Serge Hallyn <serge.hallyn@canonical.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 years ago/proc/PID/status: show all sets of pid according to ns
Chen Hanxiao [Tue, 7 Apr 2015 23:44:51 +0000 (09:44 +1000)]
/proc/PID/status: show all sets of pid according to ns

If some issues occurred inside a container guest, host user could not know
which process is in trouble just by guest pid: the users of container
guest only knew the pid inside containers.  This will bring obstacle for
trouble shooting.

This patch adds four fields: NStgid, NSpid, NSpgid and NSsid:

a) In init_pid_ns, nothing changed;

b) In one pidns, will tell the pid inside containers:
  NStgid: 21776   5       1
  NSpid:  21776   5       1
  NSpgid: 21776   5       1
  NSsid:  21729   1       0
  ** Process id is 21776 in level 0, 5 in level 1, 1 in level 2.

c) If pidns is nested, it depends on which pidns are you in.
  NStgid: 5       1
  NSpid:  5       1
  NSpgid: 5       1
  NSsid:  1       0
  ** Views from level 1

Signed-off-by: Chen Hanxiao <chenhanxiao@cn.fujitsu.com>
Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
Tested-by: Serge Hallyn <serge.hallyn@canonical.com>
Tested-by: Nathan Scott <nathans@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 years agozram: fix error return code
Julia Lawall [Tue, 7 Apr 2015 23:44:50 +0000 (09:44 +1000)]
zram: fix error return code

Return a negative error code on failure.

A simplified version of the semantic match that finds this problem is as
follows: (http://coccinelle.lip6.fr/)

// <smpl>
@@
identifier ret; expression e1,e2;
@@
(
if (\(ret < 0\|ret != 0\))
 { ... return ret; }
|
ret = 0
)
... when != ret = e1
    when != &ret
*if(...)
{
  ... when != ret = e2
      when forall
 return ret;
}
// </smpl>

Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Nitin Gupta <ngupta@vflare.org>
Acked-by: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 years agozsmalloc: remove extra cond_resched() in __zs_compact
Sergey Senozhatsky [Tue, 7 Apr 2015 23:44:50 +0000 (09:44 +1000)]
zsmalloc: remove extra cond_resched() in __zs_compact

Do not perform cond_resched() before the busy compaction loop in
__zs_compact(), because this loop does it when needed.

Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Acked-by: Minchan Kim <minchan@kernel.org>
Cc: Nitin Gupta <ngupta@vflare.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 years agozsmalloc: fix fatal corruption due to wrong size class selection
Heesub Shin [Tue, 7 Apr 2015 23:44:50 +0000 (09:44 +1000)]
zsmalloc: fix fatal corruption due to wrong size class selection

There is no point in overriding the size class below.  It causes fatal
corruption on the next chunk on the 3264-bytes size class, which is the
last size class that is not huge.

For example, if the requested size was exactly 3264 bytes, current
zsmalloc allocates and returns a chunk from the size class of 3264 bytes,
not 4096.  User access to this chunk may overwrite head of the next
adjacent chunk.

Here is the panic log captured when freelist was corrupted due to this:

    Kernel BUG at ffffffc00030659c [verbose debug info unavailable]
    Internal error: Oops - BUG: 96000006 [#1] PREEMPT SMP
    Modules linked in:
    exynos-snapshot: core register saved(CPU:5)
    CPUMERRSR: 0000000000000000, L2MERRSR: 0000000000000000
    exynos-snapshot: context saved(CPU:5)
    exynos-snapshot: item - log_kevents is disabled
    CPU: 5 PID: 898 Comm: kswapd0 Not tainted 3.10.61-4497415-eng #1
    task: ffffffc0b8783d80 ti: ffffffc0b71e8000 task.ti: ffffffc0b71e8000
    PC is at obj_idx_to_offset+0x0/0x1c
    LR is at obj_malloc+0x44/0xe8
    pc : [<ffffffc00030659c>] lr : [<ffffffc000306604>] pstate: a0000045
    sp : ffffffc0b71eb790
    x29: ffffffc0b71eb790 x28: ffffffc00204c000
    x27: 000000000001d96f x26: 0000000000000000
    x25: ffffffc098cc3500 x24: ffffffc0a13f2810
    x23: ffffffc098cc3501 x22: ffffffc0a13f2800
    x21: 000011e1a02006e3 x20: ffffffc0a13f2800
    x19: ffffffbc02a7e000 x18: 0000000000000000
    x17: 0000000000000000 x16: 0000000000000feb
    x15: 0000000000000000 x14: 00000000a01003e3
    x13: 0000000000000020 x12: fffffffffffffff0
    x11: ffffffc08b264000 x10: 00000000e3a01004
    x9 : ffffffc08b263fea x8 : ffffffc0b1e611c0
    x7 : ffffffc000307d24 x6 : 0000000000000000
    x5 : 0000000000000038 x4 : 000000000000011e
    x3 : ffffffbc00003e90 x2 : 0000000000000cc0
    x1 : 00000000d0100371 x0 : ffffffbc00003e90

Reported-by: Sooyong Suk <s.suk@samsung.com>
Signed-off-by: Heesub Shin <heesub.shin@samsung.com>
Tested-by: Sooyong Suk <s.suk@samsung.com>
Acked-by: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 years agozsmalloc: remove unnecessary insertion/removal of zspage in compaction
Minchan Kim [Tue, 7 Apr 2015 23:44:50 +0000 (09:44 +1000)]
zsmalloc: remove unnecessary insertion/removal of zspage in compaction

In putback_zspage, we don't need to insert a zspage into list of zspage
in size_class again to just fix fullness group. We could do directly
without reinsertion so we could save some instuctions.

Reported-by: Heesub Shin <heesub.shin@samsung.com>
Signed-off-by: Minchan Kim <minchan@kernel.org>
Cc: Nitin Gupta <ngupta@vflare.org>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Dan Streetman <ddstreet@ieee.org>
Cc: Seth Jennings <sjennings@variantweb.net>
Cc: Ganesh Mahendran <opensource.ganesh@gmail.com>
Cc: Luigi Semenzato <semenzato@google.com>
Cc: Gunho Lee <gunho.lee@lge.com>
Cc: Juneho Choi <juno.choi@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 years agozsmalloc: micro-optimize zs_object_copy()
Sergey Senozhatsky [Tue, 7 Apr 2015 23:44:50 +0000 (09:44 +1000)]
zsmalloc: micro-optimize zs_object_copy()

A micro-optimization.  Avoid additional branching and reduce (a bit)
registry pressure (f.e.  s_off += size; d_off += size; may be calculated
twise: first for >= PAGE_SIZE check and later for offset update in "else"
clause).

scripts/bloat-o-meter shows some improvement

add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-10 (-10)
function                          old     new   delta
zs_object_copy                    550     540     -10

Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Acked-by: Minchan Kim <minchan@kernel.org>
Cc: Nitin Gupta <ngupta@vflare.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 years agozsmalloc: remove synchronize_rcu from zs_compact()
Sergey Senozhatsky [Tue, 7 Apr 2015 23:44:49 +0000 (09:44 +1000)]
zsmalloc: remove synchronize_rcu from zs_compact()

Do not synchronize rcu in zs_compact(). Neither zsmalloc not
zram use rcu.

Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Acked-by: Minchan Kim <minchan@kernel.org>
Cc: Nitin Gupta <ngupta@vflare.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 years agozram: deprecate zram attrs sysfs nodes
Sergey Senozhatsky [Tue, 7 Apr 2015 23:44:49 +0000 (09:44 +1000)]
zram: deprecate zram attrs sysfs nodes

Add Documentation/ABI/obsolete/sysfs-block-zram file and list obsolete and
deprecated attributes there.  The patch also adds additional information
to zram documentation and describes the basic strategy:

- the existing RW nodes will be downgraded to WO nodes (in 4.11)
- deprecated RO sysfs nodes will eventually be removed (in 4.11)

Users will be additionally notified about deprecated attr usage by
pr_warn_once() (added to every deprecated attr _show()), as suggested by
Minchan Kim.

User space is advised to use zram<id>/stat, zram<id>/io_stat and
zram<id>/mm_stat files.

Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Reported-by: Minchan Kim <minchan@kernel.org>
Cc: Nitin Gupta <ngupta@vflare.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 years agozram: export new 'mm_stat' sysfs attrs
Sergey Senozhatsky [Tue, 7 Apr 2015 23:44:49 +0000 (09:44 +1000)]
zram: export new 'mm_stat' sysfs attrs

Per-device `zram<id>/mm_stat' file provides mm statistics of a particular
zram device in a format similar to block layer statistics.  The file
consists of a single line and represents the following stats (separated by
whitespace):

orig_data_size
compr_data_size
mem_used_total
mem_limit
mem_used_max
zero_pages
num_migrated

Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Acked-by: Minchan Kim <minchan@kernel.org>
Cc: Nitin Gupta <ngupta@vflare.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 years agozram: export new 'io_stat' sysfs attrs
Sergey Senozhatsky [Tue, 7 Apr 2015 23:44:49 +0000 (09:44 +1000)]
zram: export new 'io_stat' sysfs attrs

Per-device `zram<id>/io_stat' file provides accumulated I/O statistics of
particular zram device in a format similar to block layer statistics.  The
file consists of a single line and represents the following stats
(separated by whitespace):

failed_reads
failed_writes
invalid_io
notify_free

Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Acked-by: Minchan Kim <minchan@kernel.org>
Cc: Nitin Gupta <ngupta@vflare.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 years agozram: describe device attrs in documentation
Sergey Senozhatsky [Tue, 7 Apr 2015 23:44:49 +0000 (09:44 +1000)]
zram: describe device attrs in documentation

Briefly describe exported device stat attrs in zram documentation.  We
will eventually get rid of per-stat sysfs nodes and, thus, clean up
Documentation/ABI/testing/sysfs-block-zram file, which is the only source
of information about device sysfs nodes.

Add `num_migrated' description, since there is no independent
`num_migrated' sysfs node (and no corresponding sysfs-block-zram entry),
it will be exported via zram<id>/mm_stat file.

At this point we can provide minimal description, because sysfs-block-zram
still contains detailed information.

Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Acked-by: Minchan Kim <minchan@kernel.org>
Cc: Nitin Gupta <ngupta@vflare.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 years agozram: use generic start/end io accounting
Sergey Senozhatsky [Tue, 7 Apr 2015 23:44:48 +0000 (09:44 +1000)]
zram: use generic start/end io accounting

Use bio generic_start_io_acct() and generic_end_io_acct() to account
device's block layer statistics.  This will let users to monitor zram
activities using sysstat and similar packages/tools.

Apart from the usual per-stat sysfs attr, zram IO stats are now also
available in '/sys/block/zram<id>/stat' and '/proc/diskstats' files.

We will slowly get rid of per-stat sysfs files.

Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Acked-by: Minchan Kim <minchan@kernel.org>
Cc: Nitin Gupta <ngupta@vflare.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 years agozram: move compact_store() to sysfs functions area
Sergey Senozhatsky [Tue, 7 Apr 2015 23:44:48 +0000 (09:44 +1000)]
zram: move compact_store() to sysfs functions area

A cosmetic change.  We have a new code layout and keep zram per-device
sysfs store and show functions in one place.  Move compact_store() to that
handlers block to conform to current layout.

Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Acked-by: Minchan Kim <minchan@kernel.org>
Cc: Nitin Gupta <ngupta@vflare.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 years agozram: remove `num_migrated' device attr
Sergey Senozhatsky [Tue, 7 Apr 2015 23:44:48 +0000 (09:44 +1000)]
zram: remove `num_migrated' device attr

This patch introduces rework to zram stats.  We have per-stat sysfs nodes,
and it makes things a bit hard to use in user space: it doesn't give an
immediate stats 'snapshot', it requires user space to use more syscalls -
open, read, close for every stat file, with appropriate error checks on
every step, etc.

First, zram now accounts block layer statistics, available in
/sys/block/zram<id>/stat and /proc/diskstats files.  So some new stats are
available (see Documentation/block/stat.txt), besides, zram's activities
now can be monitored by sysstat's iostat or similar tools.

Example:
cat /sys/block/zram0/stat
248     0    1984    0   251029     0  2008232   5120   0   5116   5116

Second, group currently exported on per-stat basis nodes into two
categories (files):

-- zram<id>/io_stat
accumulates device's IO stats, that are not accounted by block layer,
and contains:
        failed_reads
        failed_writes
        invalid_io
        notify_free

Example:
cat /sys/block/zram0/io_stat
0        0        0   652572

-- zram<id>/mm_stat
accumulates zram mm stats and contains:
        orig_data_size
        compr_data_size
        mem_used_total
        mem_limit
        mem_used_max
        zero_pages
        num_migrated

Example:
cat /sys/block/zram0/mm_stat
434634752 270288572 279158784        0 579895296    15060        0

per-stat sysfs nodes are now considered to be deprecated and we plan to
remove them (and clean up some of the existing stat code) in two years (as
of now, there is no warning printed to syslog about deprecated stats being
used).  User space is advised to use the above mentioned 3 files.

This patch (of 7):

Remove sysfs `num_migrated' attribute.  We are moving away from per-stat
device attrs towards 3 stat files that will accumulate io and mm stats in
a format similar to block layer statistics in /sys/block/<dev>/stat.  That
will be easier to use in user space, and reduce the number of syscalls
needed to read zram device statistics.

`num_migrated' will return back in zram<id>/mm_stat file.

Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Acked-by: Minchan Kim <minchan@kernel.org>
Cc: Nitin Gupta <ngupta@vflare.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 years agomm/zsmalloc.c: fix comment for get_pages_per_zspage
Yinghao Xie [Tue, 7 Apr 2015 23:44:48 +0000 (09:44 +1000)]
mm/zsmalloc.c: fix comment for get_pages_per_zspage

Signed-off-by: Yinghao Xie <yinghao.xie@sumsung.com>
Suggested-by: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 years agozsmalloc: zsmalloc documentation
Minchan Kim [Tue, 7 Apr 2015 23:44:47 +0000 (09:44 +1000)]
zsmalloc: zsmalloc documentation

Create zsmalloc doc which explains design concept and stat information.

Signed-off-by: Minchan Kim <minchan@kernel.org>
Cc: Juneho Choi <juno.choi@lge.com>
Cc: Gunho Lee <gunho.lee@lge.com>
Cc: Luigi Semenzato <semenzato@google.com>
Cc: Dan Streetman <ddstreet@ieee.org>
Cc: Seth Jennings <sjennings@variantweb.net>
Cc: Nitin Gupta <ngupta@vflare.org>
Cc: Jerome Marchand <jmarchan@redhat.com>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 years agozsmalloc: add fullness into stat
Minchan Kim [Tue, 7 Apr 2015 23:44:47 +0000 (09:44 +1000)]
zsmalloc: add fullness into stat

During investigating compaction, fullness information of each class is
helpful for investigating how the compaction works well.  With that, we
could know how compaction works well more clear on each size class.

Signed-off-by: Minchan Kim <minchan@kernel.org>
Cc: Juneho Choi <juno.choi@lge.com>
Cc: Gunho Lee <gunho.lee@lge.com>
Cc: Luigi Semenzato <semenzato@google.com>
Cc: Dan Streetman <ddstreet@ieee.org>
Cc: Seth Jennings <sjennings@variantweb.net>
Cc: Nitin Gupta <ngupta@vflare.org>
Cc: Jerome Marchand <jmarchan@redhat.com>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 years agozsmalloc: record handle in page->private for huge object
Minchan Kim [Tue, 7 Apr 2015 23:44:47 +0000 (09:44 +1000)]
zsmalloc: record handle in page->private for huge object

We store handle on header of each allocated object so it increases the
size of each object by sizeof(unsigned long).

If zram stores 4096 bytes to zsmalloc(ie, bad compression), zsmalloc needs
4104B-class to add handle.

However, 4104B-class has 1-pages_per_zspage so wasted size by internal
fragment is 8192 - 4104, which is terrible.

So this patch records the handle in page->private on such huge object(ie,
pages_per_zspage == 1 && maxobj_per_zspage == 1) instead of header of each
object so we could use 4096B-class, not 4104B-class.

Signed-off-by: Minchan Kim <minchan@kernel.org>
Cc: Juneho Choi <juno.choi@lge.com>
Cc: Gunho Lee <gunho.lee@lge.com>
Cc: Luigi Semenzato <semenzato@google.com>
Cc: Dan Streetman <ddstreet@ieee.org>
Cc: Seth Jennings <sjennings@variantweb.net>
Cc: Nitin Gupta <ngupta@vflare.org>
Cc: Jerome Marchand <jmarchan@redhat.com>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 years agozram: support compaction
Minchan Kim [Tue, 7 Apr 2015 23:44:47 +0000 (09:44 +1000)]
zram: support compaction

Now that zsmalloc supports compaction, zram can use it.  For the first
step, this patch exports compact knob via sysfs so user can do compaction
via "echo 1 > /sys/block/zram0/compact".

Signed-off-by: Minchan Kim <minchan@kernel.org>
Cc: Juneho Choi <juno.choi@lge.com>
Cc: Gunho Lee <gunho.lee@lge.com>
Cc: Luigi Semenzato <semenzato@google.com>
Cc: Dan Streetman <ddstreet@ieee.org>
Cc: Seth Jennings <sjennings@variantweb.net>
Cc: Nitin Gupta <ngupta@vflare.org>
Cc: Jerome Marchand <jmarchan@redhat.com>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 years agozsmalloc: adjust ZS_ALMOST_FULL
Minchan Kim [Tue, 7 Apr 2015 23:44:47 +0000 (09:44 +1000)]
zsmalloc: adjust ZS_ALMOST_FULL

Curretly, zsmalloc regards a zspage as ZS_ALMOST_EMPTY if the zspage has
under 1/4 used objects(ie, fullness_threshold_frac).  It could make result
in loose packing since zsmalloc migrates only ZS_ALMOST_EMPTY zspage out.

This patch changes the rule so that zsmalloc makes zspage which has above
3/4 used object ZS_ALMOST_FULL so it could make tight packing.

Signed-off-by: Minchan Kim <minchan@kernel.org>
Cc: Juneho Choi <juno.choi@lge.com>
Cc: Gunho Lee <gunho.lee@lge.com>
Cc: Luigi Semenzato <semenzato@google.com>
Cc: Dan Streetman <ddstreet@ieee.org>
Cc: Seth Jennings <sjennings@variantweb.net>
Cc: Nitin Gupta <ngupta@vflare.org>
Cc: Jerome Marchand <jmarchan@redhat.com>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 years agozsmalloc-support-compaction-fix
Andrew Morton [Tue, 7 Apr 2015 23:44:46 +0000 (09:44 +1000)]
zsmalloc-support-compaction-fix

zsmalloc.c needs sched.h for cond_resched()

Cc: Minchan Kim <minchan@kernel.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Reported-by: Fengguang Wu <fengguang.wu@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 years agozsmalloc: support compaction
Minchan Kim [Tue, 7 Apr 2015 23:44:46 +0000 (09:44 +1000)]
zsmalloc: support compaction

This patch provides core functions for migration of zsmalloc.  Migraion
policy is simple as follows.

for each size class {
while {
src_page = get zs_page from ZS_ALMOST_EMPTY
if (!src_page)
break;
dst_page = get zs_page from ZS_ALMOST_FULL
if (!dst_page)
dst_page = get zs_page from ZS_ALMOST_EMPTY
if (!dst_page)
break;
migrate(from src_page, to dst_page);
}
}

For migration, we need to identify which objects in zspage are allocated
to migrate them out.  We could know it by iterating of freed objects in a
zspage because first_page of zspage keeps free objects singly-linked list
but it's not efficient.  Instead, this patch adds a tag(ie,
OBJ_ALLOCATED_TAG) in header of each object(ie, handle) so we could check
whether the object is allocated easily.

This patch adds another status bit in handle to synchronize between user
access through zs_map_object and migration.  During migration, we cannot
move objects user are using due to data coherency between old object and
new object.

Signed-off-by: Minchan Kim <minchan@kernel.org>
Cc: Juneho Choi <juno.choi@lge.com>
Cc: Gunho Lee <gunho.lee@lge.com>
Cc: Luigi Semenzato <semenzato@google.com>
Cc: Dan Streetman <ddstreet@ieee.org>
Cc: Seth Jennings <sjennings@variantweb.net>
Cc: Nitin Gupta <ngupta@vflare.org>
Cc: Jerome Marchand <jmarchan@redhat.com>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 years agozsmalloc: factor out obj_[malloc|free]
Minchan Kim [Tue, 7 Apr 2015 23:44:46 +0000 (09:44 +1000)]
zsmalloc: factor out obj_[malloc|free]

In later patch, migration needs some part of functions in zs_malloc and
zs_free so this patch factor out them.

Signed-off-by: Minchan Kim <minchan@kernel.org>
Cc: Juneho Choi <juno.choi@lge.com>
Cc: Gunho Lee <gunho.lee@lge.com>
Cc: Luigi Semenzato <semenzato@google.com>
Cc: Dan Streetman <ddstreet@ieee.org>
Cc: Seth Jennings <sjennings@variantweb.net>
Cc: Nitin Gupta <ngupta@vflare.org>
Cc: Jerome Marchand <jmarchan@redhat.com>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 years agozsmalloc: decouple handle and object
Minchan Kim [Tue, 7 Apr 2015 23:44:46 +0000 (09:44 +1000)]
zsmalloc: decouple handle and object

Recently, we started to use zram heavily and some of issues
popped.

1) external fragmentation

I got a report from Juneho Choi that fork failed although there are plenty
of free pages in the system.  His investigation revealed zram is one of
the culprit to make heavy fragmentation so there was no more contiguous
16K page for pgd to fork in the ARM.

2) non-movable pages

Other problem of zram now is that inherently, user want to use zram as
swap in small memory system so they use zRAM with CMA to use memory
efficiently.  However, unfortunately, it doesn't work well because zRAM
cannot use CMA's movable pages unless it doesn't support compaction.  I
got several reports about that OOM happened with zram although there are
lots of swap space and free space in CMA area.

3) internal fragmentation

zRAM has started support memory limitation feature to limit memory usage
and I sent a patchset(https://lkml.org/lkml/2014/9/21/148) for VM to be
harmonized with zram-swap to stop anonymous page reclaim if zram consumed
memory up to the limit although there are free space on the swap.  One
problem for that direction is zram has no way to know any hole in memory
space zsmalloc allocated by internal fragmentation so zram would regard
swap is full although there are free space in zsmalloc.  For solving the
issue, zram want to trigger compaction of zsmalloc before it decides full
or not.

This patchset is first step to support above issues.  For that, it adds
indirect layer between handle and object location and supports manual
compaction to solve 3th problem first of all.

After this patchset got merged, next step is to make VM aware of zsmalloc
compaction so that generic compaction will move zsmalloced-pages
automatically in runtime.

In my imaginary experiment(ie, high compress ratio data with heavy swap
in/out on 8G zram-swap), data is as follows,

Before =
zram allocated object :      60212066 bytes
zram total used:     140103680 bytes
ratio:         42.98 percent
MemFree:          840192 kB

Compaction

After =
frag ratio after compaction
zram allocated object :      60212066 bytes
zram total used:      76185600 bytes
ratio:         79.03 percent
MemFree:          901932 kB

Juneho reported below in his real platform with small aging.
So, I think the benefit would be bigger in real aging system
for a long time.

- frag_ratio increased 3% (ie, higher is better)
- memfree increased about 6MB
- In buddy info, Normal 2^3: 4, 2^2: 1: 2^1 increased, Highmem: 2^1 21 increased

frag ratio after swap fragment
used :        156677 kbytes
total:        166092 kbytes
frag_ratio :  94
meminfo before compaction
MemFree:           83724 kB
Node 0, zone   Normal  13642   1364     57     10     61     17      9      5      4      0      0
Node 0, zone  HighMem    425     29      1      0      0      0      0      0      0      0      0

num_migrated :  23630
compaction done

frag ratio after compaction
used :        156673 kbytes
total:        160564 kbytes
frag_ratio :  97
meminfo after compaction
MemFree:           89060 kB
Node 0, zone   Normal  14076   1544     67     14     61     17      9      5      4      0      0
Node 0, zone  HighMem    863     50      1      0      0      0      0      0      0      0      0

This patchset adds more logics(about 480 lines) in zsmalloc but when I
tested heavy swapin/out program, the regression for swapin/out speed is
marginal because most of overheads were caused by compress/decompress and
other MM reclaim stuff.

This patch (of 7):

Currently, handle of zsmalloc encodes object's location directly so it
makes support of migration hard.

This patch decouples handle and object via adding indirect layer.  For
that, it allocates handle dynamically and returns it to user.  The handle
is the address allocated by slab allocation so it's unique and we could
keep object's location in the memory space allocated for handle.

With it, we can change object's position without changing handle itself.

Signed-off-by: Minchan Kim <minchan@kernel.org>
Cc: Juneho Choi <juno.choi@lge.com>
Cc: Gunho Lee <gunho.lee@lge.com>
Cc: Luigi Semenzato <semenzato@google.com>
Cc: Dan Streetman <ddstreet@ieee.org>
Cc: Seth Jennings <sjennings@variantweb.net>
Cc: Nitin Gupta <ngupta@vflare.org>
Cc: Jerome Marchand <jmarchan@redhat.com>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 years agozram: do not let user enforce new device dev_id
Sergey Senozhatsky [Tue, 7 Apr 2015 23:44:46 +0000 (09:44 +1000)]
zram: do not let user enforce new device dev_id

This patch forbids user to enforce device ids for newly added zram
devices, as suggested by Minchan Kim.  There seems to be a little interest
in this functionality and its use-cases are rather non-obvious.

zram_add sysfs attr, thus, is now read only and has only automatic device
id assignment mode.

This operation is no longer allowed:
   echo 1 > /sys/class/zram-control/zram_add
   -bash: /sys/class/zram-control/zram_add: Permission denied

Correct usage example:
   cat /sys/class/zram-control/zram_add
   2

It also removes common zram_control() handler because device creation and
removal do not share a lot of common steps anymore.  Move zram_add and
zram_remove code to zram_add_show() and zram_remove_store()
correspondingly.

LKML reference: https://lkml.org/lkml/2015/3/4/1414

Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Reported-by: Minchan Kim <minchan@kernel.org>
Cc: Nitin Gupta <ngupta@vflare.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 years agozram-introduce-automatic-device_id-generation-fix
Andrew Morton [Tue, 7 Apr 2015 23:44:45 +0000 (09:44 +1000)]
zram-introduce-automatic-device_id-generation-fix

fix comment layout

Cc: Minchan Kim <minchan@kernel.org>
Cc: Nitin Gupta <ngupta@vflare.org>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 years agozram: introduce automatic device_id generation
Sergey Senozhatsky [Tue, 7 Apr 2015 23:44:45 +0000 (09:44 +1000)]
zram: introduce automatic device_id generation

The existing device creation interface requires user to provide new and
unique device_id for every device being created.  This might be difficult
to handle (f.e.  in automated scripts).

Extend zram-control/zram_add interface to support read and write
operations.  Write operation remains the same:

echo X > /sys/class/zram-control/zram_add

will add /dev/zramX (or return error).

Read operation is treated as 'pick up available device_id, add new device
and return device_id'.

Example:
 cat /sys/class/zram-control/zram_add
2
 cat /sys/class/zram-control/zram_add
3

so user-space can proceed with /dev/zram2, /dev/zram3 init and usage.

An attempt to use already used device_id (-EEXIST)

echo 3 > /sys/class/zram-control/zram_add
-bash: echo: write error: File exists

Returning zram_add error code back to user (-ENOMEM in this case)

cat /sys/class/zram-control/zram_add
cat: /sys/class/zram-control/zram_add: Cannot allocate memory

Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Nitin Gupta <ngupta@vflare.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 years agozram: return zram device_id value from zram_add()
Sergey Senozhatsky [Tue, 7 Apr 2015 23:44:45 +0000 (09:44 +1000)]
zram: return zram device_id value from zram_add()

zram_add requires valid device_id to be provided, that can be a bit
inconvenient.  Change zram_add() to return negative value upon new device
creation failure, and device_id (>= 0) value otherwise.

This prepares zram_add to perform automatic device_id assignment.  New
device_id will be returned back to user-space (so user can reference that
device as /dev/zramX).

Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Nitin Gupta <ngupta@vflare.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 years agozram: trivial: correct flag operations comment
Sergey Senozhatsky [Tue, 7 Apr 2015 23:44:45 +0000 (09:44 +1000)]
zram: trivial: correct flag operations comment

We don't have meta->tb_lock anymore and use meta table entry bit_spin_lock
instead.  update comment.

Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Nitin Gupta <ngupta@vflare.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 years agozram: report every added and removed device
Sergey Senozhatsky [Tue, 7 Apr 2015 23:44:45 +0000 (09:44 +1000)]
zram: report every added and removed device

With dynamic device creation/removal printing num_devices in zram_init()
doesn't make a lot of sense, as well as printing the number of destroyed
devices in destroy_devices().  Print per-device action (added/removed) in
zram_add() and zram_remove() instead.

Example:

[ 3645.259652] zram: Added device: zram5
[ 3646.152074] zram: Added device: zram6
[ 3650.585012] zram: Removed device: zram5
[ 3655.845584] zram: Added device: zram8
[ 3660.975223] zram: Removed device: zram6

Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Nitin Gupta <ngupta@vflare.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 years agozram: remove max_num_devices limitation
Sergey Senozhatsky [Tue, 7 Apr 2015 23:44:44 +0000 (09:44 +1000)]
zram: remove max_num_devices limitation

Limiting the number of zram devices to 32 (default max_num_devices value)
is confusing, let's drop it.  A user with 2TB or 4TB of RAM, for example,
can request as many devices as he can handle.

Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Nitin Gupta <ngupta@vflare.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 years agozram-add-dynamic-device-add-remove-functionality-fix
Andrew Morton [Tue, 7 Apr 2015 23:44:44 +0000 (09:44 +1000)]
zram-add-dynamic-device-add-remove-functionality-fix

fix comment layout

Cc: Minchan Kim <minchan@kernel.org>
Cc: Nitin Gupta <ngupta@vflare.org>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 years agozram: add dynamic device add/remove functionality
Sergey Senozhatsky [Tue, 7 Apr 2015 23:44:44 +0000 (09:44 +1000)]
zram: add dynamic device add/remove functionality

We currently don't support on-demand device creation.  The only way to
have N zram devices is to specify num_devices module parameter (default
value 1).  That means that if, for some reason, at some point, user wants
to have N + 1 devies he/she must umount all the existing devices, unload
the module, load the module passing num_devices equals to N + 1.  And do
this again, if needed.

Introduce zram control sysfs class, which has two sysfs attrs:
- zram_add      -- add a new specific (device_id) zram device
- zram_remove   -- remove a specific (device_id) zram device

Usage example:
# add a new specific zram device
echo 4 > /sys/class/zram-control/zram_add

# remove a specific zram device
echo 4 > /sys/class/zram-control/zram_remove

There is no automatic device_id generation, so user is expected to
provide one.

NOTE, there might be users who already depend on the fact that at
least zram0 device gets always created by zram_init(). Thus, due to
compatibility reasons, along with requested device_id (f.e. 5) zram0
will also be created.

Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Nitin Gupta <ngupta@vflare.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 years agozram: reorganize code layout
Sergey Senozhatsky [Tue, 7 Apr 2015 23:44:44 +0000 (09:44 +1000)]
zram: reorganize code layout

This patch looks big, but basically it just moves code blocks forward and
backward.  No functional changes.

Our current code layout looks a bit like a sandwitch.

For example,
a) between read/write handlers, we have update_used_max() helper function:

static int zram_decompress_page
static int zram_bvec_read
static inline void update_used_max
static int zram_bvec_write
static int zram_bvec_rw

b) RW request handlers __zram_make_request/zram_bio_discard are divided by
sysfs attr reset_store() function and corresponding zram_reset_device()
handler:

static void zram_bio_discard
static void zram_reset_device
static ssize_t disksize_store
static ssize_t reset_store
static void __zram_make_request

c) we first a bunch of sysfs read/store functions. then a number of
one-liners, then helper functions, RW functions, sysfs functions, helper
functions again, and so on.

Reorganize layout to be more logically grouped (a brief description,
`cat zram_drv.c | grep static` gives a bigger picture):

-- one-liners: zram_test_flag/etc.

-- helpers: is_partial_io/update_position/etc

-- sysfs attr show/store functions + ZRAM_ATTR_RO() generated stats
show() functions
exception: reset and disksize store functions are required to be after meta()
functions. because we do device create/destroy actions in these sysfs
handlers.

-- "mm" functions: meta get/put, meta alloc/free, page free
static inline bool zram_meta_get
static inline void zram_meta_put
static void zram_meta_free
static struct zram_meta *zram_meta_alloc
static void zram_free_page

-- a block of I/O functions
static int zram_decompress_page
static int zram_bvec_read
static int zram_bvec_write
static void zram_bio_discard
static int zram_bvec_rw
static void __zram_make_request
static void zram_make_request
static void zram_slot_free_notify
static int zram_rw_page

-- device contol: add/remove/init/reset functions (+zram-control class
will sit here)
static void zram_reset_device_internal
static int zram_reset_device
static ssize_t reset_store
static ssize_t disksize_store
static int zram_add
static void zram_remove
static int __init zram_init
static void __exit zram_exit

Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Nitin Gupta <ngupta@vflare.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 years agozram: factor out device reset from reset_store()
Sergey Senozhatsky [Tue, 7 Apr 2015 23:44:44 +0000 (09:44 +1000)]
zram: factor out device reset from reset_store()

Device reset currently consists of two steps:
a) holding ->bd_mutex we ensure that there are no device users
(bdev->bd_openers)

b) and internal part (executed under bdev->bd_mutex and partially
under zram->init_lock) that resets the device - frees allocated
memory and returns the device back to its initial (un-init) state.

Dynamic device removal requires the same amount of work and checks.
We can reuse internal cleanup part (b) zram_reset_device(), but step (a)
is done in sysfs ATTR reset_store() handler.

Rename step (b) from zram_reset_device() to zram_reset_device_internal()
and factor out step (a) to zram_reset_device(). Both reset_store() and
dynamic device removal will use zram_reset_device().

For readability let's keep them separated.

Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Nitin Gupta <ngupta@vflare.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 years agozram: use idr instead of `zram_devices' array
Sergey Senozhatsky [Tue, 7 Apr 2015 23:44:43 +0000 (09:44 +1000)]
zram: use idr instead of `zram_devices' array

This patch makes some preparations for dynamic device ADD/REMOVE
functionality via /dev/zram-control interface.

Remove `zram_devices' array and switch to id-to-pointer translation (idr).
 idr doesn't bloat zram struct with additional members, f.e.  list_head,
yet still provides ability to match the device_id with the device pointer.
 No user-space visible changes.

Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Nitin Gupta <ngupta@vflare.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 years agozram: cosmetic ZRAM_ATTR_RO code formatting tweak
Sergey Senozhatsky [Tue, 7 Apr 2015 23:44:43 +0000 (09:44 +1000)]
zram: cosmetic ZRAM_ATTR_RO code formatting tweak

We currently don't support zram on-demand device creation.  The only way
to have N zram devices is to specify num_devices module parameter (default
value 1).  That means that if, for some reason, at some point, user wants
to have N + 1 devies he/she must umount all the existing devices, unload
the module, load the module passing num_devices equals to N + 1.  And do
this again, if needed.

This patchset introduces zram-control sysfs class, which has two sysfs
attrs:

 - zram_add     -- add a new specific (device_id) zram device
 - zram_remove  -- remove a specific (device_id) zram device

    Usage example:
        # add a new specific zram device
        echo 4 > /sys/class/zram-control/zram_add

        # remove a specific zram device
        echo 4 > /sys/class/zram-control/zram_remove

The patchset also does some cleanups and huge code reorganization.

This patch (of 8):

Fix a misplaced backslash.

Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Nitin Gupta <ngupta@vflare.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 years agomm: lru_deactivate_fn should clear PG_referenced
Minchan Kim [Tue, 7 Apr 2015 23:44:43 +0000 (09:44 +1000)]
mm: lru_deactivate_fn should clear PG_referenced

deactivate_page aims for accelerate for reclaiming through
moving pages from active list to inactive list so we should
clear PG_referenced for the goal.

Signed-off-by: Minchan Kim <minchan@kernel.org>
Suggested-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 years agomm-move-lazy-free-pages-to-inactive-list-fix-fix
Andrew Morton [Tue, 7 Apr 2015 23:44:43 +0000 (09:44 +1000)]
mm-move-lazy-free-pages-to-inactive-list-fix-fix

tweak comment

Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Shaohua Li <shli@kernel.org>
Cc: Wang, Yalin <Yalin.Wang@sonymobile.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 years agomm: document deactivate_page
Minchan Kim [Tue, 7 Apr 2015 23:44:43 +0000 (09:44 +1000)]
mm: document deactivate_page

This patch adds function description for deactivate_page.

Signed-off-by: Minchan Kim <minchan@kernel.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Shaohua Li <shli@kernel.org>
Cc: Wang, Yalin <Yalin.Wang@sonymobile.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 years agomm: move lazily freed pages to inactive list
Minchan Kim [Tue, 7 Apr 2015 23:44:42 +0000 (09:44 +1000)]
mm: move lazily freed pages to inactive list

MADV_FREE is a hint that it's okay to discard pages if there is memory
pressure and we use reclaimers(ie, kswapd and direct reclaim) to free them
so there is no value keeping them in the active anonymous LRU so this
patch moves them to inactive LRU list's head.

This means that MADV_FREE-ed pages which were living on the inactive list
are reclaimed first because they are more likely to be cold rather than
recently active pages.

An arguable issue for the approach would be whether we should put the page
to the head or tail of the inactive list.  I chose head because the kernel
cannot make sure it's really cold or warm for every MADV_FREE usecase but
at least we know it's not *hot*, so landing of inactive head would be a
comprimise for various usecases.

This fixes suboptimal behavior of MADV_FREE when pages living on the
active list will sit there for a long time even under memory pressure
while the inactive list is reclaimed heavily.  This basically breaks the
whole purpose of using MADV_FREE to help the system to free memory which
is might not be used.

Signed-off-by: Minchan Kim <minchan@kernel.org>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Shaohua Li <shli@kernel.org>
Cc: Wang, Yalin <Yalin.Wang@sonymobile.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 years agomm: free swp_entry in madvise_free
Minchan Kim [Tue, 7 Apr 2015 23:44:42 +0000 (09:44 +1000)]
mm: free swp_entry in madvise_free

When I test below piece of code with 12 processes(ie, 512M * 12 = 6G
consume) on my (3G ram + 12 cpu + 8G swap, the madvise_free is siginficat
slower (ie, 2x times) than madvise_dontneed.

loop = 5;
mmap(512M);
while (loop--) {
        memset(512M);
        madvise(MADV_FREE or MADV_DONTNEED);
}

The reason is lots of swapin.

1) dontneed: 1,612 swapin
2) madvfree: 879,585 swapin

If we find hinted pages were already swapped out when syscall is called,
it's pointless to keep the swapped-out pages in pte.
Instead, let's free the cold page because swapin is more expensive
than (alloc page + zeroing).

With this patch, it reduced swapin from 879,585 to 1,878 so elapsed time

1) dontneed: 6.10user 233.50system 0:50.44elapsed
2) madvfree: 6.03user 401.17system 1:30.67elapsed
2) madvfree + below patch: 6.70user 339.14system 1:04.45elapsed

Signed-off-by: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 years agomm: remove lock validation check for MADV_FREE
Minchan Kim [Tue, 7 Apr 2015 23:44:42 +0000 (09:44 +1000)]
mm: remove lock validation check for MADV_FREE

Currently, madvise_free_pte_range is called only madvise path which
already holds an mmap_sem so it's pointless to add the lock validation
check.

Signed-off-by: Minchan Kim <minchan@kernel.org>
Cc: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 years agomm: Fix comment typo "CONFIG_TRANSPARNTE_HUGE"
Paul Bolle [Tue, 7 Apr 2015 23:44:42 +0000 (09:44 +1000)]
mm: Fix comment typo "CONFIG_TRANSPARNTE_HUGE"

The commit "mm: don't split THP page when syscall is called" added a
reference to CONFIG_TRANSPARNTE_HUGE in a comment.  Use
CONFIG_TRANSPARENT_HUGEPAGE instead, as was probably intended.

Signed-off-by: Paul Bolle <pebolle@tiscali.nl>
Acked-by: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 years agomm: don't split THP page when syscall is called
Minchan Kim [Tue, 7 Apr 2015 23:44:42 +0000 (09:44 +1000)]
mm: don't split THP page when syscall is called

We don't need to split THP page when MADV_FREE syscall is called.  It
could be done when VM decide really frees it so we could avoid unnecessary
THP split.

Signed-off-by: Minchan Kim <minchan@kernel.org>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 years agomm-support-madvisemadv_free-fix-2
Andrew Morton [Tue, 7 Apr 2015 23:44:41 +0000 (09:44 +1000)]
mm-support-madvisemadv_free-fix-2

Cc: Michal Hocko <mhocko@suse.cz>
Cc: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 years agomm: define MADV_FREE for some arches
Minchan Kim [Tue, 7 Apr 2015 23:44:41 +0000 (09:44 +1000)]
mm: define MADV_FREE for some arches

Most architectures use asm-generic, but alpha, mips, parisc, xtensa
need their own definitions.

This patch defines MADV_FREE for them so it should fix build break
for their architectures.

Maybe, I should split and feed piecies to arch maintainers but
included here for mmotm convenience.

Signed-off-by: Minchan Kim <minchan@kernel.org>
Reported-by: kbuild test robot <fengguang.wu@intel.com>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: "James E.J. Bottomley" <jejb@parisc-linux.org>
Cc: Helge Deller <deller@gmx.de>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Chris Zankel <chris@zankel.net>
Acked-by: Max Filippov <jcmvbkbc@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
9 years agomm: support madvise(MADV_FREE)
Minchan Kim [Tue, 7 Apr 2015 23:44:41 +0000 (09:44 +1000)]
mm: support madvise(MADV_FREE)

Linux doesn't have an ability to free pages lazy while other OS already
have been supported that named by madvise(MADV_FREE).

The gain is clear that kernel can discard freed pages rather than swapping
out or OOM if memory pressure happens.

Without memory pressure, freed pages would be reused by userspace without
another additional overhead(ex, page fault + allocation + zeroing).

How to work is following as.

When madvise syscall is called, VM clears dirty bit of ptes of the range.
If memory pressure happens, VM checks dirty bit of page table and if it
found still "clean", it means it's a "lazyfree pages" so VM could discard
the page instead of swapping out.  Once there was store operation for the
page before VM peek a page to reclaim, dirty bit is set so VM can swap out
the page instead of discarding.

Firstly, heavy users would be general allocators(ex, jemalloc, tcmalloc
and hope glibc supports it) and jemalloc/tcmalloc already have supported
the feature for other OS(ex, FreeBSD)

barrios@blaptop:~/benchmark/ebizzy$ lscpu
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                12
On-line CPU(s) list:   0-11
Thread(s) per core:    1
Core(s) per socket:    1
Socket(s):             12
NUMA node(s):          1
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 2
Stepping:              3
CPU MHz:               3200.185
BogoMIPS:              6400.53
Virtualization:        VT-x
Hypervisor vendor:     KVM
Virtualization type:   full
L1d cache:             32K
L1i cache:             32K
L2 cache:              4096K
NUMA node0 CPU(s):     0-11
ebizzy benchmark(./ebizzy -S 10 -n 512)

Higher avg is better.

 vanilla-jemalloc MADV_free-jemalloc

1 thread
records: 10     records: 10
avg: 2961.90     avg:   12069.70
std:   71.96(2.43%)     std:     186.68(1.55%)
max: 3070.00     max:   12385.00
min: 2796.00     min:   11746.00

2 thread
records: 10     records: 10
avg: 5020.00     avg:   17827.00
std:  264.87(5.28%)     std:     358.52(2.01%)
max: 5244.00     max:   18760.00
min: 4251.00     min:   17382.00

4 thread
records: 10     records: 10
avg: 8988.80     avg:   27930.80
std: 1175.33(13.08%)     std:    3317.33(11.88%)
max: 9508.00     max:   30879.00
min: 5477.00     min:   21024.00

8 thread
records: 10     records: 10
avg:   13036.50     avg:   33739.40
std:  170.67(1.31%)     std:    5146.22(15.25%)
max:   13371.00     max:   40572.00
min:   12785.00     min:   24088.00

16 thread
records: 10     records: 10
avg:   11092.40     avg:   31424.20
std:  710.60(6.41%)     std:    3763.89(11.98%)
max:   12446.00     max:   36635.00
min: 9949.00     min:   25669.00

32 thread
records: 10     records: 10
avg:   11067.00     avg:   34495.80
std:  971.06(8.77%)     std:    2721.36(7.89%)
max:   12010.00     max:   38598.00
min: 9002.00     min:   30636.00

In summary, MADV_FREE is about much faster than MADV_DONTNEED.

Signed-off-by: Minchan Kim <minchan@kernel.org>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>