namei: fix warning while make xmldocs caused by namei.c
Fix the following warnings:
Warning(.//fs/namei.c:2422): No description found for parameter 'nd'
Warning(.//fs/namei.c:2422): Excess function parameter 'nameidata'
description in 'path_mountpoint'
Signed-off-by: Masanari Iida <standby24x7@gmail.com> Acked-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Considering Linus' past rants about the (ab)use of BUG in the kernel, I
took a look at how we deal with such calls in ipc. Given that any errors
or corruption in ipc code are most likely contained within the set of
processes participating in the broken mechanisms, there aren't really many
strong fatal system failure scenarios that would require a BUG call.
Also, if something is seriously wrong, ipc might not be the place for such
a BUG either.
1. For example, recently, a customer hit one of these BUG_ONs in shm
after failing shm_lock(). A busted ID imho does not merit a BUG_ON,
and WARN would have been better.
2. MSG_COPY functionality of posix msgrcv(2) for checkpoint/restore.
I don't see how we can hit this anyway -- at least it should be IS_ERR.
The 'copy' arg from do_msgrcv is always set by calling prepare_copy()
first and foremost. We could also probably drop this check altogether.
Either way, it does not merit a BUG_ON.
3. No ->fault() callback for the fs getting the corresponding page --
seems selfish to make the system unusable.
Yinghai Lu [Wed, 9 Sep 2015 22:39:12 +0000 (15:39 -0700)]
lib/decompressors: use real out buf size for gunzip with kernel
When loading x86 64bit kernel above 4GiB with patched grub2, got kernel
gunzip error.
| early console in decompress_kernel
| decompress_kernel:
| input: [0x807f2143b4-0x807ff61aee]
| output: [0x807cc00000-0x807f3ea29b] 0x027ea29c: output_len
| boot via startup_64
| KASLR using RDTSC...
| new output: [0x46fe000000-0x470138cfff] 0x0338d000: output_run_size
| decompress: [0x46fe000000-0x47007ea29b] <=== [0x807f2143b4-0x807ff61aee]
|
| Decompressing Linux... gz...
|
| uncompression error
|
| -- System halted
the new buffer is at 0x46fe000000ULL, decompressor_gzip is using
0xffffffb901ffffff as out_len. gunzip in lib/zlib_inflate/inflate.c cap
that len to 0x01ffffff and decompress fails later.
We could hit this problem with crashkernel booting that uses kexec loading
kernel above 4GiB.
We have decompress_* support:
1. inbuf[]/outbuf[] for kernel preboot.
2. inbuf[]/flush() for initramfs
3. fill()/flush() for initrd.
This bug only affect kernel preboot path that use outbuf[].
Add __decompress and take real out_buf_len for gunzip instead of guessing
wrong buf size.
Fixes: 1431574a1c4 (lib/decompressors: fix "no limit" output buffer length) Signed-off-by: Yinghai Lu <yinghai@kernel.org> Cc: Alexandre Courbot <acourbot@nvidia.com> Cc: Jon Medhurst <tixy@linaro.org> Cc: Stephen Warren <swarren@wwwdotorg.org> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
fs/affs: make root lookup from blkdev logical size
This patch resolves https://bugzilla.kernel.org/show_bug.cgi?id=16531.
When logical blkdev size > 512 then sector numbers become larger than the
device can support.
Make affs start lookup based on the device's logical sector size instead
of 512.
Reported-by: Mark <markk@clara.co.uk> Suggested-by: Mark <markk@clara.co.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
sysctl: fix int -> unsigned long assignments in INT_MIN case
The following
if (val < 0)
*lvalp = (unsigned long)-val;
is incorrect because the compiler is free to assume -val to be positive
and use a sign-extend instruction for extending the bit pattern. This is
a problem if val == INT_MIN:
Baoquan He [Wed, 9 Sep 2015 22:39:03 +0000 (15:39 -0700)]
kexec: export KERNEL_IMAGE_SIZE to vmcoreinfo
In x86_64, since v2.6.26 the KERNEL_IMAGE_SIZE is changed to 512M, and
accordingly the MODULES_VADDR is changed to 0xffffffffa0000000. However,
in v3.12 Kees Cook introduced kaslr to randomise the location of kernel.
And the kernel text mapping addr space is enlarged from 512M to 1G. That
means now KERNEL_IMAGE_SIZE is variable, its value is 512M when kaslr
support is not compiled in and 1G when kaslr support is compiled in.
Accordingly the MODULES_VADDR is changed too to be:
So when kaslr is compiled in and enabled, the kernel text mapping addr
space and modules vaddr space need be adjusted. Otherwise makedumpfile
will collapse since the addr for some symbols is not correct.
Hence KERNEL_IMAGE_SIZE need be exported to vmcoreinfo and got in
makedumpfile to help calculate MODULES_VADDR.
Signed-off-by: Baoquan He <bhe@redhat.com> Acked-by: Kees Cook <keescook@chromium.org> Acked-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Baoquan He [Wed, 9 Sep 2015 22:39:00 +0000 (15:39 -0700)]
kexec: align crash_notes allocation to make it be inside one physical page
People reported that crash_notes in /proc/vmcore were corrupted and this
cause crash kdump failure. With code debugging and log we got the root
cause. This is because percpu variable crash_notes are allocated in 2
vmalloc pages. Currently percpu is based on vmalloc by default. Vmalloc
can't guarantee 2 continuous vmalloc pages are also on 2 continuous
physical pages. So when 1st kernel exports the starting address and size
of crash_notes through sysfs like below:
kdump kernel use them to get the content of crash_notes. However the 2nd
part may not be in the next neighbouring physical page as we expected if
crash_notes are allocated accross 2 vmalloc pages. That's why
nhdr_ptr->n_namesz or nhdr_ptr->n_descsz could be very huge in
update_note_header_size_elf64() and cause note header merging failure or
some warnings.
In this patch change to call __alloc_percpu() to passed in the align value
by rounding crash_notes_size up to the nearest power of two. This makes
sure the crash_notes is allocated inside one physical page since
sizeof(note_buf_t) in all ARCHS is smaller than PAGE_SIZE. Meanwhile add
a BUILD_BUG_ON to break compile if size is bigger than PAGE_SIZE since
crash_notes definitely will be in 2 pages. That need be avoided, and need
be reported if it's unavoidable.
[akpm@linux-foundation.org: use correct comment layout] Signed-off-by: Baoquan He <bhe@redhat.com> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: Dave Young <dyoung@redhat.com> Cc: Lisa Mitchell <lisa.mitchell@hp.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
kexec: remove unnecessary test in kimage_alloc_crash_control_pages()
Transforming PFN(Page Frame Number) to struct page is never failure, so we
can simplify the code logic to do the image->control_page assignment
directly in the loop, and remove the unnecessary conditional judgement.
Signed-off-by: Minfei Huang <mnfhuang@gmail.com> Acked-by: Dave Young <dyoung@redhat.com> Acked-by: Vivek Goyal <vgoyal@redhat.com> Cc: Simon Horman <horms@verge.net.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Dave Young [Wed, 9 Sep 2015 22:38:55 +0000 (15:38 -0700)]
kexec: split kexec_load syscall from kexec core code
There are two kexec load syscalls, kexec_load another and kexec_file_load.
kexec_file_load has been splited as kernel/kexec_file.c. In this patch I
split kexec_load syscall code to kernel/kexec.c.
And add a new kconfig option KEXEC_CORE, so we can disable kexec_load and
use kexec_file_load only, or vice verse.
The original requirement is from Ted Ts'o, he want kexec kernel signature
being checked with CONFIG_KEXEC_VERIFY_SIG enabled. But kexec-tools use
kexec_load syscall can bypass the checking.
Vivek Goyal proposed to create a common kconfig option so user can compile
in only one syscall for loading kexec kernel. KEXEC/KEXEC_FILE selects
KEXEC_CORE so that old config files still work.
Because there's general code need CONFIG_KEXEC_CORE, so I updated all the
architecture Kconfig with a new option KEXEC_CORE, and let KEXEC selects
KEXEC_CORE in arch Kconfig. Also updated general kernel code with to
kexec_load syscall.
[akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Dave Young <dyoung@redhat.com> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: Petr Tesarik <ptesarik@suse.cz> Cc: Theodore Ts'o <tytso@mit.edu> Cc: Josh Boyer <jwboyer@fedoraproject.org> Cc: David Howells <dhowells@redhat.com> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
fs: Don't dump core if the corefile would become world-readable.
On a filesystem like vfat, all files are created with the same owner
and mode independent of who created the file. When a vfat filesystem
is mounted with root as owner of all files and read access for everyone,
root's processes left world-readable coredumps on it (but other
users' processes only left empty corefiles when given write access
because of the uid mismatch).
Given that the old behavior was inconsistent and insecure, I don't see
a problem with changing it. Now, all processes refuse to dump core unless
the resulting corefile will only be readable by their owner.
Signed-off-by: Jann Horn <jann@thejh.net> Acked-by: Kees Cook <keescook@chromium.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
fs: if a coredump already exists, unlink and recreate with O_EXCL
It was possible for an attacking user to trick root (or another user) into
writing his coredumps into an attacker-readable, pre-existing file using
rename() or link(), causing the disclosure of secret data from the victim
process' virtual memory. Depending on the configuration, it was also
possible to trick root into overwriting system files with coredumps. Fix
that issue by never writing coredumps into existing files.
Requirements for the attack:
- The attack only applies if the victim's process has a nonzero
RLIMIT_CORE and is dumpable.
- The attacker can trick the victim into coredumping into an
attacker-writable directory D, either because the core_pattern is
relative and the victim's cwd is attacker-writable or because an
absolute core_pattern pointing to a world-writable directory is used.
- The attacker has one of these:
A: on a system with protected_hardlinks=0:
execute access to a folder containing a victim-owned,
attacker-readable file on the same partition as D, and the
victim-owned file will be deleted before the main part of the attack
takes place. (In practice, there are lots of files that fulfill
this condition, e.g. entries in Debian's /var/lib/dpkg/info/.)
This does not apply to most Linux systems because most distros set
protected_hardlinks=1.
B: on a system with protected_hardlinks=1:
execute access to a folder containing a victim-owned,
attacker-readable and attacker-writable file on the same partition
as D, and the victim-owned file will be deleted before the main part
of the attack takes place.
(This seems to be uncommon.)
C: on any system, independent of protected_hardlinks:
write access to a non-sticky folder containing a victim-owned,
attacker-readable file on the same partition as D
(This seems to be uncommon.)
The basic idea is that the attacker moves the victim-owned file to where
he expects the victim process to dump its core. The victim process dumps
its core into the existing file, and the attacker reads the coredump from
it.
If the attacker can't move the file because he does not have write access
to the containing directory, he can instead link the file to a directory
he controls, then wait for the original link to the file to be deleted
(because the kernel checks that the link count of the corefile is 1).
A less reliable variant that requires D to be non-sticky works with link()
and does not require deletion of the original link: link() the file into
D, but then unlink() it directly before the kernel performs the link count
check.
On systems with protected_hardlinks=0, this variant allows an attacker to
not only gain information from coredumps, but also clobber existing,
victim-writable files with coredumps. (This could theoretically lead to a
privilege escalation.)
Signed-off-by: Jann Horn <jann@thejh.net> Cc: Kees Cook <keescook@chromium.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
kmod: handle UMH_WAIT_PROC from system unbound workqueue
The UMH_WAIT_PROC handler runs in its own thread in order to make sure
that waiting for the exec kernel thread completion won't block other
usermodehelper queued jobs.
On older workqueue implementations, worklets couldn't sleep without
blocking the rest of the queue. But now the workqueue subsystem handles
that. Khelper still had the older limitation due to its singlethread
properties but we replaced it to system unbound workqueues.
Those are affine to the current node and can block up to some number of
instances.
They are a good candidate to handle UMH_WAIT_PROC assuming that we have
enough system unbound workers to handle lots of parallel usermodehelper
jobs.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Rik van Riel <riel@redhat.com> Reviewed-by: Oleg Nesterov <oleg@redhat.com> Cc: Christoph Lameter <cl@linux.com> Cc: Tejun Heo <tj@kernel.org> Cc: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We need to launch the usermodehelper kernel threads with the widest
affinity and this is partly why we use khelper. This workqueue has
unbound properties and thus a wide affinity inherited by all its children.
Now khelper also has special properties that we aren't much interested in:
ordered and singlethread. There is really no need about ordering as all
we do is creating kernel threads. This can be done concurrently. And
singlethread is a useless limitation as well.
The workqueue engine already proposes generic unbound workqueues that
don't share these useless properties and handle well parallel jobs.
The only worrysome specific is their affinity to the node of the current
CPU. It's fine for creating the usermodehelper kernel threads but those
inherit this affinity for longer jobs such as requesting modules.
This patch proposes to use these node affine unbound workqueues assuming
that a node is sufficient to handle several parallel usermodehelper
requests.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Rik van Riel <riel@redhat.com> Reviewed-by: Oleg Nesterov <oleg@redhat.com> Cc: Christoph Lameter <cl@linux.com> Cc: Tejun Heo <tj@kernel.org> Cc: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
kmod: add up-to-date explanations on the purpose of each asynchronous levels
There seem to be quite some confusions on the comments, likely due to
changes that came after them.
Now since it's very non obvious why we have 3 levels of asynchronous code
to implement usermodehelpers, it's important to comment in detail the
reason of this layout.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Rik van Riel <riel@redhat.com> Reviewed-by: Oleg Nesterov <oleg@redhat.com> Cc: Christoph Lameter <cl@linux.com> Cc: Tejun Heo <tj@kernel.org> Cc: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patchset does a bunch of cleanups and converts khelper to use system
unbound workqueues. The 3 first patches should be uncontroversial. The
last 2 patches are debatable.
Kmod creates kernel threads that perform userspace jobs and we want those
to have a large affinity in order not to contend busy CPUs. This is
(partly) why we use khelper which has a wide affinity that the kernel
threads it create can inherit from. Now khelper is a dedicated workqueue
that has singlethread properties which we aren't interested in.
Hence those two debatable changes:
_ We would like to use generic workqueues. System unbound workqueues are
a very good candidate but they are not wide affine, only node affine.
Now probably a node is enough to perform many parallel kmod jobs.
_ We would like to remove the wait_for_helper kernel thread (UMH_WAIT_PROC
handler) to use the workqueue. It means that if the workqueue blocks,
and no other worker can take pending kmod request, we can be screwed.
Now if we have 512 threads, this should be enough.
This patch (of 5):
Underscores on function names aren't much verbose to explain the purpose
of a function. And kmod has interesting such flavours.
kmod: correct documentation of return status of request_module
If request_module() successfully runs modprobe, but modprobe exits with a
non-zero status, then the return value from request_module() will be that
(positive) error status. So the return from request_module can be:
negative errno
zero for success
positive exit code.
hfs: fix B-tree corruption after insertion at position 0
Fix B-tree corruption when a new record is inserted at position 0 in the
node in hfs_brec_insert().
This is an identical change to the corresponding hfs b-tree code to Sergei
Antonov's "hfsplus: fix B-tree corruption after insertion at position 0",
to keep similar code paths in the hfs and hfsplus drivers in sync, where
appropriate.
Signed-off-by: Hin-Tak Leung <htl10@users.sourceforge.net> Cc: Sergei Antonov <saproj@gmail.com> Cc: Joe Perches <joe@perches.com> Reviewed-by: Vyacheslav Dubeyko <slava@dubeyko.com> Cc: Anton Altaparmakov <anton@tuxera.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Christoph Hellwig <hch@infradead.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
hfs,hfsplus: cache pages correctly between bnode_create and bnode_free
Pages looked up by __hfs_bnode_create() (called by hfs_bnode_create() and
hfs_bnode_find() for finding or creating pages corresponding to an inode)
are immediately kmap()'ed and used (both read and write) and kunmap()'ed,
and should not be page_cache_release()'ed until hfs_bnode_free().
This patch fixes a problem I first saw in July 2012: merely running "du"
on a large hfsplus-mounted directory a few times on a reasonably loaded
system would get the hfsplus driver all confused and complaining about
B-tree inconsistencies, and generates a "BUG: Bad page state". Most
recently, I can generate this problem on up-to-date Fedora 22 with shipped
kernel 4.0.5, by running "du /" (="/" + "/home" + "/mnt" + other smaller
mounts) and "du /mnt" simultaneously on two windows, where /mnt is a
lightly-used QEMU VM image of the full Mac OS X 10.9:
After applying the patch, I was able to run "du /" (60+ times) and "du
/mnt" (150+ times) continuously and simultaneously for 6+ hours.
There are many reports of the hfsplus driver getting confused under load
and generating "BUG: Bad page state" or other similar issues over the
years. [1]
The unpatched code [2] has always been wrong since it entered the kernel
tree. The only reason why it gets away with it is that the
kmap/memcpy/kunmap follow very quickly after the page_cache_release() so
the kernel has not had a chance to reuse the memory for something else,
most of the time.
The current RW driver appears to have followed the design and development
of the earlier read-only hfsplus driver [3], where-by version 0.1 (Dec
2001) had a B-tree node-centric approach to
read_cache_page()/page_cache_release() per bnode_get()/bnode_put(),
migrating towards version 0.2 (June 2002) of caching and releasing pages
per inode extents. When the current RW code first entered the kernel [2]
in 2005, there was an REF_PAGES conditional (and "//" commented out code)
to switch between B-node centric paging to inode-centric paging. There
was a mistake with the direction of one of the REF_PAGES conditionals in
__hfs_bnode_create(). In a subsequent "remove debug code" commit [4], the
read_cache_page()/page_cache_release() per bnode_get()/bnode_put() were
removed, but a page_cache_release() was mistakenly left in (propagating
the "REF_PAGES <-> !REF_PAGE" mistake), and the commented-out
page_cache_release() in bnode_release() (which should be spanned by
!REF_PAGES) was never enabled.
References:
[1]:
Michael Fox, Apr 2013
http://www.spinics.net/lists/linux-fsdevel/msg63807.html
("hfsplus volume suddenly inaccessable after 'hfs: recoff %d too large'")
Sasha Levin, Feb 2015
http://lkml.org/lkml/2015/2/20/85 ("use after free")
Jan Harkes [Wed, 9 Sep 2015 22:38:01 +0000 (15:38 -0700)]
fs/coda: fix readlink buffer overflow
Dan Carpenter discovered a buffer overflow in the Coda file system
readlink code. A userspace file system daemon can return a 4096 byte
result which then triggers a one byte write past the allocated readlink
result buffer.
This does not trigger with an unmodified Coda implementation because Coda
has a 1024 byte limit for symbolic links, however other userspace file
systems using the Coda kernel module could be affected.
Although this is an obvious overflow, I don't think this has to be handled
as too sensitive from a security perspective because the overflow is on
the Coda userspace daemon side which already needs root to open Coda's
kernel device and to mount the file system before we get to the point that
links can be read.
[akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Jan Harkes <jaharkes@cs.cmu.edu> Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Joe Perches [Wed, 9 Sep 2015 22:37:58 +0000 (15:37 -0700)]
checkpatch: add constant comparison on left side test
"CONST <comparison> variable" checks like:
if (NULL != foo)
and
while (0 < bar(...))
where a constant (or what appears to be a constant like an upper case
identifier) is on the left of a comparison are generally preferred to be
written using the constant on the right side like:
if (foo != NULL)
and
while (bar(...) > 0)
Add a test for this.
Add a --fix option too, but only do it when the code is immediately
surrounded by parentheses to avoid misfixing things like "(0 < bar() +
constant)"
Signed-off-by: Joe Perches <joe@perches.com> Cc: Nicolas Morey Chaisemartin <nmorey@kalray.eu> Cc: Viresh Kumar <viresh.kumar@linaro.org> Cc: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Joe Perches [Wed, 9 Sep 2015 22:37:55 +0000 (15:37 -0700)]
checkpatch: add __pmem to $Sparse annotations
commit 61031952f4c8 ("arch, x86: pmem api for ensuring durability of
persistent memory updates") added a new __pmem annotation for sparse
verification. Add __pmem to the $Sparse variable so checkpatch can
appropriately ignore uses of this attribute too.
Signed-off-by: Joe Perches <joe@perches.com> Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com> Acked-by: Andy Whitcroft <apw@canonical.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Joe Perches [Wed, 9 Sep 2015 22:37:41 +0000 (15:37 -0700)]
checkpatch: always check block comment styles
Some of the block comment tests that are used only for networking are
appropriate for all patches.
For example, these styles are not encouraged:
/*
block comment without introductory *
*/
and
/*
* block comment with line terminating */
Remove the networking specific test and add comments.
There are some infrequent false positives where code is lazily
commented out using /* and */ rather than using #if 0/#endif blocks
like:
/* case foo:
case bar: */
case baz:
Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Joe Perches [Wed, 9 Sep 2015 22:37:27 +0000 (15:37 -0700)]
checkpatch: add warning on BUG/BUG_ON use
Using BUG/BUG_ON crashes the kernel and is just unfriendly.
Enable code that emits a warning on BUG/BUG_ON use.
Make the code emit the message at WARNING level when scanning a patch and
at CHECK level when scanning files so that script users don't feel an
obligation to fix code that might be above their pay grade.
Signed-off-by: Joe Perches <joe@perches.com> Reported-by: Geert Uytterhoeven <geert@linux-m68k.org> Tested-by: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Wang Long [Wed, 9 Sep 2015 22:37:22 +0000 (15:37 -0700)]
lib/test_kasan.c: make kmalloc_oob_krealloc_less more correctly
In kmalloc_oob_krealloc_less, I think it is better to test
the size2 boundary.
If we do not call krealloc, the access of position size1 will still cause
out-of-bounds and access of position size2 does not. After call krealloc,
the access of position size2 cause out-of-bounds. So using size2 is more
correct.
Signed-off-by: Wang Long <long.wanglong@huawei.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
To further clarify the purpose of the "esc" argument, rename it to "only"
to reflect that it is a limit, not a list of additional characters to
escape.
hexdump: do not print debug dumps for !CONFIG_DEBUG
print_hex_dump_debug() is likely supposed to be analogous to pr_debug() or
dev_dbg() & friends. Currently it will adhere to dynamic debug, but will
not stub out prints if CONFIG_DEBUG is not set. Let's make it do the
right thing, because I am tired of having my dmesg buffer full of hex
dumps on production systems.
Pan Xinhui [Wed, 9 Sep 2015 22:37:08 +0000 (15:37 -0700)]
lib/bitmap.c: bitmap_parselist can accept string with whitespaces on head or tail
In __bitmap_parselist we can accept whitespaces on head or tail during
every parsing procedure. If input has valid ranges, there is no reason to
reject the user.
For example, bitmap_parselist(" 1-3, 5, ", &mask, nmaskbits). After
separating the string, we get " 1-3", " 5", and " ". It's possible and
reasonable to accept such string as long as the parsing result is correct.
Signed-off-by: Pan Xinhui <xinhuix.pan@intel.com> Cc: Yury Norov <yury.norov@gmail.com> Cc: Chris Metcalf <cmetcalf@ezchip.com> Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk> Cc: Sudeep Holla <sudeep.holla@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pan Xinhui [Wed, 9 Sep 2015 22:37:05 +0000 (15:37 -0700)]
lib/bitmap.c: fix a special string handling bug in __bitmap_parselist
If string end with '-', for exapmle, bitmap_parselist("1,0-",&mask,
nmaskbits), It is not in a valid pattern, so add a check after loop.
Return -EINVAL on such condition.
Signed-off-by: Pan Xinhui <xinhuix.pan@intel.com> Cc: Yury Norov <yury.norov@gmail.com> Cc: Chris Metcalf <cmetcalf@ezchip.com> Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk> Cc: Sudeep Holla <sudeep.holla@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
include/linux/printk.h: include pr_fmt in pr_debug_ratelimited
The other two implementations of pr_debug_ratelimited include pr_fmt,
along with every other pr_* function. But pr_debug_ratelimited forgot to
add it with the CONFIG_DYNAMIC_DEBUG implementation.
This patch unifies the behavior.
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> Cc: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit e0e817392b9a ("CRED: Add some configurable debugging [try #6]")
added the kdebug mechanism to this file back in 2009.
The kdebug macro calls no_printk which always evaluates arguments.
Most of the kdebug uses have an unnecessary call of
atomic_read(&cred->usage)
Make the kdebug macro do nothing by defining it with
do { if (0) no_printk(...); } while (0)
when not enabled.
$ size kernel/cred.o* (defconfig x86-64)
text data bss dec hex filename
2748 336 8 3092 c14 kernel/cred.o.new
2788 336 8 3132 c3c kernel/cred.o.old
Miscellanea:
o Neaten the #define kdebug macros while there
Signed-off-by: Joe Perches <joe@perches.com> Cc: David Howells <dhowells@redhat.com> Cc: James Morris <jmorris@namei.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Poison pointer values should be small enough to find a room in
non-mmap'able/hardly-mmap'able space. E.g. on x86 "poison pointer space"
is located starting from 0x0. Given unprivileged users cannot mmap
anything below mmap_min_addr, it should be safe to use poison pointers
lower than mmap_min_addr.
The current poison pointer values of LIST_POISON{1,2} might be too big for
mmap_min_addr values equal or less than 1 MB (common case, e.g. Ubuntu
uses only 0x10000). There is little point to use such a big value given
the "poison pointer space" below 1 MB is not yet exhausted. Changing it
to a smaller value solves the problem for small mmap_min_addr setups.
The values are suggested by Solar Designer:
http://www.openwall.com/lists/oss-security/2015/05/02/6
Signed-off-by: Vasily Kulikov <segoon@openwall.com> Cc: Solar Designer <solar@openwall.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Waiman Long [Wed, 9 Sep 2015 22:35:57 +0000 (15:35 -0700)]
proc: change proc_subdir_lock to a rwlock
The proc_subdir_lock spinlock is used to allow only one task to make
change to the proc directory structure as well as looking up information
in it. However, the information lookup part can actually be entered by
more than one task as the pde_get() and pde_put() reference count update
calls in the critical sections are atomic increment and decrement
respectively and so are safe with concurrent updates.
The x86 architecture has already used qrwlock which is fair and other
architectures like ARM are in the process of switching to qrwlock. So
unfairness shouldn't be a concern in that conversion.
This patch changed the proc_subdir_lock to a rwlock in order to enable
concurrent lookup. The following functions were modified to take a
write lock:
- proc_register()
- remove_proc_entry()
- remove_proc_subtree()
The following functions were modified to take a read lock:
- xlate_proc_name()
- proc_lookup_de()
- proc_readdir_de()
A parallel /proc filesystem search with the "find" command (1000 threads)
was run on a 4-socket Haswell-EX box (144 threads). Before the patch, the
parallel search took about 39s. After the patch, the parallel find took
only 25s, a saving of about 14s.
The micro-benchmark that I used was artificial, but it was used to
reproduce an exit hanging problem that I saw in real application. In
fact, only allow one task to do a lookup seems too limiting to me.
Signed-off-by: Waiman Long <Waiman.Long@hp.com> Acked-by: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Nicolas Dichtel <nicolas.dichtel@6wind.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Scott J Norton <scott.norton@hp.com> Cc: Douglas Hatch <doug.hatch@hp.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
procfs: always expose /proc/<pid>/map_files/ and make it readable
Currently, /proc/<pid>/map_files/ is restricted to CAP_SYS_ADMIN, and is
only exposed if CONFIG_CHECKPOINT_RESTORE is set.
Each mapped file region gets a symlink in /proc/<pid>/map_files/
corresponding to the virtual address range at which it is mapped. The
symlinks work like the symlinks in /proc/<pid>/fd/, so you can follow them
to the backing file even if that backing file has been unlinked.
Currently, files which are mapped, unlinked, and closed are impossible to
stat() from userspace. Exposing /proc/<pid>/map_files/ closes this
functionality "hole".
Not being able to stat() such files makes noticing and explicitly
accounting for the space they use on the filesystem impossible. You can
work around this by summing up the space used by every file in the
filesystem and subtracting that total from what statfs() tells you, but
that obviously isn't great, and it becomes unworkable once your filesystem
becomes large enough.
This patch moves map_files/ out from behind CONFIG_CHECKPOINT_RESTORE, and
adjusts the permissions enforced on it as follows:
Remove the CAP_SYS_ADMIN restriction, leaving only the current
restriction requiring PTRACE_MODE_READ. The information made
available to userspace by these three functions is already
available in /proc/PID/maps with MODE_READ, so I don't see any
reason to limit them any further (see below for more detail).
* proc_map_files_follow_link()
This stub has been added, and requires that the user have
CAP_SYS_ADMIN in order to follow the links in map_files/,
since there was concern on LKML both about the potential for
bypassing permissions on ancestor directories in the path to
files pointed to, and about what happens with more exotic
memory mappings created by some drivers (ie dma-buf).
In older versions of this patch, I changed every permission check in
the four functions above to enforce MODE_ATTACH instead of MODE_READ.
This was an oversight on my part, and after revisiting the discussion
it seems that nobody was concerned about anything outside of what is
made possible by ->follow_link(). So in this version, I've left the
checks for PTRACE_MODE_READ as-is.
[akpm@linux-foundation.org: catch up with concurrent proc_pid_follow_link() changes] Signed-off-by: Calvin Owens <calvinowens@fb.com> Reviewed-by: Kees Cook <keescook@chromium.org> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Cyrill Gorcunov <gorcunov@openvz.org> Cc: Joe Perches <joe@perches.com> Cc: Kirill A. Shutemov <kirill@shutemov.name> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
proc: add cond_resched to /proc/kpage* read/write loop
Reading/writing a /proc/kpage* file may take long on machines with a lot
of RAM installed.
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com> Suggested-by: Andres Lagar-Cavilla <andreslc@google.com> Reviewed-by: Andres Lagar-Cavilla <andreslc@google.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Greg Thelen <gthelen@google.com> Cc: Michel Lespinasse <walken@google.com> Cc: David Rientjes <rientjes@google.com> Cc: Pavel Emelyanov <xemul@parallels.com> Cc: Cyrill Gorcunov <gorcunov@openvz.org> Cc: Jonathan Corbet <corbet@lwn.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
As noted by Minchan, a benefit of reading idle flag from /proc/kpageflags
is that one can easily filter dirty and/or unevictable pages while
estimating the size of unused memory.
Note that idle flag read from /proc/kpageflags may be stale in case the
page was accessed via a PTE, because it would be too costly to iterate
over all page mappings on each /proc/kpageflags read to provide an
up-to-date value. To make sure the flag is up-to-date one has to read
/sys/kernel/mm/page_idle/bitmap first.
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com> Reviewed-by: Andres Lagar-Cavilla <andreslc@google.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Greg Thelen <gthelen@google.com> Cc: Michel Lespinasse <walken@google.com> Cc: David Rientjes <rientjes@google.com> Cc: Pavel Emelyanov <xemul@parallels.com> Cc: Cyrill Gorcunov <gorcunov@openvz.org> Cc: Jonathan Corbet <corbet@lwn.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Knowing the portion of memory that is not used by a certain application or
memory cgroup (idle memory) can be useful for partitioning the system
efficiently, e.g. by setting memory cgroup limits appropriately.
Currently, the only means to estimate the amount of idle memory provided
by the kernel is /proc/PID/{clear_refs,smaps}: the user can clear the
access bit for all pages mapped to a particular process by writing 1 to
clear_refs, wait for some time, and then count smaps:Referenced. However,
this method has two serious shortcomings:
- it does not count unmapped file pages
- it affects the reclaimer logic
To overcome these drawbacks, this patch introduces two new page flags,
Idle and Young, and a new sysfs file, /sys/kernel/mm/page_idle/bitmap.
A page's Idle flag can only be set from userspace by setting bit in
/sys/kernel/mm/page_idle/bitmap at the offset corresponding to the page,
and it is cleared whenever the page is accessed either through page tables
(it is cleared in page_referenced() in this case) or using the read(2)
system call (mark_page_accessed()). Thus by setting the Idle flag for
pages of a particular workload, which can be found e.g. by reading
/proc/PID/pagemap, waiting for some time to let the workload access its
working set, and then reading the bitmap file, one can estimate the amount
of pages that are not used by the workload.
The Young page flag is used to avoid interference with the memory
reclaimer. A page's Young flag is set whenever the Access bit of a page
table entry pointing to the page is cleared by writing to the bitmap file.
If page_referenced() is called on a Young page, it will add 1 to its
return value, therefore concealing the fact that the Access bit was
cleared.
Note, since there is no room for extra page flags on 32 bit, this feature
uses extended page flags when compiled on 32 bit.
[akpm@linux-foundation.org: fix build]
[akpm@linux-foundation.org: kpageidle requires an MMU]
[akpm@linux-foundation.org: decouple from page-flags rework] Signed-off-by: Vladimir Davydov <vdavydov@parallels.com> Reviewed-by: Andres Lagar-Cavilla <andreslc@google.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Greg Thelen <gthelen@google.com> Cc: Michel Lespinasse <walken@google.com> Cc: David Rientjes <rientjes@google.com> Cc: Pavel Emelyanov <xemul@parallels.com> Cc: Cyrill Gorcunov <gorcunov@openvz.org> Cc: Jonathan Corbet <corbet@lwn.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In the scope of the idle memory tracking feature, which is introduced by
the following patch, we need to clear the referenced/accessed bit not only
in primary, but also in secondary ptes. The latter is required in order
to estimate wss of KVM VMs. At the same time we want to avoid flushing
tlb, because it is quite expensive and it won't really affect the final
result.
Currently, there is no function for clearing pte young bit that would meet
our requirements, so this patch introduces one. To achieve that we have
to add a new mmu-notifier callback, clear_young, since there is no method
for testing-and-clearing a secondary pte w/o flushing tlb. The new method
is not mandatory and currently only implemented by KVM.
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com> Reviewed-by: Andres Lagar-Cavilla <andreslc@google.com> Acked-by: Paolo Bonzini <pbonzini@redhat.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Greg Thelen <gthelen@google.com> Cc: Michel Lespinasse <walken@google.com> Cc: David Rientjes <rientjes@google.com> Cc: Pavel Emelyanov <xemul@parallels.com> Cc: Cyrill Gorcunov <gorcunov@openvz.org> Cc: Jonathan Corbet <corbet@lwn.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/proc/kpagecgroup contains a 64-bit inode number of the memory cgroup each
page is charged to, indexed by PFN. Having this information is useful for
estimating a cgroup working set size.
The file is present if CONFIG_PROC_PAGE_MONITOR && CONFIG_MEMCG.
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com> Reviewed-by: Andres Lagar-Cavilla <andreslc@google.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Greg Thelen <gthelen@google.com> Cc: Michel Lespinasse <walken@google.com> Cc: David Rientjes <rientjes@google.com> Cc: Pavel Emelyanov <xemul@parallels.com> Cc: Cyrill Gorcunov <gorcunov@openvz.org> Cc: Jonathan Corbet <corbet@lwn.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
It is only used in mem_cgroup_try_charge, so fold it in and zap it.
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com> Reviewed-by: Andres Lagar-Cavilla <andreslc@google.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Greg Thelen <gthelen@google.com> Cc: Michel Lespinasse <walken@google.com> Cc: David Rientjes <rientjes@google.com> Cc: Pavel Emelyanov <xemul@parallels.com> Cc: Cyrill Gorcunov <gorcunov@openvz.org> Cc: Jonathan Corbet <corbet@lwn.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
hwpoison: use page_cgroup_ino for filtering by memcg
Hwpoison allows to filter pages by memory cgroup ino. Currently, it
calls try_get_mem_cgroup_from_page to obtain the cgroup from a page and
then its ino using cgroup_ino, but now we have a helper method for
that, page_cgroup_ino, so use it instead.
This patch also loosens the hwpoison memcg filter dependency rules - it
makes it depend on CONFIG_MEMCG instead of CONFIG_MEMCG_SWAP, because
hwpoison memcg filter does not require anything (nor it used to) from
CONFIG_MEMCG_SWAP side.
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com> Reviewed-by: Andres Lagar-Cavilla <andreslc@google.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Greg Thelen <gthelen@google.com> Cc: Michel Lespinasse <walken@google.com> Cc: David Rientjes <rientjes@google.com> Cc: Pavel Emelyanov <xemul@parallels.com> Cc: Cyrill Gorcunov <gorcunov@openvz.org> Cc: Jonathan Corbet <corbet@lwn.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patchset introduces a new user API for tracking user memory pages
that have not been used for a given period of time. The purpose of this
is to provide the userspace with the means of tracking a workload's
working set, i.e. the set of pages that are actively used by the
workload. Knowing the working set size can be useful for partitioning the
system more efficiently, e.g. by tuning memory cgroup limits
appropriately, or for job placement within a compute cluster.
==== USE CASES ====
The unified cgroup hierarchy has memory.low and memory.high knobs, which
are defined as the low and high boundaries for the workload working set
size. However, the working set size of a workload may be unknown or
change in time. With this patch set, one can periodically estimate the
amount of memory unused by each cgroup and tune their memory.low and
memory.high parameters accordingly, therefore optimizing the overall
memory utilization.
Another use case is balancing workloads within a compute cluster. Knowing
how much memory is not really used by a workload unit may help take a more
optimal decision when considering migrating the unit to another node
within the cluster.
Also, as noted by Minchan, this would be useful for per-process reclaim
(https://lwn.net/Articles/545668/). With idle tracking, we could reclaim idle
pages only by smart user memory manager.
==== USER API ====
The user API consists of two new files:
* /sys/kernel/mm/page_idle/bitmap. This file implements a bitmap where each
bit corresponds to a page, indexed by PFN. When the bit is set, the
corresponding page is idle. A page is considered idle if it has not been
accessed since it was marked idle. To mark a page idle one should set the
bit corresponding to the page by writing to the file. A value written to the
file is OR-ed with the current bitmap value. Only user memory pages can be
marked idle, for other page types input is silently ignored. Writing to this
file beyond max PFN results in the ENXIO error. Only available when
CONFIG_IDLE_PAGE_TRACKING is set.
This file can be used to estimate the amount of pages that are not
used by a particular workload as follows:
1. mark all pages of interest idle by setting corresponding bits in the
/sys/kernel/mm/page_idle/bitmap
2. wait until the workload accesses its working set
3. read /sys/kernel/mm/page_idle/bitmap and count the number of bits set
* /proc/kpagecgroup. This file contains a 64-bit inode number of the
memory cgroup each page is charged to, indexed by PFN. Only available when
CONFIG_MEMCG is set.
This file can be used to find all pages (including unmapped file pages)
accounted to a particular cgroup. Using /sys/kernel/mm/page_idle/bitmap, one
can then estimate the cgroup working set size.
For an example of using these files for estimating the amount of unused
memory pages per each memory cgroup, please see the script attached
below.
==== REASONING ====
The reason to introduce the new user API instead of using
/proc/PID/{clear_refs,smaps} is that the latter has two serious
drawbacks:
- it does not count unmapped file pages
- it affects the reclaimer logic
The new API attempts to overcome them both. For more details on how it
is achieved, please see the comment to patch 6.
==== PATCHSET STRUCTURE ====
The patch set is organized as follows:
- patch 1 adds page_cgroup_ino() helper for the sake of
/proc/kpagecgroup and patches 2-3 do related cleanup
- patch 4 adds /proc/kpagecgroup, which reports cgroup ino each page is
charged to
- patch 5 introduces a new mmu notifier callback, clear_young, which is
a lightweight version of clear_flush_young; it is used in patch 6
- patch 6 implements the idle page tracking feature, including the
userspace API, /sys/kernel/mm/page_idle/bitmap
- patch 7 exports idle flag via /proc/kpageflags
==== SIMILAR WORKS ====
Originally, the patch for tracking idle memory was proposed back in 2011
by Michel Lespinasse (see http://lwn.net/Articles/459269/). The main
difference between Michel's patch and this one is that Michel implemented
a kernel space daemon for estimating idle memory size per cgroup while
this patch only provides the userspace with the minimal API for doing the
job, leaving the rest up to the userspace. However, they both share the
same idea of Idle/Young page flags to avoid affecting the reclaimer logic.
==== PERFORMANCE EVALUATION ====
SPECjvm2008 (https://www.spec.org/jvm2008/) was used to evaluate the
performance impact introduced by this patch set. Three runs were carried
out:
- base: kernel without the patch
- patched: patched kernel, the feature is not used
- patched-active: patched kernel, 1 minute-period daemon is used for
tracking idle memory
For tracking idle memory, idlememstat utility was used:
https://github.com/locker/idlememstat
for dir, subdirs, files in os.walk(CGROUP_MOUNT):
ino = os.stat(dir)[stat.ST_INO]
print dir + ": " + str(idlememsz.get(ino, 0) / 1024) + " kB"
==== END SCRIPT ====
This patch (of 8):
Add page_cgroup_ino() helper to memcg.
This function returns the inode number of the closest online ancestor of
the memory cgroup a page is charged to. It is required for exporting
information about which page is charged to which cgroup to userspace,
which will be introduced by a following patch.
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com> Reviewed-by: Andres Lagar-Cavilla <andreslc@google.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Greg Thelen <gthelen@google.com> Cc: Michel Lespinasse <walken@google.com> Cc: David Rientjes <rientjes@google.com> Cc: Pavel Emelyanov <xemul@parallels.com> Cc: Cyrill Gorcunov <gorcunov@openvz.org> Cc: Jonathan Corbet <corbet@lwn.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Dan Streetman [Wed, 9 Sep 2015 22:35:21 +0000 (15:35 -0700)]
zswap: change zpool/compressor at runtime
Update the zpool and compressor parameters to be changeable at runtime.
When changed, a new pool is created with the requested zpool/compressor,
and added as the current pool at the front of the pool list. Previous
pools remain in the list only to remove existing compressed pages from.
The old pool(s) are removed once they become empty.
Signed-off-by: Dan Streetman <ddstreet@ieee.org> Acked-by: Seth Jennings <sjennings@variantweb.net> Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Dan Streetman [Wed, 9 Sep 2015 22:35:16 +0000 (15:35 -0700)]
zpool: add zpool_has_pool()
This series makes creation of the zpool and compressor dynamic, so that
they can be changed at runtime. This makes using/configuring zswap
easier, as before this zswap had to be configured at boot time, using boot
params.
This uses a single list to track both the zpool and compressor together,
although Seth had mentioned an alternative which is to track the zpools
and compressors using separate lists. In the most common case, only a
single zpool and single compressor, using one list is slightly simpler
than using two lists, and for the uncommon case of multiple zpools and/or
compressors, using one list is slightly less simple (and uses slightly
more memory, probably) than using two lists.
This patch (of 4):
Add zpool_has_pool() function, indicating if the specified type of zpool
is available (i.e. zsmalloc or zbud). This allows checking if a pool is
available, without actually trying to allocate it, similar to
crypto_has_alg().
This is used by a following patch to zswap that enables the dynamic
runtime creation of zswap zpools.
Signed-off-by: Dan Streetman <ddstreet@ieee.org> Acked-by: Seth Jennings <sjennings@variantweb.net> Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma
Pull inifiniband/rdma updates from Doug Ledford:
"This is a fairly sizeable set of changes. I've put them through a
decent amount of testing prior to sending the pull request due to
that.
There are still a few fixups that I know are coming, but I wanted to
go ahead and get the big, sizable chunk into your hands sooner rather
than waiting for those last few fixups.
Of note is the fact that this creates what is intended to be a
temporary area in the drivers/staging tree specifically for some
cleanups and additions that are coming for the RDMA stack. We
deprecated two drivers (ipath and amso1100) and are waiting to hear
back if we can deprecate another one (ehca). We also put Intel's new
hfi1 driver into this area because it needs to be refactored and a
transfer library created out of the factored out code, and then it and
the qib driver and the soft-roce driver should all be modified to use
that library.
I expect drivers/staging/rdma to be around for three or four kernel
releases and then to go away as all of the work is completed and final
deletions of deprecated drivers are done.
Summary of changes for 4.3:
- Create drivers/staging/rdma
- Move amso1100 driver to staging/rdma and schedule for deletion
- Move ipath driver to staging/rdma and schedule for deletion
- Add hfi1 driver to staging/rdma and set TODO for move to regular
tree
- Initial support for namespaces to be used on RDMA devices
- Add RoCE GID table handling to the RDMA core caching code
- Infrastructure to support handling of devices with differing read
and write scatter gather capabilities
- Various iSER updates
- Kill off unsafe usage of global mr registrations
- Update SRP driver
- Misc mlx4 driver updates
- Support for the mr_alloc verb
- Support for a netlink interface between kernel and user space cache
daemon to speed path record queries and route resolution
- Ininitial support for safe hot removal of verbs devices"
* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma: (136 commits)
IB/ipoib: Suppress warning for send only join failures
IB/ipoib: Clean up send-only multicast joins
IB/srp: Fix possible protection fault
IB/core: Move SM class defines from ib_mad.h to ib_smi.h
IB/core: Remove unnecessary defines from ib_mad.h
IB/hfi1: Add PSM2 user space header to header_install
IB/hfi1: Add CSRs for CONFIG_SDMA_VERBOSITY
mlx5: Fix incorrect wc pkey_index assignment for GSI messages
IB/mlx5: avoid destroying a NULL mr in reg_user_mr error flow
IB/uverbs: reject invalid or unknown opcodes
IB/cxgb4: Fix if statement in pick_local_ip6adddrs
IB/sa: Fix rdma netlink message flags
IB/ucma: HW Device hot-removal support
IB/mlx4_ib: Disassociate support
IB/uverbs: Enable device removal when there are active user space applications
IB/uverbs: Explicitly pass ib_dev to uverbs commands
IB/uverbs: Fix race between ib_uverbs_open and remove_one
IB/uverbs: Fix reference counting usage of event files
IB/core: Make ib_dealloc_pd return void
IB/srp: Create an insecure all physical rkey only if needed
...
Merge tag 'for-linus-4.3' of git://git.code.sf.net/p/openipmi/linux-ipmi
Pull IPMI updates from Corey Minyard:
"Most of these have been sitting in linux-next for more than a release,
particularly commit 0fbcf4af7c83 ("ipmi: Convert the IPMI SI ACPI
handling to a platform device") which is probably the most complex
patch.
That is also the one that changes drivers/acpi/acpi_pnp.c. The change
in that file is only removing IPMI from a "special platform devices"
list, since I convert it to the standard PNP interface. I posted this
one to the ACPI list twice and got no response, and it seems to work
well in my testing, so I'm hoping it's good.
Hidehiro Kawai posted a set of changes that improves the panic time
handling in the IPMI driver.
The rest of the changes are minor bug fixes or cleanups and some
documentation"
* tag 'for-linus-4.3' of git://git.code.sf.net/p/openipmi/linux-ipmi:
ipmi:ssif: Add a module parm to specify that SMBus alerts don't work
ipmi: add of_device_id in MODULE_DEVICE_TABLE
ipmi: Compensate for BMCs that wont set the irq enable bit
ipmi: Don't call receive handler in the panic context
ipmi: Avoid touching possible corrupted lists in the panic context
ipmi: Don't flush messages in sender() in run-to-completion mode
ipmi: Factor out message flushing procedure
ipmi: Remove unneeded set_run_to_completion call
ipmi: Make some data const that was only read
ipmi: constify SSIF ACPI device ids
ipmi: Delete an unnecessary check before the function call "cleanup_one_si"
char:ipmi - Change 1 to true for bool type variables during initialization.
impi:Remove unneeded setting of module owner to THIS_MODULE in the platform structure, powernv_ipmi_driver
ipmi: Add a comment in how messages are delivered from the lower layer
ipmi/powernv: Fix potential invalid pointer dereference
ipmi: Convert the IPMI SI ACPI handling to a platform device
ipmi: Add device tree bindings information
Merge second patch-bomb from Andrew Morton:
"Almost all of the rest of MM. There was an unusually large amount of
MM material this time"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (141 commits)
zpool: remove no-op module init/exit
mm: zbud: constify the zbud_ops
mm: zpool: constify the zpool_ops
mm: swap: zswap: maybe_preload & refactoring
zram: unify error reporting
zsmalloc: remove null check from destroy_handle_cache()
zsmalloc: do not take class lock in zs_shrinker_count()
zsmalloc: use class->pages_per_zspage
zsmalloc: consider ZS_ALMOST_FULL as migrate source
zsmalloc: partial page ordering within a fullness_list
zsmalloc: use shrinker to trigger auto-compaction
zsmalloc: account the number of compacted pages
zsmalloc/zram: introduce zs_pool_stats api
zsmalloc: cosmetic compaction code adjustments
zsmalloc: introduce zs_can_compact() function
zsmalloc: always keep per-class stats
zsmalloc: drop unused variable `nr_to_migrate'
mm/memblock.c: fix comment in __next_mem_range()
mm/page_alloc.c: fix type information of memoryless node
memory-hotplug: fix comments in zone_spanned_pages_in_node() and zone_spanned_pages_in_node()
...
Merge branch 'parisc-4.3-1' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux
Pull parisc updates from Helge Deller:
"The most important changes in this patchset are:
- re-enable 64bit PCI bus addresses which were temporarily disabled
for PA-RISC in kernel 4.2
- fix the 64bit CAS operation in the LWS path which now enables us to
enable the 64bit gcc atomic builtins even on 32bit userspace with
64bit kernel
- fix a long-standing bug which sometimes crashed kernel at bootup
while serial interrupt wasn't registered yet"
* 'parisc-4.3-1' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux:
parisc: Use platform_device_register_simple("rtc-generic")
parisc: Drop CONFIG_SMP around update_cr16_clocksource()
parisc: Use double word condition in 64bit CAS operation
parisc: Filter out spurious interrupts in PA-RISC irq handler
parisc: Additionally check for in_atomic() in page fault handler
PCI,parisc: Enable 64-bit bus addresses on PA-RISC
parisc: Define ioremap_uc and ioremap_wc
Merge tag 'linux-kselftest-4.3-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest
Pull kselftest update from Shuah Khan:
"This update adds new zram test and fixes to problems found during
testing this new zram test. In addition, there are a few bug fixes
and ksefltest improvement patches from Linaro developers.
I will send another update later on this week to fix kselftest
breakage due to commit 2bf9e0ab08c6 ("locking/static_keys: Provide a
selftest") after the fix soaks in next for a couple of days"
* tag 'linux-kselftest-4.3-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest:
selftests/zram: Makefile fix
selftests/zram: must be run as root
selftests: breakpoints: fix installing error on the architecture except x86
selftests: check before install
selftests/zram: Adding zram tests
Merge tag 'iommu-updates-v4.3' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu
Pull iommu updates for from Joerg Roedel:
"This time the IOMMU updates are mostly cleanups or fixes. No big new
features or drivers this time. In particular the changes include:
- Bigger cleanup of the Domain<->IOMMU data structures and the code
that manages them in the Intel VT-d driver. This makes the code
easier to understand and maintain, and also easier to keep the data
structures in sync. It is also a preparation step to make use of
default domains from the IOMMU core in the Intel VT-d driver.
- Fixes for a couple of DMA-API misuses in ARM IOMMU drivers, namely
in the ARM and Tegra SMMU drivers.
- Fix for a potential buffer overflow in the OMAP iommu driver's
debug code
- A couple of smaller fixes and cleanups in various drivers
- One small new feature: Report domain-id usage in the Intel VT-d
driver to easier detect bugs where these are leaked"
* tag 'iommu-updates-v4.3' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu: (83 commits)
iommu/vt-d: Really use upper context table when necessary
x86/vt-d: Fix documentation of DRHD
iommu/fsl: Really fix init section(s) content
iommu/io-pgtable-arm: Unmap and free table when overwriting with block
iommu/io-pgtable-arm: Move init-fn declarations to io-pgtable.h
iommu/msm: Use BUG_ON instead of if () BUG()
iommu/vt-d: Access iomem correctly
iommu/vt-d: Make two functions static
iommu/vt-d: Use BUG_ON instead of if () BUG()
iommu/vt-d: Return false instead of 0 in irq_remapping_cap()
iommu/amd: Use BUG_ON instead of if () BUG()
iommu/amd: Make a symbol static
iommu/amd: Simplify allocation in irq_remapping_alloc()
iommu/tegra-smmu: Parameterize number of TLB lines
iommu/tegra-smmu: Factor out tegra_smmu_set_pde()
iommu/tegra-smmu: Extract tegra_smmu_pte_get_use()
iommu/tegra-smmu: Use __GFP_ZERO to allocate zeroed pages
iommu/tegra-smmu: Remove PageReserved manipulation
iommu/tegra-smmu: Convert to use DMA API
iommu/tegra-smmu: smmu_flush_ptc() wants device addresses
...
Merge tag 'regmap-v4.3' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regmap
Pull regmap updates from Mark Brown:
"This has been a busy release for regmap.
By far the biggest set of changes here are those from Markus Pargmann
which implement support for block transfers in smbus devices. This
required quite a bit of refactoring but leaves us better able to
handle odd restrictions that controllers may have and with better
performance on smbus.
Other new features include:
- Fix interactions with lockdep for nested regmaps (eg, when a device
using regmap is connected to a bus where the bus controller has a
separate regmap). Lockdep's default class identification is too
crude to work without help.
- Support for must write bitfield operations, useful for operations
which require writing a bit to trigger them from Kuniori Morimoto.
- Support for delaying during register patch application from Nariman
Poushin.
- Support for overriding cache state via the debugfs implementation
from Richard Fitzgerald"
* tag 'regmap-v4.3' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regmap: (25 commits)
regmap: fix a NULL pointer dereference in __regmap_init
regmap: Support bulk reads for devices without raw formatting
regmap-i2c: Add smbus i2c block support
regmap: Add raw_write/read checks for max_raw_write/read sizes
regmap: regmap max_raw_read/write getter functions
regmap: Introduce max_raw_read/write for regmap_bulk_read/write
regmap: Add missing comments about struct regmap_bus
regmap: No multi_write support if bus->write does not exist
regmap: Split use_single_rw internally into use_single_read/write
regmap: Fix regmap_bulk_write for bus writes
regmap: regmap_raw_read return error on !bus->read
regulator: core: Print at debug level on debugfs creation failure
regmap: Fix regmap_can_raw_write check
regmap: fix typos in regmap.c
regmap: Fix integertypes for register address and value
regmap: Move documentation to regmap.h
regmap: Use different lockdep class for each regmap init call
thermal: sti: Add parentheses around bridge->ops->regmap_init call
mfd: vexpress: Add parentheses around bridge->ops->regmap_init call
regmap: debugfs: Fix misuse of IS_ENABLED
...
Merge tag 'fbdev-4.3' of git://git.kernel.org/pub/scm/linux/kernel/git/tomba/linux
Pull fbdev updates from Tomi Valkeinen:
"Minor fixes and cleanups"
* tag 'fbdev-4.3' of git://git.kernel.org/pub/scm/linux/kernel/git/tomba/linux:
video: fbdev: atmel_lcdfb: remove useless include
video: fbdev: pxa168fb: Use devm_clk_get
fbdev: ssd1307fb: fix error return code
fbdev: fix snprintf() limit in show_bl_curve()
video: fbdev: s3c-fb: Constify platform_device_id
video: fbdev: atmel: fix warning for const return value
video: fbdev: Drop owner assignment from platform_driver
video: fbdev: Drop owner assignment from i2c_driver
fbdev: remove unnecessary memset in vfb
framebuffer: disable vgacon on microblaze arch
fbdev: udlfb: remove unneeded initialization in few places
fbdev: Allow compile test of GPIO consumers if !GPIOLIB
fbdev: fix cea_modes array size
Merge tag 'mmc-v4.3' of git://git.linaro.org/people/ulf.hansson/mmc
Pull MMC updates from Ulf Hansson:
"MMC core:
- Fix a race condition in the request handling
- Skip trim commands for some buggy kingston eMMCs
- An optimization and a correction for erase groups
- Set CMD23 quirk for some Sandisk cards
MMC host:
- sdhci: Give GPIO CD higher precedence and don't poll when it's used
- sdhci: Fix DMA memory leakage
- sdhci: Some updates for clock management
- sdhci-of-at91: introduce driver for the Atmel SDMMC
- sdhci-of-arasan: Add support for sdhci-5.1
- sdhci-esdhc-imx: Add support for imx7d which also supports HS400
- sdhci: A collection of fixes and improvements for various sdhci hosts
- omap_hsmmc: Modernization of the regulator code
- dw_mmc: A couple of fixes for DMA and PIO mode
- usdhi6rol0: A few fixes and support probe deferral for regulators
- pxamci: Convert to use dmaengine
- sh_mmcif: Fix the suspend process in a short term solution
- tmio: Adjust timeout for commands
- sunxi: Fix timeout while gating/ungating clock"
* tag 'mmc-v4.3' of git://git.linaro.org/people/ulf.hansson/mmc: (67 commits)
mmc: android-goldfish: remove incorrect __iomem annotation
mmc: core: fix race condition in mmc_wait_data_done
mmc: host: omap_hsmmc: remove CONFIG_REGULATOR check
mmc: host: omap_hsmmc: use ios->vdd for setting vmmc voltage
mmc: host: omap_hsmmc: use regulator_is_enabled to find pbias status
mmc: host: omap_hsmmc: enable/disable vmmc_aux regulator based on previous state
mmc: host: omap_hsmmc: don't use ->set_power to set initial regulator state
mmc: host: omap_hsmmc: avoid pbias regulator enable on power off
mmc: host: omap_hsmmc: add separate function to set pbias
mmc: host: omap_hsmmc: add separate functions for enable/disable supply
mmc: host: omap_hsmmc: return error if any of the regulator APIs fail
mmc: host: omap_hsmmc: remove unnecessary pbias set_voltage
mmc: host: omap_hsmmc: use mmc_host's vmmc and vqmmc
mmc: host: omap_hsmmc: use the ocrmask provided by the vmmc regulator
mmc: host: omap_hsmmc: cleanup omap_hsmmc_reg_get()
mmc: host: omap_hsmmc: return on fatal errors from omap_hsmmc_reg_get
mmc: host: omap_hsmmc: use devm_regulator_get_optional() for vmmc
mmc: sdhci-of-at91: fix platform_no_drv_owner.cocci warnings
mmc: sh_mmcif: Fix suspend process
mmc: usdhi6rol0: fix error return code
...
Merge tag 'platform-drivers-x86-v4.3-1' of git://git.infradead.org/users/dvhart/linux-platform-drivers-x86
Pull x86 platform driver updates from Darren Hart:
"Significant work on toshiba_acpi, including new hardware support,
refactoring, and cleanups. Extend device support for asus, ideapad,
and acer systems. New surface pro 3 buttons driver. Misc minor
cleanups for thinkpad and hp-wireless.
acer-wmi:
- No rfkill on HP Omen 15 wifi
thinkpad_acpi:
- Remove side effects from vdbg_printk -> no_printk macro
surface pro 3:
- Add support driver for Surface Pro 3 buttons
hp-wireless:
- remove unneeded goto/label in hpwl_init
ideapad-laptop:
- add alternative representation for Yoga 2 to DMI table
- Add Lenovo Yoga 3 14 to no_hw_rfkill dmi list
asus-laptop:
- Add key found on Asus F3M
MAINTAINERS:
- Remove Toshiba Linux mailing list address
toshiba_acpi:
- Bump driver version to 0.23
- Remove unnecessary checks and returns in HCI/SCI functions
- Refactor *{get, set} functions return value
- Remove "*not supported" feature prints
- Change *available functions return type
- Add set_fan_status function
- Change some variables to avoid warnings from ninja-check
- Reorder toshiba_acpi_alt_keymap entries
- Remove unused wireless defines
- Transflective backlight updates
- Avoid registering input device on WMI event laptops
- Add /dev/toshiba_acpi device
- Adapt /proc/acpi/toshiba/keys to TOS1900 devices"
* tag 'platform-drivers-x86-v4.3-1' of git://git.infradead.org/users/dvhart/linux-platform-drivers-x86: (21 commits)
acer-wmi: No rfkill on HP Omen 15 wifi
thinkpad_acpi: Remove side effects from vdbg_printk -> no_printk macro
surface pro 3: Add support driver for Surface Pro 3 buttons
hp-wireless: remove unneeded goto/label in hpwl_init
ideapad-laptop: add alternative representation for Yoga 2 to DMI table
asus-laptop: Add key found on Asus F3M
MAINTAINERS: Remove Toshiba Linux mailing list address
ideapad-laptop: Add Lenovo Yoga 3 14 to no_hw_rfkill dmi list
toshiba_acpi: Bump driver version to 0.23
toshiba_acpi: Remove unnecessary checks and returns in HCI/SCI functions
toshiba_acpi: Refactor *{get, set} functions return value
toshiba_acpi: Remove "*not supported" feature prints
toshiba_acpi: Change *available functions return type
toshiba_acpi: Add set_fan_status function
toshiba_acpi: Change some variables to avoid warnings from ninja-check
toshiba_acpi: Reorder toshiba_acpi_alt_keymap entries
toshiba_acpi: Remove unused wireless defines
toshiba_acpi: Transflective backlight updates
toshiba_acpi: Avoid registering input device on WMI event laptops
toshiba_acpi: Add /dev/toshiba_acpi device
...
Merge branch 'i2c/for-4.3' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux
Pull i2c updates from Wolfram Sang:
"Features:
- new drivers: Renesas EMEV2, register based MUX, NXP LPC2xxx
- core: scans DT and assigns wakeup interrupts. no driver changes needed.
- core: some refcouting issues fixed and better API for that
- core: new helper function for best effort block read emulation
- slave framework: proper DT bindings and userspace instantiation
- some bigger work for xiic, pxa, omap drivers
.. and quite a number of smaller driver fixes, cleanups, improvements"
* 'i2c/for-4.3' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux: (65 commits)
i2c: mux: reg Change ioread endianness for readback
i2c: mux: reg: fix compilation warnings
i2c: mux: reg: simplify register size checking
i2c: muxes: fix leaked i2c adapter device node references
i2c: allow specifying separate wakeup interrupt in device tree
of/irq: export of_get_irq_byname()
i2c: xgene-slimpro: dma_mapping_error() doesn't return an error code
i2c: Replace I2C_CROS_EC_TUNNEL dependency
eeprom: at24: use i2c_smbus_read_i2c_block_data_or_emulated
i2c: core: Add support for best effort block read emulation
i2c: lpc2k: add driver
i2c: mux: Add register-based mux i2c-mux-reg
i2c: dt: describe generic bindings
i2c: slave: print warning if slave flag not set
i2c: support 10 bit and slave addresses in sysfs 'new_device'
i2c: take address space into account when checking for used addresses
i2c: apply DT flags when probing
i2c: make address check indpendent from client struct
i2c: rename address check functions
i2c: apply address offset for slaves, too
...
Merge tag 'rtc-v4.3' of git://git.kernel.org/pub/scm/linux/kernel/git/abelloni/linux
Pull RTC updates from Alexandre Belloni:
"Core:
- use is_visible() to control sysfs attributes
- switch wakealarm attribute to DEVICE_ATTR_RW
- make rtc_does_wakealarm() return boolean
- properly manage lifetime of dev and cdev in rtc device
- remove unnecessary device_get() in rtc_device_unregister
- fix double free in rtc_register_device() error path
Subsystem wide cleanups:
- fix drivers that consider 0 as a valid IRQ in client->irq
- Drop (un)likely before IS_ERR(_OR_NULL)
- drop the remaining owner assignment for i2c_driver and
platform_driver
- module autoload fixes
Drivers:
- 88pm80x: add device tree support
- abx80x: fix RTC write bit
- ab8500: Add a sentinel to ab85xx_rtc_ids[]
- armada38x: Align RTC set time procedure with the official errata
- as3722: correct month value
- at91sam9: cleanups
- at91rm9200: get and use slow clock and cleanups
- bq32k: remove redundant check
- cmos: century support, proper fix for the spurious wakeup
- ds1307: cleanups and wakeup irq support
- ds1374: Remove unused variable
- ds1685: Use module_platform_driver
- ds3232: fix WARNING trace in resume function
- gemini: fix ptr_ret.cocci warnings
- mt6397: implement suspend/resume
- omap: support internal and external clock enabling
- opal: Enable alarms only when opal supports tpo
- pcf2127: use OFS flag to detect unreliable date and warn the user
- pl031: fix typo for author email
- rx8025: huge cleanup and fixes
- sa1100/pxa: share common code
- s5m: fix to update ctrl register
- s3c: fix clocks and wakeup, cleanup
- sirfsoc: use regmap
- nvram_read()/nvram_write() functions for cmos, ds1305, ds1307,
ds1343, ds1511, ds1553, ds1742, m48t59, rp5c01, stk17ta8, tx4939
- use rtc_valid_tm() error code when reading date/time instead of 0
for isl12022, pcf2123, pcf2127"
* tag 'rtc-v4.3' of git://git.kernel.org/pub/scm/linux/kernel/git/abelloni/linux: (90 commits)
rtc: abx80x: fix RTC write bit
rtc: ab8500: Add a sentinel to ab85xx_rtc_ids[]
rtc: ds1374: Remove unused variable
rtc: Fix module autoload for OF platform drivers
rtc: Fix module autoload for rtc-{ab8500,max8997,s5m} drivers
rtc: omap: Add external clock enabling support
rtc: omap: Add internal clock enabling support
ARM: dts: AM437x: Add the internal and external clock nodes for rtc
rtc: s5m: fix to update ctrl register
rtc: add xilinx zynqmp rtc driver
devicetree: bindings: rtc: add bindings for xilinx zynqmp rtc
rtc: as3722: correct month value
ARM: config: Switch PXA27x platforms to use PXA RTC driver
ARM: mmp: remove unused RTC register definitions
ARM: sa1100: remove unused RTC register definitions
rtc: sa1100/pxa: convert to run-time register mapping
ARM: pxa: add memory resource to SA1100 RTC device
rtc: pxa: convert to use shared sa1100 functions
rtc: sa1100: prepare to share sa1100_rtc_ops
rtc: ds3232: fix WARNING trace in resume function
...
zswap_get_swap_cache_page and read_swap_cache_async have pretty much the
same code with only significant difference in return value and usage of
swap_readpage.
I a helper __read_swap_cache_async() with the common code. Behavior
change: now zswap_get_swap_cache_page will use radix_tree_maybe_preload
instead radix_tree_preload. Looks like, this wasn't changed only by the
reason of code duplication.
Signed-off-by: Dmitry Safonov <0x7f454c46@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Vladimir Davydov <vdavydov@parallels.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Hugh Dickins <hughd@google.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Tejun Heo <tj@kernel.org> Cc: Jens Axboe <axboe@fb.com> Cc: Christoph Hellwig <hch@lst.de> Cc: David Herrmann <dh.herrmann@gmail.com> Cc: Seth Jennings <sjennings@variantweb.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Make zram syslog error reporting more consistent. We have random
error levels in some places. For example, critical errors like
"Error allocating memory for compressed page"
and
"Unable to allocate temp memory"
are reported as KERN_INFO messages.
a) Reassign error levels
Error messages that directly affect zram
functionality -- pr_err():
Error allocating zram address table
Error creating memory pool
Decompression failed! err=%d, page=%u
Unable to allocate temp memory
Compression failed! err=%d
Error allocating memory for compressed page: %u, size=%zu
Cannot initialise %s compressing backend
Error allocating disk queue for device %d
Error allocating disk structure for device %d
Error creating sysfs group for device %d
Unable to register zram-control class
Unable to get major number
Messages that do not affect functionality, but user
must be warned (because sysfs attrs will be removed in
this particular case) -- pr_warn():
%d (%s) Attribute %s (and others) will be removed. %s
Messages that do not affect functionality and mostly are
informative -- pr_info():
Cannot change max compression streams
Can't change algorithm for initialized device
Cannot change disksize for initialized device
Added device: %s
Removed device: %s
b) Update sysfs_create_group() error message
First, it lacks a trailing new line; add it. Second, every error message
in zram_add() has a "for device %d" part, which makes errors more
informative. Add missing part to "Error creating sysfs group" message.
Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
zsmalloc: do not take class lock in zs_shrinker_count()
We can avoid taking class ->lock around zs_can_compact() in
zs_shrinker_count(), because the number that we return back is outdated
in general case, by design. We have different sources that are able to
change class's state right after we return from zs_can_compact() --
ongoing I/O operations, manually triggered compaction, or two of them
happening simultaneously.
We re-do this calculations during compaction on a per class basis
anyway.
zs_unregister_shrinker() will not return until we have an active
shrinker, so classes won't unexpectedly disappear while
zs_shrinker_count() iterates them.
Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Acked-by: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
zsmalloc: partial page ordering within a fullness_list
We want to see more ZS_FULL pages and less ZS_ALMOST_{FULL, EMPTY}
pages. Put a page with higher ->inuse count first within its
->fullness_list, which will give us better chances to fill up this page
with new objects (find_get_zspage() return ->fullness_list head for new
object allocation), so some zspages will become ZS_ALMOST_FULL/ZS_FULL
quicker.
It performs a trivial and cheap ->inuse compare which does not slow down
zsmalloc and in the worst case keeps the list pages in no particular
order.
A more expensive solution could sort fullness_list by ->inuse count.
[minchan@kernel.org: code adjustments] Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Perform automatic pool compaction by a shrinker when system is getting
tight on memory.
User-space has a very little knowledge regarding zsmalloc fragmentation
and basically has no mechanism to tell whether compaction will result in
any memory gain. Another issue is that user space is not always aware
of the fact that system is getting tight on memory. Which leads to very
uncomfortable scenarios when user space may start issuing compaction
'randomly' or from crontab (for example). Fragmentation is not always
necessarily bad, allocated and unused objects, after all, may be filled
with the data later, w/o the need of allocating a new zspage. On the
other hand, we obviously don't want to waste memory when the system
needs it.
Compaction now has a relatively quick pool scan so we are able to
estimate the number of pages that will be freed easily, which makes it
possible to call this function from a shrinker->count_objects()
callback. We also abort compaction as soon as we detect that we can't
free any pages any more, preventing wasteful objects migrations.
Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Suggested-by: Minchan Kim <minchan@kernel.org> Acked-by: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Compaction returns back to zram the number of migrated objects, which is
quite uninformative -- we have objects of different sizes so user space
cannot obtain any valuable data from that number. Change compaction to
operate in terms of pages and return back to compaction issuer the
number of pages that were freed during compaction. So from now on we
will export more meaningful value in zram<id>/mm_stat -- the number of
freed (compacted) pages.
This requires:
(a) a rename of `num_migrated' to 'pages_compacted'
(b) a internal API change -- return first_page's fullness_group from
putback_zspage(), so we know when putback_zspage() did
free_zspage(). It helps us to account compaction stats correctly.
Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Acked-by: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
`zs_compact_control' accounts the number of migrated objects but it has
a limited lifespan -- we lose it as soon as zs_compaction() returns back
to zram. It worked fine, because (a) zram had it's own counter of
migrated objects and (b) only zram could trigger compaction. However,
this does not work for automatic pool compaction (not issued by zram).
To account objects migrated during auto-compaction (issued by the
shrinker) we need to store this number in zs_pool.
Define a new `struct zs_pool_stats' structure to keep zs_pool's stats
there. It provides only `num_migrated', as of this writing, but it
surely can be extended.
A new zsmalloc zs_pool_stats() symbol exports zs_pool's stats back to
caller.
Use zs_pool_stats() in zram and remove `num_migrated' from zram_stats.
Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Suggested-by: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Change zs_object_copy() argument order to be (DST, SRC) rather than
(SRC, DST). copy/move functions usually have (to, from) arguments
order.
Rename alloc_target_page() to isolate_target_page(). This function
doesn't allocate anything, it isolates target page, pretty much like
isolate_source_page().
Tweak __zs_compact() comment.
Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Acked-by: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This function checks if class compaction will free any pages.
Rephrasing -- do we have enough unused objects to form at least one
ZS_EMPTY page and free it. It aborts compaction if class compaction
will not result in any (further) savings.
EXAMPLE (this debug output is not part of this patch set):
- class size
- number of allocated objects
- number of used objects
- max objects per zspage
- pages per zspage
- estimated number of pages that will be freed
Every "compaction is useless" indicates that we saved CPU cycles.
class-512 has
544 object allocated
540 objects used
8 objects per-page
Even if we have a ALMOST_EMPTY zspage, we still don't have enough room to
migrate all of its objects and free this zspage; so compaction will not
make a lot of sense, it's better to just leave it as is.
Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Acked-by: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Always account per-class `zs_size_stat' stats. This data will help us
make better decisions during compaction. We are especially interested
in OBJ_ALLOCATED and OBJ_USED, which can tell us if class compaction
will result in any memory gain.
For instance, we know the number of allocated objects in the class, the
number of objects being used (so we also know how many objects are not
used) and the number of objects per-page. So we can ensure if we have
enough unused objects to form at least one ZS_EMPTY zspage during
compaction.
We calculate this value on per-class basis so we can calculate a total
number of zspages that can be released. Which is exactly what a
shrinker wants to know.
Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Acked-by: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>