drivers/pps/kc.c:37:1: sparse: symbol 'pps_kc_hardpps_lock' was not declared. Should it be static?
drivers/pps/kc.c:39:19: sparse: symbol 'pps_kc_hardpps_dev' was not declared. Should it be static?
drivers/pps/kc.c:40:5: sparse: symbol 'pps_kc_hardpps_mode' was not declared. Should it be static?
pps: hide more configuration symbols behind CONFIG_PPS
Make CONFIG_PPS_DEBUG and CONFIG_NTP_PPS be hidden if CONFIG_PPS is not
selected, so that we are not prompted for these configuration options if
CONFIG_PPS is not set.
Michal Belczyk [Tue, 30 Apr 2013 22:28:28 +0000 (15:28 -0700)]
nbd: increase default and max request sizes
Raise the default max request size for nbd to 128KB (from 127KB) to get it
4KB aligned. This patch also allows the max request size to be increased
(via /sys/block/nbd<x>/queue/max_sectors_kb) to 32MB.
The patch makes nbd network traffic more efficient by:
- reducing request fragmentation (4KB alignment)
- reducing the number of requests (fewer round trips, less network overhead)
Especially in high latency networks, larger request size can make a dramatic
Signed-off-by: Paul Clements <paul.clements@steeleye.com> Signed-off-by: Michal Belczyk <belczyk@bsd.krakow.pl> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
kernel/pid.c: improve flow of a loop inside alloc_pidmap.
find_next_offset() searches for an available "cleaned bit" in the
respective pid bitmap (page), so returns the offset if found, otherwise
it returns a value equals to BITS_PER_PAGE.
For example, suppose find_next_offset didn't find any available bit, so
there's no purpose to call mk_pid (Wasteful Cpu Cycles).
Therefore, I found it could be better to call mk_pid after the checking
(offset < BITS_PER_PAGE) returned sucessfully! Another point: If (offset
< BITS_PER_PAGE) results in a "failure", then mk_pid would be called
again afterwards.
[akpm@linux-foundation.org: simplify code] Signed-off-by: Raphael S. Carvalho <raphael.scarv@gmail.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Serge Hallyn <serge.hallyn@canonical.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
exec: do not abuse ->cred_guard_mutex in threadgroup_lock()
threadgroup_lock() takes signal->cred_guard_mutex to ensure that
thread_group_leader() is stable. This doesn't look nice, the scope of
this lock in do_execve() is huge.
And as Dave pointed out this can lead to deadlock, we have the
following dependencies:
Change de_thread() to take threadgroup_change_begin() around the
switch-the-leader code and change threadgroup_lock() to avoid
->cred_guard_mutex.
Note that de_thread() can't sleep with ->group_rwsem held, this can
obviously deadlock with the exiting leader if the writer is active, so it
does threadgroup_change_end() before schedule().
Reported-by: Dave Jones <davej@redhat.com> Acked-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com> Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
David Rientjes [Tue, 30 Apr 2013 22:28:18 +0000 (15:28 -0700)]
fs, proc: truncate /proc/pid/comm writes to first TASK_COMM_LEN bytes
Currently, a write to a procfs file will return the number of bytes
successfully written. If the actual string is longer than this, the
remainder of the string will not be be written and userspace will
complete the operation by issuing additional write()s.
Hence
$ echo -n "abcdefghijklmnopqrs" > /proc/self/comm
results in
$ cat /proc/$$/comm
pqrs
since the final four bytes were written with a second write() since
TASK_COMM_LEN == 16. This is obviously an undesired result and not
equivalent to prctl(PR_SET_NAME). The implementation should not need to
know the definition of TASK_COMM_LEN.
This patch truncates the string to the first TASK_COMM_LEN bytes and
returns the bytes written as the length of the string written so the
second write() is suppressed.
$ cat /proc/$$/comm
abcdefghijklmno
Signed-off-by: David Rientjes <rientjes@google.com> Acked-by: John Stultz <john.stultz@linaro.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
coredump: change wait_for_dump_helpers() to use wait_event_interruptible()
wait_for_dump_helpers() calls wake_up/kill_fasync from inside the
wait_event-like loop. This is not needed and in fact this is not
strictly correct, we can/should do this only once after we change
pipe->writers. We could even check if it becomes zero.
Change this code to use use wait_event_interruptible(), this can also
help to make this wait freezable.
With this patch we check pipe->readers without pipe_lock(), this is
fine. Once we see pipe->readers == 1 we know that the handler
decremented the counter, this is all we need.
Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Mandeep Singh Baines <msb@chromium.org> Cc: Neil Horman <nhorman@redhat.com> Cc: "Rafael J. Wysocki" <rjw@sisk.pl> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Change dump_write(), dump_seek() and do_coredump() to check
signal_pending() and abort if it is true. dump_seek() does this only
before f_op->llseek(), otherwise it relies on dump_write().
We need this change to ensure that the coredump won't delay suspend, and
to ensure it reacts to SIGKILL "quickly enough", a core dump can take a
lot of time. In particular this can help oom-killer.
We add the new trivial helper, dump_interrupted() to add the comments and
to simplify the potential freezer changes. Perhaps it will have more
callers.
Ideally it should do try_to_freeze() but then we need the unpleasant
changes in dump_write() and wait_for_dump_helpers(). It is not trivial to
change dump_write() to restart if f_op->write() fails because of
freezing(). We need to handle the short writes, we need to clear
TIF_SIGPENDING (and we can't rely on recalc_sigpending() unless we change
it to check PF_DUMPCORE). And if the buggy f_op->write() sets
TIF_SIGPENDING we can not distinguish this case from the race with
freeze_task() + __thaw_task().
So we simply accept the fact that the freezer can truncate a core-dump but
at least you can reliably suspend. Hopefully we can tolerate this
unlikely case and the necessary complications doesn't worth a trouble.
But if we decide to make the coredumping freezable later we can do this on
top of this change.
Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Mandeep Singh Baines <msb@chromium.org> Cc: Neil Horman <nhorman@redhat.com> Cc: "Rafael J. Wysocki" <rjw@sisk.pl> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
coredump: sanitize the setting of signal->group_exit_code
Now that the coredumping process can be SIGKILL'ed, the setting of
->group_exit_code in do_coredump() can race with complete_signal() and
SIGKILL or 0x80 can be "lost", or wait(status) can report status ==
SIGKILL | 0x80.
But the main problem is that it is not clear to me what should we do if
binfmt->core_dump() succeeds but SIGKILL was sent, that is why this patch
comes as a separate change.
This patch adds 0x80 if ->core_dump() succeeds and the process was not
killed. But perhaps we can (should?) re-set ->group_exit_code changed by
SIGKILL back to "siginfo->si_signo |= 0x80" in case when core_dumped == T.
Signed-off-by: Oleg Nesterov <oleg@redhat.com> Tested-by: Mandeep Singh Baines <msb@chromium.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Neil Horman <nhorman@redhat.com> Cc: "Rafael J. Wysocki" <rjw@sisk.pl> Cc: Roland McGrath <roland@hack.frob.com> Cc: Tejun Heo <tj@kernel.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
coredump: ensure that SIGKILL always kills the dumping thread
prepare_signal() blesses SIGKILL sent to the dumping process but this
signal can be "lost" anyway. The problems is, complete_signal() sees
SIGNAL_GROUP_EXIT and skips the "kill them all" logic. And even if the
dumping process is single-threaded (so the target is always "correct"),
the group-wide SIGKILL is not recorded in task->pending and thus
__fatal_signal_pending() won't be true. A multi-threaded case has even
more problems.
And even ignoring all technical details, SIGNAL_GROUP_EXIT doesn't look
right to me. This coredumping process is not exiting yet, it can do a lot
of work dumping the core.
With this patch the dumping process doesn't have SIGNAL_GROUP_EXIT, we set
signal->group_exit_task instead. This makes signal_group_exit() true and
thus this should equally close the races with exit/exec/stop but allows to
kill the dumping thread reliably.
Notes:
- It is not clear what should we do with ->group_exit_code
if the dumper was killed, see the next change.
- we need more (hopefully straightforward) changes to ensure
that SIGKILL actually interrupts the coredump. Basically we
need to check __fatal_signal_pending() in dump_write() and
dump_seek().
Signed-off-by: Oleg Nesterov <oleg@redhat.com> Tested-by: Mandeep Singh Baines <msb@chromium.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Neil Horman <nhorman@redhat.com> Cc: "Rafael J. Wysocki" <rjw@sisk.pl> Cc: Roland McGrath <roland@hack.frob.com> Cc: Tejun Heo <tj@kernel.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
coredump: only SIGKILL should interrupt the coredumping task
There are 2 well known and ancient problems with coredump/signals, and a
lot of related bug reports:
- do_coredump() clears TIF_SIGPENDING but of course this can't help
if, say, SIGCHLD comes after that.
In this case the coredump can fail unexpectedly. See for example
wait_for_dump_helper()->signal_pending() check but there are other
reasons.
- At the same time, dumping a huge core on the slow media can take a
lot of time/resources and there is no way to kill the coredumping
task reliably. In particular this is not oom_kill-friendly.
This patch tries to fix the 1st problem, and makes the preparation for the
next changes.
We add the new SIGNAL_GROUP_COREDUMP flag set by zap_threads() to indicate
that this process dumps the core. prepare_signal() checks this flag and
nacks any signal except SIGKILL.
Note that this check tries to be conservative, in the long term we should
probably treat the SIGNAL_GROUP_EXIT case equally but this needs more
discussion. See marc.info/?l=linux-kernel&m=120508897917439
Notes:
- recalc_sigpending() doesn't check SIGNAL_GROUP_COREDUMP.
The patch assumes that dump_write/etc paths should never
call it, but we can change it as well.
- There is another source of TIF_SIGPENDING, freezer. This
will be addressed separately.
Signed-off-by: Oleg Nesterov <oleg@redhat.com> Tested-by: Mandeep Singh Baines <msb@chromium.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Neil Horman <nhorman@redhat.com> Cc: "Rafael J. Wysocki" <rjw@sisk.pl> Cc: Roland McGrath <roland@hack.frob.com> Cc: Tejun Heo <tj@kernel.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Lucas De Marchi [Tue, 30 Apr 2013 22:28:09 +0000 (15:28 -0700)]
kmod: remove call_usermodehelper_fns()
This function suffers from not being able to determine if the cleanup is
called in case it returns -ENOMEM. Nobody is using it anymore, so let's
remove it.
Signed-off-by: Lucas De Marchi <lucas.demarchi@profusion.mobi> Cc: Oleg Nesterov <oleg@redhat.com> Cc: David Howells <dhowells@redhat.com> Cc: James Morris <james.l.morris@oracle.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Tejun Heo <tj@kernel.org> Cc: "Rafael J. Wysocki" <rjw@sisk.pl> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Lucas De Marchi [Tue, 30 Apr 2013 22:28:07 +0000 (15:28 -0700)]
usermodehelper: split remaining calls to call_usermodehelper_fns()
These are the only users of call_usermodehelper_fns(). This function
suffers from not being able to determine if the cleanup is called. Even
if in this places the cleanup pointer is NULL, convert them to use the
separate call_usermodehelper_setup() + call_usermodehelper_exec()
functions so we can remove the _fns variant.
Signed-off-by: Lucas De Marchi <lucas.demarchi@profusion.mobi> Cc: Oleg Nesterov <oleg@redhat.com> Cc: David Howells <dhowells@redhat.com> Cc: James Morris <james.l.morris@oracle.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Tejun Heo <tj@kernel.org> Cc: "Rafael J. Wysocki" <rjw@sisk.pl> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Lucas De Marchi [Tue, 30 Apr 2013 22:28:05 +0000 (15:28 -0700)]
KEYS: split call to call_usermodehelper_fns()
Use call_usermodehelper_setup() + call_usermodehelper_exec() instead of
calling call_usermodehelper_fns(). In case there's an OOM in this last
function the cleanup function may not be called - in this case we would
miss a call to key_put().
Signed-off-by: Lucas De Marchi <lucas.demarchi@profusion.mobi> Cc: Oleg Nesterov <oleg@redhat.com> Acked-by: David Howells <dhowells@redhat.com> Acked-by: James Morris <james.l.morris@oracle.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Tejun Heo <tj@kernel.org> Cc: "Rafael J. Wysocki" <rjw@sisk.pl> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Lucas De Marchi [Tue, 30 Apr 2013 22:28:03 +0000 (15:28 -0700)]
kmod: split call to call_usermodehelper_fns()
Use call_usermodehelper_setup() + call_usermodehelper_exec() instead of
calling call_usermodehelper_fns(). In case the latter returns -ENOMEM the
cleanup function may had not been called - in this case we would not free
argv and module_name.
Signed-off-by: Lucas De Marchi <lucas.demarchi@profusion.mobi> Cc: Oleg Nesterov <oleg@redhat.com> Cc: David Howells <dhowells@redhat.com> Cc: James Morris <james.l.morris@oracle.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Tejun Heo <tj@kernel.org> Cc: "Rafael J. Wysocki" <rjw@sisk.pl> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Lucas De Marchi [Tue, 30 Apr 2013 22:28:02 +0000 (15:28 -0700)]
usermodehelper: export call_usermodehelper_exec() and call_usermodehelper_setup()
call_usermodehelper_setup() + call_usermodehelper_exec() need to be
called instead of call_usermodehelper_fns() when the cleanup function
needs to be called even when an ENOMEM error occurs. In this case using
call_usermodehelper_fns() the user can't distinguish if the cleanup
function was called or not.
[akpm@linux-foundation.org: export call_usermodehelper_setup() to modules] Signed-off-by: Lucas De Marchi <lucas.demarchi@profusion.mobi> Reviewed-by: Oleg Nesterov <oleg@redhat.com> Cc: David Howells <dhowells@redhat.com> Cc: James Morris <james.l.morris@oracle.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Tejun Heo <tj@kernel.org> Cc: "Rafael J. Wysocki" <rjw@sisk.pl> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* Dump signals from process-wide and per-thread queues with
different sizes of buffers.
* Check error paths for buffers with restricted permissions. A part of
buffer or a whole buffer is for read-only.
* Try to get nonexistent signal.
Signed-off-by: Andrew Vagin <avagin@openvz.org> Cc: Roland McGrath <roland@redhat.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Cc: David Howells <dhowells@redhat.com> Cc: Dave Jones <davej@redhat.com> Cc: "Michael Kerrisk (man-pages)" <mtk.manpages@gmail.com> Cc: Pavel Emelyanov <xemul@parallels.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Pedro Alves <palves@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
ptrace: add ability to retrieve signals without removing from a queue (v4)
This patch adds a new ptrace request PTRACE_PEEKSIGINFO.
This request is used to retrieve information about pending signals
starting with the specified sequence number. Siginfo_t structures are
copied from the child into the buffer starting at "data".
The argument "addr" is a pointer to struct ptrace_peeksiginfo_args.
struct ptrace_peeksiginfo_args {
u64 off; /* from which siginfo to start */
u32 flags;
s32 nr; /* how may siginfos to take */
};
"nr" has type "s32", because ptrace() returns "long", which has 32 bits on
i386 and a negative values is used for errors.
Currently here is only one flag PTRACE_PEEKSIGINFO_SHARED for dumping
signals from process-wide queue. If this flag is not set, signals are
read from a per-thread queue.
The request PTRACE_PEEKSIGINFO returns a number of dumped signals. If a
signal with the specified sequence number doesn't exist, ptrace returns
zero. The request returns an error, if no signal has been dumped.
Errors:
EINVAL - one or more specified flags are not supported or nr is negative
EFAULT - buf or addr is outside your accessible address space.
A result siginfo contains a kernel part of si_code which usually striped,
but it's required for queuing the same siginfo back during restore of
pending signals.
This functionality is required for checkpointing pending signals. Pedro
Alves suggested using it in "gdb" to peek at pending signals. gdb already
uses PTRACE_GETSIGINFO to get the siginfo for the signal which was already
dequeued. This functionality allows gdb to look at the pending signals
which were not reported yet.
The prototype of this code was developed by Oleg Nesterov.
Signed-off-by: Andrew Vagin <avagin@openvz.org> Cc: Roland McGrath <roland@redhat.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Cc: David Howells <dhowells@redhat.com> Cc: Dave Jones <davej@redhat.com> Cc: "Michael Kerrisk (man-pages)" <mtk.manpages@gmail.com> Cc: Pavel Emelyanov <xemul@parallels.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Pedro Alves <palves@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
fs/hfsplus/bfind.c: In function 'hfs_find_1st_rec_by_cnid':
(1) include/uapi/linux/swab.h:60:2: warning: 'search_cnid' may be used uninitialized in this function [-Wmaybe-uninitialized]
(2) include/uapi/linux/swab.h:60:2: warning: 'cur_cnid' may be used uninitialized in this function [-Wmaybe-uninitialized]
[akpm@linux-foundation.org: make the workaround more explicit] Signed-off-by: Vyacheslav Dubeyko <slava@dubeyko.com> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
hfs_find_init() may fail with ENOMEM, but there are places, where the
returned value is not checked. The consequences can be very unpleasant,
e.g. kfree uninitialized pointer and inappropriate mutex unlocking.
The patch adds checks for errors in hfs_find_init().
Found by Linux Driver Verification project (linuxtesting.org).
nilfs2: fix issue with flush kernel thread after remount in RO mode because of driver's internal error or metadata corruption
The NILFS2 driver remounts itself in RO mode in the case of discovering
metadata corruption (for example, discovering a broken bmap). But
usually, this takes place when there have been file system operations
before remounting in RO mode.
Thereby, NILFS2 driver can be in RO mode with presence of dirty pages in
modified inodes' address spaces. It results in flush kernel thread's
infinite trying to flush dirty pages in RO mode. As a result, it is
possible to see such side effects as: (1) flush kernel thread occupies
50% - 99% of CPU time; (2) system can't be shutdowned without manual
power switch off.
SYMPTOMS:
(1) System log contains error message: "Remounting filesystem read-only".
(2) The flush kernel thread occupies 50% - 99% of CPU time.
(3) The system can't be shutdowned without manual power switch off.
REPRODUCTION PATH:
(1) Create volume group with name "unencrypted" by means of vgcreate utility.
(2) Run script (prepared by Anthony Doggett <Anthony2486@interfaces.org.uk>):
This patch implements checking mount state of NILFS2 driver in
nilfs_writepage(), nilfs_writepages() and nilfs_mdt_write_page()
methods. If it is detected the RO mount state then all dirty pages are
simply discarded with warning messages is written in system log.
Move the calls to memcpy_fromio() up into the loop in
dmi_scan_machine(), and move the signature checks back down into
dmi_decode(). We need to check at 16-byte intervals but keep a 32-byte
buffer for an SMBIOS entry, so shift the buffer after each iteration.
Merge smbios_present() into dmi_present(), so we look for an SMBIOS
signature at the beginning of the given buffer and then for a DMI
signature at an offset of 16 bytes.
[artem.savkov@gmail.com: use proper buf type in dmi_present()] Signed-off-by: Ben Hutchings <ben@decadent.org.uk> Reported-by: Tim McGrath <tmhikaru@gmail.com> Tested-by: Tim Mcgrath <tmhikaru@gmail.com> Cc: Zhenzhong Duan <zhenzhong.duan@oracle.com> Signed-off-by: Artem Savkov <artem.savkov@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
binfmt_elf: PIE: make PF_RANDOMIZE check comment more accurate
The comment I originally added in commit a3defbe5c337 ("binfmt_elf: fix
PIE execution with randomization disabled") is not really 100% accurate
-- sysctl is not the only way how PF_RANDOMIZE could be forcibly unset
in runtime.
Another option of course is direct modification of personality flags
(i.e. running through setarch wrapper).
fs: make binfmt support for #! scripts modular and removable
Add a new configuration option CONFIG_BINFMT_SCRIPT to configure support
for interpreted scripts starting with "#!"; allow compiling out that
support, or building it as a module. Embedded systems running exclusively
compiled binaries could leave this support out, and systems that don't
need scripts before mounting the root filesystem can build this as a
module.
Signed-off-by: Josh Triplett <josh@joshtriplett.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Eric Wong [Tue, 30 Apr 2013 22:27:43 +0000 (15:27 -0700)]
epoll: cleanup: use RCU_INIT_POINTER when nulling
It is always safe to use RCU_INIT_POINTER to NULL a pointer. This results
in slightly smaller/faster code.
Signed-off-by: Eric Wong <normalperson@yhbt.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Eric Wong [Tue, 30 Apr 2013 22:27:42 +0000 (15:27 -0700)]
epoll: cleanup: hoist out f_op->poll calls
This reduces the amount of code inside the ready list iteration loops for
better readability IMHO.
Signed-off-by: Eric Wong <normalperson@yhbt.net> Cc: Davide Libenzi <davidel@xmailserver.org> Cc: Al Viro <viro@ZenIV.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Eric Wong [Tue, 30 Apr 2013 22:27:40 +0000 (15:27 -0700)]
epoll: lock ep->mtx in ep_free to silence lockdep
Technically we do not need to hold ep->mtx during ep_free since we are
certain there are no other users of ep at that point. However, lockdep
complains with a "suspicious rcu_dereference_check() usage!" message; so
lock the mutex before ep_remove to silence the warning.
Signed-off-by: Eric Wong <normalperson@yhbt.net> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Arve Hjønnevåg <arve@android.com> Cc: Davide Libenzi <davidel@xmailserver.org> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: NeilBrown <neilb@suse.de>, Cc: Rafael J. Wysocki <rjw@sisk.pl> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Eric Wong [Tue, 30 Apr 2013 22:27:38 +0000 (15:27 -0700)]
epoll: trim epitem by one cache line
It is common for epoll users to have thousands of epitems, so saving a
cache line on every allocation leads to large memory savings.
Since epitem allocations are cache-aligned, reducing sizeof(struct
epitem) from 136 bytes to 128 bytes will allow it to squeeze under a
cache line boundary on x86_64.
Via /sys/kernel/slab/eventpoll_epi, I see the following changes on my
x86_64 Core2 Duo (which has 64-byte cache alignment):
object_size : 192 => 128
objs_per_slab: 21 => 32
Also, add a BUILD_BUG_ON() to check for future accidental breakage.
[akpm@linux-foundation.org: use __packed, for all architectures] Signed-off-by: Eric Wong <normalperson@yhbt.net> Cc: Davide Libenzi <davidel@xmailserver.org> Cc: Al Viro <viro@ZenIV.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Fan Du [Tue, 30 Apr 2013 22:27:27 +0000 (15:27 -0700)]
include/linux/fs.h: disable preempt when acquire i_size_seqcount write lock
Two rt tasks bind to one CPU core.
The higher priority rt task A preempts a lower priority rt task B which
has already taken the write seq lock, and then the higher priority rt
task A try to acquire read seq lock, it's doomed to lockup.
rt task A with lower priority: call write
i_size_write rt task B with higher priority: call sync, and preempt task A
write_seqcount_begin(&inode->i_size_seqcount); i_size_read
inode->i_size = i_size; read_seqcount_begin <-- lockup here...
So disable preempt when acquiring every i_size_seqcount *write* lock will
cure the problem.
Signed-off-by: Fan Du <fan.du@windriver.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
csd_lock() uses assignment to data->flags rather than |=. That is not
buggy at present because only one bit (CSD_FLAG_LOCK) is defined in
call_single_data.flags.
But it will become buggy if we later add another flag, so fix it now.
writeback: set worker desc to identify writeback workers in task dumps
Writeback has been recently converted to use workqueue instead of its
private thread pool implementation. One negative side effect of this
conversion is that there's no easy to tell which backing device a
writeback work item was working on at the time of task dump, be it
sysrq-t, BUG, WARN or whatever, which, according to our writeback
brethren, is important in tracking down issues with a lot of mounted
file systems on a lot of different devices.
This patch restores that information using the new worker description
facility. bdi_writeback_workfn() calls set_work_desc() to identify
which bdi it's working on. The description is printed out together with
the worqueue name and worker function as in the following example dump.
workqueue: include workqueue info when printing debug dump of a worker task
One of the problems that arise when converting dedicated custom
threadpool to workqueue is that the shared worker pool used by workqueue
anonimizes each worker making it more difficult to identify what the
worker was doing on which target from the output of sysrq-t or debug
dump from oops, BUG() and friends.
This patch implements set_worker_desc() which can be called from any
workqueue work function to set its description. When the worker task is
dumped for whatever reason - sysrq-t, WARN, BUG, oops, lockdep assertion
and so on - the description will be printed out together with the
workqueue name and the worker function pointer.
The printing side is implemented by print_worker_info() which is called
from functions in task dump paths - sched_show_task() and
dump_stack_print_info(). print_worker_info() can be safely called on
any task in any state as long as the task struct itself is accessible.
It uses probe_*() functions to access worker fields. It may print
garbage if something went very wrong, but it wouldn't cause (another)
oops.
The description is currently limited to 24bytes including the
terminating \0. worker->desc_valid and workder->desc[] are added and
the 64 bytes marker which was already incorrect before adding the new
fields is moved to the correct position.
Here's an example dump with writeback updated to set the bdi name as
worker desc.
One of the problems that arise when converting dedicated custom threadpool
to workqueue is that the shared worker pool used by workqueue anonimizes
each worker making it more difficult to identify what the worker was doing
on which target from the output of sysrq-t or debug dump from oops, BUG()
and friends.
For example, after writeback is converted to use workqueue instead of
priviate thread pool, there's no easy to tell which backing device a
writeback work item was working on at the time of task dump, which,
according to our writeback brethren, is important in tracking down issues
with a lot of mounted file systems on a lot of different devices.
This patchset implements a way for a work function to mark its execution
instance so that task dump of the worker task includes information to
indicate what the work item was doing.
An example WARN dump would look like the following.
Implement probe_kthread_data() which returns kthread_data if accessible.
The function is equivalent to kthread_data() except that the specified
@task may not be a kthread or its vfork_done is already cleared rendering
struct kthread inaccessible. In the former case, probe_kthread_data() may
return any value. In the latter, NULL.
This will be used to safely print debug information without affecting
synchronization in the normal paths. Workqueue debug info printing on
dump_stack() and friends will make use of it.
Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Oleg Nesterov <oleg@redhat.com> Acked-by: Jan Kara <jack@suse.cz> Cc: Dave Chinner <david@fromorbit.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
dump_stack: unify debug information printed by show_regs()
show_regs() is inherently arch-dependent but it does make sense to print
generic debug information and some archs already do albeit in slightly
different forms. This patch introduces a generic function to print debug
information from show_regs() so that different archs print out the same
information and it's much easier to modify what's printed.
show_regs_print_info() prints out the same debug info as dump_stack()
does plus task and thread_info pointers.
* Already prints debug info. Replaced with show_regs_print_info().
The printed information is superset of what used to be there.
arm, arm64, avr32, mips, powerpc, sh32, tile, unicore32, x86
* s390 is special in that it used to print arch-specific information
along with generic debug info. Heiko and Martin think that the
arch-specific extra isn't worth keeping s390 specfic implementation.
Converted to use the generic version.
Note that now all archs print the debug info before actual register
dumps.
v3: CPU number dropped from show_regs_print_info() as
dump_stack_print_info() has been updated to print it. s390
specific implementation dropped as requested by s390 maintainers.
Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: David S. Miller <davem@davemloft.net> Acked-by: Jesper Nilsson <jesper.nilsson@axis.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Bjorn Helgaas <bhelgaas@google.com> Cc: Fengguang Wu <fengguang.wu@intel.com> Cc: Mike Frysinger <vapier@gentoo.org> Cc: Vineet Gupta <vgupta@synopsys.com> Cc: Sam Ravnborg <sam@ravnborg.org> Acked-by: Chris Metcalf <cmetcalf@tilera.com> [tile bits] Acked-by: Richard Kuo <rkuo@codeaurora.org> [hexagon bits] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
dump_stack: implement arch-specific hardware description in task dumps
x86 and ia64 can acquire extra hardware identification information
from DMI and print it along with task dumps; however, the usage isn't
consistent.
* x86 show_regs() collects vendor, product and board strings and print
them out with PID, comm and utsname. Some of the information is
printed again later in the same dump.
* warn_slowpath_common() explicitly accesses the DMI board and prints
it out with "Hardware name:" label. This applies to both x86 and
ia64 but is irrelevant on all other archs.
* ia64 doesn't show DMI information on other non-WARN dumps.
This patch introduces arch-specific hardware description used by
dump_stack(). It can be set by calling dump_stack_set_arch_desc()
during boot and, if exists, printed out in a separate line with
"Hardware name:" label.
dmi_set_dump_stack_arch_desc() is added which sets arch-specific
description from DMI data. It uses dmi_ids_string[] which is set from
dmi_present() used for DMI debug message. It is superset of the
information x86 show_regs() is using. The function is called from x86
and ia64 boot code right after dmi_scan_machine().
This makes the explicit DMI handling in warn_slowpath_common()
unnecessary. Removed.
show_regs() isn't yet converted to use generic debug information
printing and this patch doesn't remove the duplicate DMI handling in
x86 show_regs(). The next patch will unify show_regs() handling and
remove the duplication.
v2: Use the same string as the debug message from dmi_present() which
also contains BIOS information. Move hardware name into its own
line as warn_slowpath_common() did. This change was suggested by
Bjorn Helgaas.
Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Bjorn Helgaas <bhelgaas@google.com> Cc: David S. Miller <davem@davemloft.net> Cc: Fengguang Wu <fengguang.wu@intel.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Jesper Nilsson <jesper.nilsson@axis.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Mike Frysinger <vapier@gentoo.org> Cc: Vineet Gupta <vgupta@synopsys.com> Cc: Sam Ravnborg <sam@ravnborg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
dmi: morph dmi_dump_ids() into dmi_format_ids() which formats into a buffer
We're goning to use DMI identification for other purposes too. Morph
dmi_dump_ids() which is used to print DMI identification as a debug
message during boot into dmi_format_ids() which formats the same
information sans the leading "DMI:" tag into a string buffer.
dmi_present() is updated to format the information into dmi_ids_string[]
using the new function and print it with "DMI:" prefix.
dmi_ids_string[] will be used for another purpose by a future patch.
Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Bjorn Helgaas <bhelgaas@google.com> Cc: David S. Miller <davem@davemloft.net> Cc: Fengguang Wu <fengguang.wu@intel.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Jesper Nilsson <jesper.nilsson@axis.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Mike Frysinger <vapier@gentoo.org> Cc: Vineet Gupta <vgupta@synopsys.com> Cc: Sam Ravnborg <sam@ravnborg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
dump_stack: consolidate dump_stack() implementations and unify their behaviors
Both dump_stack() and show_stack() are currently implemented by each
architecture. show_stack(NULL, NULL) dumps the backtrace for the
current task as does dump_stack(). On some archs, dump_stack() prints
extra information - pid, utsname and so on - in addition to the
backtrace while the two are identical on other archs.
The usages in arch-independent code of the two functions indicate
show_stack(NULL, NULL) should print out bare backtrace while
dump_stack() is used for debugging purposes when something went wrong,
so it does make sense to print additional information on the task which
triggered dump_stack().
There's no reason to require archs to implement two separate but mostly
identical functions. It leads to unnecessary subtle information.
This patch expands the dummy fallback dump_stack() implementation in
lib/dump_stack.c such that it prints out debug information (taken from
x86) and invokes show_stack(NULL, NULL) and drops arch-specific
dump_stack() implementations in all archs except blackfin. Blackfin's
dump_stack() does something wonky that I don't understand.
Debug information can be printed separately by calling
dump_stack_print_info() so that arch-specific dump_stack()
implementation can still emit the same debug information. This is used
in blackfin.
This patch brings the following behavior changes.
* On some archs, an extra level in backtrace for show_stack() could be
printed. This is because the top frame was determined in
dump_stack() on those archs while generic dump_stack() can't do that
reliably. It can be compensated by inlining dump_stack() but not
sure whether that'd be necessary.
* Most archs didn't use to print debug info on dump_stack(). They do
now.
v2: CPU number added to the generic debug info as requested by s390
folks and dropped the s390 specific dump_stack(). This loses %ksp
from the debug message which the maintainers think isn't important
enough to keep the s390-specific dump_stack() implementation.
dump_stack_print_info() is moved to kernel/printk.c from
lib/dump_stack.c. Because linkage is per objecct file,
dump_stack_print_info() living in the same lib file as generic
dump_stack() means that archs which implement custom dump_stack()
- at this point, only blackfin - can't use dump_stack_print_info()
as that will bring in the generic version of dump_stack() too. v1
The v1 patch broke build on blackfin due to this issue. The build
breakage was reported by Fengguang Wu.
Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: David S. Miller <davem@davemloft.net> Acked-by: Vineet Gupta <vgupta@synopsys.com> Acked-by: Jesper Nilsson <jesper.nilsson@axis.com> Acked-by: Vineet Gupta <vgupta@synopsys.com> Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com> [s390 bits] Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Mike Frysinger <vapier@gentoo.org> Cc: Fengguang Wu <fengguang.wu@intel.com> Cc: Bjorn Helgaas <bhelgaas@google.com> Cc: Sam Ravnborg <sam@ravnborg.org> Acked-by: Richard Kuo <rkuo@codeaurora.org> [hexagon bits] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
sparc32: make show_stack() acquire %fp if @_ksp is not specified
show_stack(current or NULL, NULL) is used by arch-independent code to dump
backtrace of the current task; however, sparc32 show_stack() doesn't
implement it and wouldn't print any backtrace when NULL @_ksp is specfied.
Make show_stack() acquire and use %fp if @tsk is NULL or current and @_ksp
is NULL. This makes %fp fetching in dump_stack() unnecessary. Make it
use NULL for @_ksp instead.
Only compile tested.
Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: David S. Miller <davem@davemloft.net> Cc: Bjorn Helgaas <bhelgaas@google.com> Cc: Fengguang Wu <fengguang.wu@intel.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Jesper Nilsson <jesper.nilsson@axis.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Mike Frysinger <vapier@gentoo.org> Cc: Vineet Gupta <vgupta@synopsys.com> Cc: Sam Ravnborg <sam@ravnborg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
x86: don't show trace beyond show_stack(NULL, NULL)
There are multiple ways a task can be dumped - explicit call to
dump_stack(), triggering WARN() or BUG(), through sysrq-t and so on.
Most of what gets printed is upto each architecture and the current
state is not particularly pretty. Different pieces of information are
presented differently depending on which path the dump takes and which
architecture it's running on. This is messy for no good reason and
makes it exceedingly difficult to add or modify debug information to
task dumps.
In all archs except for s390, there's nothing arch-specific about the
printed debug information. This patchset updates all those archs to use
the same helpers to consistently print out the same debug information.
0001-0002 update stack dumping functions in x86 and sparc32 in
preparation.
0003 makes all arches except blackfin use generic dump_stack().
blackfin still uses the generic helper to print the same info.
0004-0005 properly abstract DMI identifier printing in WARN() and
show_regs() so that all dumps print out the information. This enables
show_regs() to use the same debug info message.
0006 updates show_regs() of all arches to use a common generic helper
to print debug info.
0007 removes somem duplicate information from arc dumps.
While this patchset changes how debug info is printed on some archs,
the printed information is always superset of what used to be there.
This patchset makes task dump debug messages consistent and enables
adding more information. Workqueue is scheduled to add worker
information including the workqueue in use and work item specific
description.
While this patch touches a lot of archs, it isn't too likely to cause
non-trivial conflicts with arch-specfic changes and would probably be
best to route together either through -mm.
x86 is tested but other archs are either only compile tested or not
tested at all. Changes to most archs are generally trivial.
This patch:
show_stack(current or NULL, NULL) is used to print the backtrace of the
current task. As trace beyond the function itself isn't of much
interest to anyone, don't show it by determining sp and bp in
show_stack()'s frame and passing them to show_stack_log_lvl().
This brings show_stack(NULL, NULL)'s behavior in line with
dump_stack().
Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Bjorn Helgaas <bhelgaas@google.com> Cc: David S. Miller <davem@davemloft.net> Cc: Fengguang Wu <fengguang.wu@intel.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Jesper Nilsson <jesper.nilsson@axis.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Mike Frysinger <vapier@gentoo.org> Cc: Vineet Gupta <vgupta@synopsys.com> Cc: Sam Ravnborg <sam@ravnborg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pavel Emelyanov [Tue, 30 Apr 2013 22:27:06 +0000 (15:27 -0700)]
selftest: add simple test for soft-dirty bit
It creates a mapping of 3 pages and checks that reads, writes and
clear-refs result in present and soft-dirt bits reported from pagemap2
set as expected.
[akpm@linux-foundation.org: alphasort the Makefile TARGETS to reduce rejects] Signed-off-by: Pavel Emelyanov <xemul@parallels.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Dan Magenheimer [Tue, 30 Apr 2013 22:27:05 +0000 (15:27 -0700)]
staging: zcache: enable zcache to be built/loaded as a module
Allow zcache to be built/loaded as a module. Note runtime dependency
disallows loading if cleancache/frontswap lazy initialization patches
are not present. Zsmalloc support has not yet been merged into zcache
but, once merged, could now easily be selected via a module_param.
If built-in (not built as a module), the original mechanism of enabling
via a kernel boot parameter is retained, but this should be considered
deprecated.
Note that module unload is explicitly not yet supported.
Signed-off-by: Dan Magenheimer <dan.magenheimer@oracle.com>
[v1: Rebased with different order of patches]
[v2: Removed [CLEANCACHE|FRONTSWAP]_HAS_LAZY_INIT ifdef]
[v3: Rebased on top of ramster->zcache move]
[v4: Redid the Makefile]
[v5: s/ZCACHE2/ZCACHE/] Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Bob Liu <lliubbo@gmail.com> Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com> Cc: Andor Daam <andor.daam@googlemail.com> Cc: Florian Schmaus <fschmaus@gmail.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Stefan Hengelein <ilendir@googlemail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Dan Magenheimer [Tue, 30 Apr 2013 22:27:03 +0000 (15:27 -0700)]
staging: zcache: enable ramster to be built/loaded as a module
Enable module support for ramster. Note runtime dependency disallows
loading if cleancache/frontswap lazy initialization patches are not
present.
If built-in (not built as a module), the original mechanism of enabling
via a kernel boot parameter is retained, but this should be considered
deprecated.
Note that module unload is explicitly not yet supported.
[v1: Fixed compile issues since ramster_init now has four arguments]
[v2: Fixed rebase on ramster->zcache move]
[akpm@linux-foundation.org: use_frontswap_selfshrink cannot be __initdata] Signed-off-by: Dan Magenheimer <dan.magenheimer@oracle.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Bob Liu <lliubbo@gmail.com> Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com> Cc: Andor Daam <andor.daam@googlemail.com> Cc: Florian Schmaus <fschmaus@gmail.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Stefan Hengelein <ilendir@googlemail.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Dan Magenheimer [Tue, 30 Apr 2013 22:27:00 +0000 (15:27 -0700)]
xen: tmem: enable Xen tmem shim to be built/loaded as a module
Allow Xen tmem shim to be built/loaded as a module. Xen self-ballooning
and frontswap-selfshrinking are now also "lazily" initialized when the
Xen tmem shim is loaded as a module, unless explicitly disabled by
module parameters.
Note runtime dependency disallows loading if cleancache/frontswap lazy
initialization patches are not present.
If built-in (not built as a module), the original mechanism of enabling
via a kernel boot parameter is retained, but this should be considered
deprecated.
Note that module unload is explicitly not yet supported.
[v1: Removed the [CLEANCACHE|FRONTSWAP]_HAS_LAZY_INIT ifdef]
[v2: Squashed the xen/tmem: Remove the subsys call patch in]
[akpm@linux-foundation.org: fix build (disable_frontswap_selfshrinking undeclared)] Signed-off-by: Dan Magenheimer <dan.magenheimer@oracle.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Bob Liu <lliubbo@gmail.com> Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com> Cc: Andor Daam <andor.daam@googlemail.com> Cc: Florian Schmaus <fschmaus@gmail.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Stefan Hengelein <ilendir@googlemail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
cleancache: Make cleancache_init use a pointer for the ops
Instead of using a backend_registered to determine whether a backend is
enabled. This allows us to remove the backend_register check and just
do 'if (cleancache_ops)'
[v1: Rebase on top of b97c4b430b0a (ramster->zcache move] Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Bob Liu <lliubbo@gmail.com> Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com> Cc: Andor Daam <andor.daam@googlemail.com> Cc: Dan Magenheimer <dan.magenheimer@oracle.com> Cc: Florian Schmaus <fschmaus@gmail.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Stefan Hengelein <ilendir@googlemail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Dan Magenheimer [Tue, 30 Apr 2013 22:26:56 +0000 (15:26 -0700)]
mm: cleancache: lazy initialization to allow tmem backends to build/run as modules
With the goal of allowing tmem backends (zcache, ramster, Xen tmem) to
be built/loaded as modules rather than built-in and enabled by a boot
parameter, this patch provides "lazy initialization", allowing backends
to register to cleancache even after filesystems were mounted. Calls to
init_fs and init_shared_fs are remembered as fake poolids but no real
tmem_pools created. On backend registration the fake poolids are mapped
to real poolids and respective tmem_pools.
Signed-off-by: Stefan Hengelein <ilendir@googlemail.com> Signed-off-by: Florian Schmaus <fschmaus@gmail.com> Signed-off-by: Andor Daam <andor.daam@googlemail.com> Signed-off-by: Dan Magenheimer <dan.magenheimer@oracle.com>
[v1: Minor fixes: used #define for some values and bools]
[v2: Removed CLEANCACHE_HAS_LAZY_INIT]
[v3: Added more comments, added a lock for [shared_|]fs_poolid_map] Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Bob Liu <lliubbo@gmail.com> Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com> Cc: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Minchan Kim [Tue, 30 Apr 2013 22:26:54 +0000 (15:26 -0700)]
frontswap: get rid of swap_lock dependency
Frontswap initialization routine depends on swap_lock, which want to be
atomic about frontswap's first appearance. IOW, frontswap is not present
and will fail all calls OR frontswap is fully functional but if new
swap_info_struct isn't registered by enable_swap_info, swap subsystem
doesn't start I/O so there is no race between init procedure and page I/O
working on frontswap.
So let's remove unnecessary swap_lock dependency.
Cc: Dan Magenheimer <dan.magenheimer@oracle.com> Signed-off-by: Minchan Kim <minchan@kernel.org>
[v1: Rebased on my branch, reworked to work with backends loading late]
[v2: Added a check for !map]
[v3: Made the invalidate path follow the init path]
[v4: Address comments by Wanpeng Li <liwanp@linux.vnet.ibm.com>] Signed-off-by: Konrad Rzeszutek Wilk <konrad@darnok.org> Signed-off-by: Bob Liu <lliubbo@gmail.com> Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com> Cc: Andor Daam <andor.daam@googlemail.com> Cc: Florian Schmaus <fschmaus@gmail.com> Cc: Stefan Hengelein <ilendir@googlemail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Bob Liu [Tue, 30 Apr 2013 22:26:53 +0000 (15:26 -0700)]
mm: frontswap: cleanup code
After allowing tmem backends to build/run as modules, frontswap_enabled
always true if defined CONFIG_FRONTSWAP. But frontswap_test() depends on
whether backend is registered, mv it into frontswap.c using fronstswap_ops
to make the decision.
frontswap_set/clear are not used outside frontswap, so don't export them.
Signed-off-by: Bob Liu <lliubbo@gmail.com> Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com> Cc: Andor Daam <andor.daam@googlemail.com> Cc: Dan Magenheimer <dan.magenheimer@oracle.com> Cc: Florian Schmaus <fschmaus@gmail.com> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Stefan Hengelein <ilendir@googlemail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
frontswap: make frontswap_init use a pointer for the ops
This simplifies the code in the frontswap - we can get rid of the
'backend_registered' test and instead check against frontswap_ops.
[v1: Rebase on top of 703ba7fe5e0 (ramster->zcache move] Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Bob Liu <lliubbo@gmail.com> Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com> Cc: Andor Daam <andor.daam@googlemail.com> Cc: Dan Magenheimer <dan.magenheimer@oracle.com> Cc: Florian Schmaus <fschmaus@gmail.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Stefan Hengelein <ilendir@googlemail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Dan Magenheimer [Tue, 30 Apr 2013 22:26:50 +0000 (15:26 -0700)]
mm: frontswap: lazy initialization to allow tmem backends to build/run as modules
With the goal of allowing tmem backends (zcache, ramster, Xen tmem) to
be built/loaded as modules rather than built-in and enabled by a boot
parameter, this patch provides "lazy initialization", allowing backends
to register to frontswap even after swapon was run. Before a backend
registers all calls to init are recorded and the creation of tmem_pools
delayed until a backend registers or until a frontswap store is
attempted.
Signed-off-by: Stefan Hengelein <ilendir@googlemail.com> Signed-off-by: Florian Schmaus <fschmaus@gmail.com> Signed-off-by: Andor Daam <andor.daam@googlemail.com> Signed-off-by: Dan Magenheimer <dan.magenheimer@oracle.com>
[v1: Fixes per Seth Jennings suggestions]
[v2: Removed FRONTSWAP_HAS_.. ]
[v3: Fix up per Bob Liu <lliubbo@gmail.com> recommendations]
[v4: Fix up per Andrew's comments] Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Bob Liu <lliubbo@gmail.com> Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com> Cc: Dan Magenheimer <dan.magenheimer@oracle.com> Cc: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
select_parent() populates the dispose list with dentries which
shrink_dentry_list() then deletes. select_parent() carefully uses
need_resched() to avoid doing too much work at once. But neither
shrink_dcache_parent() nor its called functions call cond_resched(). So
once need_resched() is set select_parent() will return single dentry
dispose list which is then deleted by shrink_dentry_list(). This is
inefficient when there are a lot of dentry to process. This can cause
softlockup and hurts interactivity on non preemptable kernels.
This change adds cond_resched() in shrink_dcache_parent(). The benefit
of this is that need_resched() is quickly cleared so that future calls
to select_parent() are able to efficiently return a big batch of dentry.
These additional cond_resched() do not seem to impact performance, at
least for the workload below.
Here is a program which can cause soft lockup if other system activity
sets need_resched().
int main()
{
struct rlimit rlim;
int i;
int f[100000];
char buf[20];
struct timeval t1, t2;
double diff;
Yan Hong [Tue, 30 Apr 2013 22:26:47 +0000 (15:26 -0700)]
fs/block_dev.c: no need to check inode->i_bdev in bd_forget()
Its only caller evict() has promised a non-NULL inode->i_bdev.
Signed-off-by: Yan Hong <clouds.yan@gmail.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
inotify: invalid mask should return a error number but not set it
When we run the crackerjack testsuite, the inotify_add_watch test is
stalled.
This is caused by the invalid mask 0 - the task is waiting for the event
but it never comes. inotify_add_watch() should return -EINVAL as it did
before commit 676a0675cf92 ("inotify: remove broken mask checks causing
unmount to be EINVAL"). That commit removes the invalid mask check, but
that check is needed.
Check the mask's ALL_INOTIFY_BITS before the inotify_arg_to_mask() call.
If none are set, just return -EINVAL.
Because IN_UNMOUNT is in ALL_INOTIFY_BITS, this change will not trigger
the problem that above commit fixed.
[akpm@linux-foundation.org: fix build] Signed-off-by: Zhao Hongjiang <zhaohongjiang@huawei.com> Acked-by: Jim Somerville <Jim.Somerville@windriver.com> Cc: Paul Gortmaker <paul.gortmaker@windriver.com> Cc: Jerome Marchand <jmarchan@redhat.com> Cc: Eric Paris <eparis@parisplace.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Merge tag 'arm64-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/cmarinas/linux-aarch64
Pull arm64 update from Catalin Marinas:
"Main features:
- Versatile Express SoC (model) support - DT files and Kconfig
entries (there are no arch/arm64/mach-* directories). The bulk of
the code has already been moved to drivers/ as part of the ARM SoC
clean-up.
- Basic multi-cluster support (CPU logical map initialised from the
DT)
- Simple earlyprintk support for UART 8250/16550 and FastModel
console output
- Optimised kernel library bitops and string functions.
- Automatic initialisation of the irqchip and clocks via DT"
* tag 'arm64-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/cmarinas/linux-aarch64: (26 commits)
arm64: Use acquire/release semantics instead of explicit DMB
arm64: klib: bitops: fix unpredictable stxr usage
arm64: vexpress: Enable ARMv8 RTSM model (SoC) support
arm64: vexpress: Add dts files for the ARMv8 RTSM models
arm64: Survive invalid cpu enable-methods
arm64: mm: Correct show_pte behaviour
arm64: Fix compat types affecting struct compat_stat
arm64: Execute DSB during thread switching for TLB/cache maintenance
arm64: compiling issue, need add include/asm/vga.h file
arm64: smp: honour #address-size when parsing CPU reg property
arm64: Define cmpxchg64 and cmpxchg64_local for outside use
arm64: Define readq and writeq for driver module using
arm64: Fix task tracing
arm64: add explicit symbols to ESR_EL1 decoding
arm64: Use irqchip_init() for interrupt controller initialisation
arm64: psci: Use the MPIDR values from cpu_logical_map for cpu ids.
arm64: klib: Optimised atomic bitops
arm64: klib: Optimised string functions
arm64: klib: Optimised memory functions
arm64: head: match all affinity levels in the pen of the secondaries
...
Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/geert/linux-m68k
Pull m68k update from Geert Uytterhoeven.
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/geert/linux-m68k:
m68k: Remove inline strlen() implementation
m68k/atari: USB - add platform devices for EtherNAT/NetUSBee ISP1160 HCD
m68k: Implement ndelay() based on the existing udelay() logic
m68k/atari: EtherNAT - add interrupt chip definition for CPLD interrupts
m68k/atari: EtherNEC - add platform device support
m68k/atari: EtherNAT - platform device and IRQ support code
m68k/atari: use dedicated irq_chip for timer D interrupts
m68k/atari: ROM port ISA adapter support
m68k: Add missing cmpxchg64() if CONFIG_RMW_INSNS=y
Merge branch 'linux_next' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-edac
Pull edac fixes from Mauro Carvalho Chehab:
"Two edac fixes:
- i7300_edac currently reports a wrong number of DIMMs when the
memory controller is in single channel mode
- on some Sandy Bridge machines, the EDAC driver bails out as one of
the PCI IDs used by the driver is hidden by BIOS. As the driver
uses it only to detect the type of memory, make it optional at the
driver"
* 'linux_next' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-edac:
edac: sb_edac.c should not require prescence of IMC_DDRIO device
i7300_edac: Fix memory detection in single mode
Merge branch 'v4l_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media
Pull media update from Mauro Carvalho Chehab:
- OF documentation and patches at core and drivers, to be used by for
embedded media systems
- some I2C drivers used on go7007 were rewritten/promoted from staging:
sony-btf-mpx, tw2804, tw9903, tw9906, wis-ov7640, wis-uda1342
- add fimc-is driver (Exynos)
- add a new radio driver: radio-si476x
- add a two new tuners: r820t and tuner_it913x
- split camera code on em28xx driver and add more models
- the cypress firmware load is used outside dvb usb drivers. So, move
it to a common directory to make easier to re-use it
- siano media driver updated to work with sms2270 devices
- several work done in order to promote go7007 and solo6x1x out of
staging (still, there are some pending issues)
- several API compliance fixes at v4l2 drivers that don't behave as
expected
- as usual, lots of driver fixes, improvements, cleanups and new device
addition at the existing drivers.
* 'v4l_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media: (831 commits)
[media] cx88: make core less verbose
[media] em28xx: fix oops at em28xx_dvb_bus_ctrl()
[media] s5c73m3: fix indentation of the help section in Kconfig
[media] cx25821-alsa: get rid of a __must_check warning
[media] cx25821-video: declare cx25821_vidioc_s_std as static
[media] cx25821-video: remove maxw from cx25821_vidioc_try_fmt_vid_cap
[media] r820t: Remove a warning for an unused value
[media] dib0090: Fix a warning at dib0090_set_EFUSE
[media] dib8000: fix a warning
[media] dib8000: Fix sub-channel range
[media] dib8000: store dtv_property_cache in a temp var
[media] dib8000: warning fix: declare internal functions as static
[media] r820t: quiet gcc warning on n_ring
[media] r820t: memory leak in release()
[media] r820t: precendence bug in r820t_xtal_check()
[media] videodev2.h: Remove the unused old V4L1 buffer types
[media] anysee: Grammar s/report the/report to/
[media] anysee: Initialize ret = 0 in anysee_frontend_attach()
[media] media: videobuf2: fix the length check for mmap
[media] em28xx: save isoc endpoint number for DVB only if endpoint has alt settings with xMaxPacketSize != 0
...
Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/hid
Pull HID updates from Jiri Kosina:
- hid driver transport cleanup, finalizing the long-desired decoupling
of core from transport layers, by Benjamin Tissoires and Henrik
Rydberg
- support for hybrid finger/pen multitouch HID devices, by Benjamin
Tissoires
- fix for long-standing issue in Logitech unifying driver sometimes not
inializing properly due to device specifics, by Andrew de los Reyes
- Wii remote driver updates to support 2nd generation of devices, by
David Herrmann
- support for Apple IR remote
- roccat driver now supports new devices (Roccat Kone Pure, IskuFX), by
Stefan Achatz
- debugfs locking fixes in hid debug interface, by Jiri Kosina
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/hid: (43 commits)
HID: protect hid_debug_list
HID: debug: break out hid_dump_report() into hid-debug
HID: Add PID for Japanese version of NE4K keyboard
HID: hid-lg4ff add support for new version of DFGT wheel
HID: icade: u16 which never < 0
HID: clarify Magic Mouse Kconfig description
HID: appleir: add support for Apple ir devices
HID: roccat: added media key support for Kone
HID: hid-lenovo-tpkbd: remove doubled hid_get_drvdata
HID: i2c-hid: fix length for set/get report in i2c hid
HID: wiimote: parse reduced status reports
HID: wiimote: add 2nd generation Wii Remote IDs
HID: wiimote: use unique battery names
HID: hidraw: warn if userspace headers are outdated
HID: multitouch: force BTN_STYLUS for pen devices
HID: multitouch: append " Pen" to the name of the stylus input
HID: multitouch: add handling for pen in dual-sensors device
HID: multitouch: change touch sensor detection in mt_input_configured()
HID: multitouch: do not map usage from non used reports
HID: multitouch: breaks out touch handling in specific functions
...
Merge branch 'x86-ras-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 RAS changes from Ingo Molnar:
- Add an Intel CMCI hotplug fix
- Add AMD family 16h EDAC support
- Make the AMD MCE banks code more flexible for virtual environments
* 'x86-ras-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
amd64_edac: Add Family 16h support
x86/mce: Rework cmci_rediscover() to play well with CPU hotplug
x86, MCE, AMD: Use MCG_CAP MSR to find out number of banks on AMD
x86, MCE, AMD: Replace shared_bank array with is_shared_bank() helper
Merge branch 'x86-platform-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 platform changes from Ingo Molnar:
"Small fixes and cleanups all over the map"
* 'x86-platform-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/setup: Drop unneeded include <asm/dmi.h>
x86/olpc/xo1/sci: Don't call input_free_device() after input_unregister_device()
x86/platform/intel/mrst: Remove cast for kmalloc() return value
x86/platform/uv: Replace kmalloc() & memset with kzalloc()
Merge branch 'x86-paravirt-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 paravirt update from Ingo Molnar:
"Various paravirtualization related changes - the biggest one makes
guest support optional via CONFIG_HYPERVISOR_GUEST"
* 'x86-paravirt-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86, wakeup, sleep: Use pvops functions for changing GDT entries
x86, xen, gdt: Remove the pvops variant of store_gdt.
x86-32, gdt: Store/load GDT for ACPI S3 or hibernation/resume path is not needed
x86-64, gdt: Store/load GDT for ACPI S3 or hibernate/resume path is not needed.
x86: Make Linux guest support optional
x86, Kconfig: Move PARAVIRT_DEBUG into the paravirt menu
Merge branch 'x86-mm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 mm changes from Ingo Molnar:
"Misc smaller changes all over the map"
* 'x86-mm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/iommu/dmar: Remove warning for HPET scope type
x86/mm/gart: Drop unnecessary check
x86/mm/hotplug: Put kernel_physical_mapping_remove() declaration in CONFIG_MEMORY_HOTREMOVE
x86/mm/fixmap: Remove unused FIX_CYCLONE_TIMER
x86/mm/numa: Simplify some bit mangling
x86/mm: Re-enable DEBUG_TLBFLUSH for X86_32
x86/mm/cpa: Cleanup split_large_page() and its callee
x86: Drop always empty .text..page_aligned section
Merge branch 'x86-kaslr-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perparatory x86 kasrl changes from Ingo Molnar:
"This contains changes from the ongoing KASLR work, by Kees Cook.
The main changes are the use of a read-only IDT on x86 (which
decouples the userspace visible virtual IDT address from the physical
address), and a rework of ELF relocation support, in preparation of
random, boot-time kernel image relocation."
* 'x86-kaslr-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86, relocs: Refactor the relocs tool to merge 32- and 64-bit ELF
x86, relocs: Build separate 32/64-bit tools
x86, relocs: Add 64-bit ELF support to relocs tool
x86, relocs: Consolidate processing logic
x86, relocs: Generalize ELF structure names
x86: Use a read-only IDT alias on all CPUs
Merge branch 'x86-debug-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 debug update from Ingo Molnar:
"Two small changes: a documentation update and a constification"
* 'x86-debug-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86, early-printk: Update earlyprintk documentation (and kill x86 copy)
x86: Constify a few items