Rik van Riel [Fri, 6 Jun 2014 21:38:13 +0000 (14:38 -0700)]
sysrq: rcu-ify __handle_sysrq
Echoing values into /proc/sysrq-trigger seems to be a popular way to get
information out of the kernel. However, dumping information about
thousands of processes, or hundreds of CPUs to serial console can result
in IRQs being blocked for minutes, resulting in various kinds of cascade
failures.
The most common failure is due to interrupts being blocked for a very
long time. This can lead to things like failed IO requests, and other
things the system cannot easily recover from.
This problem is easily fixable by making __handle_sysrq use RCU instead
of spin_lock_irqsave.
This leaves the warning that RCU grace periods have not elapsed for a
long time, but the system will come back from that automatically.
It also leaves sysrq-from-irq-context when the sysrq keys are pressed,
but that is probably desired since people want that to work in
situations where the system is already hosed.
The callers of register_sysrq_key and unregister_sysrq_key appear to be
capable of sleeping.
Signed-off-by: Rik van Riel <riel@redhat.com> Reported-by: Madper Xie <cxie@redhat.com> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Richard Weinberger <richard@nod.at> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Davidlohr Bueso [Fri, 6 Jun 2014 21:38:10 +0000 (14:38 -0700)]
blackfin/ptrace: call find_vma with the mmap_sem held
Performing vma lookups without taking the mm->mmap_sem is asking for
trouble. While doing the search, the vma in question can be modified or
even removed before returning to the caller. Take the lock (shared) in
order to avoid races while iterating through the vmacache and/or rbtree.
Signed-off-by: Davidlohr Bueso <davidlohr@hp.com> Cc: Steven Miao <realmz6@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Manfred Spraul [Fri, 6 Jun 2014 21:37:52 +0000 (14:37 -0700)]
ipc/sem.c: add a printk_once for semctl(GETNCNT/GETZCNT)
The actual Linux implementation for semctl(GETNCNT) and semctl(GETZCNT)
always (since 0.99.10) reported a thread as sleeping on all semaphores
that are listed in the semop() call.
The documented behavior (both in the Linux man page and in the Single
Unix Specification) is that a task should be reported on exactly one
semaphore: The semaphore that caused the thread to got to sleep.
This patch adds a pr_info_once() that is triggered if a thread hits the
relevant case.
The code triggers slightly too often, otherwise it would be necessary to
replicate the old code. As there are no known users of GETNCNT or
GETZCNT, this is done to prevent unnecessary bloat.
The task that triggered is reported with name (tsk->comm) and pid.
Signed-off-by: Manfred Spraul <manfred@colorfullife.com> Acked-by: Davidlohr Bueso <davidlohr@hp.com> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Cc: Joe Perches <joe@perches.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Manfred Spraul [Fri, 6 Jun 2014 21:37:51 +0000 (14:37 -0700)]
ipc/sem.c: make semctl(,,{GETNCNT,GETZCNT}) standard compliant
SUSv4 clearly defines how semncnt and semzcnt must be calculated: A task
waits on exactly one semaphore: The semaphore from the first operation
in the sop array that cannot proceed.
The Linux implementation never followed the standard, it tried to count
all semaphores that might be the reason why a task sleeps.
This patch fixes that.
Note:
a) The implementation assumes that GETNCNT and GETZCNT are rare operations,
therefore the code counts them only on demand.
(If they wouldn't be rare, then the non-compliance would have
been found earlier)
b) compared to the initial version of the patch, the BUG_ONs were removed
and it was clarified that the new behavior conforms to SUS.
Back-compatibility concerns:
Manfred:
: - there is no application in Fedora that uses GETNCNT or GETZCNT.
:
: - application that use only single-sop semop() are also safe, the
: difference only affects complex apps.
:
: - portable application are also safe, the new behavior is standard
: compliant.
:
: But that's it. The old behavior existed in Linux from 0.99.something
: until now.
Michael:
: * These operations seem to be very little used. Grepping the public
: source that is contained Fedora 20 source DVD, there appear to be no
: uses. Of course, this says nothing about uses in private /
: non-mainstream FOSS code, but it seems likely that the same pattern
: is followed there.
:
: * The existing behavior is hard enough to understand that I suspect
: that no one understood it well enough to rely on it anyway
: (especially as that behavior contradicted both man page and POSIX).
:
: So, there's a chance of breakage, but I estimate that it's minute.
Signed-off-by: Manfred Spraul <manfred@colorfullife.com> Cc: Davidlohr Bueso <davidlohr.bueso@hp.com> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Manfred Spraul [Fri, 6 Jun 2014 21:37:48 +0000 (14:37 -0700)]
ipc/sem.c: remove code duplication
count_semzcnt and count_semncnt are more of less identical. The patch
creates a single function that either counts the number of tasks waiting
for zero or waiting due to a decrease operation.
Compared to the initial version, the BUG_ONs were removed.
Signed-off-by: Manfred Spraul <manfred@colorfullife.com> Cc: Davidlohr Bueso <davidlohr.bueso@hp.com> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Manfred Spraul [Fri, 6 Jun 2014 21:37:42 +0000 (14:37 -0700)]
ipc/shm.c: increase the defaults for SHMALL, SHMMAX
System V shared memory
a) can be abused to trigger out-of-memory conditions and the standard
measures against out-of-memory do not work:
- it is not possible to use setrlimit to limit the size of shm segments.
- segments can exist without association with any processes, thus
the oom-killer is unable to free that memory.
b) is typically used for shared information - today often multiple GB.
(e.g. database shared buffers)
The current default is a maximum segment size of 32 MB and a maximum
total size of 8 GB. This is often too much for a) and not enough for
b), which means that lots of users must change the defaults.
This patch increases the default limits (nearly) to the maximum, which
is perfect for case b). The defaults are used after boot and as the
initial value for each new namespace.
Admins/distros that need a protection against a) should reduce the
limits and/or enable shm_rmid_forced.
Unix has historically required setting these limits for shared memory,
and Linux inherited such behavior. The consequence of this is added
complexity for users and administrators. One very common example are
Database setup/installation documents and scripts, where users must
manually calculate the values for these limits. This also requires
(some) knowledge of how the underlying memory management works, thus
causing, in many occasions, the limits to just be flat out wrong.
Disabling these limits sooner could have saved companies a lot of time,
headaches and money for support. But it's never too late, simplify
users life now.
Further notes:
- The patch only changes default, overrides behave as before:
# sysctl kernel.shmall=33554432
would recreate the previous limit for SHMMAX (for the current namespace).
- Disabling sysv shm allocation is possible with:
# sysctl kernel.shmall=0
(not a new feature, also per-namespace)
- The limits are intentionally set to a value slightly less than ULONG_MAX,
to avoid triggering overflows in user space apps.
[not unreasonable, see http://marc.info/?l=linux-mm&m=139638334330127]
Manfred Spraul [Fri, 6 Jun 2014 21:37:41 +0000 (14:37 -0700)]
ipc/shm.c: check for integer overflow during shmget.
SHMMAX is the upper limit for the size of a shared memory segment, counted
in bytes. The actual allocation is that size, rounded up to the next full
page.
Add a check that prevents the creation of segments where the rounded up
size causes an integer overflow.
Manfred Spraul [Fri, 6 Jun 2014 21:37:40 +0000 (14:37 -0700)]
ipc/shm.c: check for overflows of shm_tot
shm_tot counts the total number of pages used by shm segments.
If SHMALL is ULONG_MAX (or nearly ULONG_MAX), then the number can
overflow. Subsequent calls to shmctl(,SHM_INFO,) would return wrong
values for shm_tot.
Manfred Spraul [Fri, 6 Jun 2014 21:37:38 +0000 (14:37 -0700)]
ipc/shm.c: check for ulong overflows in shmat
The increase of SHMMAX/SHMALL is a 4 patch series.
The change itself is trivial, the only problem are interger overflows.
The overflows are not new, but if we make huge values the default, then
the code should be free from overflows.
SHMMAX:
- shmmem_file_setup places a hard limit on the segment size:
MAX_LFS_FILESIZE.
On 32-bit, the limit is > 1 TB, i.e. 4 GB-1 byte segments are
possible. Rounded up to full pages the actual allocated size
is 0. --> must be fixed, patch 3
- shmat:
- find_vma_intersection does not handle overflows properly.
--> must be fixed, patch 1
- the rest is fine, do_mmap_pgoff limits mappings to TASK_SIZE
and checks for overflows (i.e.: map 2 GB, starting from
addr=2.5GB fails).
SHMALL:
- after creating 8192 segments size (1L<<63)-1, shm_tot overflows and
returns 0. --> must be fixed, patch 2.
Userspace:
- Obviously, there could be overflows in userspace. There is nothing
we can do, only use values smaller than ULONG_MAX.
I ended with "ULONG_MAX - 1L<<24":
- TASK_SIZE cannot be used because it is the size of the current
task. Could be 4G if it's a 32-bit task on a 64-bit kernel.
- The maximum size is not standardized across archs:
I found TASK_MAX_SIZE, TASK_SIZE_MAX and TASK_SIZE_64.
- Just in case some arch revives a 4G/4G split, nearly
ULONG_MAX is a valid segment size.
- Using "0" as a magic value for infinity is even worse, because
right now 0 means 0, i.e. fail all allocations.
This patch (of 4):
find_vma_intersection() does not work as intended if addr+size overflows.
The patch adds a manual check before the call to find_vma_intersection.
Mathias Krause [Fri, 6 Jun 2014 21:37:36 +0000 (14:37 -0700)]
ipc: constify ipc_ops
There is no need to recreate the very same ipc_ops structure on every
kernel entry for msgget/semget/shmget. Just declare it static and be
done with it. While at it, constify it as we don't modify the structure
at runtime.
Paul Bolle [Fri, 6 Jun 2014 21:37:35 +0000 (14:37 -0700)]
initramfs: remove "compression mode" choice
Commit 9ba4bcb64589 ("initramfs: read CONFIG_RD_ variables for initramfs
compression") removed the users of the various INITRAMFS_COMPRESSION_*
Kconfig symbols. So since v3.13 the entire "Built-in initramfs
compression mode" choice is a set of knobs connected to nothing. The
entire choice can safely be removed.
Signed-off-by: Paul Bolle <pebolle@tiscali.nl> Cc: P J P <ppandit@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Also convert spaces to tabs (checkpatch warnings) if (!dentry) KERN_NOTICE
converted to pr_err (like if (!inode) error process)
[akpm@linux-foundation.org: use KBUILD_MODNAME, per Joe] Signed-off-by: Fabian Frederick <fabf@skynet.be> Cc: Joe Perches <joe@perches.com> Cc: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
- Define pr_fmt in plateform.c and ram_core.c for global prefix.
- Coalesce format fragments.
- Separate format/arguments on lines > 80 characters.
Note: Some pr_foo() were initially declared without prefix and therefore
this could break existing log analyzer.
[akpm@linux-foundation.org: missed a couple of prefix removals] Signed-off-by: Fabian Frederick <fabf@skynet.be> Cc: Joe Perches <joe@perches.com> Cc: Anton Vorontsov <anton@enomsg.org> Cc: Colin Cross <ccross@android.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This adds several behavioral tests to sysctl string and number writing
to detect unexpected cases that behaved differently when the sysctl
kernel.sysctl_writes_strict != 1.
[ original ]
root@localhost:~# make test_num
== Testing sysctl behavior against /proc/sys/kernel/domainname ==
Writing test file ... ok
Checking sysctl is not set to test value ... ok
Writing sysctl from shell ... ok
Resetting sysctl to original value ... ok
Writing entire sysctl in single write ... ok
Writing middle of sysctl after synchronized seek ... FAIL
Writing beyond end of sysctl ... FAIL
Writing sysctl with multiple long writes ... FAIL
Writing entire sysctl in short writes ... FAIL
Writing middle of sysctl after unsynchronized seek ... ok
Checking sysctl maxlen is at least 65 ... ok
Checking sysctl keeps original string on overflow append ... FAIL
Checking sysctl stays NULL terminated on write ... ok
Checking sysctl stays NULL terminated on overwrite ... ok
make: *** [test_num] Error 1
root@localhost:~# make test_string
== Testing sysctl behavior against /proc/sys/vm/swappiness ==
Writing test file ... ok
Checking sysctl is not set to test value ... ok
Writing sysctl from shell ... ok
Resetting sysctl to original value ... ok
Writing entire sysctl in single write ... ok
Writing middle of sysctl after synchronized seek ... FAIL
Writing beyond end of sysctl ... FAIL
Writing sysctl with multiple long writes ... ok
make: *** [test_string] Error 1
[ with CONFIG_PROC_SYSCTL_STRICT_WRITES ]
root@localhost:~# make run_tests
== Testing sysctl behavior against /proc/sys/kernel/domainname ==
Writing test file ... ok
Checking sysctl is not set to test value ... ok
Writing sysctl from shell ... ok
Resetting sysctl to original value ... ok
Writing entire sysctl in single write ... ok
Writing middle of sysctl after synchronized seek ... ok
Writing beyond end of sysctl ... ok
Writing sysctl with multiple long writes ... ok
Writing entire sysctl in short writes ... ok
Writing middle of sysctl after unsynchronized seek ... ok
Checking sysctl maxlen is at least 65 ... ok
Checking sysctl keeps original string on overflow append ... ok
Checking sysctl stays NULL terminated on write ... ok
Checking sysctl stays NULL terminated on overwrite ... ok
== Testing sysctl behavior against /proc/sys/vm/swappiness ==
Writing test file ... ok
Checking sysctl is not set to test value ... ok
Writing sysctl from shell ... ok
Resetting sysctl to original value ... ok
Writing entire sysctl in single write ... ok
Writing middle of sysctl after synchronized seek ... ok
Writing beyond end of sysctl ... ok
Writing sysctl with multiple long writes ... ok
Signed-off-by: Kees Cook <keescook@chromium.org> Cc: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Kees Cook [Fri, 6 Jun 2014 21:37:19 +0000 (14:37 -0700)]
sysctl: allow for strict write position handling
When writing to a sysctl string, each write, regardless of VFS position,
begins writing the string from the start. This means the contents of
the last write to the sysctl controls the string contents instead of the
first:
Expected behaviour would be to have the sysctl be "AAAA..." capped at
maxlen (in this case KMOD_PATH_LEN: 256), instead of truncating to the
contents of the second write. Similarly, multiple short writes would
not append to the sysctl.
The old behavior is unlike regular POSIX files enough that doing audits
of software that interact with sysctls can end up in unexpected or
dangerous situations. For example, "as long as the input starts with a
trusted path" turns out to be an insufficient filter, as what must also
happen is for the input to be entirely contained in a single write
syscall -- not a common consideration, especially for high level tools.
This provides kernel.sysctl_writes_strict as a way to make this behavior
act in a less surprising manner for strings, and disallows non-zero file
position when writing numeric sysctls (similar to what is already done
when reading from non-zero file positions). For now, the default (0) is
to warn about non-zero file position use, but retain the legacy
behavior. Setting this to -1 disables the warning, and setting this to
1 enables the file position respecting behavior.
[akpm@linux-foundation.org: fix build]
[akpm@linux-foundation.org: move misplaced hunk, per Randy] Signed-off-by: Kees Cook <keescook@chromium.org> Cc: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Kees Cook [Fri, 6 Jun 2014 21:37:18 +0000 (14:37 -0700)]
sysctl: refactor sysctl string writing logic
Consolidate buffer length checking with new-line/end-of-line checking.
Additionally, instead of reading user memory twice, just do the
assignment during the loop.
This change doesn't affect the potential races here. It was already
possible to read a sysctl that was in the middle of a write. In both
cases, the string will always be NULL terminated. The pre-existing race
remains a problem to be solved.
Signed-off-by: Kees Cook <keescook@chromium.org> Cc: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Kees Cook [Fri, 6 Jun 2014 21:37:17 +0000 (14:37 -0700)]
sysctl: clean up char buffer arguments
When writing to a sysctl string, each write, regardless of VFS position,
began writing the string from the start. This meant the contents of the
last write to the sysctl controlled the string contents instead of the
first.
This misbehavior was featured in an exploit against Chrome OS. While
it's not in itself a vulnerability, it's a weirdness that isn't on the
mind of most auditors: "This filter looks correct, the first line
written would not be meaningful to sysctl" doesn't apply here, since the
size of the write and the contents of the final write are what matter
when writing to sysctls.
This adds the sysctl kernel.sysctl_writes_strict to control the write
behavior. The default (0) reports when VFS position is non-0 on a
write, but retains legacy behavior, -1 disables the warning, and 1
enables the position-respecting behavior.
The long-term plan here is to wait for userspace to be fixed in response
to the new warning and to then switch the default kernel behavior to the
new position-respecting behavior.
This patch (of 4):
The char buffer arguments are needlessly cast in weird places. Clean it
up so things are easier to read.
Signed-off-by: Kees Cook <keescook@chromium.org> Cc: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
rapidio/tsi721: use pci_enable_msix_exact() instead of pci_enable_msix()
As result of deprecation of MSI-X/MSI enablement functions
pci_enable_msix() and pci_enable_msi_block() all drivers using these two
interfaces need to be updated to use the new pci_enable_msi_range() or
pci_enable_msi_exact() and pci_enable_msix_range() or
pci_enable_msix_exact() interfaces.
The patch has no runtime effect.
Signed-off-by: Alexander Gordeev <agordeev@redhat.com> Cc: Matt Porter <mporter@kernel.crashing.org> Acked-by: Alexandre Bounine <alexandre.bounine@idt.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Lai Jiangshan [Fri, 6 Jun 2014 21:37:13 +0000 (14:37 -0700)]
idr: don't need to shink the free list when idr_remove()
After idr subsystem is changed to RCU-awared, the free layer will not go
to the free list. The free list will not be filled up when
idr_remove(). So we don't need to shink it too.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Lai Jiangshan [Fri, 6 Jun 2014 21:37:13 +0000 (14:37 -0700)]
idr: fix idr_replace()'s returned error code
When the smaller id is not found, idr_replace() returns -ENOENT. But
when the id is bigger enough, idr_replace() returns -EINVAL, actually
there is no difference between these two kinds of ids.
These are all unallocated id, the return values of the idr_replace() for
these ids should be the same: -ENOENT.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Lai Jiangshan [Fri, 6 Jun 2014 21:37:12 +0000 (14:37 -0700)]
idr: fix NULL pointer dereference when ida_remove(unallocated_id)
If the ida has at least one existing id, and when an unallocated ID
which meets a certain condition is passed to the ida_remove(), the
system will crash because it hits NULL pointer dereference.
The condition is that the unallocated ID shares the same lowest idr
layer with the existing ID, but the idr slot would be different if the
unallocated ID were to be allocated.
In this case the matching idr slot for the unallocated_id is NULL,
causing @bitmap to be NULL which the function dereferences without
checking crashing the kernel.
See the test code:
static void test3(void)
{
int id;
DEFINE_IDA(test_ida);
printk(KERN_INFO "Start test3\n");
if (ida_pre_get(&test_ida, GFP_KERNEL) < 0) return;
if (ida_get_new(&test_ida, &id) < 0) return;
ida_remove(&test_ida, 4000); /* bug: null deference here */
printk(KERN_INFO "End of test3\n");
}
It happens only when the caller tries to free an unallocated ID which is
the caller's fault. It is not a bug. But it is better to add the
proper check and complain rather than crashing the kernel.
[tj@kernel.org: updated patch description] Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
It happens only when the caller tries to free an unallocated ID which is
the caller's fault. It is not a bug. But it is better to add the
proper check and complain rather than removing an existing_id silently.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Lai Jiangshan [Fri, 6 Jun 2014 21:37:10 +0000 (14:37 -0700)]
idr: fix overflow bug during maximum ID calculation at maximum height
idr_replace() open-codes the logic to calculate the maximum valid ID
given the height of the idr tree; unfortunately, the open-coded logic
doesn't account for the fact that the top layer may have unused slots
and over-shifts the limit to zero when the tree is at its maximum
height.
The following test code shows it fails to replace the value for
id=((1<<27)+42):
static void test5(void)
{
int id;
DEFINE_IDR(test_idr);
#define TEST5_START ((1<<27)+42) /* use the highest layer */
Fix the bug by using idr_max() which correctly takes into account the
maximum allowed shift.
sub_alloc() shares the same problem and may incorrectly fail with
-EAGAIN; however, this bug doesn't affect correct operation because
idr_get_empty_slot(), which already uses idr_max(), retries with the
increased @id in such cases.
kernel/panic.c: add "crash_kexec_post_notifiers" option for kdump after panic_notifers
Add a "crash_kexec_post_notifiers" boot option to run kdump after
running panic_notifiers and dump kmsg. This can help rare situations
where kdump fails because of unstable crashed kernel or hardware failure
(memory corruption on critical data/code), or the 2nd kernel is already
broken by the 1st kernel (it's a broken behavior, but who can guarantee
that the "crashed" kernel works correctly?).
Usage: add "crash_kexec_post_notifiers" to kernel boot option.
Note that this actually increases risks of the failure of kdump. This
option should be set only if you worry about the rare case of kdump
failure rather than increasing the chance of success.
smp: print more useful debug info upon receiving IPI on an offline CPU
There is a longstanding problem related to CPU hotplug which causes IPIs
to be delivered to offline CPUs, and the smp-call-function IPI handler
code prints out a warning whenever this is detected. Every once in a
while this (usually harmless) warning gets reported on LKML, but so far
it has not been completely fixed. Usually the solution involves finding
out the IPI sender and fixing it by adding appropriate synchronization
with CPU hotplug.
However, while going through one such internal bug reports, I found that
there is a significant bug in the receiver side itself (more
specifically, in stop-machine) that can lead to this problem even when
the sender code is perfectly fine. This patchset fixes that
synchronization problem in the CPU hotplug stop-machine code.
Patch 1 adds some additional debug code to the smp-call-function
framework, to help debug such issues easily.
Patch 2 modifies the stop-machine code to ensure that any IPIs that were
sent while the target CPU was online, would be noticed and handled by
that CPU without fail before it goes offline. Thus, this avoids
scenarios where IPIs are received on offline CPUs (as long as the sender
uses proper hotplug synchronization).
In fact, I debugged the problem by using Patch 1, and found that the
payload of the IPI was always the block layer's trigger_softirq()
function. But I was not able to find anything wrong with the block
layer code. That's when I started looking at the stop-machine code and
realized that there is a race-window which makes the IPI _receiver_ the
culprit, not the sender. Patch 2 fixes that race and hence this should
put an end to most of the hard-to-debug IPI-to-offline-CPU issues.
This patch (of 2):
Today the smp-call-function code just prints a warning if we get an IPI
on an offline CPU. This info is sufficient to let us know that
something went wrong, but often it is very hard to debug exactly who
sent the IPI and why, from this info alone.
In most cases, we get the warning about the IPI to an offline CPU,
immediately after the CPU going offline comes out of the stop-machine
phase and reenables interrupts. Since all online CPUs participate in
stop-machine, the information regarding the sender of the IPI is already
lost by the time we exit the stop-machine loop. So even if we dump the
stack on each CPU at this point, we won't find anything useful since all
of them will show the stack-trace of the stopper thread. So we need a
better way to figure out who sent the IPI and why.
To achieve this, when we detect an IPI targeted to an offline CPU, loop
through the call-single-data linked list and print out the payload
(i.e., the name of the function which was supposed to be executed by the
target CPU). This would give us an insight as to who might have sent
the IPI and help us debug this further.
[akpm@linux-foundation.org: correctly suppress warning output on second and later occurrences] Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@kernel.org> Cc: Tejun Heo <tj@kernel.org> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Rik van Riel <riel@redhat.com> Cc: Borislav Petkov <bp@suse.de> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Mike Galbraith <mgalbraith@suse.de> Cc: Gautham R Shenoy <ego@linux.vnet.ibm.com> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Rafael J. Wysocki <rjw@rjwysocki.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Oleg Nesterov [Fri, 6 Jun 2014 21:37:00 +0000 (14:37 -0700)]
signals: introduce kernel_sigaction()
Now that allow_signal() is really trivial we can unify it with
disallow_signal(). Add the new helper, kernel_sigaction(), and
reimplement allow_signal/disallow_signal as a trivial wrappers.
This saves one EXPORT_SYMBOL() and the new helper can have more users.
Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Al Viro <viro@ZenIV.linux.org.uk> Cc: David Woodhouse <dwmw2@infradead.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Ingo Molnar <mingo@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Richard Weinberger <richard@nod.at> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Oleg Nesterov [Fri, 6 Jun 2014 21:36:58 +0000 (14:36 -0700)]
signals: disallow_signal() should flush the potentially pending signal
disallow_signal() simply sets SIG_IGN, this is not enough and
recalc_sigpending() is simply pointless because in can never change the
state of TIF_SIGPENDING.
If we ignore a signal, we also need to do flush_sigqueue_mask() for the
case when this signal is pending, this way recalc_sigpending() can
actually clear TIF_SIGPENDING and we do not "leak" the allocated
siginfo's.
Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Al Viro <viro@ZenIV.linux.org.uk> Cc: David Woodhouse <dwmw2@infradead.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Ingo Molnar <mingo@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Richard Weinberger <richard@nod.at> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Oleg Nesterov [Fri, 6 Jun 2014 21:36:57 +0000 (14:36 -0700)]
signals: kill the obsolete sigdelset() and recalc_sigpending() in allow_signal()
allow_signal() does sigdelset(current->blocked) due to historic reason,
previously it could be called by a daemonize()'ed kthread, and
daemonize() played with current->blocked.
Now that daemonize() has gone away we can remove sigdelset() and
recalc_sigpending(). If a user really wants to unblock a signal, it
must use sigprocmask() or set_current_block() explicitely.
Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Al Viro <viro@ZenIV.linux.org.uk> Cc: David Woodhouse <dwmw2@infradead.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Ingo Molnar <mingo@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Richard Weinberger <richard@nod.at> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Oleg Nesterov [Fri, 6 Jun 2014 21:36:55 +0000 (14:36 -0700)]
signals: jffs2: fix the wrong usage of disallow_signal()
jffs2_garbage_collect_thread() does disallow_signal(SIGHUP) around
jffs2_garbage_collect_pass() and the comment says "We don't want SIGHUP
to interrupt us".
But disallow_signal() can't ensure that jffs2_garbage_collect_pass()
won't be interrupted by SIGHUP, the problem is that SIGHUP can be
already pending when disallow_signal() is called, and in this case any
interruptible sleep won't block.
Note: this is in fact because disallow_signal() is buggy and should be
fixed, see the next changes.
But there is another reason why disallow_signal() is wrong: SIG_IGN set
by disallow_signal() silently discards any SIGHUP which can be sent
before the next allow_signal(SIGHUP).
Change this code to use sigprocmask(SIG_UNBLOCK/SIG_BLOCK, SIGHUP).
This even matches the old (and wrong) semantics allow/disallow had when
this logic was written.
Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Al Viro <viro@ZenIV.linux.org.uk> Cc: David Woodhouse <dwmw2@infradead.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Ingo Molnar <mingo@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Richard Weinberger <richard@nod.at> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Oleg Nesterov [Fri, 6 Jun 2014 21:36:53 +0000 (14:36 -0700)]
signals: mv {dis,}allow_signal() from sched.h/exit.c to signal.[ch]
Move the declaration/definition of allow_signal/disallow_signal to
signal.h/signal.c. The new place is more logical and allows to use the
static helpers in signal.c (see the next changes).
While at it, make them return void and remove the valid_signal() check.
Nobody checks the returned value, and in-kernel users must not pass the
wrong signal number.
Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Al Viro <viro@ZenIV.linux.org.uk> Cc: David Woodhouse <dwmw2@infradead.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Ingo Molnar <mingo@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Richard Weinberger <richard@nod.at> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Oleg Nesterov [Fri, 6 Jun 2014 21:36:51 +0000 (14:36 -0700)]
signals: cleanup the usage of t/current in do_sigaction()
The usage of "task_struct *t" and "current" in do_sigaction() looks really
annoying and chaotic. Initially "t" is used as a cached value of current
but not consistently, then it is reused as a loop variable and we have to
use "current" again.
Clean up this mess and also convert the code to use for_each_thread().
Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Al Viro <viro@ZenIV.linux.org.uk> Cc: David Woodhouse <dwmw2@infradead.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Ingo Molnar <mingo@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Richard Weinberger <richard@nod.at> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Oleg Nesterov [Fri, 6 Jun 2014 21:36:50 +0000 (14:36 -0700)]
signals: rename rm_from_queue_full() to flush_sigqueue_mask()
"rm_from_queue_full" looks ugly and misleading, especially now that
rm_from_queue() has gone away. Rename it to flush_sigqueue_mask(), this
matches flush_sigqueue() we already have.
Also remove the obsolete comment which explains the difference with
rm_from_queue() we already killed.
Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Al Viro <viro@ZenIV.linux.org.uk> Cc: David Woodhouse <dwmw2@infradead.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Ingo Molnar <mingo@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Richard Weinberger <richard@nod.at> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Matthew Dempsky [Fri, 6 Jun 2014 21:36:42 +0000 (14:36 -0700)]
ptrace: fix fork event messages across pid namespaces
When tracing a process in another pid namespace, it's important for fork
event messages to contain the child's pid as seen from the tracer's pid
namespace, not the parent's. Otherwise, the tracer won't be able to
correlate the fork event with later SIGTRAP signals it receives from the
child.
We still risk a race condition if a ptracer from a different pid
namespace attaches after we compute the pid_t value. However, sending a
bogus fork event message in this unlikely scenario is still a vast
improvement over the status quo where we always send bogus fork event
messages to debuggers in a different pid namespace than the forking
process.
Signed-off-by: Matthew Dempsky <mdempsky@chromium.org> Acked-by: Oleg Nesterov <oleg@redhat.com> Cc: Kees Cook <keescook@chromium.org> Cc: Julien Tinnes <jln@chromium.org> Cc: Roland McGrath <mcgrathr@chromium.org> Cc: Jan Kratochvil <jan.kratochvil@redhat.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Linked article in seq_file.txt still uses create_proc_entry which was
removed in commit 80e928f7ebb9 ("proc: Kill create_proc_entry()").
This patch adds information for kernel 3.10 and above
Signed-off-by: Fabian Frederick <fabf@skynet.be> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Jacob Keller [Fri, 6 Jun 2014 21:36:39 +0000 (14:36 -0700)]
Documentation/SubmittingPatches: describe the Fixes: tag
Update the SubmittingPatches process to include howto about the new
'Fixes:' tag to be used when a patch fixes an issue in a previous commit
(found by git-bisect for example).
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Tested-by: Aaron Brown <aaron.f.brown@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com> Cc: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
fs/fat/inode.c: clean up string initializations (char[] instead of char *)
Initializations like 'char *foo = "bar"' will create two variables: a
static string and a pointer (foo) to that static string. Instead 'char
foo[] = "bar"' will declare a single variable and will end up in shorter
assembly (according to Jeff Garzik on the KernelJanitor's TODO list).
Signed-off-by: Manuel Schölling <manuel.schoelling@gmx.de> Acked-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Conrad Meyer [Fri, 6 Jun 2014 21:36:37 +0000 (14:36 -0700)]
fs/fat/: add support for DOS 1.x formatted volumes
Add structure for parsed BPB information, struct fat_bios_param_block,
and move all of the deserialization and validation logic from
fat_fill_super() into fat_read_bpb().
Add a 'dos1xfloppy' mount option to infer DOS 2.x BIOS Parameter Block
defaults from block device geometry for ancient floppies and floppy
images, as a fall-back from the default BPB parsing logic.
When fat_read_bpb() finds an invalid FAT filesystem and dos1xfloppy is
set, fall back to fat_read_static_bpb(). fat_read_static_bpb()
validates that the entire BPB is zero, and that the floppy has a
DOS-style 8086 code bootstrapping header. Then it fills in default BPB
values from media size and a table.[0]
Media size is assumed to be static for archaic FAT volumes. See also:
[1].
Christian Kujau [Fri, 6 Jun 2014 21:36:32 +0000 (14:36 -0700)]
hfsplus: fix compiler warning on PowerPC
Commit a99b7069aab8 ("hfsplus: Fix undefined __divdi3 in
hfsplus_init_header_node()") introduced do_div() to xattr.c and the
warning below too.
As Geert remarked: "tmp" is "loff_t" which is "__kernel_loff_t", which
is "long long", i.e. signed, while include/asm-generic/div64.h compares
its type with "uint64_t". As inode sizes are positive, it should be
safe to change the type of "tmp" to "u64".
In file included from
arch/powerpc/include/asm/div64.h:1:0,
from include/linux/kernel.h:124,
from include/asm-generic/bug.h:13,
from arch/powerpc/include/asm/bug.h:127,
from include/linux/bug.h:4,
from include/linux/thread_info.h:11,
from include/asm-generic/preempt.h:4,
from arch/powerpc/include/generated/asm/preempt.h:1,
from include/linux/preempt.h:18,
from include/linux/spinlock.h:50,
from include/linux/wait.h:8,
from include/linux/fs.h:6,
from fs/hfsplus/hfsplus_fs.h:19,
from fs/hfsplus/xattr.c:9:
fs/hfsplus/xattr.c: In function 'hfsplus_init_header_node':
include/asm-generic/div64.h:43:28: warning: comparison of distinct pointer types lacks a cast [enabled by default]
(void)(((typeof((n)) *)0) == ((uint64_t *)0)); \
^
fs/hfsplus/xattr.c:86:2: note: in expansion of macro 'do_div'
do_div(tmp, node_size);
^
Signed-off-by: Christian Kujau <lists@nerdbynature.de> Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org> Acked-by: Sergei Antonov <saproj@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Sergei Antonov [Fri, 6 Jun 2014 21:36:30 +0000 (14:36 -0700)]
hfsplus: coding style fix for declarations in hfsplus_fs.h
Some function declarations in hfsplus_fs.h were with argument names,
some without, and some were mixed. This patch adds argument names
everywhere, sorts function in order they go in .c files, and moves
hfs_part_find() to a proper section.
Auto-formatting and sorting was done with:
cfunctions *.c | indent -linux | sed "s| \* | \*|"
Signed-off-by: Sergei Antonov <saproj@gmail.com> Cc: Vyacheslav Dubeyko <slava@dubeyko.com> Cc: Hin-Tak Leung <htl10@users.sourceforge.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Sergei Antonov [Fri, 6 Jun 2014 21:36:28 +0000 (14:36 -0700)]
hfsplus: fix "unused node is not erased" error
Zero newly allocated extents in the catalog tree if volume attributes
tell us to. Not doing so we risk getting the "unused node is not
erased" error. See kHFSUnusedNodeFix flag in Apple's source code for
reference.
There was a previous commit clearing the node when it is freed: commit 899bed05e9f6 ("hfsplus: fix issue with unzeroed unused b-tree nodes").
But it did not handle newly allocated extents (this patch fixes it).
And it zeroed nodes in all trees unconditionally which is an overkill.
This patch adds a condition and also switches to 'tree->node_size' as a
simpler method of getting the length to zero.
Signed-off-by: Sergei Antonov <saproj@gmail.com> Cc: Anton Altaparmakov <aia21@cam.ac.uk> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Christoph Hellwig <hch@infradead.org> Cc: Vyacheslav Dubeyko <slava@dubeyko.com> Cc: Hin-Tak Leung <htl10@users.sourceforge.net> Cc: Kyle Laracey <kalaracey@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>