Shaohua Li [Wed, 24 Apr 2013 01:42:43 +0000 (11:42 +1000)]
raid5: make release_stripe lockless
release_stripe still has big lock contention. We just add the stripe to a llist
without taking device_lock. We let the raid5d thread to do the real stripe
release, which must hold device_lock anyway. In this way, release_stripe
doesn't hold any locks.
The side effect is the released stripes order is changed. But sounds not a big
deal, stripes are never handled in order. And I thought block layer can already
do nice request merge, which means order isn't that important.
I kept the unplug release batch, which is unnecessary with this patch from lock
contention avoid point of view, and actually if we delete it, the stripe_head
release_list and lru can share storage. But the unplug release batch is also
helpful for request merge. We probably can delay wakeup raid5d till unplug, but
I'm still afraid of the case which raid5d is running.
Signed-off-by: Shaohua Li <shli@fusionio.com> Signed-off-by: NeilBrown <neilb@suse.de>
Shaohua Li [Wed, 24 Apr 2013 01:42:43 +0000 (11:42 +1000)]
raid5: create multiple threads to handle stripes
This is a new tempt to make raid5 handle stripes in multiple threads, as
suggested by Neil to have maxium flexibility and better numa binding. It
basically is a combination of my first and second generation patches. By
default, no multiple thread is enabled (all stripes are handled by raid5d).
An example to enable multiple threads:
#echo 3 > /sys/block/md0/md/auxthread_number
This will create 3 auxiliary threads to handle stripes. The threads can run
on any cpus and handle stripes produced by any cpus.
#echo 1-3 > /sys/block/md0/md/auxth0/cpulist
This will bind auxiliary thread 0 to cpu 1-3, and this thread will only handle
stripes produced by cpu 1-3. User tool can further change the thread's
affinity, but the thread can only handle stripes produced by cpu 1-3 till the
sysfs entry is changed again.
If stripes produced by a CPU aren't handled by any auxiliary thread, such
stripes will be handled by raid5d. Otherwise, raid5d doesn't handle any
stripes.
Signed-off-by: Shaohua Li <shli@fusionio.com> Signed-off-by: NeilBrown <neilb@suse.de>
NeilBrown [Tue, 11 Jun 2013 05:08:03 +0000 (15:08 +1000)]
md/raid10: check In_sync flag in 'enough()'.
It isn't really enough to check that the rdev is present, we need to
also be sure that the device is still In_sync.
Doing this requires using rcu_dereference to access the rdev, and
holding the rcu_read_lock() to ensure the rdev doesn't disappear while
we look at it.
NeilBrown [Tue, 11 Jun 2013 04:57:09 +0000 (14:57 +1000)]
md/raid10: locking changes for 'enough()'.
As 'enough' accesses conf->prev and conf->geo, which can change
spontanously, it should guard against changes.
This can be done with device_lock as start_reshape holds device_lock
while updating 'geo' and end_reshape holds it while updating 'prev'.
So 'error' needs to hold 'device_lock'.
On the other hand, raid10_end_read_request knows which of the two it
really wants to access, and as it is an active request on that one,
the value cannot change underneath it.
So change _enough to take flag rather than a pointer, pass the
appropriate flag from raid10_end_read_request(), and remove the locking.
All other calls to 'enough' are made with reconfig_mutex held, so
neither 'prev' nor 'geo' can change.
Hannes Reinecke [Tue, 2 Apr 2013 06:38:55 +0000 (08:38 +0200)]
md: Wait for md_check_recovery before attempting device removal.
When a device has failed, it needs to be removed from the personality
module before it can be removed from the array as a whole.
The first step is performed by md_check_recovery() which is called
from the raid management thread.
So when a HOT_REMOVE ioctl arrives, wait briefly for md_check_recovery
to have run. This increases the chance that the ioctl will succeed.
Signed-off-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Neil Brown <nfbrown@suse.de>
DM RAID: Fix raid_resume not reviving failed devices in all cases
DM RAID: Fix raid_resume not reviving failed devices in all cases
When a device fails in a RAID array, it is marked as Faulty. Later,
md_check_recovery is called which (through the call chain) calls
'hot_remove_disk' in order to have the personalities remove the device
from use in the array.
Sometimes, it is possible for the array to be suspended before the
personalities get their chance to perform 'hot_remove_disk'. This is
normally not an issue. If the array is deactivated, then the failed
device will be noticed when the array is reinstantiated. If the
array is resumed and the disk is still missing, md_check_recovery will
be called upon resume and 'hot_remove_disk' will be called at that
time. However, (for dm-raid) if the device has been restored,
a resume on the array would cause it to attempt to revive the device
by calling 'hot_add_disk'. If 'hot_remove_disk' had not been called,
a situation is then created where the device is thought to concurrently
be the replacement and the device to be replaced. Thus, the device
is first sync'ed with the rest of the array (because it is the replacement
device) and then marked Faulty and removed from the array (because
it is also the device being replaced).
The solution is to check and see if the device had properly been removed
before the array was suspended. This is done by seeing whether the
device's 'raid_disk' field is -1 - a condition that implies that
'md_check_recovery -> remove_and_add_spares (where raid_disk is set to -1)
-> hot_remove_disk' has been called. If 'raid_disk' is not -1, then
'hot_remove_disk' must be called to complete the removal of the previously
faulty device before it can be revived via 'hot_add_disk'.
Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: NeilBrown <neilb@suse.de>
DM RAID: Add ability to restore transiently failed devices on resume
DM RAID: Add ability to restore transiently failed devices on resume
This patch adds code to the resume function to check over the devices
in the RAID array. If any are found to be marked as failed and their
superblocks can be read, an attempt is made to reintegrate them into
the array. This allows the user to refresh the array with a simple
suspend and resume of the array - rather than having to load a
completely new table, allocate and initialize all the structures and
throw away the old instantiation.
Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: NeilBrown <neilb@suse.de>
Linus Torvalds [Thu, 13 Jun 2013 20:09:50 +0000 (13:09 -0700)]
Merge tag 'acpi-3.10-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull ACPI fix from Rafael Wysocki:
"This is an alternative fix for the regression introduced in 3.9 whose
previous fix had to be reverted right before 3.10-rc5, because it
broke one of the Tony's machines.
In this one the check is confined to the ACPI video driver (which is
the only one causing the problem to happen in the first place) and the
Tony's box shouldn't even notice it.
- ACPI fix for an issue causing ACPI video driver to attempt to bind
to devices it shouldn't touch from Rafael J Wysocki."
* tag 'acpi-3.10-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
ACPI / video: Do not bind to device objects with a scan handler
Linus Torvalds [Thu, 13 Jun 2013 20:08:51 +0000 (13:08 -0700)]
Merge branch 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fixes from Peter Anvin:
"Another set of fixes, the biggest bit of this is yet another tweak to
the UEFI anti-bricking code; apparently we finally got some feedback
from Samsung as to what makes at least their systems fail. This set
should actually fix the boot regressions that some other systems (e.g.
SGI) have exhibited.
Other than that, there is a patch to avoid a panic with particularly
unhappy memory layouts and two minor protocol fixes which may or may
not be manifest bugs"
* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86: Fix typo in kexec register clearing
x86, relocs: Move __vvar_page from S_ABS to S_REL
Modify UEFI anti-bricking code
x86: Fix adjust_range_size_mask calling position
Linus Torvalds [Thu, 13 Jun 2013 19:36:42 +0000 (12:36 -0700)]
Merge branch 'rcu/urgent' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu
Pull RCU fixes from Paul McKenney:
"I must confess that this past merge window was not RCU's best showing.
This series contains three more fixes for RCU regressions:
1. A fix to __DECLARE_TRACE_RCU() that causes it to act as an
interrupt from idle rather than as a task switch from idle.
This change is needed due to the recent use of _rcuidle()
tracepoints that can be invoked from interrupt handlers as well
as from idle. Without this fix, invoking _rcuidle() tracepoints
from interrupt handlers results in splats and (more seriously)
confusion on RCU's part as to whether a given CPU is idle or not.
This confusion can in turn result in too-short grace periods and
therefore random memory corruption.
2. A fix to a subtle deadlock that could result due to RCU doing
a wakeup while holding one of its rcu_node structure's locks.
Although the probability of occurrence is low, it really
does happen. The fix, courtesy of Steven Rostedt, uses
irq_work_queue() to avoid the deadlock.
3. A fix to a silent deadlock (invisible to lockdep) due to the
interaction of timeouts posted by RCU debug code enabled by
CONFIG_PROVE_RCU_DELAY=y, grace-period initialization, and CPU
hotplug operations. This will not occur in production kernels,
but really does occur in randconfig testing. Diagnosis courtesy
of Steven Rostedt"
* 'rcu/urgent' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu:
rcu: Fix deadlock with CPU hotplug, RCU GP init, and timer migration
rcu: Don't call wakeup() with rcu_node structure ->lock held
trace: Allow idle-safe tracepoints to be called from irq
Linus Torvalds [Thu, 13 Jun 2013 18:02:31 +0000 (11:02 -0700)]
Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux
Pull s390 fixes from Martin Schwidefsky:
"Three kvm related memory management fixes, a fix for show_trace, a fix
for early console output and a patch from Ben to help prevent compile
errors in regard to irq functions (or our lack thereof)"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux:
s390/pci: Implement IRQ functions if !PCI
s390/sclp: fix new line detection
s390/pgtable: make pgste lock an explicit barrier
s390/pgtable: Save pgste during modify_prot_start/commit
s390/dumpstack: fix address ranges for asynchronous and panic stack
s390/pgtable: Fix guest overindication for change bit
Linus Torvalds [Thu, 13 Jun 2013 17:18:33 +0000 (10:18 -0700)]
Merge tag 'asoc-v3.10-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/sound
Pull ASoC sound updates from Mark Brown:
"Takashi is travelling at the minute and it'd be good to get the
MAINTAINERS update in here merged so sending directly.
As well as the usual driver specifics we've got a couple of core fixes
here, one fixing capabilities for unidirectional streams and the other
fixing suspend while audio streams are active.
The suspend fix is a little involved but mostly as a result of
removing some special casing that was doing the wrong thing."
* tag 'asoc-v3.10-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/sound:
ASoC: tlv320aic3x: Remove deadlock from snd_soc_dapm_put_volsw_aic3x()
ASoC: dapm: Treat DAI widgets like AIF widgets for power
ASoC: arizona: Correct AEC loopback enable
ASoC: pcm: Require both CODEC and CPU support when declaring stream caps
MAINTAINERS: Remove myself from Wolfson maintainers
ASoC: wm8994: Ensure microphone detection state is reset on removal
ASoC: wm8994: Avoid leaking pm_runtime reference on removed jack race
ASoC: cs42l52: fix hp_gain_enum shift value.
ASoC: cs42l52: use correct PCM mixer TLV dB scale to match datasheet.
Linus Torvalds [Thu, 13 Jun 2013 17:13:29 +0000 (10:13 -0700)]
Merge tag 'md-3.10-fixes' of git://neil.brown.name/md
Pull md bugfixes from Neil Brown:
"A few bugfixes for md
Some tagged for -stable"
* tag 'md-3.10-fixes' of git://neil.brown.name/md:
md/raid1,5,10: Disable WRITE SAME until a recovery strategy is in place
md/raid1,raid10: use freeze_array in place of raise_barrier in various places.
md/raid1: consider WRITE as successful only if at least one non-Faulty and non-rebuilding drive completed it.
md: md_stop_writes() should always freeze recovery.
Josh Triplett [Thu, 13 Jun 2013 00:26:37 +0000 (17:26 -0700)]
turbostat: Increase output buffer size to accommodate C8-C10
On platforms with C8-C10 support, the additional C-states cause
turbostat to overrun its output buffer of 128 bytes per CPU. Increase
this to 256 bytes per CPU.
[ As a bugfix, this should go into 3.10; however, since the C8-C10
support didn't go in until after 3.9, this need not go into any stable
kernel. ]
Signed-off-by: Josh Triplett <josh@joshtriplett.org> Cc: Len Brown <len.brown@intel.com> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
H. Peter Anvin [Thu, 13 Jun 2013 15:59:23 +0000 (08:59 -0700)]
Merge tag 'efi-urgent' into x86/urgent
* More tweaking to the EFI variable anti-bricking algorithm. Quite a
few users were reporting boot regressions in v3.9. This has now been
fixed with a more accurate "minimum storage requirement to avoid
bricking" value from Samsung (5K instead of 50%) and code to trigger
garbage collection when we near our limit - Matthew Garrett.
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
H. Peter Anvin [Wed, 12 Jun 2013 14:37:43 +0000 (07:37 -0700)]
md/raid1,5,10: Disable WRITE SAME until a recovery strategy is in place
There are cases where the kernel will believe that the WRITE SAME
command is supported by a block device which does not, in fact,
support WRITE SAME. This currently happens for SATA drivers behind a
SAS controller, but there are probably a hundred other ways that can
happen, including drive firmware bugs.
After receiving an error for WRITE SAME the block layer will retry the
request as a plain write of zeroes, but mdraid will consider the
failure as fatal and consider the drive failed. This has the effect
that all the mirrors containing a specific set of data are each
offlined in very rapid succession resulting in data loss.
However, just bouncing the request back up to the block layer isn't
ideal either, because the whole initial request-retry sequence should
be inside the write bitmap fence, which probably means that md needs
to do its own conversion of WRITE SAME to write zero.
Until the failure scenario has been sorted out, disable WRITE SAME for
raid1, raid5, and raid10.
[neilb: added raid5]
This patch is appropriate for any -stable since 3.7 when write_same
support was added.
Cc: stable@vger.kernel.org Signed-off-by: H. Peter Anvin <hpa@linux.intel.com> Signed-off-by: NeilBrown <neilb@suse.de>
NeilBrown [Wed, 12 Jun 2013 01:01:22 +0000 (11:01 +1000)]
md/raid1,raid10: use freeze_array in place of raise_barrier in various places.
Various places in raid1 and raid10 are calling raise_barrier when they
really should call freeze_array.
The former is only intended to be called from "make_request".
The later has extra checks for 'nr_queued' and makes a call to
flush_pending_writes(), so it is safe to call it from within the
management thread.
Using raise_barrier will sometimes deadlock. Using freeze_array
should not.
As 'freeze_array' currently expects one request to be pending (in
handle_read_error - the only previous caller), we need to pass
it the number of pending requests (extra) to ignore.
The deadlock was made particularly noticeable by commits 050b66152f87c7 (raid10) and 6b740b8d79252f13 (raid1) which
appeared in 3.4, so the fix is appropriate for any -stable
kernel since then.
This patch probably won't apply directly to some early kernels and
will need to be applied by hand.
Cc: stable@vger.kernel.org Reported-by: Alexander Lyakas <alex.bolshoy@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>
Alex Lyakas [Tue, 4 Jun 2013 17:42:21 +0000 (20:42 +0300)]
md/raid1: consider WRITE as successful only if at least one non-Faulty and non-rebuilding drive completed it.
Without that fix, the following scenario could happen:
- RAID1 with drives A and B; drive B was freshly-added and is rebuilding
- Drive A fails
- WRITE request arrives to the array. It is failed by drive A, so
r1_bio is marked as R1BIO_WriteError, but the rebuilding drive B
succeeds in writing it, so the same r1_bio is marked as
R1BIO_Uptodate.
- r1_bio arrives to handle_write_finished, badblocks are disabled,
md_error()->error() does nothing because we don't fail the last drive
of raid1
- raid_end_bio_io() calls call_bio_endio()
- As a result, in call_bio_endio():
if (!test_bit(R1BIO_Uptodate, &r1_bio->state))
clear_bit(BIO_UPTODATE, &bio->bi_flags);
this code doesn't clear the BIO_UPTODATE flag, and the whole master
WRITE succeeds, back to the upper layer.
So we returned success to the upper layer, even though we had written
the data onto the rebuilding drive only. But when we want to read the
data back, we would not read from the rebuilding drive, so this data
is lost.
[neilb - applied identical change to raid10 as well]
This bug can result in lost data, so it is suitable for any
-stable kernel.
Cc: stable@vger.kernel.org Signed-off-by: Alex Lyakas <alex@zadarastorage.com> Signed-off-by: NeilBrown <neilb@suse.de>
NeilBrown [Wed, 8 May 2013 23:48:30 +0000 (09:48 +1000)]
md: md_stop_writes() should always freeze recovery.
__md_stop_writes() will currently sometimes freeze recovery.
So any caller must be ready for that to happen, and indeed they are.
However if __md_stop_writes() doesn't freeze_recovery, then
a recovery could start before mddev_suspend() is called, which
could be awkward. This can particularly cause problems or dm-raid.
So change __md_stop_writes() to always freeze recovery. This is safe
and more predicatable.
Reported-by: Brassow Jonathan <jbrassow@redhat.com> Tested-by: Brassow Jonathan <jbrassow@redhat.com> Signed-off-by: NeilBrown <neilb@suse.de>
1) Fix dump iterator in nfnl_acct_dump() and ctnl_timeout_dump() to
dump all objects properly, from Pablo Neira Ayuso.
2) xt_TCPMSS must use the default MSS of 536 when no MSS TCP option is
present. Fix from Phil Oester.
3) qdisc_get_rtab() looks for an existing matching rate table and uses
that instead of creating a new one. However, it's key matching is
incomplete, it fails to check to make sure the ->data[] array is
identical too. Fix from Eric Dumazet.
4) ip_vs_dest_entry isn't fully initialized before copying back to
userspace, fix from Dan Carpenter.
5) Fix ubuf reference counting regression in vhost_net, from Jason
Wang.
6) When sock_diag dumps a socket filter back to userspace, we have to
translate it out of the kernel's internal representation first.
From Nicolas Dichtel.
7) davinci_mdio holds a spinlock while calling pm_runtime, which
sleeps. Fix from Sebastian Siewior.
8) Timeout check in sh_eth_check_reset is off by one, from Sergei
Shtylyov.
9) If sctp socket init fails, we can NULL deref during cleanup. Fix
from Daniel Borkmann.
10) netlink_mmap() does not propagate errors properly, from Patrick
McHardy.
11) Disable powersave and use minstrel by default in ath9k. From Sujith
Manoharan.
12) Fix a regression in that SOCK_ZEROCOPY is not set on tuntap sockets
which prevents vhost from being able to use zerocopy. From Jason
Wang.
13) Fix race between port lookup and TX path in team driver, from Jiri
Pirko.
14) Missing length checks in bluetooth L2CAP packet parsing, from Johan
Hedberg.
15) rtlwifi fails to connect to networking using any encryption method
other than WPA2. Fix from Larry Finger.
16) Fix iwlegacy build due to incorrect CONFIG_* ifdeffing for power
management stuff. From Yijing Wang.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (35 commits)
b43: stop format string leaking into error msgs
ath9k: Use minstrel rate control by default
Revert "ath9k_hw: Update rx gain initval to improve rx sensitivity"
ath9k: Disable PowerSave by default
net: wireless: iwlegacy: fix build error for il_pm_ops
rtlwifi: Fix a false leak indication for PCI devices
wl12xx/wl18xx: scan all 5ghz channels
wl12xx: increase minimum singlerole firmware version required
wl12xx: fix minimum required firmware version for wl127x multirole
rtlwifi: rtl8192cu: Fix problem in connecting to WEP or WPA(1) networks
mwifiex: debugfs: Fix out of bounds array access
Bluetooth: Fix mgmt handling of power on failures
Bluetooth: Fix missing length checks for L2CAP signalling PDUs
Bluetooth: btmrvl: support Marvell Bluetooth device SD8897
Bluetooth: Fix checks for LE support on LE-only controllers
team: fix checks in team_get_first_port_txable_rcu()
team: move add to port list before port enablement
team: check return value of team_get_port_by_index_rcu() for NULL
tuntap: set SOCK_ZEROCOPY flag during open
netlink: fix error propagation in netlink_mmap()
...
Linus Torvalds [Wed, 12 Jun 2013 23:42:39 +0000 (16:42 -0700)]
Merge branch 'for-linus' of git://git.kernel.dk/linux-block
Pull block layer fixes from Jens Axboe:
"Outside of bcache (which really isn't super big), these are all
few-liners. There are a few important fixes in here:
- Fix blk pm sleeping when holding the queue lock
- A small collection of bcache fixes that have been done and tested
since bcache was included in this merge window.
- A fix for a raid5 regression introduced with the bio changes.
- Two important fixes for mtip32xx, fixing an oops and potential data
corruption (or hang) due to wrong bio iteration on stacked devices."
* 'for-linus' of git://git.kernel.dk/linux-block:
scatterlist: sg_set_buf() argument must be in linear mapping
raid5: Initialize bi_vcnt
pktcdvd: silence static checker warning
block: remove refs to XD disks from documentation
blkpm: avoid sleep when holding queue lock
mtip32xx: Correctly handle bio->bi_idx != 0 conditions
mtip32xx: Fix NULL pointer dereference during module unload
bcache: Fix error handling in init code
bcache: clarify free/available/unused space
bcache: drop "select CLOSURES"
bcache: Fix incompatible pointer type warning
The lockless reclaim hierarchy iterator currently has a misplaced
barrier that can lead to use-after-free crashes.
The reclaim hierarchy iterator consist of a sequence count and a
position pointer that are read and written locklessly, with memory
barriers enforcing ordering.
The write side sets the position pointer first, then updates the
sequence count to "publish" the new position. Likewise, the read side
must read the sequence count first, then the position. If the sequence
count is up to date, it's guaranteed that the position is up to date as
well:
writer: reader:
iter->position = position if iter->sequence == expected:
smp_wmb() smp_rmb()
iter->sequence = sequence position = iter->position
However, the read side barrier is currently misplaced, which can lead to
dereferencing stale position pointers that no longer point to valid
memory. Fix this.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reported-by: Tejun Heo <tj@kernel.org> Reviewed-by: Tejun Heo <tj@kernel.org> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Glauber Costa <glommer@parallels.com> Cc: <stable@kernel.org> [3.10+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Akinobu Mita [Wed, 12 Jun 2013 21:05:08 +0000 (14:05 -0700)]
frontswap: fix incorrect zeroing and allocation size for frontswap_map
The bitmap accessed by bitops must have enough size to hold the required
numbers of bits rounded up to a multiple of BITS_PER_LONG. And the
bitmap must not be zeroed by memset() if the number of bits cleared is
not a multiple of BITS_PER_LONG.
This fixes incorrect zeroing and allocation size for frontswap_map. The
incorrect zeroing part doesn't cause any problem because frontswap_map
is freed just after zeroing. But the wrongly calculated allocation size
may cause the problem.
For 32bit systems, the allocation size of frontswap_map is about twice
as large as required size. For 64bit systems, the allocation size is
smaller than requeired if the number of bits is not a multiple of
BITS_PER_LONG.
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Chen Gang [Wed, 12 Jun 2013 21:05:07 +0000 (14:05 -0700)]
kernel/audit_tree.c:audit_add_tree_rule(): protect `rule' from kill_rules()
audit_add_tree_rule() must set 'rule->tree = NULL;' firstly, to protect
the rule itself freed in kill_rules().
The reason is when it is killed, the 'rule' itself may have already
released, we should not access it. one example: we add a rule to an
inode, just at the same time the other task is deleting this inode.
The work flow for adding a rule:
audit_receive() -> (need audit_cmd_mutex lock)
audit_receive_skb() ->
audit_receive_msg() ->
audit_receive_filter() ->
audit_add_rule() ->
audit_add_tree_rule() -> (need audit_filter_mutex lock)
...
unlock audit_filter_mutex
get_tree()
...
iterate_mounts() -> (iterate all related inodes)
tag_mount() ->
tag_trunk() ->
create_trunk() -> (assume it is 1st rule)
fsnotify_add_mark() ->
fsnotify_add_inode_mark() -> (add mark to inode->i_fsnotify_marks)
...
get_tree(); (each inode will get one)
...
lock audit_filter_mutex
The work flow for deleting an inode:
__destroy_inode() ->
fsnotify_inode_delete() ->
__fsnotify_inode_delete() ->
fsnotify_clear_marks_by_inode() -> (get mark from inode->i_fsnotify_marks)
fsnotify_destroy_mark() ->
fsnotify_destroy_mark_locked() ->
audit_tree_freeing_mark() ->
evict_chunk() ->
...
tree->goner = 1
...
kill_rules() -> (assume current->audit_context == NULL)
call_rcu() -> (rule->tree != NULL)
audit_free_rule_rcu() ->
audit_free_rule()
...
audit_schedule_prune() -> (assume current->audit_context == NULL)
kthread_run() -> (need audit_cmd_mutex and audit_filter_mutex lock)
prune_one() -> (delete it from prue_list)
put_tree(); (match the original get_tree above)
Signed-off-by: Chen Gang <gang.chen@asianux.com> Cc: Eric Paris <eparis@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Naoya Horiguchi [Wed, 12 Jun 2013 21:05:04 +0000 (14:05 -0700)]
mm: migration: add migrate_entry_wait_huge()
When we have a page fault for the address which is backed by a hugepage
under migration, the kernel can't wait correctly and do busy looping on
hugepage fault until the migration finishes. As a result, users who try
to kick hugepage migration (via soft offlining, for example) occasionally
experience long delay or soft lockup.
This is because pte_offset_map_lock() can't get a correct migration entry
or a correct page table lock for hugepage. This patch introduces
migration_entry_wait_huge() to solve this.
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Reviewed-by: Rik van Riel <riel@redhat.com> Reviewed-by: Wanpeng Li <liwanp@linux.vnet.ibm.com> Reviewed-by: Michal Hocko <mhocko@suse.cz> Cc: Mel Gorman <mgorman@suse.de> Cc: Andi Kleen <andi@firstfloor.org> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: <stable@vger.kernel.org> [2.6.35+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
mm/page_alloc.c: fix watermark check in __zone_watermark_ok()
The watermark check consists of two sub-checks. The first one is:
if (free_pages <= min + lowmem_reserve)
return false;
The check assures that there is minimal amount of RAM in the zone. If
CMA is used then the free_pages is reduced by the number of free pages
in CMA prior to the over-mentioned check.
if (!(alloc_flags & ALLOC_CMA))
free_pages -= zone_page_state(z, NR_FREE_CMA_PAGES);
This prevents the zone from being drained from pages available for
non-movable allocations.
The second check prevents the zone from getting too fragmented.
for (o = 0; o < order; o++) {
free_pages -= z->free_area[o].nr_free << o;
min >>= 1;
if (free_pages <= min)
return false;
}
The field z->free_area[o].nr_free is equal to the number of free pages
including free CMA pages. Therefore the CMA pages are subtracted twice.
This may cause a false positive fail of __zone_watermark_ok() if the CMA
area gets strongly fragmented. In such a case there are many 0-order
free pages located in CMA. Those pages are subtracted twice therefore
they will quickly drain free_pages during the check against
fragmentation. The test fails even though there are many free non-cma
pages in the zone.
This patch fixes this issue by subtracting CMA pages only for a purpose of
(free_pages <= min + lowmem_reserve) check.
Laura said:
We were observing allocation failures of higher order pages (order 5 =
128K typically) under tight memory conditions resulting in driver
failure. The output from the page allocation failure showed plenty of
free pages of the appropriate order/type/zone and mostly CMA pages in
the lower orders.
For full disclosure, we still observed some page allocation failures
even after applying the patch but the number was drastically reduced and
those failures were attributed to fragmentation/other system issues.
Signed-off-by: Tomasz Stanislawski <t.stanislaws@samsung.com> Signed-off-by: Kyungmin Park <kyungmin.park@samsung.com> Tested-by: Laura Abbott <lauraa@codeaurora.org> Cc: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com> Acked-by: Minchan Kim <minchan@kernel.org> Cc: Mel Gorman <mel@csn.ul.ie> Tested-by: Marek Szyprowski <m.szyprowski@samsung.com> Cc: <stable@vger.kernel.org> [3.7+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Kent Overstreet [Wed, 12 Jun 2013 21:04:59 +0000 (14:04 -0700)]
aio: fix io_destroy() regression by using call_rcu()
There was a regression introduced by 36f5588905c1 ("aio: refcounting
cleanup"), reported by Jens Axboe - the refcounting cleanup switched to
using RCU in the shutdown path, but the synchronize_rcu() was done in
the context of the io_destroy() syscall greatly increasing the time it
could block.
This patch switches it to call_rcu() and makes shutdown asynchronous
(more asynchronous than it was originally; before the refcount changes
io_destroy() would still wait on pending kiocbs).
Note that there's a global quota on the max outstanding kiocbs, and that
quota must be manipulated synchronously; otherwise io_setup() could
return -EAGAIN when there isn't quota available, and userspace won't
have any way of waiting until shutdown of the old kioctxs has finished
(besides busy looping).
So we release our quota before kioctx shutdown has finished, which
should be fine since the quota never corresponded to anything real
anyways.
Signed-off-by: Kent Overstreet <koverstreet@google.com> Cc: Zach Brown <zab@redhat.com> Cc: Felipe Balbi <balbi@ti.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Rusty Russell <rusty@rustcorp.com.au> Reported-by: Jens Axboe <axboe@kernel.dk> Tested-by: Jens Axboe <axboe@kernel.dk> Cc: Asai Thambi S P <asamymuthupa@micron.com> Cc: Selvan Mani <smani@micron.com> Cc: Sam Bradshaw <sbradshaw@micron.com> Cc: Jeff Moyer <jmoyer@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Benjamin LaHaise <bcrl@kvack.org> Tested-by: Benjamin LaHaise <bcrl@kvack.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Johan Hovold [Wed, 12 Jun 2013 21:04:56 +0000 (14:04 -0700)]
rtc-at91rm9200: add shadow interrupt mask
Add shadow interrupt-mask register which can be used on SoCs where the
actual hardware register is broken.
Note that some care needs to be taken to make sure the shadow mask
corresponds to the actual hardware state. The added overhead is not an
issue for the non-broken SoCs due to the relatively infrequent
interrupt-mask updates. We do, however, only use the shadow mask value
as a fall-back when it actually needed as there is still a theoretical
possibility that the mask is incorrect (see the code for details).
Signed-off-by: Johan Hovold <jhovold@gmail.com> Acked-by: Nicolas Ferre <nicolas.ferre@atmel.com> Cc: Douglas Gilbert <dgilbert@interlog.com> Cc: Jean-Christophe PLAGNIOL-VILLARD <plagnioj@jcrosoft.com> Cc: Ludovic Desroches <ludovic.desroches@atmel.com> Cc: Robert Nelson <Robert.Nelson@digikey.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Johan Hovold [Wed, 12 Jun 2013 21:04:52 +0000 (14:04 -0700)]
rtc-at91rm9200: add match-table compile guard
The members of Atmel's at91sam9x5 family (9x5) have a broken RTC
interrupt mask register (AT91_RTC_IMR). It does not reflect enabled
interrupts but instead always returns zero.
The kernel's rtc-at91rm9200 driver handles the RTC for the 9x5 family.
Currently when the date/time is set, an interrupt is generated and this
driver neglects to handle the interrupt. The kernel complains about the
un-handled interrupt and disables it henceforth. This not only breaks
the RTC function, but since that interrupt is shared (Atmel's SYS
interrupt) then other things break as well (e.g. the debug port no
longer accepts characters).
Tested on the at91sam9g25. Bug confirmed by Atmel.
This patch (of 5):
Add missing match-table compile guard.
Signed-off-by: Johan Hovold <jhovold@gmail.com> Acked-by: Nicolas Ferre <nicolas.ferre@atmel.com> Cc: Douglas Gilbert <dgilbert@interlog.com> Cc: Jean-Christophe PLAGNIOL-VILLARD <plagnioj@jcrosoft.com> Cc: Ludovic Desroches <ludovic.desroches@atmel.com> Cc: Robert Nelson <Robert.Nelson@digikey.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Rafael Aquini [Wed, 12 Jun 2013 21:04:49 +0000 (14:04 -0700)]
swap: avoid read_swap_cache_async() race to deadlock while waiting on discard I/O completion
read_swap_cache_async() can race against get_swap_page(), and stumble
across a SWAP_HAS_CACHE entry in the swap map whose page wasn't brought
into the swapcache yet.
This transient swap_map state is expected to be transitory, but the
actual placement of discard at scan_swap_map() inserts a wait for I/O
completion thus making the thread at read_swap_cache_async() to loop
around its -EEXIST case, while the other end at get_swap_page() is
scheduled away at scan_swap_map(). This can leave the system deadlocked
if the I/O completion happens to be waiting on the CPU waitqueue where
read_swap_cache_async() is busy looping and !CONFIG_PREEMPT.
This patch introduces a cond_resched() call to make the aforementioned
read_swap_cache_async() busy loop condition to bail out when necessary,
thus avoiding the subtle race window.
Signed-off-by: Rafael Aquini <aquini@redhat.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Acked-by: Hugh Dickins <hughd@google.com> Cc: Shaohua Li <shli@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Tony Lindgren [Wed, 12 Jun 2013 21:04:48 +0000 (14:04 -0700)]
drivers/rtc/rtc-twl.c: fix missing device_init_wakeup() when booted with device tree
When booted in legacy mode device_init_wakeup() gets called by
drivers/mfd/twl-core.c when the children are initialized. However, when
booted using device tree, the children are created with
of_platform_populate() instead add_children().
This means that the RTC driver will not have device_init_wakeup() set,
and we need to call it from the driver probe like RTC drivers typically
do.
Without this we cannot test PM wake-up events on omaps for cases where
there may not be any physical wake-up event.
Signed-off-by: Tony Lindgren <tony@atomide.com> Reported-by: Kevin Hilman <khilman@linaro.org> Cc: Alessandro Zummo <a.zummo@towertech.it> Cc: Jingoo Han <jg1.han@samsung.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
If a new logical drive is added and the CCISS_REGNEWD ioctl is invoked
(as is normal with the Array Configuration Utility) the process will
hang as below. It attempts to acquire the same mutex twice, once in
do_ioctl() and once in cciss_unlocked_open(). The BKL was recursive,
the mutex isn't.
Linux version 3.10.0-rc2 (scameron@localhost.localdomain) (gcc version 4.4.7 20120313 (Red Hat 4.4.7-3) (GCC) ) #1 SMP Fri May 24 14:32:12 CDT 2013
[...]
acu D 0000000000000001 0 3246 3191 0x00000080
Call Trace:
schedule+0x29/0x70
schedule_preempt_disabled+0xe/0x10
__mutex_lock_slowpath+0x17b/0x220
mutex_lock+0x2b/0x50
cciss_unlocked_open+0x2f/0x110 [cciss]
__blkdev_get+0xd3/0x470
blkdev_get+0x5c/0x1e0
register_disk+0x182/0x1a0
add_disk+0x17c/0x310
cciss_add_disk+0x13a/0x170 [cciss]
cciss_update_drive_info+0x39b/0x480 [cciss]
rebuild_lun_table+0x258/0x370 [cciss]
cciss_ioctl+0x34f/0x470 [cciss]
do_ioctl+0x49/0x70 [cciss]
__blkdev_driver_ioctl+0x28/0x30
blkdev_ioctl+0x200/0x7b0
block_ioctl+0x3c/0x40
do_vfs_ioctl+0x89/0x350
SyS_ioctl+0xa1/0xb0
system_call_fastpath+0x16/0x1b
This mutex usage was added into the ioctl path when the big kernel lock
was removed. As it turns out, these paths are all thread safe anyway
(or can easily be made so) and we don't want ioctl() to be single
threaded in any case.
Signed-off-by: Stephen M. Cameron <scameron@beardog.cce.hp.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Mike Miller <mike.miller@hp.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
During resume, we call hpet_rtc_timer_init after masking an irq bit in
hpet. This will cause the call to hpet_disable_rtc_channel to be undone
if RTC_AIE is the only bit not masked.
Allowing the cmos interrupt handler to run before resuming caused some
issues where the timer for the alarm was not removed. This would cause
other, later timers to not be cleared, so utilities such as hwclock
would time out when waiting for the update interrupt.
Use device_init_wakeup() instead of device_set_wakeup_capable() and move
it before rtc dev registering. This fixes alarmtimer not registered
when tps6586x rtc is the only wakeup compatible rtc in the system.
Signed-off-by: Dmitry Osipenko <digetx@gmail.com> Cc: Laxman Dewangan <ldewangan@nvidia.com> Cc: Jingoo Han <jg1.han@samsung.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Andrey Vagin [Wed, 12 Jun 2013 21:04:42 +0000 (14:04 -0700)]
memcg: don't initialize kmem-cache destroying work for root caches
struct memcg_cache_params has a union. Different parts of this union
are used for root and non-root caches. A part with destroying work is
used only for non-root caches.
Xiaowei.Hu [Wed, 12 Jun 2013 21:04:41 +0000 (14:04 -0700)]
ocfs2: ocfs2_prep_new_orphaned_file() should return ret
If an error occurs, for example an EIO in __ocfs2_prepare_orphan_dir,
ocfs2_prep_new_orphaned_file will release the inode_ac, then when the
caller of ocfs2_prep_new_orphaned_file gets a 0 return, it will refer to
a NULL ocfs2_alloc_context struct in the following functions. A kernel
panic happens.
Signed-off-by: "Xiaowei.Hu" <xiaowei.hu@oracle.com> Reviewed-by: shencanquan <shencanquan@huawei.com> Acked-by: Sunil Mushran <sunil.mushran@gmail.com> Cc: Joe Jin <joe.jin@oracle.com> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Chen Gang [Wed, 12 Jun 2013 21:04:40 +0000 (14:04 -0700)]
lib/mpi/mpicoder.c: looping issue, need stop when equal to zero, found by 'EXTRA_FLAGS=-W'.
For 'while' looping, need stop when 'nbytes == 0', or will cause issue.
('nbytes' is size_t which is always bigger or equal than zero).
The related warning: (with EXTRA_CFLAGS=-W)
lib/mpi/mpicoder.c:40:2: warning: comparison of unsigned expression >= 0 is always true [-Wtype-limits]
Signed-off-by: Chen Gang <gang.chen@asianux.com> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: David Howells <dhowells@redhat.com> Cc: James Morris <james.l.morris@oracle.com> Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Acked-by: David Howells <dhowells@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Kees Cook [Wed, 12 Jun 2013 21:04:39 +0000 (14:04 -0700)]
kmsg: honor dmesg_restrict sysctl on /dev/kmsg
The dmesg_restrict sysctl currently covers the syslog method for access
dmesg, however /dev/kmsg isn't covered by the same protections. Most
people haven't noticed because util-linux dmesg(1) defaults to using the
syslog method for access in older versions. With util-linux dmesg(1)
defaults to reading directly from /dev/kmsg.
To fix /dev/kmsg, let's compare the existing interfaces and what they
allow:
- /proc/kmsg allows:
- open (SYSLOG_ACTION_OPEN) if CAP_SYSLOG since it uses a destructive
single-reader interface (SYSLOG_ACTION_READ).
- everything, after an open.
- syslog syscall allows:
- anything, if CAP_SYSLOG.
- SYSLOG_ACTION_READ_ALL and SYSLOG_ACTION_SIZE_BUFFER, if
dmesg_restrict==0.
- nothing else (EPERM).
The use-cases were:
- dmesg(1) needs to do non-destructive SYSLOG_ACTION_READ_ALLs.
- sysklog(1) needs to open /proc/kmsg, drop privs, and still issue the
destructive SYSLOG_ACTION_READs.
AIUI, dmesg(1) is moving to /dev/kmsg, and systemd-journald doesn't
clear the ring buffer.
Based on the comments in devkmsg_llseek, it sounds like actions besides
reading aren't going to be supported by /dev/kmsg (i.e.
SYSLOG_ACTION_CLEAR), so we have a strict subset of the non-destructive
syslog syscall actions.
To this end, move the check as Josh had done, but also rename the
constants to reflect their new uses (SYSLOG_FROM_CALL becomes
SYSLOG_FROM_READER, and SYSLOG_FROM_FILE becomes SYSLOG_FROM_PROC).
SYSLOG_FROM_READER allows non-destructive actions, and SYSLOG_FROM_PROC
allows destructive actions after a capabilities-constrained
SYSLOG_ACTION_OPEN check.
- /dev/kmsg allows:
- open if CAP_SYSLOG or dmesg_restrict==0
- reading/polling, after open
Robin Holt [Wed, 12 Jun 2013 21:04:37 +0000 (14:04 -0700)]
reboot: rigrate shutdown/reboot to boot cpu
We recently noticed that reboot of a 1024 cpu machine takes approx 16
minutes of just stopping the cpus. The slowdown was tracked to commit f96972f2dc63 ("kernel/sys.c: call disable_nonboot_cpus() in
kernel_restart()").
The current implementation does all the work of hot removing the cpus
before halting the system. We are switching to just migrating to the
boot cpu and then continuing with shutdown/reboot.
This also has the effect of not breaking x86's command line parameter
for specifying the reboot cpu. Note, this code was shamelessly copied
from arch/x86/kernel/reboot.c with bits removed pertaining to the
reboot_cpu command line parameter.
Signed-off-by: Robin Holt <holt@sgi.com> Tested-by: Shawn Guo <shawn.guo@linaro.org> Cc: "Srivatsa S. Bhat" <srivatsa.bhat@linux.vnet.ibm.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@elte.hu> Cc: Russ Anderson <rja@sgi.com> Cc: Robin Holt <holt@sgi.com> Cc: Russell King <linux@arm.linux.org.uk> Cc: Guan Xuetao <gxt@mprc.pku.edu.cn> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Srivatsa S. Bhat [Wed, 12 Jun 2013 21:04:36 +0000 (14:04 -0700)]
CPU hotplug: provide a generic helper to disable/enable CPU hotplug
There are instances in the kernel where we would like to disable CPU
hotplug (from sysfs) during some important operation. Today the freezer
code depends on this and the code to do it was kinda tailor-made for
that.
Restructure the code and make it generic enough to be useful for other
usecases too.
Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com> Signed-off-by: Robin Holt <holt@sgi.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@elte.hu> Cc: Russ Anderson <rja@sgi.com> Cc: Robin Holt <holt@sgi.com> Cc: Russell King <linux@arm.linux.org.uk> Cc: Guan Xuetao <gxt@mprc.pku.edu.cn> Cc: Shawn Guo <shawn.guo@linaro.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Kees Cook [Tue, 11 Jun 2013 18:56:52 +0000 (11:56 -0700)]
x86, relocs: Move __vvar_page from S_ABS to S_REL
The __vvar_page relocation should actually be listed in S_REL instead
of S_ABS. Oddly, this didn't always cause things to break, presumably
because there are no users for relocation information on 64 bits yet.
David S. Miller [Wed, 12 Jun 2013 20:35:24 +0000 (13:35 -0700)]
Merge branch 'wireless'
John W. Linville says:
====================
For now I have dropped the mac80211 tree from this request.
We are developing a little backlog of fixes and I would like to
avoid introducing any more uncertainty to this pull request for the
3.10 stream. All the other bits are the same as what was in the
2013-06-06 request, including the ath9k fixes intended to address
the problems observed by Linus w/ his Pixel, and a CVE fix for a
potential security issue in the b43 driver.
Regarding the wl12xx bits, Luca says:
"Here are three patches that I'd like to get into 3.10. Two of them, by
me, are related to the firmware version checks in our driver. Without
them, the firmwares fail to load. The other one, by Eliad, fixes a typo
bug in our 5GHz scanning code."
And as for the Bluetooth bits, Gustavo says:
"The following patches are important bug fixes for 3.10, plus the
support for a new device. We do have three fixes from Johan. The first
one is a fix to avoid LE-only devices to rely on the (inexistent)
extended features data. The second patch fixes length checks on
incoming L2CAP signalling PDUs so we can discard PDU whose size
doesn't match the one reported in the header. The last one fixes
the handling of power on failures, we now report proper errors to
mgmt when hci_dev_open()."
Along with that...
Larry Finger corrects an rtlwifi problem that caused some devices to
refuse to connect to non-WPA2 networks if the device had previously
assocated with a WPA2 network. He also adds a one-line fix to prevent
false reports from kmemleak.
Mark A. Greer fixes an out of bounds array access in mwifiex.
Felix Fietkau reverts an earlier ath9k initval patch that reduced rx
sensitivity in a number of ath9k devices with no corresponding benefit.
Kees Cook fixes a potential uid-0 to ring-0 escalation in b43
(CVE-2013-2852).
Sujith Manoharan turns-off powersave mode by default for ath9k, and
also defaults ath9k to use the minstrel_ht rate control algorithm.
Both of these are believed to contribute to greater stability/usability
of ath9k in real-world situations.
Yijing Wang fixes an iwlegacy build error for il_pm_ops if CONFIG_PM
is set but CONFIG_PM_SLEEP is not set.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Linus Torvalds [Wed, 12 Jun 2013 18:48:14 +0000 (11:48 -0700)]
Merge branch 'upstream' of git://git.linux-mips.org/pub/scm/ralf/upstream-linus
Pull MIPS fixes from Ralf Baechle:
"Resurrect Alchemy platforms by invoking the WAIT instructions with
interrupts enabled. This still leaves the race condition between
testing TIF_NEED_RESCHED and the WAIT instruction for Alchemy
platforms which need a different fix than other MIPS platforms. But
at least it gets MIPS platforms flying again.
There are also fixes for two build errors (CONFIG_FTRACE=y with
CONFIG_DYNAMIC_FTRACE=n) and CONFIG_VIRTUALIZATION without CONFIG_KVM"
* 'upstream' of git://git.linux-mips.org/pub/scm/ralf/upstream-linus:
MIPS: ftrace: Add missing CONFIG_DYNAMIC_FTRACE
MIPS: include: mmu_context.h: Replace VIRTUALIZATION with KVM
MIPS: Alchemy: fix wait function
Linus Torvalds [Wed, 12 Jun 2013 15:29:11 +0000 (08:29 -0700)]
Merge tag 'trace-fixes-v3.10-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing fix from Steven Rostedt:
"Yoshihiro Yunomae fixed a regression in the output format when using
one of the counter clocks.
The new multibuffer code changed the trace_clock file to update the
trace instances tr->clock_id but the actual traces still used the
value from the obsolete global variable trace_clock_id"
* tag 'trace-fixes-v3.10-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
tracing: Fix outputting formats of x86-tsc and counter when use trace_clock
Linus Torvalds [Wed, 12 Jun 2013 15:28:19 +0000 (08:28 -0700)]
Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client
Pull ceph fixes from Sage Weil:
"There is a pair of fixes for double-frees in the recent bundle for
3.10, a couple of fixes for long-standing bugs (sleep while atomic and
an endianness fix), and a locking fix that can be triggered when osds
are going down"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client:
rbd: fix cleanup in rbd_add()
rbd: don't destroy ceph_opts in rbd_add()
ceph: ceph_pagelist_append might sleep while atomic
ceph: add cpu_to_le32() calls when encoding a reconnect capability
libceph: must hold mutex for reset_changed_osds()
Kees Cook [Fri, 10 May 2013 21:48:21 +0000 (14:48 -0700)]
b43: stop format string leaking into error msgs
The module parameter "fwpostfix" is userspace controllable, unfiltered,
and is used to define the firmware filename. b43_do_request_fw() populates
ctx->errors[] on error, containing the firmware filename. b43err()
parses its arguments as a format string. For systems with b43 hardware,
this could lead to a uid-0 to ring-0 escalation.
CVE-2013-2852
Signed-off-by: Kees Cook <keescook@chromium.org> Cc: stable@vger.kernel.org Signed-off-by: John W. Linville <linville@tuxdriver.com>
The ath9k rate control algorithm has various architectural
issues that make it a poor fit in scenarios like congested
environments etc.
An example: https://bugzilla.redhat.com/show_bug.cgi?id=927191
Change the default to minstrel which is more robust in such cases.
The ath9k RC code is left in the driver for now, maybe it can
be removed altogether later on.
Cc: stable@vger.kernel.org Cc: Jouni Malinen <jouni@qca.qualcomm.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sujith Manoharan <c_manoha@qca.qualcomm.com> Signed-off-by: John W. Linville <linville@tuxdriver.com>
This change reduces rx sensitivity with no apparent extra benefit.
It looks like it was meant for testing in a specific scenario,
but it was never properly validated.
Cc: rmanohar@qca.qualcomm.com Cc: stable@vger.kernel.org Signed-off-by: Felix Fietkau <nbd@openwrt.org> Signed-off-by: John W. Linville <linville@tuxdriver.com>
Almost all the DMA issues which have plagued ath9k (in station mode)
for years are related to PS. Disabling PS usually "fixes" the user's
connection stablility. Reports of DMA problems are still trickling in
and are sitting in the kernel bugzilla. Until the PS code in ath9k is
given a thorough review, disbale it by default. The slight increase
in chip power consumption is a small price to pay for improved link
stability.
Cc: stable@vger.kernel.org Signed-off-by: Sujith Manoharan <c_manoha@qca.qualcomm.com> Signed-off-by: John W. Linville <linville@tuxdriver.com>
Signed-off-by: Yijing Wang <wangyijing@huawei.com> Cc: Stanislaw Gruszka <sgruszka@redhat.com> Cc: "John W. Linville" <linville@tuxdriver.com> Cc: netdev@vger.kernel.org Cc: linux-wireless@vger.kernel.org Cc: Jingoo Han <jg1.han@samsung.com> Acked-by: Jingoo Han <jg1.han@samsung.com> Signed-off-by: John W. Linville <linville@tuxdriver.com>
Eliad Peller [Tue, 7 May 2013 12:41:09 +0000 (15:41 +0300)]
wl12xx/wl18xx: scan all 5ghz channels
Due to a typo, the current code copies only sizeof(cmd->channels_2)
bytes, which is smaller than the correct sizeof(cmd->channels_5)
size, resulting in a partial scan (some channels are skipped).
Signed-off-by: Eliad Peller <eliad@wizery.com> Signed-off-by: Luciano Coelho <coelho@ti.com> Signed-off-by: John W. Linville <linville@tuxdriver.com>
Luciano Coelho [Fri, 10 May 2013 07:19:38 +0000 (10:19 +0300)]
wl12xx: fix minimum required firmware version for wl127x multirole
There was a typo in commit 8675f9 (wlcore/wl12xx/wl18xx: verify
multi-role and single-role fw versions), which was causing the
multirole firmware for wl127x (WiLink6) to be rejected. The actual
minimum version needed for wl127x multirole is 6.5.7.0.42.
Reported-by: Levi Pearson <levipearson@gmail.com> Reported-by: Michael Scott <hashcode0f@gmail.com> Cc: stable@kernel.org # 3.9+ Signed-off-by: Luciano Coelho <coelho@ti.com> Signed-off-by: John W. Linville <linville@tuxdriver.com>
Larry Finger [Thu, 30 May 2013 23:05:55 +0000 (18:05 -0500)]
rtlwifi: rtl8192cu: Fix problem in connecting to WEP or WPA(1) networks
Driver rtl8192cu can connect to WPA2 networks, but fails for any other
encryption method. The cause is a failure to set the rate control data
blocks. These changes fix https://bugzilla.redhat.com/show_bug.cgi?id=952793
and https://bugzilla.redhat.com/show_bug.cgi?id=761525.
Signed-off-by: Larry Finger <Larry.Finger@lwfinger.net> Signed-off-by: John W. Linville <linville@tuxdriver.com>
The panic is caused by the mwifiex_info_read() routine assuming that
there can only be four modes (0-3) which is an invalid assumption.
For example, when testing P2P, the mode is '8' (P2P_CLIENT) so the
code accesses data beyond the bounds of the bss_modes[] array which
causes the panic. Fix this by updating bss_modes[] to support the
current list of modes and adding a check to prevent the out-of-bounds
access from occuring in the future when more modes are added.
Signed-off-by: Mark A. Greer <mgreer@animalcreek.com> Acked-by: Bing Zhao <bzhao@marvell.com> Signed-off-by: John W. Linville <linville@tuxdriver.com>
Johan Hedberg [Wed, 29 May 2013 06:51:29 +0000 (09:51 +0300)]
Bluetooth: Fix mgmt handling of power on failures
If hci_dev_open fails we need to ensure that the corresponding
mgmt_set_powered command gets an appropriate response. This patch fixes
the missing response by adding a new mgmt_set_powered_failed function
that's used to indicate a power on failure to mgmt. Since a situation
with the device being rfkilled may require special handling in user
space the patch uses a new dedicated mgmt status code for this.
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com> Cc: stable@vger.kernel.org Acked-by: Marcel Holtmann <marcel@holtmann.org> Signed-off-by: Gustavo Padovan <gustavo.padovan@collabora.co.uk> Signed-off-by: John W. Linville <linville@tuxdriver.com>
Johan Hedberg [Tue, 28 May 2013 10:46:30 +0000 (13:46 +0300)]
Bluetooth: Fix missing length checks for L2CAP signalling PDUs
There has been code in place to check that the L2CAP length header
matches the amount of data received, but many PDU handlers have not been
checking that the data received actually matches that expected by the
specific PDU. This patch adds passing the length header to the specific
handler functions and ensures that those functions fail cleanly in the
case of an incorrect amount of data.
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com> Cc: stable@vger.kernel.org Signed-off-by: Gustavo Padovan <gustavo.padovan@collabora.co.uk> Signed-off-by: John W. Linville <linville@tuxdriver.com>
Johan Hedberg [Wed, 24 Apr 2013 10:05:32 +0000 (13:05 +0300)]
Bluetooth: Fix checks for LE support on LE-only controllers
LE-only controllers do not support extended features so any kind of host
feature bit checks do not make sense for them. This patch fixes code
used for both single-mode (LE-only) and dual-mode (BR/EDR/LE) to use the
HCI_LE_ENABLED flag instead of the "Host LE supported" feature bit for
LE support tests.
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com> Acked-by: Marcel Holtmann <marcel@holtmann.org> Signed-off-by: Gustavo Padovan <gustavo.padovan@collabora.co.uk> Signed-off-by: John W. Linville <linville@tuxdriver.com>
HID: multitouch: prevent memleak with the allocated name
mt_free_input_name() was never called during .remove():
hid_hw_stop() removes the hid_input items in hdev->inputs, and so the
list is therefore empty after the call. In the end, we never free the
special names that has been allocated during .probe().
Restore the original name before freeing it to avoid acessing already
freed pointer.
This fixes a regression introduced by 49a5a827a ("HID: multitouch: append " Pen" to
the name of the stylus input")
Signed-off-by: Benjamin Tissoires <benjamin.tissoires@redhat.com> Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Jiri Pirko [Sat, 8 Jun 2013 13:00:54 +0000 (15:00 +0200)]
team: move add to port list before port enablement
team_port_enable() adds port to port_hashlist. Reader sees port
in team_get_port_by_index_rcu() and returns it, but
team_get_first_port_txable_rcu() tries to go through port_list, where the
port is not inserted yet -> NULL pointer dereference.
Fix this by reordering port_list and port_hashlist insertion.
Panic is easily triggeable when txing packets and adding/removing port
in a loop.
Introduced by commit 3d249d4c "net: introduce ethernet teaming device"
Signed-off-by: Jiri Pirko <jiri@resnulli.us> Signed-off-by: David S. Miller <davem@davemloft.net>
Jiri Pirko [Sat, 8 Jun 2013 13:00:53 +0000 (15:00 +0200)]
team: check return value of team_get_port_by_index_rcu() for NULL
team_get_port_by_index_rcu() might return NULL due to race between port
removal and skb tx path. Panic is easily triggeable when txing packets
and adding/removing port in a loop.
introduced by commit 3d249d4ca "net: introduce ethernet teaming device"
and commit 753f993911b "team: introduce random mode" (for random mode)
Signed-off-by: Jiri Pirko <jiri@resnulli.us> Signed-off-by: David S. Miller <davem@davemloft.net>
Jason Wang [Sat, 8 Jun 2013 06:17:41 +0000 (14:17 +0800)]
tuntap: set SOCK_ZEROCOPY flag during open
Commit 54f968d6efdbf7dec36faa44fc11f01b0e4d1990
(tuntap: move socket to tun_file) forgets to set SOCK_ZEROCOPY flag, which will
prevent vhost_net from doing zercopy w/ tap. This patch fixes this by setting
it during file open.
Cc: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Jason Wang <jasowang@redhat.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Linus Torvalds [Wed, 12 Jun 2013 06:07:21 +0000 (23:07 -0700)]
Merge branch 'fixes-3.10' of git://git.infradead.org/users/willy/linux-nvme
Pull NVMe fixes from Matthew Wilcox.
* 'fixes-3.10' of git://git.infradead.org/users/willy/linux-nvme:
NVMe: Add MSI support
NVMe: Use dma_set_mask() correctly
Return the result from user admin command IOCTL even in case of failure
NVMe: Do not cancel command multiple times
NVMe: fix error return code in nvme_submit_bio_queue()
NVMe: check for integer overflow in nvme_map_user_pages()
MAINTAINERS: update NVM EXPRESS DRIVER file list
NVMe: Fix a signedness bug in nvme_trans_modesel_get_mp
NVMe: Remove redundant version.h header include
Linus Torvalds [Tue, 11 Jun 2013 18:16:43 +0000 (11:16 -0700)]
Merge branch 'fixes' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull kvm bugfixes from Gleb Natapov:
"There is one more fix for MIPS KVM ABI here, MIPS and PPC build
breakage fixes and a couple of PPC bug fixes"
* 'fixes' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
kvm/ppc/booke64: Fix lazy ee handling in kvmppc_handle_exit()
kvm/ppc/booke: Hold srcu lock when calling gfn functions
kvm/ppc/booke64: Disable e6500 support
kvm/ppc/booke64: Fix AltiVec interrupt numbers and build breakage
mips/kvm: Use KVM_REG_MIPS and proper size indicators for *_ONE_REG
kvm: Add definition of KVM_REG_MIPS
KVM: add kvm_para_available to asm-generic/kvm_para.h
tracing: Fix outputting formats of x86-tsc and counter when use trace_clock
Outputting formats of x86-tsc and counter should be a raw format, but after
applying the patch(2b6080f28c7cc3efc8625ab71495aae89aeb63a0), the format was
changed to nanosec. This is because the global variable trace_clock_id was used.
When we use multiple buffers, clock_id of each sub-buffer should be used. Then,
this patch uses tr->clock_id instead of the global variable trace_clock_id.
[ Basically, this fixes a regression where the multibuffer code changed the
trace_clock file to update tr->clock_id but the traces still use the old
global trace_clock_id variable, negating the file's effect. The global
trace_clock_id variable is obsolete and removed. - SR ]
I did not hit this with the lksctp-tools functional tests, but with a
small, multi-threaded test program, that heavily allocates, binds,
listens and waits in accept on sctp sockets, and then randomly kills
some of them (no need for an actual client in this case to hit this).
Then, again, allocating, binding, etc, and then killing child processes.
This panic then only occurs when ``echo 1 > /proc/sys/net/sctp/auth_enable''
is set. The cause for that is actually very simple: in sctp_endpoint_init()
we enter the path of sctp_auth_init_hmacs(). There, we try to allocate
our crypto transforms through crypto_alloc_hash(). In our scenario,
it then can happen that crypto_alloc_hash() fails with -EINTR from
crypto_larval_wait(), thus we bail out and release the socket via
sk_common_release(), sctp_destroy_sock() and hit the NULL pointer
dereference as soon as we try to access members in the endpoint during
sctp_endpoint_free(), since endpoint at that time is still NULL. Now,
if we have that case, we do not need to do any cleanup work and just
leave the destruction handler.
Signed-off-by: Daniel Borkmann <dborkman@redhat.com> Acked-by: Neil Horman <nhorman@tuxdriver.com> Acked-by: Vlad Yasevich <vyasevich@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
vhost_net_clear_ubuf_info didn't clear ubuf_info
after kfree, this could trigger double free.
Fix this and simplify this code to make it more robust: make sure
ubuf info is always freed through vhost_net_clear_ubuf_info.
Reported-by: Tommi Rantala <tt.rantala@gmail.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Bjørn Mork [Thu, 6 Jun 2013 10:57:02 +0000 (12:57 +0200)]
qmi_wwan/cdc_ether: let qmi_wwan handle the Huawei E1820
Another QMI speaking Qualcomm based device, which should be
driven by qmi_wwan, while cdc_ether should ignore it.
Like on other Huawei devices, the wwan function can appear
either as a single vendor specific interface or as a CDC ECM
class function using separate control and data interfaces.
The ECM control interface protocol is 0xff, likely in an
attempt to indicate that vendor specific management is
required.
In addition to the near standard CDC class, Huawei also add
vendor specific AT management commands to their firmwares.
This is probably an attempt to support non-Windows systems
using standard class drivers. Unfortunately, this part of
the firmware is often buggy. Linux is much better off using
whatever native vendor specific management protocol the
device offers, and Windows uses, whenever possible. This
means QMI in the case of Qualcomm based devices.
The E1820 has been verified to work fine with QMI.
Matching on interface number is necessary to distiguish the
wwan function from serial functions in the single interface
mode, as both function types will have class/subclass/function
set to ff/ff/ff.
The control interface number does not change in CDC ECM mode,
so the interface number matching rule is sufficient to handle
both modes. The cdc_ether blacklist entry is only relevant in
CDC ECM mode, but using a similar interface number based rule
helps document this as a transfer from one driver to another.
Other Huawei 02/06/ff devices are left with the cdc_ether driver
because we do not know whether they are based on Qualcomm chips.
The Huawei specific AT command management is known to be somewhat
hardware independent, and their usage of these class codes may
also be independent of the modem hardware.
Reported-by: Graham Inggs <graham.inggs@uct.ac.za> Signed-off-by: Bjørn Mork <bjorn@mork.no> Signed-off-by: David S. Miller <davem@davemloft.net>
Dave Airlie [Tue, 11 Jun 2013 09:38:27 +0000 (19:38 +1000)]
Merge tag 'drm-intel-fixes-2013-06-11' of git://people.freedesktop.org/~danvet/drm-intel into drm-fixes
Daniel writes:
Just tiny regression fixes here:
- Two fixes to fix sdvo hotplug which broke in the hpd storm detection
work.
- One fix to patch-up the sdvo lvds regression fixer from the last pull -
we need to prefer the vbt mode over edid modes.
* tag 'drm-intel-fixes-2013-06-11' of git://people.freedesktop.org/~danvet/drm-intel:
drm/i915: prefer VBT modes for SVDO-LVDS over EDID
drm/i915: Enable hotplug interrupts after querying hw capabilities.
drm/i915: Fix hotplug interrupt enabling for SDVOC
Sergei Shtylyov [Wed, 5 Jun 2013 19:54:01 +0000 (23:54 +0400)]
sh_eth: fix result of sh_eth_check_reset() on timeout
When the first loop in sh_eth_check_reset() runs to its end, 'cnt' is 0, so the
following check for 'cnt < 0' fails to catch the timeout. Fix the condition in
this check, so that the timeout is actually reported.
While at it, fix the grammar in the failure message...
Signed-off-by: Sergei Shtylyov <sergei.shtylyov@cogentembedded.com> Signed-off-by: David S. Miller <davem@davemloft.net>
net/ti davinci_mdio: don't hold a spin lock while calling pm_runtime
was playing with suspend and run into this:
|BUG: sleeping function called from invalid context at drivers/base/power/runtime.c:891
|in_atomic(): 1, irqs_disabled(): 0, pid: 1963, name: bash
|6 locks held by bash/1963:
|CPU: 0 PID: 1963 Comm: bash Not tainted 3.10.0-rc4+ #50
|[<c0014fdc>] (unwind_backtrace+0x0/0xf8) from [<c0011da4>] (show_stack+0x10/0x14)
|[<c0011da4>] (show_stack+0x10/0x14) from [<c02e8680>] (__pm_runtime_idle+0xa4/0xac)
|[<c02e8680>] (__pm_runtime_idle+0xa4/0xac) from [<c0341158>] (davinci_mdio_suspend+0x6c/0x9c)
|[<c0341158>] (davinci_mdio_suspend+0x6c/0x9c) from [<c02e0628>] (platform_pm_suspend+0x2c/0x54)
|[<c02e0628>] (platform_pm_suspend+0x2c/0x54) from [<c02e52bc>] (dpm_run_callback.isra.3+0x2c/0x64)
|[<c02e52bc>] (dpm_run_callback.isra.3+0x2c/0x64) from [<c02e57e4>] (__device_suspend+0x100/0x22c)
|[<c02e57e4>] (__device_suspend+0x100/0x22c) from [<c02e67e8>] (dpm_suspend+0x68/0x230)
|[<c02e67e8>] (dpm_suspend+0x68/0x230) from [<c0072a20>] (suspend_devices_and_enter+0x68/0x350)
|[<c0072a20>] (suspend_devices_and_enter+0x68/0x350) from [<c0072f18>] (pm_suspend+0x210/0x24c)
|[<c0072f18>] (pm_suspend+0x210/0x24c) from [<c0071c74>] (state_store+0x6c/0xbc)
|[<c0071c74>] (state_store+0x6c/0xbc) from [<c02714dc>] (kobj_attr_store+0x14/0x20)
|[<c02714dc>] (kobj_attr_store+0x14/0x20) from [<c01341a0>] (sysfs_write_file+0x16c/0x19c)
|[<c01341a0>] (sysfs_write_file+0x16c/0x19c) from [<c00ddfe4>] (vfs_write+0xb4/0x190)
|[<c00ddfe4>] (vfs_write+0xb4/0x190) from [<c00de3a4>] (SyS_write+0x3c/0x70)
|[<c00de3a4>] (SyS_write+0x3c/0x70) from [<c000e2c0>] (ret_fast_syscall+0x0/0x48)
I don't see a reason why the pm_runtime call must be under the lock.
Further I don't understand why this is a spinlock and not mutex.
Cc: Mugunthan V N <mugunthanvnm@ti.com> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Acked-by: Mugunthan V N <mugunthanvnm@ti.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Scott Wood [Fri, 7 Jun 2013 00:16:30 +0000 (19:16 -0500)]
kvm/ppc/booke64: Disable e6500 support
The previous patch made 64-bit booke KVM build again, but Altivec
support is still not complete, and we can't prevent the guest from
turning on Altivec (which can corrupt host state until state
save/restore is implemented). Disable e6500 on KVM until this is
fixed.
Signed-off-by: Scott Wood <scottwood@freescale.com> Signed-off-by: Gleb Natapov <gleb@redhat.com>
Mihai Caraman [Fri, 7 Jun 2013 00:16:29 +0000 (19:16 -0500)]
kvm/ppc/booke64: Fix AltiVec interrupt numbers and build breakage
Interrupt numbers defined for Book3E follows IVORs definition. Align
BOOKE_INTERRUPT_ALTIVEC_UNAVAIL and BOOKE_INTERRUPT_ALTIVEC_ASSIST to this
rule which also fixes the build breakage.
IVORs 32 and 33 are shared so reflect this in the interrupts naming.
This fixes a build break for 64-bit booke KVM.
Signed-off-by: Mihai Caraman <mihai.caraman@freescale.com> Signed-off-by: Scott Wood <scottwood@freescale.com> Signed-off-by: Gleb Natapov <gleb@redhat.com>
Ben Greear [Thu, 6 Jun 2013 21:29:49 +0000 (14:29 -0700)]
Fix lockup related to stop_machine being stuck in __do_softirq.
The stop machine logic can lock up if all but one of the migration
threads make it through the disable-irq step and the one remaining
thread gets stuck in __do_softirq. The reason __do_softirq can hang is
that it has a bail-out based on jiffies timeout, but in the lockup case,
jiffies itself is not incremented.
To work around this, re-add the max_restart counter in __do_irq and stop
processing irqs after 10 restarts.
Thanks to Tejun Heo and Rusty Russell and others for helping me track
this down.
This was introduced in 3.9 by commit c10d73671ad3 ("softirq: reduce
latencies").
It may be worth looking into ath9k to see if it has issues with its irq
handler at a later date.
The hang stack traces look something like this:
------------[ cut here ]------------
WARNING: at kernel/watchdog.c:245 watchdog_overflow_callback+0x9c/0xa7()
Watchdog detected hard LOCKUP on cpu 2
Modules linked in: ath9k ath9k_common ath9k_hw ath mac80211 cfg80211 nfsv4 auth_rpcgss nfs fscache nf_nat_ipv4 nf_nat veth 8021q garp stp mrp llc pktgen lockd sunrpc]
Pid: 23, comm: migration/2 Tainted: G C 3.9.4+ #11
Call Trace:
<NMI> warn_slowpath_common+0x85/0x9f
warn_slowpath_fmt+0x46/0x48
watchdog_overflow_callback+0x9c/0xa7
__perf_event_overflow+0x137/0x1cb
perf_event_overflow+0x14/0x16
intel_pmu_handle_irq+0x2dc/0x359
perf_event_nmi_handler+0x19/0x1b
nmi_handle+0x7f/0xc2
do_nmi+0xbc/0x304
end_repeat_nmi+0x1e/0x2e
<<EOE>>
cpu_stopper_thread+0xae/0x162
smpboot_thread_fn+0x258/0x260
kthread+0xc7/0xcf
ret_from_fork+0x7c/0xb0
---[ end trace 4947dfa9b0a4cec3 ]---
BUG: soft lockup - CPU#1 stuck for 22s! [migration/1:17]
Modules linked in: ath9k ath9k_common ath9k_hw ath mac80211 cfg80211 nfsv4 auth_rpcgss nfs fscache nf_nat_ipv4 nf_nat veth 8021q garp stp mrp llc pktgen lockd sunrpc]
irq event stamp: 835637905
hardirqs last enabled at (835637904): __do_softirq+0x9f/0x257
hardirqs last disabled at (835637905): apic_timer_interrupt+0x6d/0x80
softirqs last enabled at (5654720): __do_softirq+0x1ff/0x257
softirqs last disabled at (5654725): irq_exit+0x5f/0xbb
CPU 1
Pid: 17, comm: migration/1 Tainted: G WC 3.9.4+ #11 To be filled by O.E.M. To be filled by O.E.M./To be filled by O.E.M.
RIP: tasklet_hi_action+0xf0/0xf0
Process migration/1
Call Trace:
<IRQ>
__do_softirq+0x117/0x257
irq_exit+0x5f/0xbb
smp_apic_timer_interrupt+0x8a/0x98
apic_timer_interrupt+0x72/0x80
<EOI>
printk+0x4d/0x4f
stop_machine_cpu_stop+0x22c/0x274
cpu_stopper_thread+0xae/0x162
smpboot_thread_fn+0x258/0x260
kthread+0xc7/0xcf
ret_from_fork+0x7c/0xb0
Signed-off-by: Ben Greear <greearb@candelatech.com> Acked-by: Tejun Heo <tj@kernel.org> Acked-by: Pekka Riikonen <priikone@iki.fi> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: stable@kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>