Most SIGP orders are handled partially in kernel and partially in
user space. In order to:
- Get a correct SIGP SET PREFIX handler that informs user space
- Avoid race conditions between concurrently executed SIGP orders
- Serialize SIGP orders per VCPU
We need to handle all "slow" SIGP orders in user space. The remaining
ones to be handled completely in kernel are:
- SENSE
- SENSE RUNNING
- EXTERNAL CALL
- EMERGENCY SIGNAL
- CONDITIONAL EMERGENCY SIGNAL
According to the PoP, they have to be fast. They can be executed
without conflicting to the actions of other pending/concurrently
executing orders (e.g. STOP vs. START).
This patch introduces a new capability that will - when enabled -
forward all but the mentioned SIGP orders to user space. The
instruction counters in the kernel are still updated.
Reviewed-by: Cornelia Huck <cornelia.huck@de.ibm.com> Signed-off-by: David Hildenbrand <dahi@linux.vnet.ibm.com> Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
KVM: s390: only one external call may be pending at a time
Only one external call may be pending at a vcpu at a time. For this
reason, we have to detect whether the SIGP externcal call interpretation
facility is available. If so, all external calls have to be injected
using this mechanism.
SIGP EXTERNAL CALL orders have to return whether another external
call is already pending. This check was missing until now.
SIGP SENSE hasn't returned yet in all conditions whether an external
call was pending.
If a SIGP EXTERNAL CALL irq is to be injected and one is already
pending, -EBUSY is returned.
Reviewed-by: Cornelia Huck <cornelia.huck@de.ibm.com> Signed-off-by: David Hildenbrand <dahi@linux.vnet.ibm.com> Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
This patch cleanes up the the SIGP SET PREFIX code.
A SIGP SET PREFIX irq may only be injected if the target vcpu is
stopped. Let's move the checking code into the injection code and
return -EBUSY if the target vcpu is not stopped.
Reviewed-by: Jens Freimann <jfrei@linux.vnet.ibm.com> Acked-by: Cornelia Huck <cornelia.huck@de.ibm.com> Signed-off-by: David Hildenbrand <dahi@linux.vnet.ibm.com> Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
KVM: s390: a VCPU may only stop when no interrupts are left pending
As a SIGP STOP is an interrupt with the least priority, it may only result
in stop of the vcpu when no other interrupts are left pending.
To detect whether a non-stop irq is pending, we need a way to mask out
stop irqs from the general kvm_cpu_has_interrupt() function. For this
reason, the existing function (with an outdated name) is replaced by
kvm_s390_vcpu_has_irq() which allows to mask out pending stop irqs.
Acked-by: Cornelia Huck <cornelia.huck@de.ibm.com> Signed-off-by: David Hildenbrand <dahi@linux.vnet.ibm.com> Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
In order to get rid of the action_flags and to properly migrate pending SIGP
STOP irqs triggered e.g. by SIGP STOP AND STORE STATUS, we need to remember
whether to store the status when stopping.
For this reason, a new parameter (flags) for the SIGP STOP irq is introduced.
These flags further define details of the requested STOP and can be easily
migrated.
Reviewed-by: Thomas Huth <thuth@linux.vnet.ibm.com> Acked-by: Cornelia Huck <cornelia.huck@de.ibm.com> Signed-off-by: David Hildenbrand <dahi@linux.vnet.ibm.com> Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
KVM: s390: forward hrtimer if guest ckc not pending yet
Patch 0759d0681cae ("KVM: s390: cleanup handle_wait by reusing
kvm_vcpu_block") changed the way pending guest clock comparator
interrupts are detected. It was assumed that as soon as the hrtimer
wakes up, the condition for the guest ckc is satisfied.
This is however only true as long as adjclock() doesn't speed
up the monotonic clock. Reason is that the hrtimer is based on
CLOCK_MONOTONIC, the guest clock comparator detection is based
on the raw TOD clock. If CLOCK_MONOTONIC runs faster than the
TOD clock, the hrtimer wakes the target VCPU up too early and
the target VCPU will not detect any pending interrupts, therefore
going back to sleep. It will never be woken up again because the
hrtimer has finished. The VCPU is stuck.
As a quick fix, we have to forward the hrtimer until the guest
clock comparator is really due, to guarantee properly timed wake
ups.
As the hrtimer callback might be triggered on another cpu, we
have to make sure that the timer is really stopped and not currently
executing the callback on another cpu. This can happen if the vcpu
thread is scheduled onto another physical cpu, but the timer base
is not migrated. So lets use hrtimer_cancel instead of try_to_cancel.
A proper fix might be to introduce a RAW based hrtimer.
Reported-by: Christian Borntraeger <borntraeger@de.ibm.com> Cc: stable@vger.kernel.org Signed-off-by: David Hildenbrand <dahi@linux.vnet.ibm.com> Acked-by: Cornelia Huck <cornelia.huck@de.ibm.com> Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Dominik Dingel [Fri, 31 Oct 2014 13:10:41 +0000 (14:10 +0100)]
KVM: s390: Allow userspace to limit guest memory size
With commit c6c956b80bdf ("KVM: s390/mm: support gmap page tables with less
than 5 levels") we are able to define a limit for the guest memory size.
As we round up the guest size in respect to the levels of page tables
we get to guest limits of: 2048 MB, 4096 GB, 8192 TB and 16384 PB.
We currently limit the guest size to 16 TB, which means we end up
creating a page table structure supporting guest sizes up to 8192 TB.
This patch introduces an interface that allows userspace to tune
this limit. This may bring performance improvements for small guests.
Dominik Dingel [Tue, 2 Dec 2014 15:53:21 +0000 (16:53 +0100)]
KVM: s390: move vcpu specific initalization to a later point
As we will allow in a later patch to recreate gmaps with new limits,
we need to make sure that vcpus get their reference for that gmap
after they increased the online_vcpu counter, so there is no possible race.
While we are doing this, we also can simplify the vcpu_init function, by
moving ucontrol specifics to an own function.
That way we also start now setting the kvm_valid_regs for the ucontrol path.
Reviewed-by: Jens Freimann <jfrei@linux.vnet.ibm.com> Reviewed-by: David Hildenbrand <dahi@linux.vnet.ibm.com> Acked-by: Cornelia Huck <cornelia.huck@de.ibm.com> Signed-off-by: Dominik Dingel <dingel@linux.vnet.ibm.com> Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Dominik Dingel [Thu, 4 Dec 2014 14:47:07 +0000 (15:47 +0100)]
KVM: remove unneeded return value of vcpu_postcreate
The return value of kvm_arch_vcpu_postcreate is not checked in its
caller. This is okay, because only x86 provides vcpu_postcreate right
now and it could only fail if vcpu_load failed. But that is not
possible during KVM_CREATE_VCPU (kvm_arch_vcpu_load is void, too), so
just get rid of the unchecked return value.
Nadav Amit [Thu, 25 Dec 2014 00:52:23 +0000 (02:52 +0200)]
KVM: x86: Access to LDT/GDT that wraparound is incorrect
When access to descriptor in LDT/GDT wraparound outside long-mode, the address
of the descriptor should be truncated to 32-bit. Citing Intel SDM 2.1.1.1
"Global and Local Descriptor Tables in IA-32e Mode": "GDTR and LDTR registers
are expanded to 64-bits wide in both IA-32e sub-modes (64-bit mode and
compatibility mode)."
So in other cases, we need to truncate. Creating new function to return a
pointer to descriptor table to avoid too much code duplication.
Signed-off-by: Nadav Amit <namit@cs.technion.ac.il>
[Wrap 64-bit check with #ifdef CONFIG_X86_64, to avoid a "right shift count
>= width of type" warning and consequent undefined behavior. - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Nadav Amit [Thu, 25 Dec 2014 00:52:22 +0000 (02:52 +0200)]
KVM: x86: Do not set access bit on accessed segments
When segment is loaded, the segment access bit is set unconditionally. In
fact, it should be set conditionally, based on whether the segment had the
accessed bit set before. In addition, it can improve performance.
Signed-off-by: Nadav Amit <namit@cs.technion.ac.il> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Nadav Amit [Thu, 25 Dec 2014 00:52:21 +0000 (02:52 +0200)]
KVM: x86: POP [ESP] is not emulated correctly
According to Intel SDM: "If the ESP register is used as a base register for
addressing a destination operand in memory, the POP instruction computes the
effective address of the operand after it increments the ESP register."
The current emulation does not behave so. The fix required to waste another
of the precious instruction flags and to check the flag in decode_modrm.
Signed-off-by: Nadav Amit <namit@cs.technion.ac.il> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Nadav Amit [Thu, 25 Dec 2014 00:52:19 +0000 (02:52 +0200)]
KVM: x86: JMP/CALL using call- or task-gate causes exception
The KVM emulator does not emulate JMP and CALL that target a call gate or a
task gate. This patch does not try to implement these scenario as they are
presumably rare; yet it returns X86EMUL_UNHANDLEABLE error in such cases
instead of generating an exception.
Signed-off-by: Nadav Amit <namit@cs.technion.ac.il> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Nadav Amit [Thu, 25 Dec 2014 00:52:18 +0000 (02:52 +0200)]
KVM: x86: fnstcw and fnstsw may cause spurious exception
Since the operand size of fnstcw and fnstsw is updated during the execution,
the emulation may cause spurious exceptions as it reads the memory beforehand.
Marking these instructions as Mov (since the previous value is ignored) and
DstMem16 to simplify the setting of operand size.
Signed-off-by: Nadav Amit <namit@cs.technion.ac.il> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Nadav Amit [Thu, 25 Dec 2014 00:52:17 +0000 (02:52 +0200)]
KVM: x86: pop sreg accesses only 2 bytes
Although pop sreg updates RSP according to the operand size, only 2 bytes are
read. The current behavior may result in incorrect #GP or #PF exceptions.
Signed-off-by: Nadav Amit <namit@cs.technion.ac.il> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Paolo Bonzini [Wed, 2 Oct 2013 14:56:14 +0000 (16:56 +0200)]
KVM: x86: mmu: remove argument to kvm_init_shadow_mmu and kvm_init_shadow_ept_mmu
The initialization function in mmu.c can always use walk_mmu, which
is known to be vcpu->arch.mmu. Only init_kvm_nested_mmu is used to
initialize vcpu->arch.nested_mmu.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Marcelo Tosatti [Tue, 16 Dec 2014 14:08:15 +0000 (09:08 -0500)]
KVM: x86: add option to advance tscdeadline hrtimer expiration
For the hrtimer which emulates the tscdeadline timer in the guest,
add an option to advance expiration, and busy spin on VM-entry waiting
for the actual expiration time to elapse.
This allows achieving low latencies in cyclictest (or any scenario
which requires strict timing regarding timer expiration).
Reduces average cyclictest latency from 12us to 8us
on Core i5 desktop.
Note: this option requires tuning to find the appropriate value
for a particular hardware/guest combination. One method is to measure the
average delay between apic_timer_fn and VM-entry.
Another method is to start with 1000ns, and increase the value
in say 500ns increments until avg cyclictest numbers stop decreasing.
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Tiejun Chen [Mon, 22 Dec 2014 09:32:57 +0000 (10:32 +0100)]
kvm: x86: vmx: NULL out hwapic_isr_update() in case of !enable_apicv
In most cases calling hwapic_isr_update(), we always check if
kvm_apic_vid_enabled() == 1, but actually,
kvm_apic_vid_enabled()
-> kvm_x86_ops->vm_has_apicv()
-> vmx_vm_has_apicv() or '0' in svm case
-> return enable_apicv && irqchip_in_kernel(kvm)
So its a little cost to recall vmx_vm_has_apicv() inside
hwapic_isr_update(), here just NULL out hwapic_isr_update() in
case of !enable_apicv inside hardware_setup() then make all
related stuffs follow this. Note we don't check this under that
condition of irqchip_in_kernel() since we should make sure
definitely any caller don't work without in-kernel irqchip.
Signed-off-by: Tiejun Chen <tiejun.chen@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Nicholas Krause [Fri, 19 Dec 2014 02:13:22 +0000 (21:13 -0500)]
KVM: x86: Remove FIXMEs in emulate.c for the function,task_switch_32
Remove FIXME comments about needing fault addresses to be returned. These
are propaagated from walk_addr_generic to gva_to_gpa and from there to
ops->read_std and ops->write_std.
Signed-off-by: Nicholas Krause <xerofoify@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
KVM: nVMX: consult PFEC_MASK and PFEC_MATCH when generating #PF VM-exit
When generating #PF VM-exit, check equality:
(PFEC & PFEC_MASK) == PFEC_MATCH
If there is equality, the 14 bit of exception bitmap is used to take decision
about generating #PF VM-exit. If there is inequality, inverted 14 bit is used.
Signed-off-by: Eugene Korenevsky <ekorenevsky@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
This patch improve checks required by Intel Software Developer Manual.
- SMM MSRs are not allowed.
- microcode MSRs are not allowed.
- check x2apic MSRs only when LAPIC is in x2apic mode.
- MSR switch areas must be aligned to 16 bytes.
- address of first and last byte in MSR switch areas should not set any bits
beyond the processor's physical-address width.
Also it adds warning messages on failures during MSR switch. These messages
are useful for people who debug their VMMs in nVMX.
Signed-off-by: Eugene Korenevsky <ekorenevsky@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Wincy Van [Thu, 11 Dec 2014 05:52:58 +0000 (08:52 +0300)]
KVM: nVMX: Add nested msr load/restore algorithm
Several hypervisors need MSR auto load/restore feature.
We read MSRs from VM-entry MSR load area which specified by L1,
and load them via kvm_set_msr in the nested entry.
When nested exit occurs, we get MSRs via kvm_get_msr, writing
them to L1`s MSR store area. After this, we read MSRs from VM-exit
MSR load area, and load them via kvm_set_msr.
Signed-off-by: Wincy Van <fanwenyi0529@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Linus Torvalds [Mon, 5 Jan 2015 22:49:02 +0000 (14:49 -0800)]
Merge tag 'powerpc-3.19-3' of git://git.kernel.org/pub/scm/linux/kernel/git/mpe/linux
Pull powerpc fixes from Michael Ellerman:
- Wire up sys_execveat(). Tested on 32 & 64 bit.
- Fix for kdump on LE systems with cpus hot unplugged.
- Revert Anton's fix for "kernel BUG at kernel/smpboot.c:134!", this
broke other platforms, we'll do a proper fix for 3.20.
* tag 'powerpc-3.19-3' of git://git.kernel.org/pub/scm/linux/kernel/git/mpe/linux:
Revert "powerpc: Secondary CPUs must set cpu_callin_map after setting active and online"
powerpc/kdump: Ignore failure in enabling big endian exception during crash
powerpc: Wire up sys_execveat() syscall
Linus Torvalds [Sun, 4 Jan 2015 19:46:43 +0000 (11:46 -0800)]
Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rw/uml
Pull UML fixes from Richard Weinberger:
"Two fixes for UML regressions. Nothing exciting"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rw/uml:
x86, um: actually mark system call tables readonly
um: Skip futex_atomic_cmpxchg_inatomic() test
Pavel Machek [Sun, 4 Jan 2015 19:01:23 +0000 (20:01 +0100)]
Revert "ARM: 7830/1: delay: don't bother reporting bogomips in /proc/cpuinfo"
Commit 9fc2105aeaaf ("ARM: 7830/1: delay: don't bother reporting
bogomips in /proc/cpuinfo") breaks audio in python, and probably
elsewhere, with message
Daniel Borkmann [Sat, 3 Jan 2015 12:11:10 +0000 (13:11 +0100)]
x86, um: actually mark system call tables readonly
Commit a074335a370e ("x86, um: Mark system call tables readonly") was
supposed to mark the sys_call_table in UML as RO by adding the const,
but it doesn't have the desired effect as it's nevertheless being placed
into the data section since __cacheline_aligned enforces sys_call_table
being placed into .data..cacheline_aligned instead. We need to use
the ____cacheline_aligned version instead to fix this issue.
Before:
$ nm -v arch/x86/um/sys_call_table_64.o | grep -1 "sys_call_table"
U sys_writev 0000000000000000 D sys_call_table 0000000000000000 D syscall_table_size
After:
$ nm -v arch/x86/um/sys_call_table_64.o | grep -1 "sys_call_table"
U sys_writev 0000000000000000 R sys_call_table 0000000000000000 D syscall_table_size
Fixes: a074335a370e ("x86, um: Mark system call tables readonly") Cc: H. Peter Anvin <hpa@zytor.com> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Daniel Borkmann <dborkman@redhat.com> Signed-off-by: Richard Weinberger <richard@nod.at>
futex_atomic_cmpxchg_inatomic() does not work on UML because
it triggers a copy_from_user() in kernel context.
On UML copy_from_user() can only be used if the kernel was called
by a real user space process such that UML can use ptrace()
to fetch the value.
Reported-by: Miklos Szeredi <miklos@szeredi.hu> Suggested-by: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: Richard Weinberger <richard@nod.at> Tested-by: Daniel Walter <d.walter@0x90.at>
Linus Torvalds [Fri, 2 Jan 2015 21:24:41 +0000 (13:24 -0800)]
Merge tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi
Pull SCSI fixes from James Bottomley:
"This is a set of three fixes: one to correct an abort path thinko
causing failures (and a panic) in USB on device misbehaviour, One to
fix an out of order issue in the fnic driver and one to match discard
expectations to qemu which otherwise cause Linux to behave badly as a
guest"
* tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
SCSI: fix regression in scsi_send_eh_cmnd()
fnic: IOMMU Fault occurs when IO and abort IO is out of order
sd: tweak discard heuristics to work around QEMU SCSI issue
Linus Torvalds [Fri, 2 Jan 2015 20:57:20 +0000 (12:57 -0800)]
Merge tag 'sound-3.19-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound
Pull sound fixes from Takashi Iwai:
"Nothing too exciting as a new year's start here: most of fixes are for
ASoC, a boot crash fix on OMAP for deferred probe, a few driver
specific fixes (Intel, dwc, rockchip, rt5677), in addition to typo
fixes in kerneldoc comments for PCM"
* tag 'sound-3.19-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound:
ALSA: pcm: Fix kerneldoc for params_*() functions
ASoC: rockchip: i2s: fix maxburst of dma data to 4
ASoC: rockchip: i2s: fix error defination of transmit data level
ASoC: Intel: correct the fixed free block allocation
ASoC: rt5677: fixed rt5677_dsp_vad_put rt5677_dsp_vad_get panic
ASoC: Intel: Fix BYTCR machine driver MODULE_ALIAS
ASoC: Intel: Fix BYTCR firmware name
ASoC: dwc: Iterate over all channels
ASoC: dwc: Ensure FIFOs are flushed to prevent channel swap
ASoC: Intel: Add I2C dependency to two new machines
ASoC: dapm: Remove snd_soc_of_parse_audio_routing() due to deferred probe
Linus Torvalds [Fri, 2 Jan 2015 20:07:50 +0000 (12:07 -0800)]
Merge tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost
Pull vhost cleanup and virtio bugfix
"There's a single change here, fixing a vhost bug where vhost
initialization fails due to used ring alignment check being too
strict"
* tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost:
vhost: relax used address alignment
virtio_ring: document alignment requirements
Linus Torvalds [Wed, 31 Dec 2014 22:52:18 +0000 (14:52 -0800)]
Merge branch 'upstream' of git://git.infradead.org/users/pcmoore/audit
Pull audit fix from Paul Moore:
"One audit patch to resolve a panic/oops when recording filenames in
the audit log, see the mail archive link below.
The fix isn't as nice as I would like, as it involves an allocate/copy
of the filename, but it solves the problem and the overhead should
only affect users who have configured audit rules involving file
names.
We'll revisit this issue with future kernels in an attempt to make
this suck less, but in the meantime I think this fix should go into
the next release of v3.19-rcX.
Tobias Klauser [Wed, 31 Dec 2014 02:53:11 +0000 (10:53 +0800)]
nios2: Use preempt_schedule_irq
Follow aa0d53260596 ("ia64: Use preempt_schedule_irq") and use
preempt_schedule_irq instead of enabling/disabling interrupts and
messing around with PREEMPT_ACTIVE in the nios2 low-level preemption
code ourselves. Also get rid of the now needless re-check for
TIF_NEED_RESCHED, preempt_schedule_irq will already take care of
rescheduling.
This also fixes the following build error when building with
CONFIG_PREEMPT:
arch/nios2/kernel/built-in.o: In function `need_resched':
arch/nios2/kernel/entry.S:374: undefined reference to `PREEMPT_ACTIVE'
Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Tobias Klauser <tklauser@distanz.ch> Acked-by: Ley Foon Tan <lftan@altera.com>
Linus Torvalds [Wed, 31 Dec 2014 01:13:13 +0000 (17:13 -0800)]
Merge tag 'fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc
Pull ARM SoC fixes from Arnd Bergmann:
"A very small set of fixes for 3.19, as everyone was out.
The clocksource patch was something I missed for the merge window
after the change that broke arm64 was merged through arm-soc. The
other two patches are a fix for an undetected merge problem in mvebu
and a defconfig change to make some exynos boards work with the normal
multi_v7_defconfig"
* tag 'fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc:
Add USB_EHCI_EXYNOS to multi_v7_defconfig
ARM: mvebu: Fix pinctrl configuration for Armada 370 DB
clocksource: arch_timer: Only use the virtual counter (CNTVCT) on arm64
Linus Torvalds [Wed, 31 Dec 2014 01:04:56 +0000 (17:04 -0800)]
Merge tag 'fbdev-fixes-3.19' of git://git.kernel.org/pub/scm/linux/kernel/git/tomba/linux
Pull fbdev fixes from Tomi Valkeinen:
- Fix regression with Nokia N900 display
- Fix crash on fbdev using freed __initdata logos
- Fix fb_deferred_io_fsync() return value.
* tag 'fbdev-fixes-3.19' of git://git.kernel.org/pub/scm/linux/kernel/git/tomba/linux:
OMAPDSS: SDI: fix output port_num
video/fbdev: fix defio's fsync
video/logo: prevent use of logos after they have been freed
OMAPDSS: pll: NULL dereference in error handling
OMAPDSS: HDMI: remove double initializer entries
It's causing severe userspace breakage. Namely, all the utilities from
wireless-utils which are relying on CONFIG_WEXT (which means tools like
'iwconfig', 'iwlist', etc) are not working anymore. There is a 'iw'
utility in newer wireless-tools, which is supposed to be a replacement
for all the "deprecated" binaries, but it's far away from being
massively adopted.
Please see [1] for example of the userspace breakage this is causing.
In addition to that, Larry Finger reports [2] that this patch is also
causing ipw2200 driver being impossible to build.
To me this clearly shows that CONFIG_WEXT is far, far away from being
"deprecated enough" to be removed.
1) Fix double SKB free in bluetooth 6lowpan layer, from Jukka Rissanen.
2) Fix receive checksum handling in enic driver, from Govindarajulu
Varadarajan.
3) Fix NAPI poll list corruption in virtio_net and caif_virtio, from
Herbert Xu. Also, add code to detect drivers that have this mistake
in the future.
4) Fix doorbell endianness handling in mlx4 driver, from Amir Vadai.
5) Don't clobber IP6CB() before xfrm6_policy_check() is called in TCP
input path,f rom Nicolas Dichtel.
6) Fix MPLS action validation in openvswitch, from Pravin B Shelar.
7) Fix double SKB free in vxlan driver, also from Pravin.
8) When we scrub a packet, which happens when we are switching the
context of the packet (namespace, etc.), we should reset the
secmark. From Thomas Graf.
9) ->ndo_gso_check() needs to do more than return true/false, it also
has to allow the driver to clear netdev feature bits in order for
the caller to be able to proceed properly. From Jesse Gross.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (62 commits)
genetlink: A genl_bind() to an out-of-range multicast group should not WARN().
netlink/genetlink: pass network namespace to bind/unbind
ne2k-pci: Add pci_disable_device in error handling
bonding: change error message to debug message in __bond_release_one()
genetlink: pass multicast bind/unbind to families
netlink: call unbind when releasing socket
netlink: update listeners directly when removing socket
genetlink: pass only network namespace to genl_has_listeners()
netlink: rename netlink_unbind() to netlink_undo_bind()
net: Generalize ndo_gso_check to ndo_features_check
net: incorrect use of init_completion fixup
neigh: remove next ptr from struct neigh_table
net: xilinx: Remove unnecessary temac_property in the driver
net: phy: micrel: use generic config_init for KSZ8021/KSZ8031
net/core: Handle csum for CHECKSUM_COMPLETE VXLAN forwarding
openvswitch: fix odd_ptr_err.cocci warnings
Bluetooth: Fix accepting connections when not using mgmt
Bluetooth: Fix controller configuration with HCI_QUIRK_INVALID_BDADDR
brcmfmac: Do not crash if platform data is not populated
ipw2200: select CFG80211_WEXT
...
Alan Stern [Fri, 21 Nov 2014 15:44:49 +0000 (10:44 -0500)]
SCSI: fix regression in scsi_send_eh_cmnd()
Commit ac61d1955934 (scsi: set correct completion code in
scsi_send_eh_cmnd()) introduced a bug. It changed the stored return
value from a queuecommand call, but it didn't take into account that
the return value was used again later on. This patch fixes the bug by
changing the later usage.
There is a big comment in the middle of scsi_send_eh_cmnd() which
does a good job of explaining how the routine works. But it mentions
a "rtn = FAILURE" value that doesn't exist in the code. This patch
adjusts the code to match the comment (I assume the comment is right
and the code is wrong).
This fixes Bugzilla #88341.
Signed-off-by: Alan Stern <stern@rowland.harvard.edu> Reported-by: Андрей Аладьев <aladjev.andrew@gmail.com> Tested-by: Андрей Аладьев <aladjev.andrew@gmail.com> Fixes: ac61d19559349e205dad7b5122b281419aa74a82 Acked-by: Hannes Reinecke <hare@suse.de> Cc: <stable@vger.kernel.org> Signed-off-by: James Bottomley <JBottomley@Parallels.com>
Takashi Iwai [Tue, 30 Dec 2014 15:17:13 +0000 (16:17 +0100)]
Merge tag 'asoc-fix-v3.19-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/sound into for-linus
ASoC: Fixes for v3.19
A few fixes for v3.19, a few driver specifics and one core fix which
fixes a boot crash on OMAP if deferred probing kicks in due to
attempting to modify static data.
Currently we enable Exynos devices in the multi v7 defconfig, however, when
testing on my ODROID-U3, I noticed that USB was not working. Enabling this
option causes USB to work, which enables networking support as well since the
ODROID-U3 has networking on the USB bus.
[arnd] Support for odroid-u3 was added in 3.10, so it would be nice to
backport this fix at least that far.
Paul Moore [Tue, 30 Dec 2014 14:26:21 +0000 (09:26 -0500)]
audit: create private file name copies when auditing inodes
Unfortunately, while commit 4a928436 ("audit: correctly record file
names with different path name types") fixed a problem where we were
not recording filenames, it created a new problem by attempting to use
these file names after they had been freed. This patch resolves the
issue by creating a copy of the filename which the audit subsystem
frees after it is done with the string.
At some point it would be nice to resolve this issue with refcounts,
or something similar, instead of having to allocate/copy strings, but
that is almost surely beyond the scope of a -rcX patch so we'll defer
that for later. On the plus side, only audit users should be impacted
by the string copying.
Reported-by: Toralf Foerster <toralf.foerster@gmx.de> Signed-off-by: Paul Moore <pmoore@redhat.com>
fnic: IOMMU Fault occurs when IO and abort IO is out of order
When I/O is aborted by mid-layer, fnic FW will complete the I/O before
completing the abort task. In some cases abort request is completed before
the I/O, which could lead to inconsistent driver and firmware states.
In this case firmware reset would clear the inconsistent state.
Signed-off-by: Anil Chintalapati <achintal@cisco.com> Signed-off-by: Sesidhar Baddela <sebaddel@cisco.com> Signed-off-by: Hiral Shah <hishah@cisco.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
sd: tweak discard heuristics to work around QEMU SCSI issue
7985090aa020 changed the discard heuristics to give preference to the
WRITE SAME commands that (unlike UNMAP) guarantee deterministic results.
Ming Lei discovered that QEMU SCSI's WRITE SAME implementation
internally relied on limits that were only communicated for the UNMAP
case. And therefore discard commands backed by WRITE SAME would fail.
Tweak the heuristics so we still pick UNMAP in the LBPRZ=0 case and only
prefer the WRITE SAME variants if the device has the LBPRZ flag set.
Reported-by: Ming Lei <ming.lei@canonical.com> Tested-by: Ming Lei <ming.lei@canonical.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Acked-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
Tomi Valkeinen [Mon, 29 Dec 2014 07:57:11 +0000 (09:57 +0200)]
OMAPDSS: SDI: fix output port_num
After the commit ef691ff48bc8 (OMAPDSS: DT: Get source endpoint by
matching reg-id) we look for the SDI output using the port number.
However, the SDI driver doesn't set the port number, which causes the
SDI display to not initialize.
Fix this by setting the SDI port number to 1. We use a hardcoded value,
as SDI was used only on OMAP3 and it's always port number 1 there.
Reported-by: Aaro Koskinen <aaro.koskinen@iki.fi> Reported-by: Pavel Machek <pavel@ucw.cz> Signed-off-by: Tomi Valkeinen <tomi.valkeinen@ti.com>
Tomi Valkeinen [Fri, 19 Dec 2014 11:55:41 +0000 (13:55 +0200)]
video/fbdev: fix defio's fsync
fb_deferred_io_fsync() returns the value of schedule_delayed_work() as
an error code, but schedule_delayed_work() does not return an error. It
returns true/false depending on whether the work was already queued.
Fix this by ignoring the return value of schedule_delayed_work().
Signed-off-by: Tomi Valkeinen <tomi.valkeinen@ti.com> Cc: stable@vger.kernel.org
Linus Torvalds [Tue, 30 Dec 2014 05:09:57 +0000 (21:09 -0800)]
Merge branch 'for-linus' of git://git.samba.org/sfrench/cifs-2.6
Pull CIFS fixes from Steve French:
"A set of three minor cifs fixes"
* 'for-linus' of git://git.samba.org/sfrench/cifs-2.6:
cifs: make new inode cache when file type is different
Fix signed/unsigned pointer warning
Convert MessageID in smb2_hdr to LE
Linus Torvalds [Tue, 30 Dec 2014 04:43:10 +0000 (20:43 -0800)]
Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs
Pull UDF & isofs fixes from Jan Kara:
"A couple of UDF fixes of handling of corrupted media and one iso9660
fix of the same"
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
udf: Reduce repeated dereferences
udf: Check component length before reading it
udf: Check path length when reading symlink
udf: Verify symlink size before loading it
udf: Verify i_size when loading inode
isofs: Fix unchecked printing of ER records
Linus Torvalds [Tue, 30 Dec 2014 02:50:02 +0000 (18:50 -0800)]
Merge tag 'pm+acpi-3.19-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull power management and ACPI material from Rafael J Wysocki:
"These are fixes (operating performance points library, cpufreq-dt
driver, cpufreq core, ACPI backlight, cpupower tool), cleanups
(cpuidle), new processor IDs for the RAPL (Running Average Power
Limit) power capping driver, and a modification of the generic power
domains framework allowing modular drivers to call one of its helper
functions.
Specifics:
- Fix for a potential NULL pointer dereference in the cpufreq core
due to an initialization race condition (Ethan Zhao).
- Fixes for abuse of the OPP (Operating Performance Points) API
related to RCU and other minor issues in the OPP library and the
cpufreq-dt driver (Dmitry Torokhov).
- cpuidle governors cleanup making them measure idle duration in a
better way without using the CPUIDLE_FLAG_TIME_INVALID flag which
allows that flag to be dropped from the ACPI cpuidle driver and
from the core too (Len Brown).
- New ACPI backlight blacklist entries for Samsung machines without a
working native backlight interface that need to use the ACPI
backlight instead (Aaron Lu).
- New CPU IDs of future Intel Xeon CPUs for the Intel RAPL power
capping driver (Jacob Pan).
- Generic power domains framework modification to export the
of_genpd_get_from_provider() function to modular drivers that will
allow future driver modifications to be based on the mainline (Amit
Daniel Kachhap).
- Two fixes for the cpupower tool (Michal Privoznik, Prarit
Bhargava)"
* tag 'pm+acpi-3.19-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
ACPI / video: Add some Samsung models to disable_native_backlight list
tools / cpupower: Fix no idle state information return value
tools / cpupower: Correctly detect if running as root
cpufreq: fix a NULL pointer dereference in __cpufreq_governor()
cpufreq-dt: defer probing if OPP table is not ready
PM / OPP: take RCU lock in dev_pm_opp_get_opp_count
PM / OPP: fix warning in of_free_opp_table()
PM / OPP: add some lockdep annotations
powercap / RAPL: add IDs for future Xeon CPUs
PM / Domains: Export of_genpd_get_from_provider function
cpuidle / ACPI: remove unused CPUIDLE_FLAG_TIME_INVALID
cpuidle: ladder: Better idle duration measurement without using CPUIDLE_FLAG_TIME_INVALID
cpuidle: menu: Better idle duration measurement without using CPUIDLE_FLAG_TIME_INVALID
Linus Torvalds [Mon, 29 Dec 2014 21:30:50 +0000 (13:30 -0800)]
Merge tag 'spi-v3.19-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi
Pull spi fixes from Mark Brown:
"A few driver specific fixes here, the DMA burst size increase in the
spfi driver is a fix to make the hardware happier in some situations"
* tag 'spi-v3.19-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi:
spi: img-spfi: Increase DMA burst size
spi: img-spfi: Enable controller before starting TX DMA
spi: sh-msiof: Add runtime PM lock in initializing
Linus Torvalds [Mon, 29 Dec 2014 21:24:38 +0000 (13:24 -0800)]
Merge tag 'regulator-v3.19-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regulator
Pull one regulator fix from Mark Brown:
"One fix here, a fix for the voltage mapping on one of the s2mps11
regulators which broke systems using it including apparently the
Gear 2 smartwatches"
* tag 'regulator-v3.19-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regulator:
regulator: s2mps11: Fix dw_mmc failure on Gear 2
Linus Torvalds [Mon, 29 Dec 2014 21:13:41 +0000 (13:13 -0800)]
Merge branch 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/rzhang/linux
Pull thermal management updates from Zhang Rui:
"First of all, the most important change is the thermal cpu cooling
fixes. The major fix here is to have proper sequencing between
cpufreq layer and thermal cpu cooling registration. A take away of
this fix is an improvement in the thermal drivers code. Thermal
drivers that require cpu cooling do not need to check for cpufreq
layer. The requirement now is to propagate the error code, if any,
while registering cpu cooling device. Thanks to Viresh for
implementing the required CPUfreq changes.
Second, a new driver is introduced for int340x processor thermal
device. Given that int340x thermal is disabled by default, and this
processor thermal device is only available on limited platforms, plus
the driver does nothing but exposes some thermal limitation
information for user space to use, thus I think it is safe to include
it in this pull request after missing 3.19-rc2.
- several small fixes and cleanups for int340x thermal drivers"
* 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/rzhang/linux: (43 commits)
Thermal/int340x/int3403: Free acpi notification handler
Thermal/int340x/processor_thermal: Fix memory leak
Thermal/int340x/int3403: Fix memory leak
thermal: int340x: Introduce processor reporting device
thermal: int340x_thermal: drop owner assignment from platform_drivers
thermal: drop owner assignment from platform_drivers
thermal: cpu_cooling: document node in struct cpufreq_cooling_device
thermal/powerclamp: add ids for future xeon cpus
Thermal/int340x: Handle properly the case when _trt or _art acpi entry is missing
thermal: cpu_cooling: return ERR_PTR() for !CPU_THERMAL or !THERMAL_OF
thermal: cpu_cooling: small memory leak on error
thermal: ti-soc-thermal: Do not print error message in the EPROBE_DEFER case
thermal: db8500: Do not print error message in the EPROBE_DEFER case
thermal: imx: Do not print error message in the EPROBE_DEFER case
thermal: Fix cdev registration with THERMAL_NO_LIMIT on 64bit
drivers: thermal: Remove ARCH_HAS_BANDGAP dependency for samsung
thermal:core:fix: Check return code of the ->get_max_state() callback
thermal: cpu_cooling: update copyright tags
thermal: cpu_cooling: Use cpufreq_dev->freq_table for finding level/freq
thermal: cpu_cooling: Store frequencies in descending order
...
Michal Hocko [Mon, 29 Dec 2014 19:30:35 +0000 (20:30 +0100)]
mm: get rid of radix tree gfp mask for pagecache_get_page
Commit 2457aec63745 ("mm: non-atomically mark page accessed during page
cache allocation where possible") has added a separate parameter for
specifying gfp mask for radix tree allocations.
Not only this is less than optimal from the API point of view because it
is error prone, it is also buggy currently because
grab_cache_page_write_begin is using GFP_KERNEL for radix tree and if
fgp_flags doesn't contain FGP_NOFS (mostly controlled by fs by
AOP_FLAG_NOFS flag) but the mapping_gfp_mask has __GFP_FS cleared then
the radix tree allocation wouldn't obey the restriction and might
recurse into filesystem and cause deadlocks. This is the case for most
filesystems unfortunately because only ext4 and gfs2 are using
AOP_FLAG_NOFS.
Let's simply remove radix_gfp_mask parameter because the allocation
context is same for both page cache and for the radix tree. Just make
sure that the radix tree gets only the sane subset of the mask (e.g. do
not pass __GFP_WRITE).
Long term it is more preferable to convert remaining users of
AOP_FLAG_NOFS to use mapping_gfp_mask instead and simplify this
interface even further.
Reported-by: Dave Chinner <david@fromorbit.com> Signed-off-by: Michal Hocko <mhocko@suse.cz> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
mmc: core: stop trying to switch width when only one bit is supported
mmc_select_bus_width() will try to switch to MMC_BUS_WIDTH_4 even if
MMC_CAP_4_BIT_DATA and MMC_CAP_8_BIT_DATA are not set in host->caps.
Return as soon as possible when those flags are not set
virtio 1.0 only requires used address to be 4 byte aligned,
vhost required 8 bytes (size of vring_used_elem).
Fix up vhost to match that.
Additionally, while vhost correctly requires 8 byte
alignment for log, it's unconnected to used ring:
it's a consequence that log has u64 entries.
Tweak code to make that clearer.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Host needs to know vring element alignment requirements:
simply doing alignof on structures doesn't work reliably: on some
platforms gcc has alignof(uint32_t) == 2.
Add macros for alignment as specified in virtio 1.0 cs01,
export them to userspace as well.
Acked-by: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Tomi Valkeinen [Thu, 18 Dec 2014 11:40:06 +0000 (13:40 +0200)]
video/logo: prevent use of logos after they have been freed
If the probe of an fb driver has been deferred due to missing
dependencies, and the probe is later ran when a module is loaded, the
fbdev framework will try to find a logo to use.
However, the logos are __initdata, and have already been freed. This
causes sometimes page faults, if the logo memory is not mapped,
sometimes other random crashes as the logo data is invalid, and
sometimes nothing, if the fbdev decides to reject the logo (e.g. the
random value depicting the logo's height is too big).
This patch adds a late_initcall function to mark the logos as freed. In
reality the logos are freed later, and fbdev probe may be ran between
this late_initcall and the freeing of the logos. In that case we will
miss drawing the logo, even if it would be possible.
Signed-off-by: Tomi Valkeinen <tomi.valkeinen@ti.com> Cc: stable@vger.kernel.org
Although this did fix the bug it was aimed at, it also broke secondary
startup on platforms that use give/take_timebase(). Unfortunately we
didn't detect that while it was in next.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Hari Bathini [Thu, 18 Dec 2014 18:06:55 +0000 (23:36 +0530)]
powerpc/kdump: Ignore failure in enabling big endian exception during crash
In LE kernel, we currently have a hack for kexec that resets the exception
endian before starting a new kernel as the kernel that is loaded could be a
big endian or a little endian kernel. In kdump case, resetting exception
endian fails when one or more cpus is disabled. But we can ignore the failure
and still go ahead, as in most cases crashkernel will be of same endianess
as primary kernel and reseting endianess is not even needed in those cases.
This patch adds a new inline function to say if this is kdump path. This
function is used at places where such a check is needed.
Signed-off-by: Hari Bathini <hbathini@linux.vnet.ibm.com>
[mpe: Rename to kdump_in_progress(), use bool, and edit comment] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Linus Torvalds [Sun, 28 Dec 2014 21:08:08 +0000 (13:08 -0800)]
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull KVM fixes from Paolo Bonzini:
"The important fixes are for two bugs introduced by the merge window.
On top of this, add a couple of WARN_ONs and stop spamming dmesg on
pretty much every boot of a virtual machine"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
kvm: warn on more invariant breakage
kvm: fix sorting of memslots with base_gfn == 0
kvm: x86: drop severity of "generation wraparound" message
kvm: x86: vmx: reorder some msr writing
Paolo Bonzini [Sat, 27 Dec 2014 17:01:00 +0000 (18:01 +0100)]
kvm: fix sorting of memslots with base_gfn == 0
Before commit 0e60b0799fed (kvm: change memslot sorting rule from size
to GFN, 2014-12-01), the memslots' sorting key was npages, meaning
that a valid memslot couldn't have its sorting key equal to zero.
On the other hand, a valid memslot can have base_gfn == 0, and invalid
memslots are identified by base_gfn == npages == 0.
Because of this, commit 0e60b0799fed broke the invariant that invalid
memslots are at the end of the mslots array. When a memslot with
base_gfn == 0 was created, any invalid memslot before it were left
in place.
This can be fixed by changing the insertion to use a ">=" comparison
instead of "<=", but some care is needed to avoid breaking the case
of deleting a memslot; see the comment in update_memslots.
Thanks to Tiejun Chen for posting an initial patch for this bug.
Reported-by: Jamie Heilman <jamie@audible.transient.net> Reported-by: Andy Lutomirski <luto@amacapital.net> Tested-by: Jamie Heilman <jamie@audible.transient.net> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Linus Torvalds [Sat, 27 Dec 2014 21:12:00 +0000 (13:12 -0800)]
Merge tag 'sound-3.19-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound
Pull sound fixes from Takashi Iwai:
"Just a couple of fixes for the new Intel Skylake HD-audio support"
* tag 'sound-3.19-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound:
ALSA: hda_intel: apply the Seperate stream_tag for Skylake
ALSA: hda_controller: Separate stream_tag for input and output streams.
Tiejun Chen [Tue, 23 Dec 2014 08:21:11 +0000 (16:21 +0800)]
kvm: x86: vmx: reorder some msr writing
The commit 34a1cd60d17f, "x86: vmx: move some vmx setting from
vmx_init() to hardware_setup()", tried to refactor some codes
specific to vmx hardware setting into hardware_setup(), but some
msr writing should depend on our previous setting condition like
enable_apicv, enable_ept and so on.
Johannes Berg [Tue, 23 Dec 2014 20:00:06 +0000 (21:00 +0100)]
netlink/genetlink: pass network namespace to bind/unbind
Netlink families can exist in multiple namespaces, and for the most
part multicast subscriptions are per network namespace. Thus it only
makes sense to have bind/unbind notifications per network namespace.
To achieve this, pass the network namespace of a given client socket
to the bind/unbind functions.
Also do this in generic netlink, and there also make sure that any
bind for multicast groups that only exist in init_net is rejected.
This isn't really a problem if it is accepted since a client in a
different namespace will never receive any notifications from such
a group, but it can confuse the family if not rejected (it's also
possible to silently (without telling the family) accept it, but it
would also have to be ignored on unbind so families that take any
kind of action on bind/unbind won't do unnecessary work for invalid
clients like that.
Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>