Joerg Roedel [Wed, 14 Oct 2015 13:10:54 +0000 (15:10 +0200)]
kvm: svm: Only propagate next_rip when guest supports it
Currently we always write the next_rip of the shadow vmcb to
the guests vmcb when we emulate a vmexit. This could confuse
the guest when its cpuid indicated no support for the
next_rip feature.
Fix this by only propagating next_rip if the guest actually
supports it.
Cc: Bandan Das <bsd@redhat.com> Cc: Dirk Mueller <dmueller@suse.com> Tested-By: Dirk Mueller <dmueller@suse.com> Signed-off-by: Joerg Roedel <jroedel@suse.de> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Wanpeng Li [Tue, 13 Oct 2015 16:18:37 +0000 (09:18 -0700)]
KVM: nVMX: expose VPID capability to L1
Expose VPID capability to L1. For nested guests, we don't do anything
specific for single context invalidation. Hence, only advertise support
for global context invalidation. The major benefit of nested VPID comes
from having separate vpids when switching between L1 and L2, and also
when L2's vCPUs not sched in/out on L1.
Reviewed-by: Wincy Van <fanwenyi0529@gmail.com> Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Wanpeng Li [Tue, 13 Oct 2015 16:18:36 +0000 (09:18 -0700)]
KVM: nVMX: nested VPID emulation
VPID is used to tag address space and avoid a TLB flush. Currently L0 use
the same VPID to run L1 and all its guests. KVM flushes VPID when switching
between L1 and L2.
This patch advertises VPID to the L1 hypervisor, then address space of L1
and L2 can be separately treated and avoid TLB flush when swithing between
L1 and L2. For each nested vmentry, if vpid12 is changed, reuse shadow vpid
w/ an invvpid.
Performance:
run lmbench on L2 w/ 3.5 kernel.
Context switching - times in microseconds - smaller is better
-------------------------------------------------------------------------
Host OS 2p/0K 2p/16K 2p/64K 8p/16K 8p/64K 16p/16K 16p/64K
ctxsw ctxsw ctxsw ctxsw ctxsw ctxsw ctxsw
--------- ------------- ------ ------ ------ ------ ------ ------- -------
kernel Linux 3.5.0-1 1.2200 1.3700 1.4500 4.7800 2.3300 5.60000 2.88000 nested VPID
kernel Linux 3.5.0-1 1.2600 1.4300 1.5600 12.7 12.9 3.49000 7.46000 vanilla
Reviewed-by: Jan Kiszka <jan.kiszka@siemens.com> Reviewed-by: Wincy Van <fanwenyi0529@gmail.com> Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
kvm: fix waitqueue_active without memory barrier in virt/kvm/async_pf.c
async_pf_execute() seems to be missing a memory barrier which might
cause the waker to not notice the waiter and miss sending a wake_up as
in the following figure.
async_pf_execute kvm_vcpu_block
------------------------------------------------------------------------
spin_lock(&vcpu->async_pf.lock);
if (waitqueue_active(&vcpu->wq))
/* The CPU might reorder the test for
the waitqueue up here, before
prior writes complete */
prepare_to_wait(&vcpu->wq, &wait,
TASK_INTERRUPTIBLE);
/*if (kvm_vcpu_check_block(vcpu) < 0) */
/*if (kvm_arch_vcpu_runnable(vcpu)) { */
...
return (vcpu->arch.mp_state == KVM_MP_STATE_RUNNABLE &&
!vcpu->arch.apf.halted)
|| !list_empty_careful(&vcpu->async_pf.done)
...
return 0;
list_add_tail(&apf->link,
&vcpu->async_pf.done);
spin_unlock(&vcpu->async_pf.lock);
waited = true;
schedule();
------------------------------------------------------------------------
The attached patch adds the missing memory barrier.
I found this issue when I was looking through the linux source code
for places calling waitqueue_active() before wake_up*(), but without
preceding memory barriers, after sending a patch to fix a similar
issue in drivers/tty/n_tty.c (Details about the original issue can be
found here: https://lkml.org/lkml/2015/9/28/849).
Signed-off-by: Kosuke Tatsukawa <tatsu@ab.jp.nec.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Radim Krčmář [Thu, 8 Oct 2015 18:30:00 +0000 (20:30 +0200)]
KVM: x86: don't notify userspace IOAPIC on edge EOI
On real hardware, edge-triggered interrupts don't set a bit in TMR,
which means that IOAPIC isn't notified on EOI. Do the same here.
Staying in guest/kernel mode after edge EOI is what we want for most
devices. If some bugs could be nicely worked around with edge EOI
notifications, we should invest in a better interface.
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Radim Krčmář [Thu, 8 Oct 2015 18:23:34 +0000 (20:23 +0200)]
KVM: x86: fix edge EOI and IOAPIC reconfig race
KVM uses eoi_exit_bitmap to track vectors that need an action on EOI.
The problem is that IOAPIC can be reconfigured while an interrupt with
old configuration is pending and eoi_exit_bitmap only remembers the
newest configuration; thus EOI from the pending interrupt is not
recognized.
(Reconfiguration is not a problem for level interrupts, because IOAPIC
sends interrupt with the new configuration.)
For an edge interrupt with ACK notifiers, like i8254 timer; things can
happen in this order
1) IOAPIC inject a vector from i8254
2) guest reconfigures that vector's VCPU and therefore eoi_exit_bitmap
on original VCPU gets cleared
3) guest's handler for the vector does EOI
4) KVM's EOI handler doesn't pass that vector to IOAPIC because it is
not in that VCPU's eoi_exit_bitmap
5) i8254 stops working
A simple solution is to set the IOAPIC vector in eoi_exit_bitmap if the
vector is in PIR/IRR/ISR.
This creates an unwanted situation if the vector is reused by a
non-IOAPIC source, but I think it is so rare that we don't want to make
the solution more sophisticated. The simple solution also doesn't work
if we are reconfiguring the vector. (Shouldn't happen in the wild and
I'd rather fix users of ACK notifiers instead of working around that.)
The are no races because ioapic injection and reconfig are locked.
Fixes: b053b2aef25d ("KVM: x86: Add EOI exit bitmap inference")
[Before b053b2aef25d, this bug happened only with APICv.] Fixes: c7c9c56ca26f ("x86, apicv: add virtual interrupt delivery support") Cc: <stable@vger.kernel.org> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Paolo Bonzini [Wed, 14 Oct 2015 13:25:52 +0000 (15:25 +0200)]
KVM: x86: fix RSM into 64-bit protected mode
In order to get into 64-bit protected mode, you need to enable
paging while EFER.LMA=1. For this to work, CS.L must be 0.
Currently, we load the segments before CR0 and CR4, which means
that if RSM returns into 64-bit protected mode CS.L is already 1
and everything breaks.
Luckily, CS.L=0 is always the case when executing RSM, because it
is forbidden to execute RSM from 64-bit protected mode. Hence it
is enough to load CR0 and CR4 first, and only then the segments.
Fixes: 660a5d517aaab9187f93854425c4c63f4a09195c Cc: stable@vger.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Paolo Bonzini [Tue, 13 Oct 2015 19:32:50 +0000 (21:32 +0200)]
Merge branch 'kvm-master' into HEAD
This merge brings in a couple important SMM fixes, which makes it
easier to test latest KVM with unrestricted_guest=0 and to test
the in-progress work on SMM support in the firmware.
Paolo Bonzini [Mon, 12 Oct 2015 11:56:27 +0000 (13:56 +0200)]
KVM: x86: map/unmap private slots in __x86_set_memory_region
Otherwise, two copies (one of them never populated and thus bogus)
are allocated for the regular and SMM address spaces. This breaks
SMM with EPT but without unrestricted guest support, because the
SMM copy of the identity page map is all zeros.
By moving the allocation to the caller we also remove the last
vestiges of kernel-allocated memory regions (not accessible anymore
in userspace since commit b74a07beed0e, "KVM: Remove kernel-allocated
memory regions", 2010-06-21); that is a nice bonus.
Paolo Bonzini [Mon, 12 Oct 2015 11:38:32 +0000 (13:38 +0200)]
KVM: x86: build kvm_userspace_memory_region in x86_set_memory_region
The next patch will make x86_set_memory_region fill the
userspace_addr. Since the struct is not used untouched
anymore, it makes sense to build it in x86_set_memory_region
directly; it also simplifies the callers.
KVM: s390: factor out reading of the guest TOD clock
Let's factor this out and always use get_tod_clock_fast() when
reading the guest TOD.
STORE CLOCK FAST does not do serialization and, therefore, might
result in some fuzziness between different processors in a way
that subsequent calls on different CPUs might have time stamps that
are earlier. This semantics is fine though for all KVM use cases.
To make it obvious that the new function has STORE CLOCK FAST
semantics we name it kvm_s390_get_tod_clock_fast.
With this patch, we only have a handful of places were we
have to care about STP sync (using preempt_disable() logic).
Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com> Signed-off-by: David Hildenbrand <dahi@linux.vnet.ibm.com> Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
KVM: s390: factor out and fix setting of guest TOD clock
Let's move that whole logic into one function. We now always use unsigned
values when calculating the epoch (to avoid over/underflow defined).
Also, we always have to get all VCPUs out of SIE before doing the update
to avoid running differing VCPUs with different TODs.
Acked-by: Christian Borntraeger <borntraeger@de.ibm.com> Signed-off-by: David Hildenbrand <dahi@linux.vnet.ibm.com> Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
KVM: s390: switch to get_tod_clock() and fix STP sync races
Nobody except early.c makes use of store_tod_clock() to handle the
cc. So if we would get a cc != 0, we would be in more trouble.
Let's replace all users with get_tod_clock(). Returning a cc
on an ioctl sounded strange either way.
We can now also easily move the get_tod_clock() call into the
preempt_disable() section. This is in fact necessary to make the
STP sync work as expected. Otherwise the host TOD could change
and we would end up with a wrong epoch calculation.
Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com> Signed-off-by: David Hildenbrand <dahi@linux.vnet.ibm.com> Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
KVM: s390: correctly handle injection of pgm irqs and per events
PER events can always co-exist with other program interrupts.
For now, we always overwrite all program interrupt parameters when
injecting any type of program interrupt.
Let's handle that correctly by only overwriting the relevant portion of
the program interrupt parameters. Therefore we can now inject PER events
and ordinary program interrupts concurrently, resulting in no loss of
program interrupts. This will especially by helpful when manually detecting
PER events later - as both types might be triggered during one SIE exit.
Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com> Signed-off-by: David Hildenbrand <dahi@linux.vnet.ibm.com> Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
KVM: s390: simplify in-kernel program irq injection
The main reason to keep program injection in kernel separated until now
was that we were able to do some checking, if really only the owning
thread injects program interrupts (via waitqueue_active(li->wq)).
This BUG_ON was never triggered and the chances of really hitting it, if
another thread injected a program irq to another vcpu, were very small.
Let's drop this check and turn kvm_s390_inject_program_int() and
kvm_s390_inject_prog_irq() into simple inline functions that makes use of
kvm_s390_inject_vcpu().
__must_check can be dropped as they are implicitely given by
kvm_s390_inject_vcpu(), to avoid ugly long function prototypes.
Reviewed-by: Jens Freimann <jfrei@linux.vnet.ibm.com> Acked-by: Cornelia Huck <cornelia.huck@de.ibm.com> Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com> Signed-off-by: David Hildenbrand <dahi@linux.vnet.ibm.com> Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Let's get rid of the local variable and exit directly if we found
any pending interrupt. This is not only faster, but also better
readable.
Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com> Signed-off-by: David Hildenbrand <dahi@linux.vnet.ibm.com> Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
KVM: s390: kvm_arch_vcpu_runnable already cares about timer interrupts
We can remove that double check.
Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com> Signed-off-by: David Hildenbrand <dahi@linux.vnet.ibm.com> Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
KVM: s390: set interception requests for all floating irqs
No need to separate pending and floating irqs when setting interception
requests. Let's do it for all equally.
Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com> Signed-off-by: David Hildenbrand <dahi@linux.vnet.ibm.com> Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
KVM: s390: disabled wait cares about machine checks, not PER
We don't care about program event recording irqs (synchronous
program irqs) but asynchronous irqs when checking for disabled
wait. Machine checks were missing.
Let's directly switch to the functions we have for that purpose
instead of testing once again for magic bits.
Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com> Signed-off-by: David Hildenbrand <dahi@linux.vnet.ibm.com> Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
This patch adds the routine to update IRTE for posted-interrupts
when guest changes the interrupt configuration.
Signed-off-by: Feng Wu <feng.wu@intel.com> Reviewed-by: Alex Williamson <alex.williamson@redhat.com> Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
[Squashed in automatically generated patch from the build robot
"KVM: x86: vcpu_to_pi_desc() can be static" - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
KVM: Define a new interface kvm_intr_is_single_vcpu()
This patch defines a new interface kvm_intr_is_single_vcpu(),
which can returns whether the interrupt is for single-CPU or not.
It is used by VT-d PI, since now we only support single-CPU
interrupts, For lowest-priority interrupts, if user configures
it via /proc/irq or uses irqbalance to make it single-CPU, we
can use PI to deliver the interrupts to it. Full functionality
of lowest-priority support will be added later.
Signed-off-by: Feng Wu <feng.wu@intel.com> Reviewed-by: Alex Williamson <alex.williamson@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
KVM: Add some helper functions for Posted-Interrupts
This patch adds some helper functions to manipulate the
Posted-Interrupts Descriptor.
Signed-off-by: Feng Wu <feng.wu@intel.com> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Reviewed-by: Alex Williamson <alex.williamson@redhat.com>
[Make the new functions inline. - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
KVM: Add an arch specific hooks in 'struct kvm_kernel_irqfd'
This patch adds an arch specific hooks 'arch_update' in
'struct kvm_kernel_irqfd'. On Intel side, it is used to
update the IRTE when VT-d posted-interrupts is used.
Signed-off-by: Feng Wu <feng.wu@intel.com> Reviewed-by: Alex Williamson <alex.williamson@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
They make possible to specialize the KVM IRQ bypass consumer in
case CONFIG_KVM_HAVE_IRQ_BYPASS is set.
Signed-off-by: Eric Auger <eric.auger@linaro.org>
[Add weak implementations of the callbacks. - Feng] Signed-off-by: Feng Wu <feng.wu@intel.com> Reviewed-by: Alex Williamson <alex.williamson@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Eric Auger [Fri, 18 Sep 2015 14:29:42 +0000 (22:29 +0800)]
KVM: create kvm_irqfd.h
Move _irqfd_resampler and _irqfd struct declarations in a new
public header: kvm_irqfd.h. They are respectively renamed into
kvm_kernel_irqfd_resampler and kvm_kernel_irqfd. Those datatypes
will be used by architecture specific code, in the context of
IRQ bypass manager integration.
Signed-off-by: Eric Auger <eric.auger@linaro.org> Signed-off-by: Feng Wu <feng.wu@intel.com> Reviewed-by: Alex Williamson <alex.williamson@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Alex Williamson [Fri, 18 Sep 2015 14:29:39 +0000 (22:29 +0800)]
virt: IRQ bypass manager
When a physical I/O device is assigned to a virtual machine through
facilities like VFIO and KVM, the interrupt for the device generally
bounces through the host system before being injected into the VM.
However, hardware technologies exist that often allow the host to be
bypassed for some of these scenarios. Intel Posted Interrupts allow
the specified physical edge interrupts to be directly injected into a
guest when delivered to a physical processor while the vCPU is
running. ARM IRQ Forwarding allows forwarded physical interrupts to
be directly deactivated by the guest.
The IRQ bypass manager here is meant to provide the shim to connect
interrupt producers, generally the host physical device driver, with
interrupt consumers, generally the hypervisor, in order to configure
these bypass mechanism. To do this, we base the connection on a
shared, opaque token. For KVM-VFIO this is expected to be an
eventfd_ctx since this is the connection we already use to connect an
eventfd to an irqfd on the in-kernel path. When a producer and
consumer with matching tokens is found, callbacks via both registered
participants allow the bypass facilities to be automatically enabled.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com> Reviewed-by: Eric Auger <eric.auger@linaro.org> Tested-by: Eric Auger <eric.auger@linaro.org> Tested-by: Feng Wu <feng.wu@intel.com> Signed-off-by: Feng Wu <feng.wu@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Newer KVM won't be exposing PVCLOCK_COUNTS_FROM_ZERO anymore.
The purpose of that flags was to start counting system time from 0 when
the KVM clock has been initialized.
We can achieve the same by selecting one read as the initial point.
A simple subtraction will work unless the KVM clock count overflows
earlier (has smaller width) than scheduler's cycle count. We should be
safe till x86_128.
Because PVCLOCK_COUNTS_FROM_ZERO was enabled only on new hypervisors,
setting sched clock as stable based on PVCLOCK_TSC_STABLE_BIT might
regress on older ones.
I presume we don't need to change kvm_clock_read instead of introducing
kvm_sched_clock_read. A problem could arise in case sched_clock is
expected to return the same value as get_cycles, but we should have
merged those clocks in that case.
Signed-off-by: Xiao Guangrong <guangrong.xiao@linux.intel.com>
[Rewrite to use vmcs_set_secondary_exec_control. - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
KVM: VMX: simplify invpcid handling in vmx_cpuid_update()
If vmx_invpcid_supported() is true, second execution control
filed must be supported and SECONDARY_EXEC_ENABLE_INVPCID
must have already been set in current vmcs by
vmx_secondary_exec_control()
If vmx_invpcid_supported() is false, no need to clear
SECONDARY_EXEC_ENABLE_INVPCID
Signed-off-by: Xiao Guangrong <guangrong.xiao@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Pass PCOMMIT CPU feature to guest to enable PCOMMIT instruction
Currently we do not catch pcommit instruction for L1 guest and
allow L1 to catch this instruction for L2 if, as required by the spec,
L1 can enumerate the PCOMMIT instruction via CPUID:
| IA32_VMX_PROCBASED_CTLS2[53] (which enumerates support for the
| 1-setting of PCOMMIT exiting) is always the same as
| CPUID.07H:EBX.PCOMMIT[bit 22]. Thus, software can set PCOMMIT exiting
| to 1 if and only if the PCOMMIT instruction is enumerated via CPUID
The spec can be found at
https://software.intel.com/sites/default/files/managed/0d/53/319433-022.pdf
Signed-off-by: Xiao Guangrong <guangrong.xiao@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
HV_X64_MSR_VP_RUNTIME msr used by guest to get
"the time the virtual processor consumes running guest code,
and the time the associated logical processor spends running
hypervisor code on behalf of that guest."
Calculation of this time is performed by task_cputime_adjusted()
for vcpu task.
Necessary to support loading of winhv.sys in guest, which in turn is
required to support Windows VMBus.
Signed-off-by: Andrey Smetanin <asmetanin@virtuozzo.com> Reviewed-by: Roman Kagan <rkagan@virtuozzo.com> Signed-off-by: Denis V. Lunev <den@openvz.org> CC: Paolo Bonzini <pbonzini@redhat.com> CC: Gleb Natapov <gleb@kernel.org> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
kvm/x86: Hyper-V HV_X64_MSR_VP_INDEX export for QEMU.
Insert Hyper-V HV_X64_MSR_VP_INDEX into msr's emulated list,
so QEMU can set Hyper-V features cpuid HV_X64_MSR_VP_INDEX_AVAILABLE
bit correctly. KVM emulation part is in place already.
Necessary to support loading of winhv.sys in guest, which in turn is
required to support Windows VMBus.
Signed-off-by: Andrey Smetanin <asmetanin@virtuozzo.com> Reviewed-by: Roman Kagan <rkagan@virtuozzo.com> Signed-off-by: Denis V. Lunev <den@openvz.org> CC: Paolo Bonzini <pbonzini@redhat.com> CC: Gleb Natapov <gleb@kernel.org> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Jason Wang [Tue, 15 Sep 2015 06:41:59 +0000 (14:41 +0800)]
kvm: add capability for any-length ioeventfds
Cc: Gleb Natapov <gleb@kernel.org> Cc: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Jason Wang <jasowang@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Jason Wang [Tue, 15 Sep 2015 06:41:58 +0000 (14:41 +0800)]
kvm: add tracepoint for fast mmio
Cc: Gleb Natapov <gleb@kernel.org> Cc: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Jason Wang <jasowang@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Jason Wang [Tue, 25 Aug 2015 09:05:46 +0000 (17:05 +0800)]
kvm: use kmalloc() instead of kzalloc() during iodev register/unregister
All fields of kvm_io_range were initialized or copied explicitly
afterwards. So switch to use kmalloc().
Cc: Gleb Natapov <gleb@kernel.org> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Jason Wang <jasowang@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Steve Rutherford [Thu, 30 Jul 2015 09:27:16 +0000 (11:27 +0200)]
KVM: x86: Add support for local interrupt requests from userspace
In order to enable userspace PIC support, the userspace PIC needs to
be able to inject local interrupts even when the APICs are in the
kernel.
KVM_INTERRUPT now supports sending local interrupts to an APIC when
APICs are in the kernel.
The ready_for_interrupt_request flag is now only set when the CPU/APIC
will immediately accept and inject an interrupt (i.e. APIC has not
masked the PIC).
When the PIC wishes to initiate an INTA cycle with, say, CPU0, it
kicks CPU0 out of the guest, and renedezvous with CPU0 once it arrives
in userspace.
When the CPU/APIC unmasks the PIC, a KVM_EXIT_IRQ_WINDOW_OPEN is
triggered, so that userspace has a chance to inject a PIC interrupt
if it had been pending.
Overall, this design can lead to a small number of spurious userspace
renedezvous. In particular, whenever the PIC transistions from low to
high while it is masked and whenever the PIC becomes unmasked while
it is low.
Note: this does not buffer more than one local interrupt in the
kernel, so the VMM needs to enter the guest in order to complete
interrupt injection before injecting an additional interrupt.
Compiles for x86.
Can pass the KVM Unit Tests.
Signed-off-by: Steve Rutherford <srutherford@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Steve Rutherford [Thu, 30 Jul 2015 06:32:35 +0000 (23:32 -0700)]
KVM: x86: Add EOI exit bitmap inference
In order to support a userspace IOAPIC interacting with an in kernel
APIC, the EOI exit bitmaps need to be configurable.
If the IOAPIC is in userspace (i.e. the irqchip has been split), the
EOI exit bitmaps will be set whenever the GSI Routes are configured.
In particular, for the low MSI routes are reservable for userspace
IOAPICs. For these MSI routes, the EOI Exit bit corresponding to the
destination vector of the route will be set for the destination VCPU.
The intention is for the userspace IOAPICs to use the reservable MSI
routes to inject interrupts into the guest.
This is a slight abuse of the notion of an MSI Route, given that MSIs
classically bypass the IOAPIC. It might be worthwhile to add an
additional route type to improve clarity.
Compile tested for Intel x86.
Signed-off-by: Steve Rutherford <srutherford@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Steve Rutherford [Thu, 30 Jul 2015 06:21:41 +0000 (23:21 -0700)]
KVM: x86: Add KVM exit for IOAPIC EOIs
Adds KVM_EXIT_IOAPIC_EOI which allows the kernel to EOI
level-triggered IOAPIC interrupts.
Uses a per VCPU exit bitmap to decide whether or not the IOAPIC needs
to be informed (which is identical to the EOI_EXIT_BITMAP field used
by modern x86 processors, but can also be used to elide kvm IOAPIC EOI
exits on older processors).
[Note: A prototype using ResampleFDs found that decoupling the EOI
from the VCPU's thread made it possible for the VCPU to not see a
recent EOI after reentering the guest. This does not match real
hardware.]
Compile tested for Intel x86.
Signed-off-by: Steve Rutherford <srutherford@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Steve Rutherford [Thu, 30 Jul 2015 06:21:40 +0000 (23:21 -0700)]
KVM: x86: Split the APIC from the rest of IRQCHIP.
First patch in a series which enables the relocation of the
PIC/IOAPIC to userspace.
Adds capability KVM_CAP_SPLIT_IRQCHIP;
KVM_CAP_SPLIT_IRQCHIP enables the construction of LAPICs without the
rest of the irqchip.
Compile tested for x86.
Signed-off-by: Steve Rutherford <srutherford@google.com> Suggested-by: Andrew Honig <ahonig@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Paolo Bonzini [Thu, 30 Jul 2015 08:32:16 +0000 (10:32 +0200)]
KVM: x86: unify handling of interrupt window
The interrupt window is currently checked twice, once in vmx.c/svm.c and
once in dm_request_for_irq_injection. The only difference is the extra
check for kvm_arch_interrupt_allowed in dm_request_for_irq_injection,
and the different return value (EINTR/KVM_EXIT_INTR for vmx.c/svm.c vs.
0/KVM_EXIT_IRQ_WINDOW_OPEN for dm_request_for_irq_injection).
However, dm_request_for_irq_injection is basically dead code! Revive it
by removing the checks in vmx.c and svm.c's vmexit handlers, and
fixing the returned values for the dm_request_for_irq_injection case.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Paolo Bonzini [Wed, 29 Jul 2015 08:43:18 +0000 (10:43 +0200)]
KVM: x86: store IOAPIC-handled vectors in each VCPU
We can reuse the algorithm that computes the EOI exit bitmap to figure
out which vectors are handled by the IOAPIC. The only difference
between the two is for edge-triggered interrupts other than IRQ8
that have no notifiers active; however, the IOAPIC does not have to
do anything special for these interrupts anyway.
This again limits the interactions between the IOAPIC and the LAPIC,
making it easier to move the former to userspace.
Inspired by a patch from Steve Rutherford.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Paolo Bonzini [Wed, 29 Jul 2015 13:03:06 +0000 (15:03 +0200)]
KVM: x86: set TMR when the interrupt is accepted
Do not compute TMR in advance. Instead, set the TMR just before the interrupt
is accepted into the IRR. This limits the coupling between IOAPIC and LAPIC.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Dirk Müller [Thu, 1 Oct 2015 11:43:42 +0000 (13:43 +0200)]
Use WARN_ON_ONCE for missing X86_FEATURE_NRIPS
The cpu feature flags are not ever going to change, so warning
everytime can cause a lot of kernel log spam
(in our case more than 10GB/hour).
The warning seems to only occur when nested virtualization is
enabled, so it's probably triggered by a KVM bug. This is a
sensible and safe change anyway, and the KVM bug fix might not
be suitable for stable releases anyway.
Cc: stable@vger.kernel.org Signed-off-by: Dirk Mueller <dmueller@suse.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Paolo Bonzini [Thu, 1 Oct 2015 11:20:22 +0000 (13:20 +0200)]
Revert "KVM: SVM: use NPT page attributes"
This reverts commit 3c2e7f7de3240216042b61073803b61b9b3cfb22.
Initializing the mapping from MTRR to PAT values was reported to
fail nondeterministically, and it also caused extremely slow boot
(due to caching getting disabled---bug 103321) with assigned devices.
Reported-by: Markus Trippelsdorf <markus@trippelsdorf.de> Reported-by: Sebastian Schuette <dracon@ewetel.net> Cc: stable@vger.kernel.org # 4.2+ Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Paolo Bonzini [Mon, 28 Sep 2015 10:26:31 +0000 (12:26 +0200)]
x86/x2apic: Make stub functions available even if !CONFIG_X86_LOCAL_APIC
Some CONFIG_X86_X2APIC functions, especially x2apic_enabled(), are not
declared if !CONFIG_X86_LOCAL_APIC. However, the same stubs that work
for !CONFIG_X86_X2APIC are okay even if there is no local APIC support
at all.
Avoid the introduction of #ifdefs by moving the x2apic declarations
completely outside the CONFIG_X86_LOCAL_APIC block. (Unfortunately,
diff generation messes up the actual change that this patch makes).
There is no semantic change because CONFIG_X86_X2APIC depends on
CONFIG_X86_LOCAL_APIC.
Revert "KVM: x86: zero kvmclock_offset when vcpu0 initializes kvmclock system MSR"
Shifting pvclock_vcpu_time_info.system_time on write to KVM system time
MSR is a change of ABI. Probably only 2.6.16 based SLES 10 breaks due
to its custom enhancements to kvmclock, but KVM never declared the MSR
only for one-shot initialization. (Doc says that only one write is
needed.)
Merge branch 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fixes from Thomas Gleixner:
"Two bugfixes from Andy addressing at least some of the subtle NMI
related wreckage which has been reported by Sasha Levin"
* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/nmi/64: Fix a paravirt stack-clobbering bug in the NMI code
x86/paravirt: Replace the paravirt nop with a bona fide empty function
Merge tag 'armsoc-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc
Pull ARM SoC fixes from Olof Johansson:
"Our first real batch of fixes this release cycle. Nothing really
concerning, and diffstat is a bit inflated due to some DT contents
moving around on STi platforms.
There's a collection of them here:
- A fixup for a build breakage that hits on arm64 allmodconfig in
QCOM SCM firmware drivers
- MMC fixes for OMAP that had quite a bit of breakage this merge
window.
- Misc build/warning fixes on PXA and OMAP
- A couple of minor fixes for Beagleboard X15 which is now starting
to see a few more users in the wild"
* tag 'armsoc-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc: (31 commits)
ARM: sti: dt: adapt DT to fix probe/bind issues in DRM driver
ARM: dts: fix omap2+ address translation for pbias
firmware: qcom: scm: Add function stubs for ARM64
ARM: dts: am57xx-beagle-x15: use palmas-usb for USB2
ARM: omap2plus_defconfig: enable GPIO_PCA953X
ARM: dts: omap5-uevm.dts: fix i2c5 pinctrl offsets
ARM: OMAP2+: AM43XX: Enable autoidle for clks in am43xx_init_late
ARM: dts: am57xx-beagle-x15: Update Phy supplies
ARM: pxa: balloon3: Fix build error
ARM: dts: Fixup model name for HP t410 dts
ARM: dts: DRA7: fix a typo in ethernet
ARM: omap2plus_defconfig: make PCF857x built-in
ARM: dts: Use ti,pbias compatible string for pbias
ARM: OMAP5: Cleanup options for SoC only build
ARM: DRA7: Select missing options for SoC only build
ARM: OMAP2+: board-generic: Remove stale of_irq macros
ARM: OMAP4+: PM: erratum is used by OMAP5 and DRA7 as well
ARM: dts: omap3-igep: Move eth IRQ pinmux to IGEPv2 common dtsi
ARM: dts: am57xx-beagle-x15: Add wakeup irq for mcp79410
ARM: dts: am335x-phycore-som: Fix mpu voltage
...
Merge branch 'for-next' of git://git.samba.org/sfrench/cifs-2.6
Pull CIFS fixes from Steve French:
"Four fixes from testing at the recent SMB3 Plugfest including two
important authentication ones (one fixes authentication problems to
some popular servers when clock times differ more than two hours
between systems, the other fixes Kerberos authentication for SMB3)"
* 'for-next' of git://git.samba.org/sfrench/cifs-2.6:
fix encryption error checks on mount
[SMB3] Fix sec=krb5 on smb3 mounts
cifs: use server timestamp for ntlmv2 authentication
disabling oplocks/leases via module parm enable_oplocks broken for SMB3
Olof Johansson [Sun, 27 Sep 2015 05:22:31 +0000 (22:22 -0700)]
Merge tag 'omap-for-v4.3/fixes-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tmlind/linux-omap into fixes
Fixes for omaps for v4.3-rc cycle:
- Two more patches to fix most of the MMC regressions with the
PBIAS regulator changes. At least two MMC driver related issues
still seems to remain for omap3 legacy booting and omap4 duovero.
Note that the dts changes depend on a recent regulator fix, and
are based on the regulator commit now in mainline kernel
- Enable autoidle for am43xx clocks to prevent clocks from staying
always on
- Fix i2c5 pinctrl offsets for omap5-uevm
- Enable PCA953X as that's needed for HDMI to work on omap5
- Update phy supplies for beagle x15 beta board
- Use palmas-usb for on beagle x15 to start using the related
driver that recently got merged
* tag 'omap-for-v4.3/fixes-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tmlind/linux-omap:
ARM: dts: fix omap2+ address translation for pbias
ARM: dts: am57xx-beagle-x15: use palmas-usb for USB2
ARM: omap2plus_defconfig: enable GPIO_PCA953X
ARM: dts: omap5-uevm.dts: fix i2c5 pinctrl offsets
ARM: OMAP2+: AM43XX: Enable autoidle for clks in am43xx_init_late
ARM: dts: am57xx-beagle-x15: Update Phy supplies
regulator: pbias: program pbias register offset in pbias driver
ARM: omap2plus_defconfig: Enable MUSB DMA support
ARM: DRA752: Add ID detect for ES2.0
ARM: OMAP3: vc: fix 'or' always true warning
ARM: OMAP2+: Fix booting if no timer parent clock is available
ARM: OMAP2+: omap-device: fix race deferred probe of omap_hsmmc vs omap_device_late_init
Pull SCSI target fixes from Nicholas Bellinger:
"This includes a iser-target series from Jenny + Sagi @ Mellanox that
addresses the few remaining active I/O shutdown bugs, along with a
patch to support zero-copy for immediate data payloads that gives a
nice performance improvement for small block WRITEs.
Also included are some recent >= v4.2 regression bug-fixes. The most
notable is a RCU conversion regression for SPC-3 PR registrations, and
recent removal of obsolete RFC-3720 markers that introduced a login
regression bug with MSFT iSCSI initiators.
Thanks to everyone who has been testing + reporting bugs for v4.x"
* git://git.kernel.org/pub/scm/linux/kernel/git/nab/target-pending:
iscsi-target: Avoid OFMarker + IFMarker negotiation
target: Make TCM_WRITE_PROTECT failure honor D_SENSE bit
target: Fix target_sense_desc_format NULL pointer dereference
target: Propigate backend read-only to core_tpg_add_lun
target: Fix PR registration + APTPL RCU conversion regression
iser-target: Skip data copy if all the command data comes as immediate
iser-target: Change the recv buffers posting logic
iser-target: Fix pending connections handling in target stack shutdown sequnce
iser-target: Remove np_ prefix from isert_np members
iser-target: Remove unused variables
iser-target: Put the reference on commands waiting for unsol data
iser-target: remove command with state ISTATE_REMOVE
Merge tag 'usb-4.3-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb
Pull USB fixes from Greg KH:
"Here are some USB driver fixes for 4.3-rc3.
There's the usual assortment of new device ids, combined with xhci and
gadget driver fixes. Full details in the shortlog. All of these have
been in linux-next with no reported problems"
* tag 'usb-4.3-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb: (34 commits)
MAINTAINERS: remove amd5536udc USB gadget driver maintainer
USB: whiteheat: fix potential null-deref at probe
xhci: init command timeout timer earlier to avoid deleting it uninitialized
xhci: change xhci 1.0 only restrictions to support xhci 1.1
usb: xhci: exit early in xhci_setup_device() if we're halted or dying
usb: xhci: stop everything on the first call to xhci_stop
usb: xhci: Clear XHCI_STATE_DYING on start
usb: xhci: lock mutex on xhci_stop
xhci: Move xhci_pme_quirk() behind #ifdef CONFIG_PM
xhci: give command abortion one more chance before killing xhci
usb: Use the USB_SS_MULT() macro to get the burst multiplier.
usb: dwc3: gadget: Fix BUG in RT config
usb: musb: fix cppi channel teardown for isoch transfer
usb: phy: isp1301: Export I2C module alias information
usb: gadget: drop null test before destroy functions
usb: gadget: dummy_hcd: in transfer(), return data sent, not limit
usb: gadget: dummy_hcd: fix rescan logic for transfer
usb: gadget: dummy_hcd: fix unneeded else-if condition
usb: gadget: dummy_hcd: emulate sending zlp in packet logic
usb: musb: dsps: fix polling in device-only mode
...
Merge tag 'tty-4.3-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty
Pull serial driver fix from Greg KH:
"Here is one serial driver fix for 4.3-rc3 that resolves a module
loading issue due to splitting up of the 8250 driver into smaller
pieces. It's been in linux-next with no reported problems"
* tag 'tty-4.3-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty:
tty: serial: Add missing module license for 8250_base.ko