Suggested-by: Bandan Das <bsd@redhat.com> Suggested-by: Alex Williamson <alex.williamson@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Paolo Bonzini [Wed, 1 Apr 2015 12:25:33 +0000 (14:25 +0200)]
KVM: x86: advertise KVM_CAP_X86_SMM
... and we're done. :)
Because SMBASE is usually relocated above 1M on modern chipsets, and
SMM handlers might indeed rely on 4G segment limits, we only expose it
if KVM is able to run the guest in big real mode. This includes any
of VMX+emulate_invalid_guest_state, VMX+unrestricted_guest, or SVM.
Reviewed-by: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Paolo Bonzini [Mon, 18 May 2015 13:03:39 +0000 (15:03 +0200)]
KVM: x86: add SMM to the MMU role, support SMRAM address space
This is now very simple to do. The only interesting part is a simple
trick to find the right memslot in gfn_to_rmap, retrieving the address
space from the spte role word. The same trick is used in the auditing
code.
The comment on top of union kvm_mmu_page_role has been stale forever,
so remove it. Speaking of stale code, remove pad_for_nice_hex_output
too: it was splitting the "access" bitfield across two bytes and thus
had effectively turned into pad_for_ugly_hex_output.
Reviewed-by: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Paolo Bonzini [Mon, 18 May 2015 11:33:16 +0000 (13:33 +0200)]
KVM: x86: work on all available address spaces
This patch has no semantic change, but it prepares for the introduction
of a second address space for system management mode.
A new function x86_set_memory_region (and the "slots_lock taken"
counterpart __x86_set_memory_region) is introduced in order to
operate on all address spaces when adding or deleting private
memory slots.
Reviewed-by: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Paolo Bonzini [Wed, 8 Apr 2015 13:39:23 +0000 (15:39 +0200)]
KVM: x86: use vcpu-specific functions to read/write/translate GFNs
We need to hide SMRAM from guests not running in SMM. Therefore,
all uses of kvm_read_guest* and kvm_write_guest* must be changed to
check whether the VCPU is in system management mode and use a
different set of memslots. Switch from kvm_* to the newly-introduced
kvm_vcpu_*, which call into kvm_arch_vcpu_memslots_id.
Reviewed-by: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Paolo Bonzini [Sun, 17 May 2015 11:58:53 +0000 (13:58 +0200)]
KVM: add vcpu-specific functions to read/write/translate GFNs
We need to hide SMRAM from guests not running in SMM. Therefore, all
uses of kvm_read_guest* and kvm_write_guest* must be changed to use
different address spaces, depending on whether the VCPU is in system
management mode. We need to introduce a new family of functions for
this purpose.
For now, the VCPU-based functions have the same behavior as the
existing per-VM ones, they just accept a different type for the
first argument. Later however they will be changed to use one of many
"struct kvm_memslots" stored in struct kvm, through an architecture hook.
VM-based functions will unconditionally use the first memslots pointer.
Whenever possible, this patch introduces slot-based functions with an
__ prefix, with two wrappers for generic and vcpu-based actions.
The exceptions are kvm_read_guest and kvm_write_guest, which are copied
into the new functions kvm_vcpu_read_guest and kvm_vcpu_write_guest.
Reviewed-by: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Paolo Bonzini [Tue, 5 May 2015 09:50:23 +0000 (11:50 +0200)]
KVM: x86: save/load state on SMM switch
The big ugly one. This patch adds support for switching in and out of
system management mode, respectively upon receiving KVM_REQ_SMI and upon
executing a RSM instruction. Both 32- and 64-bit formats are supported
for the SMM state save area.
Reviewed-by: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Paolo Bonzini [Thu, 4 Jun 2015 08:41:21 +0000 (10:41 +0200)]
KVM: x86: latch INITs while in system management mode
Do not process INITs immediately while in system management mode, keep
it instead in apic->pending_events. Tell userspace if an INIT is
pending when they issue GET_VCPU_EVENTS, and similarly handle the
new field in SET_VCPU_EVENTS.
Note that the same treatment should be done while in VMX non-root mode.
Reviewed-by: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Paolo Bonzini [Thu, 7 May 2015 09:36:11 +0000 (11:36 +0200)]
KVM: x86: stubs for SMM support
This patch adds the interface between x86.c and the emulator: the
SMBASE register, a new emulator flag, the RSM instruction. It also
adds a new request bit that will be used by the KVM_SMI ioctl.
Reviewed-by: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Paolo Bonzini [Wed, 1 Apr 2015 13:06:40 +0000 (15:06 +0200)]
KVM: x86: API changes for SMM support
This patch includes changes to the external API for SMM support.
Userspace can predicate the availability of the new fields and
ioctls on a new capability, KVM_CAP_X86_SMM, which is added at the end
of the patch series.
Reviewed-by: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Paolo Bonzini [Wed, 1 Apr 2015 16:18:53 +0000 (18:18 +0200)]
KVM: x86: pass the whole hflags field to emulator and back
The hflags field will contain information about system management mode
and will be useful for the emulator. Pass the entire field rather than
just the guest-mode information.
Reviewed-by: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Paolo Bonzini [Wed, 8 Apr 2015 13:30:38 +0000 (15:30 +0200)]
KVM: x86: pass host_initiated to functions that read MSRs
SMBASE is only readable from SMM for the VCPU, but it must be always
accessible if userspace is accessing it. Thus, all functions that
read MSRs are changed to accept a struct msr_data; the host_initiated
and index fields are pre-initialized, while the data field is filled
on return.
Reviewed-by: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Paolo Bonzini [Tue, 5 May 2015 10:08:55 +0000 (12:08 +0200)]
KVM: x86: introduce num_emulated_msrs
We will want to filter away MSR_IA32_SMBASE from the emulated_msrs if
the host CPU does not support SMM virtualization. Introduce the
logic to do that, and also move paravirt MSRs to emulated_msrs for
simplicity and to get rid of KVM_SAVE_MSRS_BEGIN.
Reviewed-by: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Paolo Bonzini [Thu, 4 Jun 2015 07:51:50 +0000 (09:51 +0200)]
kvm: x86: default legacy PCI device assignment support to "n"
VFIO has proved itself a much better option than KVM's built-in
device assignment. It is mature, provides better isolation because
it enforces ACS, and even the userspace code is being tested on
a wider variety of hardware these days than the legacy support.
Disable legacy device assignment by default.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Let's remove "kvm-s390" from our printk messages and make use
of pr_fmt instead.
Also replace one printk() occurrence by a equivalent pr_warn
on the way.
Suggested-by: Christian Borntraeger <borntraeger@de.ibm.com> Signed-off-by: David Hildenbrand <dahi@linux.vnet.ibm.com> Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
KVM: s390: call exit_sie() directly on vcpu block/request
Thinking about it, I can't find a real use case where we want
to block a VCPU and not kick it out of SIE. (except if we want
to do the same in batch for multiple VCPUs - but that's a micro
optimization)
So let's simply perform the exit_sie() calls directly when setting
the other magic block bits in the SIE.
Otherwise e.g. kvm_s390_set_tod_low() still has other VCPUs running
after that call, working with a wrong epoch.
Fixes: 27406cd50c ("KVM: s390: provide functions for blocking all CPUs") Acked-by: Christian Borntraeger <borntraeger@de.ibm.com> Signed-off-by: David Hildenbrand <dahi@linux.vnet.ibm.com> Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
However, it turns out that kvmclock does provide a stable
sched_clock callback. So, let the scheduler know this which
in turn makes NOHZ_FULL work in the guest.
Marcelo Tosatti [Thu, 28 May 2015 23:20:39 +0000 (20:20 -0300)]
x86: kvmclock: add flag to indicate pvclock counts from zero
Setting sched clock stable for kvmclock causes the printk timestamps
to not start from zero, which is different from baremetal and
can possibly break userspace. Add a flag to indicate that
hypervisor sets clock base at zero when kvmclock is initialized.
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Andrew Morton [Wed, 27 May 2015 09:53:06 +0000 (11:53 +0200)]
arch/x86/kvm/mmu.c: work around gcc-4.4.4 bug
arch/x86/kvm/mmu.c: In function 'kvm_mmu_pte_write':
arch/x86/kvm/mmu.c:4256: error: unknown field 'cr0_wp' specified in initializer
arch/x86/kvm/mmu.c:4257: error: unknown field 'cr4_pae' specified in initializer
arch/x86/kvm/mmu.c:4257: warning: excess elements in union initializer
...
gcc-4.4.4 (at least) has issues when using anonymous unions in
initializers.
Fixes: edc90b7dc4ceef6 ("KVM: MMU: fix SMAP virtualization") Cc: Xiao Guangrong <guangrong.xiao@linux.intel.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Radim Krčmář [Fri, 22 May 2015 16:45:11 +0000 (18:45 +0200)]
KVM: x86: use correct APIC ID on x2APIC transition
SDM April 2015, 10.12.5 State Changes From xAPIC Mode to x2APIC Mode
• Any APIC ID value written to the memory-mapped local APIC ID register
is not preserved.
Fix it by sourcing vcpu_id (= initial APIC ID) instead of memory-mapped
APIC ID. Proper use of apic functions would result in two calls to
recalculate_apic_map(), so this patch makes a new helper.
Signed-off-by: Radim KrÄ\8dmář <rkrcmar@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Paolo Bonzini [Tue, 19 May 2015 14:29:22 +0000 (16:29 +0200)]
KVM: x86: pass struct kvm_mmu_page to account/unaccount_shadowed
Prepare for multiple address spaces this way, since a VCPU is not available
where unaccount_shadowed is called. We will get to the right kvm_memslots
struct through the role field in struct kvm_mmu_page.
Paolo Bonzini [Tue, 19 May 2015 14:09:04 +0000 (16:09 +0200)]
KVM: remove __gfn_to_pfn
Most of the function that wrap it can be rewritten without it, except
for gfn_to_pfn_prot. Just inline it into gfn_to_pfn_prot, and rewrite
the other function on top of gfn_to_pfn_memslot*.
Reviewed-by: Radim Krcmar <rkrcmar@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Paolo Bonzini [Tue, 19 May 2015 14:01:50 +0000 (16:01 +0200)]
KVM: pass kvm_memory_slot to gfn_to_page_many_atomic
The memory slot is already available from gfn_to_memslot_dirty_bitmap.
Isn't it a shame to look it up again? Plus, it makes gfn_to_page_many_atomic
agnostic of multiple VCPU address spaces.
Reviewed-by: Radim Krcmar <rkrcmar@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Paolo Bonzini [Mon, 18 May 2015 11:20:23 +0000 (13:20 +0200)]
KVM: add "new" argument to kvm_arch_commit_memory_region
This lets the function access the new memory slot without going through
kvm_memslots and id_to_memslot. It will simplify the code when more
than one address space will be supported.
Unfortunately, the "const"ness of the new argument must be casted
away in two places. Fixing KVM to accept const struct kvm_memory_slot
pointers would require modifications in pretty much all architectures,
and is left for later.
Reviewed-by: Radim Krcmar <rkrcmar@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Paolo Bonzini [Mon, 18 May 2015 11:59:39 +0000 (13:59 +0200)]
KVM: const-ify uses of struct kvm_userspace_memory_region
Architecture-specific helpers are not supposed to muck with
struct kvm_userspace_memory_region contents. Add const to
enforce this.
In order to eliminate the only write in __kvm_set_memory_region,
the cleaning of deleted slots is pulled up from update_memslots
to __kvm_set_memory_region.
Paolo Bonzini [Sun, 17 May 2015 09:41:37 +0000 (11:41 +0200)]
KVM: introduce kvm_alloc/free_memslots
kvm_alloc_memslots is extracted out of previously scattered code
that was in kvm_init_memslots_id and kvm_create_vm.
kvm_free_memslot and kvm_free_memslots are new names of
kvm_free_physmem and kvm_free_physmem_slot, but they also take
an explicit pointer to struct kvm_memslots.
This will simplify the transition to multiple address spaces,
each represented by one pointer to struct kvm_memslots.
Liang Li [Wed, 20 May 2015 20:41:25 +0000 (04:41 +0800)]
kvm/fpu: Enable eager restore kvm FPU for MPX
The MPX feature requires eager KVM FPU restore support. We have verified
that MPX cannot work correctly with the current lazy KVM FPU restore
mechanism. Eager KVM FPU restore should be enabled if the MPX feature is
exposed to VM.
Signed-off-by: Yang Zhang <yang.z.zhang@intel.com> Signed-off-by: Liang Li <liang.z.li@intel.com>
[Also activate the FPU on AMD processors. - Paolo] Cc: stable@vger.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
kvm: fix crash in kvm_vcpu_reload_apic_access_page
memslot->userfault_addr is set by the kernel with a mmap executed
from the kernel but the userland can still munmap it and lead to the
below oops after memslot->userfault_addr points to a host virtual
address that has no vma or mapping.
Nicholas Krause [Wed, 20 May 2015 04:24:10 +0000 (00:24 -0400)]
kvm: x86: Make functions that have no external callers static
This makes the functions kvm_guest_cpu_init and kvm_init_debugfs
static now due to having no external callers outside their
declarations in the file, kvm.c.
Signed-off-by: Nicholas Krause <xerofoify@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Paolo Bonzini [Thu, 2 Apr 2015 09:20:48 +0000 (11:20 +0200)]
KVM: export __gfn_to_pfn_memslot, drop gfn_to_pfn_async
gfn_to_pfn_async is used in just one place, and because of x86-specific
treatment that place will need to look at the memory slot. Hence inline
it into try_async_pf and export __gfn_to_pfn_memslot.
The patch also switches the subsequent call to gfn_to_pfn_prot to use
__gfn_to_pfn_memslot. This is a small optimization. Finally, remove
the now-unused async argument of __gfn_to_pfn.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Xiao Guangrong [Wed, 13 May 2015 06:42:27 +0000 (14:42 +0800)]
KVM: MMU: fix MTRR update
Currently, whenever guest MTRR registers are changed
kvm_mmu_reset_context is called to switch to the new root shadow page
table, however, it's useless since:
1) the cache type is not cached into shadow page's attribute so that
the original root shadow page will be reused
2) the cache type is set on the last spte, that means we should sync
the last sptes when MTRR is changed
This patch fixs this issue by drop all the spte in the gfn range which
is being updated by MTRR
Signed-off-by: Xiao Guangrong <guangrong.xiao@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Xiao Guangrong [Wed, 13 May 2015 06:42:19 +0000 (14:42 +0800)]
KVM: MMU: fix decoding cache type from MTRR
There are some bugs in current get_mtrr_type();
1: bit 1 of mtrr_state->enabled is corresponding bit 11 of
IA32_MTRR_DEF_TYPE MSR which completely control MTRR's enablement
that means other bits are ignored if it is cleared
2: the fixed MTRR ranges are controlled by bit 0 of
mtrr_state->enabled (bit 10 of IA32_MTRR_DEF_TYPE)
3: if MTRR is disabled, UC is applied to all of physical memory rather
than mtrr_state->def_type
Signed-off-by: Xiao Guangrong <guangrong.xiao@linux.intel.com> Reviewed-by: Wanpeng Li <wanpeng.li@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Xiao Guangrong [Mon, 11 May 2015 14:55:21 +0000 (22:55 +0800)]
KVM: MMU: fix SMAP virtualization
KVM may turn a user page to a kernel page when kernel writes a readonly
user page if CR0.WP = 1. This shadow page entry will be reused after
SMAP is enabled so that kernel is allowed to access this user page
Fix it by setting SMAP && !CR0.WP into shadow page's role and reset mmu
once CR4.SMAP is updated
Signed-off-by: Xiao Guangrong <guangrong.xiao@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Nadav Amit [Tue, 28 Apr 2015 10:06:01 +0000 (13:06 +0300)]
KVM: x86: Fix zero iterations REP-string
When a REP-string is executed in 64-bit mode with an address-size prefix,
ECX/EDI/ESI are used as counter and pointers. When ECX is initially zero, Intel
CPUs clear the high 32-bits of RCX, and recent Intel CPUs update the high bits
of the pointers in MOVS/STOS. This behavior is specific to Intel according to
few experiments.
As one may guess, this is an undocumented behavior. Yet, it is observable in
the guest, since at least VMX traps REP-INS/OUTS even when ECX=0. Note that
VMware appears to get it right. The behavior can be observed using the
following code:
Nadav Amit [Tue, 28 Apr 2015 10:06:00 +0000 (13:06 +0300)]
KVM: x86: Fix update RCX/RDI/RSI on REP-string
When REP-string instruction is preceded with an address-size prefix,
ECX/EDI/ESI are used as the operation counter and pointers. When they are
updated, the high 32-bits of RCX/RDI/RSI are cleared, similarly to the way they
are updated on every 32-bit register operation. Fix it.
Signed-off-by: Nadav Amit <namit@cs.technion.ac.il> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Nadav Amit [Sun, 19 Apr 2015 18:12:59 +0000 (21:12 +0300)]
KVM: x86: Fix DR7 mask on task-switch while debugging
If the host sets hardware breakpoints to debug the guest, and a task-switch
occurs in the guest, the architectural DR7 will not be updated. The effective
DR7 would be updated instead.
This fix puts the DR7 update during task-switch emulation, so it now uses the
standard DR setting mechanism instead of the one that was previously used. As a
bonus, the update of DR7 will now be effective for AMD as well.
Signed-off-by: Nadav Amit <namit@cs.technion.ac.il> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Xiao Guangrong [Mon, 11 May 2015 14:55:21 +0000 (22:55 +0800)]
KVM: MMU: fix SMAP virtualization
KVM may turn a user page to a kernel page when kernel writes a readonly
user page if CR0.WP = 1. This shadow page entry will be reused after
SMAP is enabled so that kernel is allowed to access this user page
Fix it by setting SMAP && !CR0.WP into shadow page's role and reset mmu
once CR4.SMAP is updated
Signed-off-by: Xiao Guangrong <guangrong.xiao@linux.intel.com> Cc: stable@vger.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Paolo Bonzini [Thu, 2 Apr 2015 09:04:05 +0000 (11:04 +0200)]
KVM: MMU: fix CR4.SMEP=1, CR0.WP=0 with shadow pages
smep_andnot_wp is initialized in kvm_init_shadow_mmu and shadow pages
should not be reused for different values of it. Thus, it has to be
added to the mask in kvm_mmu_pte_write.
Reviewed-by: Xiao Guangrong <guangrong.xiao@linux.intel.com> Cc: stable@vger.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Xiao Guangrong [Thu, 7 May 2015 08:20:15 +0000 (16:20 +0800)]
KVM: MMU: fix smap permission check
Current permission check assumes that RSVD bit in PFEC is always zero,
however, it is not true since MMIO #PF will use it to quickly identify
MMIO access
Fix it by clearing the bit if walking guest page table is needed
Signed-off-by: Xiao Guangrong <guangrong.xiao@linux.intel.com> Cc: stable@vger.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Paul Mackerras [Wed, 29 Apr 2015 04:49:07 +0000 (14:49 +1000)]
KVM: PPC: Book3S HV: Fix list traversal in error case
This fixes a regression introduced in commit 25fedfca94cf, "KVM: PPC:
Book3S HV: Move vcore preemption point up into kvmppc_run_vcpu", which
leads to a user-triggerable oops.
In the case where we try to run a vcore on a physical core that is
not in single-threaded mode, or the vcore has too many threads for
the physical core, we iterate the list of runnable vcpus to make
each one return an EBUSY error to userspace. Since this involves
taking each vcpu off the runnable_threads list for the vcore, we
need to use list_for_each_entry_safe rather than list_for_each_entry
to traverse the list. Otherwise the kernel will crash with an oops
message like this:
Fixes: 25fedfca94cfbf2461314c6c34ef58e74a31b025 Signed-off-by: Paul Mackerras <paulus@samba.org> Reviewed-by: Alexander Graf <agraf@suse.de> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Our implementation will never trigger interception code 12 as the
responsible setting is never enabled - and never will be.
The handler is dead code. Let's get rid of it.
Reviewed-by: Cornelia Huck <cornelia.huck@de.ibm.com> Signed-off-by: David Hildenbrand <dahi@linux.vnet.ibm.com> Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
KVM: s390: factor out and optimize floating irq VCPU kick
This patch factors out the search for a floating irq destination
VCPU as well as the kicking of the found VCPU. The search is optimized
in the following ways:
1. stopped VCPUs can't take any floating interrupts, so try to find an
operating one. We have to take care of the special case where all
VCPUs are stopped and we don't have any valid destination.
2. use online_vcpus, not KVM_MAX_VCPU. This speeds up the search
especially if KVM_MAX_VCPU is increased one day. As these VCPU
objects are initialized prior to increasing online_vcpus, we can be
sure that they exist.
Reviewed-by: Cornelia Huck <cornelia.huck@de.ibm.com> Reviewed-by: Dominik Dingel <dingel@linux.vnet.ibm.com> Reviewed-by: Jens Freimann <jfrei@linux.vnet.ibm.com> Signed-off-by: David Hildenbrand <dahi@linux.vnet.ibm.com> Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
KVM: s390: optimize interrupt handling round trip time
We can avoid checking guest control registers and guest PSW as well
as all the masking and calculations on the interrupt masks when
no interrupts are pending.
Also, the check for IRQ_PEND_COUNT can be removed, because we won't
enter the while loop if no interrupts are pending and invalid interrupt
types can't be injected.
KVM: s390: provide functions for blocking all CPUs
Some updates to the control blocks need to be done in a way that
ensures that no CPU is within SIE. Provide wrappers around the
s390_vcpu_block functions and adopt the TOD migration code to
update in a guaranteed fashion. Also rename these functions to
have the kvm_s390_ prefix as everything else.
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com> Reviewed-by: David Hildenbrand <dahi@linux.vnet.ibm.com>
exit_sie_sync is used to kick CPUs out of SIE and prevent reentering at
any point in time. This is used to reload the prefix pages and to
set the IBS stuff in a way that guarantees that after this function
returns we are no longer in SIE. All current users trigger KVM requests.
The request must be set before we block the CPUs to avoid races. Let's
make this implicit by adding the request into a new function
kvm_s390_sync_requests that replaces exit_sie_sync and split out
s390_vcpu_block and s390_vcpu_unblock, that can be used to keep
CPUs out of SIE independent of requests.
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com> Reviewed-by: David Hildenbrand <dahi@linux.vnet.ibm.com> Reviewed-by: Cornelia Huck <cornelia.huck@de.ibm.com>
KVM: s390: fix external call injection without sigp interpretation
Commit ea5f49692575 ("KVM: s390: only one external call may be pending
at a time") introduced a bug on machines that don't have SIGP
interpretation facility installed.
The injection of an external call will now always fail with -EBUSY
(if none is already pending).
This leads to the following symptoms:
- An external call will be injected but with the wrong "src cpu id",
as this id will not be remembered.
- The target vcpu will not be woken up, therefore the guest will hang if
it cannot deal with unexpected failures of the SIGP EXTERNAL CALL
instruction.
- If an external call is already pending, -EBUSY will not be reported.
Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com> Reviewed-by: Jens Freimann <jfrei@linux.vnet.ibm.com> Signed-off-by: David Hildenbrand <dahi@linux.vnet.ibm.com> Cc: stable@vger.kernel.org # v4.0 Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Paolo Bonzini [Thu, 2 Apr 2015 09:04:05 +0000 (11:04 +0200)]
KVM: MMU: fix CR4.SMEP=1, CR0.WP=0 with shadow pages
smep_andnot_wp is initialized in kvm_init_shadow_mmu and shadow pages
should not be reused for different values of it. Thus, it has to be
added to the mask in kvm_mmu_pte_write.
Reviewed-by: Xiao Guangrong <guangrong.xiao@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Xiao Guangrong [Thu, 7 May 2015 08:20:15 +0000 (16:20 +0800)]
KVM: MMU: fix smap permission check
Current permission check assumes that RSVD bit in PFEC is always zero,
however, it is not true since MMIO #PF will use it to quickly identify
MMIO access
Fix it by clearing the bit if walking guest page table is needed
Signed-off-by: Xiao Guangrong <guangrong.xiao@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Heiko Carstens [Tue, 5 May 2015 07:39:08 +0000 (09:39 +0200)]
KVM: remove pointless cpu hotplug messages
On cpu hotplug only KVM emits an unconditional message that its notifier
has been called. It certainly can be assumed that calling cpu hotplug
notifiers work, therefore there is no added value if KVM prints a message.
If an error happens on cpu online KVM will still emit a warning.
So let's remove this superfluous message.
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Jan Kiszka [Mon, 4 May 2015 06:32:32 +0000 (08:32 +0200)]
KVM: nVMX: Fix host crash when loading MSRs with userspace irqchip
vcpu->arch.apic is NULL when a userspace irqchip is active. But instead
of letting the test incorrectly depend on in-kernel irqchip mode,
open-code it to catch also userspace x2APICs.
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Julia Lawall [Mon, 27 Apr 2015 20:35:34 +0000 (22:35 +0200)]
KVM: x86: drop unneeded null test
If the null test is needed, the call to cancel_delayed_work_sync would have
already crashed. Normally, the destroy function should only be called
if the init function has succeeded, in which case ioapic is not null.
Problem found using Coccinelle.
Suggested-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
PAT should be 0007_0406_0007_0406h on RESET and not modified on INIT.
VMX used a wrong value (host's PAT) and while SVM used the right one,
it never got to arch.pat.
This is not an issue with QEMU as it will force the correct value.
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Currently KVM will clear the FPU bits in CR0.TS in the VMCS, and trap to
re-load them every time the guest accesses the FPU after a switch back into
the guest from the host.
This patch copies the x86 task switch semantics for FPU loading, with the
FPU loaded eagerly after first use if the system uses eager fpu mode,
or if the guest uses the FPU frequently.
In the latter case, after loading the FPU for 255 times, the fpu_counter
will roll over, and we will revert to loading the FPU on demand, until
it has been established that the guest is still actively using the FPU.
This mirrors the x86 task switch policy, which seems to work.
Signed-off-by: Rik van Riel <riel@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
James Sullivan [Thu, 19 Mar 2015 01:26:04 +0000 (19:26 -0600)]
kvm: x86: Deliver MSI IRQ to only lowest prio cpu if msi_redir_hint is true
An MSI interrupt should only be delivered to the lowest priority CPU
when it has RH=1, regardless of the delivery mode. Modified
kvm_is_dm_lowest_prio() to check for either irq->delivery_mode == APIC_DM_LOWPRI
or irq->msi_redir_hint.
Moved kvm_is_dm_lowest_prio() into lapic.h and renamed to
kvm_lowest_prio_delivery().
Changed a check in kvm_irq_delivery_to_apic_fast() from
irq->delivery_mode == APIC_DM_LOWPRI to kvm_is_dm_lowest_prio().
Signed-off-by: James Sullivan <sullivan.james.f@gmail.com> Reviewed-by: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
James Sullivan [Thu, 19 Mar 2015 01:26:03 +0000 (19:26 -0600)]
kvm: x86: Extended struct kvm_lapic_irq with msi_redir_hint for MSI delivery
Extended struct kvm_lapic_irq with bool msi_redir_hint, which will
be used to determine if the delivery of the MSI should target only
the lowest priority CPU in the logical group specified for delivery.
(In physical dest mode, the RH bit is not relevant). Initialized the value
of msi_redir_hint to true when RH=1 in kvm_set_msi_irq(), and initialized
to false in all other cases.
Added value of msi_redir_hint to a debug message dump of an IRQ in
apic_send_ipi().
Signed-off-by: James Sullivan <sullivan.james.f@gmail.com> Reviewed-by: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Nadav Amit [Mon, 13 Apr 2015 11:34:08 +0000 (14:34 +0300)]
KVM: x86: INIT and reset sequences are different
x86 architecture defines differences between the reset and INIT sequences.
INIT does not initialize the FPU (including MMX, XMM, YMM, etc.), TSC, PMU,
MSRs (in general), MTRRs machine-check, APIC ID, APIC arbitration ID and BSP.
References (from Intel SDM):
"If the MP protocol has completed and a BSP is chosen, subsequent INITs (either
to a specific processor or system wide) do not cause the MP protocol to be
repeated." [8.4.2: MP Initialization Protocol Requirements and Restrictions]
[Table 9-1. IA-32 Processor States Following Power-up, Reset, or INIT]
"If the processor is reset by asserting the INIT# pin, the x87 FPU state is not
changed." [9.2: X87 FPU INITIALIZATION]
"The state of the local APIC following an INIT reset is the same as it is after
a power-up or hardware reset, except that the APIC ID and arbitration ID
registers are not affected." [10.4.7.3: Local APIC State After an INIT Reset
("Wait-for-SIPI" State)]
Signed-off-by: Nadav Amit <namit@cs.technion.ac.il>
Message-Id: <1428924848-28212-1-git-send-email-namit@cs.technion.ac.il> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Nadav Amit [Sun, 12 Apr 2015 22:53:41 +0000 (01:53 +0300)]
KVM: x86: Support for disabling quirks
Introducing KVM_CAP_DISABLE_QUIRKS for disabling x86 quirks that were previous
created in order to overcome QEMU issues. Those issue were mostly result of
invalid VM BIOS. Currently there are two quirks that can be disabled:
1. KVM_QUIRK_LINT0_REENABLED - LINT0 was enabled after boot
2. KVM_QUIRK_CD_NW_CLEARED - CD and NW are cleared after boot
These two issues are already resolved in recent releases of QEMU, and would
therefore be disabled by QEMU.
Signed-off-by: Nadav Amit <namit@cs.technion.ac.il>
Message-Id: <1428879221-29996-1-git-send-email-namit@cs.technion.ac.il>
[Report capability from KVM_CHECK_EXTENSION too. - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Several kvm architectures disable interrupts before kvm_guest_enter.
kvm_guest_enter then uses local_irq_save/restore to disable interrupts
again or for the first time. Lets provide underscore versions of
kvm_guest_{enter|exit} that assume being called locked.
kvm_guest_enter now disables interrupts for the full function and
thus we can remove the check for preemptible.
This patch then adopts s390/kvm to use local_irq_disable/enable calls
which are slighty cheaper that local_irq_save/restore and call these
new functions.
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
However, it turns out that kvmclock does provide a stable
sched_clock callback. So, let the scheduler know this which
in turn makes NOHZ_FULL work in the guest.
Linus Torvalds [Mon, 4 May 2015 01:23:53 +0000 (18:23 -0700)]
Merge tag 'for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
Pull ext4 fixes from Ted Ts'o:
"Some miscellaneous bug fixes and some final on-disk and ABI changes
for ext4 encryption which provide better security and performance"
* tag 'for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
ext4: fix growing of tiny filesystems
ext4: move check under lock scope to close a race.
ext4: fix data corruption caused by unwritten and delayed extents
ext4 crypto: remove duplicated encryption mode definitions
ext4 crypto: do not select from EXT4_FS_ENCRYPTION
ext4 crypto: add padding to filenames before encrypting
ext4 crypto: simplify and speed up filename encryption
Linus Torvalds [Mon, 4 May 2015 01:15:48 +0000 (18:15 -0700)]
Merge branch 'drm-fixes' of git://people.freedesktop.org/~airlied/linux
Pull drm fixes from Dave Airlie:
"One intel fix, one rockchip fix, and a bunch of radeon fixes for some
regressions from audio rework and vm stability"
* 'drm-fixes' of git://people.freedesktop.org/~airlied/linux:
drm/i915/chv: Implement WaDisableShadowRegForCpd
drm/radeon: fix userptr return value checking (v2)
drm/radeon: check new address before removing old one
drm/radeon: reset BOs address after clearing it.
drm/radeon: fix lockup when BOs aren't part of the VM on release
drm/radeon: add SI DPM quirk for Sapphire R9 270 Dual-X 2G GDDR5
drm/radeon: adjust pll when audio is not enabled
drm/radeon: only enable audio streams if the monitor supports it
drm/radeon: only mark audio as connected if the monitor supports it (v3)
drm/radeon/audio: don't enable packets until the end
drm/radeon: drop dce6_dp_enable
drm/radeon: fix ordering of AVI packet setup
drm/radeon: Use drm_calloc_ab for CS relocs
drm/rockchip: fix error check when getting irq
MAINTAINERS: add entry for Rockchip drm drivers
Dave Airlie [Sun, 3 May 2015 22:56:47 +0000 (08:56 +1000)]
Merge tag 'drm-intel-fixes-2015-04-30' of git://anongit.freedesktop.org/drm-intel into drm-fixes
Just a single intel fix
* tag 'drm-intel-fixes-2015-04-30' of git://anongit.freedesktop.org/drm-intel:
drm/i915/chv: Implement WaDisableShadowRegForCpd
Dave Airlie [Sun, 3 May 2015 22:56:27 +0000 (08:56 +1000)]
Merge branch 'drm-next0420' of https://github.com/markyzq/kernel-drm-rockchip into drm-fixes
one fix and maintainers update
* 'drm-next0420' of https://github.com/markyzq/kernel-drm-rockchip:
drm/rockchip: fix error check when getting irq
MAINTAINERS: add entry for Rockchip drm drivers
Linus Torvalds [Sun, 3 May 2015 20:22:32 +0000 (13:22 -0700)]
Merge tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi
Pull SCSI fixes from James Bottomley:
"This is three logical fixes (as 5 patches).
The 3ware class of drivers were causing an oops with multiqueue by
tearing down the command mappings after completing the command (where
the variables in the command used to tear down the mapping were
no-longer valid). There's also a fix for the qnap iscsi target which
was choking on us sending it commands that were too long and a fix for
the reworked aha1542 allocating GFP_KERNEL under a lock"
* tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
3w-9xxx: fix command completion race
3w-xxxx: fix command completion race
3w-sas: fix command completion race
aha1542: Allocate memory before taking a lock
SCSI: add 1024 max sectors black list flag
Linus Torvalds [Sun, 3 May 2015 17:49:04 +0000 (10:49 -0700)]
Merge branch 'next' of git://git.infradead.org/users/vkoul/slave-dma
Pull slave dmaengine fixes from Vinod Koul:
"Here are the fixes in dmaengine subsystem for rc2:
- privatecnt fix for slave dma request API by Christopher
- warn fix for PM ifdef in usb-dmac by Geert
- fix hardware dependency for xgene by Jean"
* 'next' of git://git.infradead.org/users/vkoul/slave-dma:
dmaengine: increment privatecnt when using dma_get_any_slave_channel
dmaengine: xgene: Set hardware dependency
dmaengine: usb-dmac: Protect PM-only functions to kill warning