Bharat Bhushan [Wed, 6 Aug 2014 06:38:51 +0000 (12:08 +0530)]
KVM: PPC: BOOKE: allow debug interrupt at "debug level"
Debug interrupt can be either "critical level" or "debug level".
There are separate set of save/restore registers used for different level.
Example: DSRR0/DSRR1 are used for "debug level" and CSRR0/CSRR1
are used for critical level debug interrupt.
Using CPU_FTR_DEBUG_LVL_EXC to decide which interrupt level to be used.
Signed-off-by: Bharat Bhushan <Bharat.Bhushan@freescale.com> Signed-off-by: Alexander Graf <agraf@suse.de>
kvm: Make init_rmode_identity_map() return 0 on success.
In init_rmode_identity_map(), there two variables indicating the return
value, r and ret, and it return 0 on error, 1 on success. The function
is only called by vmx_create_vcpu(), and ret is redundant.
This patch removes the redundant variable, and makes init_rmode_identity_map()
return 0 on success, -errno on failure.
Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
kvm: Remove ept_identity_pagetable from struct kvm_arch.
kvm_arch->ept_identity_pagetable holds the ept identity pagetable page. But
it is never used to refer to the page at all.
In vcpu initialization, it indicates two things:
1. indicates if ept page is allocated
2. indicates if a memory slot for identity page is initialized
Actually, kvm_arch->ept_identity_pagetable_done is enough to tell if the ept
identity pagetable is initialized. So we can remove ept_identity_pagetable.
NOTE: In the original code, ept identity pagetable page is pinned in memroy.
As a result, it cannot be migrated/hot-removed. After this patch, since
kvm_arch->ept_identity_pagetable is removed, ept identity pagetable page
is no longer pinned in memory. And it can be migrated/hot-removed.
Will Deacon [Tue, 2 Sep 2014 09:27:36 +0000 (10:27 +0100)]
KVM: VFIO: register kvm_device_ops dynamically
Now that we have a dynamic means to register kvm_device_ops, use that
for the VFIO kvm device, instead of relying on the static table.
This is achieved by a module_init call to register the ops with KVM.
Cc: Gleb Natapov <gleb@kernel.org> Cc: Paolo Bonzini <pbonzini@redhat.com> Acked-by: Alex Williamson <Alex.Williamson@redhat.com> Signed-off-by: Will Deacon <will.deacon@arm.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Now that we have a dynamic means to register kvm_device_ops, use that
for the ARM VGIC, instead of relying on the static table.
Cc: Gleb Natapov <gleb@kernel.org> Cc: Paolo Bonzini <pbonzini@redhat.com> Acked-by: Marc Zyngier <marc.zyngier@arm.com> Reviewed-by: Christoffer Dall <christoffer.dall@linaro.org> Signed-off-by: Will Deacon <will.deacon@arm.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Will Deacon [Tue, 2 Sep 2014 09:27:33 +0000 (10:27 +0100)]
KVM: device: add simple registration mechanism for kvm_device_ops
kvm_ioctl_create_device currently has knowledge of all the device types
and their associated ops. This is fairly inflexible when adding support
for new in-kernel device emulations, so move what we currently have out
into a table, which can support dynamic registration of ops by new
drivers for virtual hardware.
Cc: Alex Williamson <Alex.Williamson@redhat.com> Cc: Alex Graf <agraf@suse.de> Cc: Gleb Natapov <gleb@kernel.org> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Marc Zyngier <marc.zyngier@arm.com> Acked-by: Cornelia Huck <cornelia.huck@de.ibm.com> Reviewed-by: Christoffer Dall <christoffer.dall@linaro.org> Signed-off-by: Will Deacon <will.deacon@arm.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Currently, we call ioapic_service() immediately when we find the irq is still
active during eoi broadcast. But for real hardware, there's some delay between
the EOI writing and irq delivery. If we do not emulate this behavior, and
re-inject the interrupt immediately after the guest sends an EOI and re-enables
interrupts, a guest might spend all its time in the ISR if it has a broken
handler for a level-triggered interrupt.
Such livelock actually happens with Windows guests when resuming from
hibernation.
As there's no way to recognize the broken handle from new raised ones, this patch
delays an interrupt if 10.000 consecutive EOIs found that the interrupt was
still high. The guest can then make a little forward progress, until a proper
IRQ handler is set or until some detection routine in the guest (such as
Linux's note_interrupt()) recognizes the situation.
Cc: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Jason Wang <jasowang@redhat.com> Signed-off-by: Zhang Haoyu <zhanghy@sangfor.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Paolo Bonzini [Thu, 11 Sep 2014 09:51:02 +0000 (11:51 +0200)]
KVM: x86: make apic_accept_irq tracepoint more generic
Initially the tracepoint was added only to the APIC_DM_FIXED case,
also because it reported coalesced interrupts that only made sense
for that case. However, the coalesced argument is not used anymore
and tracing other delivery modes is useful, so hoist the call out
of the switch statement.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Paolo Bonzini [Thu, 11 Sep 2014 09:09:33 +0000 (11:09 +0200)]
Merge tag 'kvm-s390-next-20140910' of git://git.kernel.org/pub/scm/linux/kernel/git/kvms390/linux into kvm-next
KVM: s390: Fixes and features for next (3.18)
1. Crypto/CPACF support: To enable the MSA4 instructions we have to
provide a common control structure for each SIE control block
2. Two cleanups found by a static code checker: one redundant assignment
and one useless if
3. Fix the page handling of the diag10 ballooning interface. If the
guest freed the pages at absolute 0 some checks and frees were
incorrect
4. Limit guests to 16TB
5. Add __must_check to interrupt injection code
KVM: s390/cmm: Fix prefix handling for diag 10 balloon
The old handling of prefix pages was broken in the diag10 ballooner.
We now rely on gmap_discard to check for start > end and do a
slow path if the prefix swap pages are affected:
1. discard the pages from start to prefix
2. discard the absolute 0 pages
3. discard the pages after prefix swap to end
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com> Reviewed-by: Thomas Huth <thuth@linux.vnet.ibm.com>
KVM: s390: get rid of constant condition in ipte_unlock_simple
Due to the earlier check we know that ipte_lock_count must be 0.
No need to add a useless if. Let's make clear that we are going
to always wakeup when we execute that code.
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com> Acked-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Currently we fill up a full 5 level page table to hold the guest
mapping. Since commit "support gmap page tables with less than 5
levels" we can do better.
Having more than 4 TB might be useful for some testing scenarios,
so let's just limit ourselves to 16TB guest size.
Having more than that is totally untested as I do not have enough
swap space/memory.
We continue to allow ucontrol the full size.
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com> Acked-by: Cornelia Huck <cornelia.huck@de.ibm.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Tony Krowiak [Fri, 27 Jun 2014 18:46:01 +0000 (14:46 -0400)]
KVM: CPACF: Enable MSA4 instructions for kvm guest
We have to provide a per guest crypto block for the CPUs to
enable MSA4 instructions. According to icainfo on z196 or
later this enables CCM-AES-128, CMAC-AES-128, CMAC-AES-192
and CMAC-AES-256.
Signed-off-by: Tony Krowiak <akrowiak@linux.vnet.ibm.com> Reviewed-by: David Hildenbrand <dahi@linux.vnet.ibm.com> Reviewed-by: Cornelia Huck <cornelia.huck@de.ibm.com> Reviewed-by: Michael Mueller <mimu@linux.vnet.ibm.com> Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
[split MSA4/protected key into two patches]
Alex Bennée [Tue, 9 Sep 2014 16:27:19 +0000 (17:27 +0100)]
KVM: fix api documentation of KVM_GET_EMULATED_CPUID
It looks like when this was initially merged it got accidentally included
in the following section. I've just moved it back in the correct section
and re-numbered it as other ioctls have been added since.
Alex Bennée [Tue, 9 Sep 2014 16:27:18 +0000 (17:27 +0100)]
KVM: document KVM_SET_GUEST_DEBUG api
In preparation for working on the ARM implementation I noticed the debug
interface was missing from the API document. I've pieced together the
expected behaviour from the code and commit messages written it up as
best I can.
The expression `vcpu->spin_loop.in_spin_loop' is always true,
because it is evaluated only when the condition
`!vcpu->spin_loop.in_spin_loop' is false.
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Paolo Bonzini [Tue, 2 Sep 2014 11:23:06 +0000 (13:23 +0200)]
KVM: x86: propagate exception from permission checks on the nested page fault
Currently, if a permission error happens during the translation of
the final GPA to HPA, walk_addr_generic returns 0 but does not fill
in walker->fault. To avoid this, add an x86_exception* argument
to the translate_gpa function, and let it fill in walker->fault.
The nested_page_fault field will be true, since the walk_mmu is the
nested_mmu and translate_gpu instead operates on the "outer" (NPT)
instance.
Reported-by: Valentine Sinitsyn <valentine.sinitsyn@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Paolo Bonzini [Thu, 4 Sep 2014 17:46:15 +0000 (19:46 +0200)]
KVM: x86: skip writeback on injection of nested exception
If a nested page fault happens during emulation, we will inject a vmexit,
not a page fault. However because writeback happens after the injection,
we will write ctxt->eip from L2 into the L1 EIP. We do not write back
if an instruction caused an interception vmexit---do the same for page
faults.
Paolo Bonzini [Tue, 2 Sep 2014 11:24:12 +0000 (13:24 +0200)]
KVM: x86: reserve bit 8 of non-leaf PDPEs and PML4Es in 64-bit mode on AMD
Bit 8 would be the "global" bit, which does not quite make sense for non-leaf
page table entries. Intel ignores it; AMD ignores it in PDEs, but reserves it
in PDPEs and PML4Es. The SVM test is relying on this behavior, so enforce it.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
David Matlack [Mon, 18 Aug 2014 22:46:07 +0000 (15:46 -0700)]
kvm: x86: fix stale mmio cache bug
The following events can lead to an incorrect KVM_EXIT_MMIO bubbling
up to userspace:
(1) Guest accesses gpa X without a memory slot. The gfn is cached in
struct kvm_vcpu_arch (mmio_gfn). On Intel EPT-enabled hosts, KVM sets
the SPTE write-execute-noread so that future accesses cause
EPT_MISCONFIGs.
(2) Host userspace creates a memory slot via KVM_SET_USER_MEMORY_REGION
covering the page just accessed.
(3) Guest attempts to read or write to gpa X again. On Intel, this
generates an EPT_MISCONFIG. The memory slot generation number that
was incremented in (2) would normally take care of this but we fast
path mmio faults through quickly_check_mmio_pf(), which only checks
the per-vcpu mmio cache. Since we hit the cache, KVM passes a
KVM_EXIT_MMIO up to userspace.
This patch fixes the issue by using the memslot generation number
to validate the mmio cache.
Cc: stable@vger.kernel.org Signed-off-by: David Matlack <dmatlack@google.com>
[xiaoguangrong: adjust the code to make it simpler for stable-tree fix.] Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> Reviewed-by: David Matlack <dmatlack@google.com> Reviewed-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> Tested-by: David Matlack <dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
David Matlack [Mon, 18 Aug 2014 22:46:06 +0000 (15:46 -0700)]
kvm: fix potentially corrupt mmio cache
vcpu exits and memslot mutations can run concurrently as long as the
vcpu does not aquire the slots mutex. Thus it is theoretically possible
for memslots to change underneath a vcpu that is handling an exit.
If we increment the memslot generation number again after
synchronize_srcu_expedited(), vcpus can safely cache memslot generation
without maintaining a single rcu_dereference through an entire vm exit.
And much of the x86/kvm code does not maintain a single rcu_dereference
of the current memslots during each exit.
We can prevent the following case:
vcpu (CPU 0) | thread (CPU 1)
--------------------------------------------+--------------------------
1 vm exit |
2 srcu_read_unlock(&kvm->srcu) |
3 decide to cache something based on |
old memslots |
4 | change memslots
| (increments generation)
5 | synchronize_srcu(&kvm->srcu);
6 retrieve generation # from new memslots |
7 tag cache with new memslot generation |
8 srcu_read_unlock(&kvm->srcu) |
... |
<action based on cache occurs even |
though the caching decision was based |
on the old memslots> |
... |
<action *continues* to occur until next |
memslot generation change, which may |
be never> |
|
By incrementing the generation after synchronizing with kvm->srcu readers,
we ensure that the generation retrieved in (6) will become invalid soon
after (8).
Keeping the existing increment is not strictly necessary, but we
do keep it and just move it for consistency from update_memslots to
install_new_memslots. It invalidates old cached MMIOs immediately,
instead of having to wait for the end of synchronize_srcu_expedited,
which makes the code more clearly correct in case CPU 1 is preempted
right after synchronize_srcu() returns.
To avoid halving the generation space in SPTEs, always presume that the
low bit of the generation is zero when reconstructing a generation number
out of an SPTE. This effectively disables MMIO caching in SPTEs during
the call to synchronize_srcu_expedited. Using the low bit this way is
somewhat like a seqcount---where the protected thing is a cache, and
instead of retrying we can simply punt if we observe the low bit to be 1.
Cc: stable@vger.kernel.org Signed-off-by: David Matlack <dmatlack@google.com> Reviewed-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> Reviewed-by: David Matlack <dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Paolo Bonzini [Wed, 20 Aug 2014 12:29:21 +0000 (14:29 +0200)]
KVM: do not bias the generation number in kvm_current_mmio_generation
The next patch will give a meaning (a la seqcount) to the low bit of the
generation number. Ensure that it matches between kvm->memslots->generation
and kvm_current_mmio_generation().
Cc: stable@vger.kernel.org Reviewed-by: David Matlack <dmatlack@google.com> Reviewed-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Paolo Bonzini [Fri, 29 Aug 2014 16:56:01 +0000 (18:56 +0200)]
KVM: x86: use guest maxphyaddr to check MTRR values
The check introduced in commit d7a2a246a1b5 (KVM: x86: #GP when attempts to write reserved bits of Variable Range MTRRs, 2014-08-19)
will break if the guest maxphyaddr is higher than the host's (which
sometimes happens depending on your hardware and how QEMU is
configured).
To fix this, use cpuid_maxphyaddr similar to how the APIC_BASE MSR
does already.
Reported-by: Jan Kiszka <jan.kiszka@siemens.com> Tested-by: Jan Kiszka <jan.kiszka@siemens.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Radim Krčmář [Thu, 28 Aug 2014 13:13:02 +0000 (15:13 +0200)]
KVM: static inline empty kvm_arch functions
Using static inline is going to save few bytes and cycles.
For example on powerpc, the difference is 700 B after stripping.
(5 kB before)
This patch also deals with two overlooked empty functions:
kvm_arch_flush_shadow was not removed from arch/mips/kvm/mips.c 2df72e9bc KVM: split kvm_arch_flush_shadow
and kvm_arch_sched_in never made it into arch/ia64/kvm/kvm-ia64.c. e790d9ef6 KVM: add kvm_arch_sched_in
Signed-off-by: Radim KrÄ\8dmář <rkrcmar@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Paolo Bonzini [Fri, 29 Aug 2014 12:01:17 +0000 (14:01 +0200)]
KVM: forward declare structs in kvm_types.h
Opaque KVM structs are useful for prototypes in asm/kvm_host.h, to avoid
"'struct foo' declared inside parameter list" warnings (and consequent
breakage due to conflicting types).
Move them from individual files to a generic place in linux/kvm_types.h.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Alex Williamson [Fri, 11 Jul 2014 17:56:31 +0000 (11:56 -0600)]
KVM: x86 emulator: emulate MOVNTDQ
Windows 8.1 guest with NVIDIA driver and GPU fails to boot with an
emulation failure. The KVM spew suggests the fault is with lack of
movntdq emulation (courtesy of Paolo):
Nadav Amit [Fri, 29 Aug 2014 08:26:55 +0000 (11:26 +0300)]
KVM: vmx: VMXOFF emulation in vm86 should cause #UD
Unlike VMCALL, the instructions VMXOFF, VMLAUNCH and VMRESUME should cause a UD
exception in real-mode or vm86. However, the emulator considers all these
instructions the same for the matter of mode checks, and emulation upon exit
due to #UD exception.
As a result, the hypervisor behaves incorrectly on vm86 mode. VMXOFF, VMLAUNCH
or VMRESUME cause on vm86 exit due to #UD. The hypervisor then emulates these
instruction and inject #GP to the guest instead of #UD.
This patch creates a new group for these instructions and mark only VMCALL as
an instruction which can be emulated.
Signed-off-by: Nadav Amit <namit@cs.technion.ac.il> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Paolo Bonzini [Tue, 26 Aug 2014 11:27:46 +0000 (13:27 +0200)]
KVM: x86: fix some sparse warnings
Sparse reports the following easily fixed warnings:
arch/x86/kvm/vmx.c:8795:48: sparse: Using plain integer as NULL pointer
arch/x86/kvm/vmx.c:2138:5: sparse: symbol vmx_read_l1_tsc was not declared. Should it be static?
arch/x86/kvm/vmx.c:6151:48: sparse: Using plain integer as NULL pointer
arch/x86/kvm/vmx.c:8851:6: sparse: symbol vmx_sched_in was not declared. Should it be static?
arch/x86/kvm/svm.c:2162:5: sparse: symbol svm_read_l1_tsc was not declared. Should it be static?
Wanpeng Li [Thu, 21 Aug 2014 11:46:50 +0000 (19:46 +0800)]
KVM: nVMX: nested TPR shadow/threshold emulation
This patch fix bug https://bugzilla.kernel.org/show_bug.cgi?id=61411
TPR shadow/threshold feature is important to speed up the Windows guest.
Besides, it is a must feature for certain VMM.
We map virtual APIC page address and TPR threshold from L1 VMCS. If
TPR_BELOW_THRESHOLD VM exit is triggered by L2 guest and L1 interested
in, we inject it into L1 VMM for handling.
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Wanpeng Li <wanpeng.li@linux.intel.com>
[Add PAGE_ALIGNED check, do not write useless virtual APIC page address
if TPR shadowing is disabled. - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Christoffer Dall [Tue, 26 Aug 2014 12:00:38 +0000 (14:00 +0200)]
KVM: Unconditionally export KVM_CAP_USER_NMI
The idea between capabilities and the KVM_CHECK_EXTENSION ioctl is that
userspace can, at run-time, determine if a feature is supported or not.
This allows KVM to being supporting a new feature with a new kernel
version without any need to update user space. Unfortunately, since the
definition of KVM_CAP_USER_NMI was guarded by #ifdef
__KVM_HAVE_USER_NMI, such discovery still required a user space update.
Therefore, unconditionally export KVM_CAP_USER_NMI and change the
the typo in the comment for the IOCTL number definition as well.
Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Christoffer Dall [Tue, 26 Aug 2014 12:00:37 +0000 (14:00 +0200)]
KVM: Unconditionally export KVM_CAP_READONLY_MEM
The idea between capabilities and the KVM_CHECK_EXTENSION ioctl is that
userspace can, at run-time, determine if a feature is supported or not.
This allows KVM to being supporting a new feature with a new kernel
version without any need to update user space. Unfortunately, since the
definition of KVM_CAP_READONLY_MEM was guarded by #ifdef
__KVM_HAVE_READONLY_MEM, such discovery still required a user space
update.
Therefore, unconditionally export KVM_CAP_READONLY_MEM and change the
in-kernel conditional to rely on __KVM_HAVE_READONLY_MEM.
Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
KVM: s390/mm: fix up indentation of set_guest_storage_key
commit ab3f285f227f ("KVM: s390/mm: try a cow on read only pages for
key ops")' misaligned a code block. Let's fixup the indentation.
Reported-by: Ben Hutchings <ben@decadent.org.uk> Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Paolo Bonzini [Tue, 26 Aug 2014 12:31:44 +0000 (14:31 +0200)]
Merge tag 'kvm-s390-next-20140825' of git://git.kernel.org/pub/scm/linux/kernel/git/kvms390/linux into HEAD
KVM: s390: Fixes and features for 3.18 part 1
1. The usual cleanups: get rid of duplicate code, use defines, factor
out the sync_reg handling, additional docs for sync_regs, better
error handling on interrupt injection
2. We use KVM_REQ_TLB_FLUSH instead of open coding tlb flushes
3. Additional registers for kvm_run sync regs. This is usually not
needed in the fast path due to eventfd/irqfd, but kvm stat claims
that we reduced the overhead of console output by ~50% on my system
4. A rework of the gmap infrastructure. This is the 2nd step towards
host large page support (after getting rid of the storage key
dependency). We introduces two radix trees to store the guest-to-host
and host-to-guest translations. This gets us rid of most of
the page-table walks in the gmap code. Only one in __gmap_link is left,
this one is required to link the shadow page table to the process page
table. Finally this contains the plumbing to support gmap page tables
with less than 5 levels.
KVM: s390/mm: support gmap page tables with less than 5 levels
Add an addressing limit to the gmap address spaces and only allocate
the page table levels that are needed for the given limit. The limit
is fixed and can not be changed after a gmap has been created.
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com> Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
KVM: s390/mm: use radix trees for guest to host mappings
Store the target address for the gmap segments in a radix tree
instead of using invalid segment table entries. gmap_translate
becomes a simple radix_tree_lookup, gmap_fault is split into the
address translation with gmap_translate and the part that does
the linking of the gmap shadow page table with the process page
table.
A second radix tree is used to keep the pointers to the segment
table entries for segments that are mapped in the guest address
space. On unmap of a segment the pointer is retrieved from the
radix tree and is used to carry out the segment invalidation in
the gmap shadow page table. As the radix tree can only store one
pointer, each host segment may only be mapped to exactly one
guest location.
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com> Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Paolo Bonzini [Mon, 25 Aug 2014 13:37:00 +0000 (15:37 +0200)]
Merge tag 'kvm-s390-20140825' of git://git.kernel.org/pub/scm/linux/kernel/git/kvms390/linux into HEAD
Here are two fixes for s390 KVM code that prevent:
1. a malicious user to trigger a kernel BUG
2. a malicious user to change the storage key of read-only pages
KVM: s390/mm: cleanup gmap function arguments, variable names
Make the order of arguments for the gmap calls more consistent,
if the gmap pointer is passed it is always the first argument.
In addition distinguish between guest address and user address
by naming the variables gaddr for a guest address and vmaddr for
a user address.
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com> Reviewed-by: Cornelia Huck <cornelia.huck@de.ibm.com> Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Jens Freimann [Mon, 11 Aug 2014 13:39:43 +0000 (15:39 +0200)]
KVM: s390: don't use kvm lock in interrupt injection code
The kvm lock protects us against vcpus going away, but they only go
away when the virtual machine is shut down. We don't need this
mutex here, so let's get rid of it.
Signed-off-by: Jens Freimann <jfrei@linux.vnet.ibm.com> Reviewed-by: David Hildenbrand <dahi@linux.vnet.ibm.com> Reviewed-by: Cornelia Huck <cornelia.huck@de.ibm.com> Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
KVM: s390: return -EFAULT if lowcore is not mapped during irq delivery
Currently we just kill the userspace process and exit the thread
immediatly without making sure that we don't hold any locks etc.
Improve this by making KVM_RUN return -EFAULT if the lowcore is not
mapped during interrupt delivery. To achieve this we need to pass
the return code of guest memory access routines used in interrupt
delivery all the way back to the KVM_RUN ioctl.
Signed-off-by: Jens Freimann <jfrei@linux.vnet.ibm.com> Reviewed-by: David Hildenbrand <dahi@linux.vnet.ibm.com> Reviewed-by: Cornelia Huck <cornelia.huck@de.ibm.com> Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
KVM: s390: implement KVM_REQ_TLB_FLUSH and make use of it
Use the KVM_REQ_TLB_FLUSH request in order to trigger tlb flushes instead
of manipulating the SIE control block whenever we need it. Also trigger it for
a control register sync directly instead of (ab)using kvm_s390_set_prefix().
Signed-off-by: David Hildenbrand <dahi@linux.vnet.ibm.com> Reviewed-by: Cornelia Huck <cornelia.huck@de.ibm.com> Acked-by: Christian Borntraeger <borntraeger@de.ibm.com> Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
KVM: s390: synchronize more registers with kvm_run
In order to reduce the number of syscalls when dropping to user space, this
patch enables the synchronization of the following "registers" with kvm_run:
- ARCH0: CPU timer, clock comparator, TOD programmable register,
guest breaking-event register, program parameter
- PFAULT: pfault parameters (token, select, compare)
The registers are grouped to reduce the overhead when syncing.
As this grows the number of sync registers quite a bit, let's move the code
synchronizing registers with kvm_run from kvm_arch_vcpu_ioctl_run() into
separate helper routines.
Signed-off-by: David Hildenbrand <dahi@linux.vnet.ibm.com> Reviewed-by: Cornelia Huck <cornelia.huck@de.ibm.com> Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com> Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
This patch clarifies that kvm_dirty_regs are just a hint to the kernel and
that the kernel might just ignore some flags and sync the values (like done for
acrs and gprs now).
Signed-off-by: David Hildenbrand <dahi@linux.vnet.ibm.com> Reviewed-by: Cornelia Huck <cornelia.huck@de.ibm.com> Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
KVM: s390/mm: try a cow on read only pages for key ops
The PFMF instruction handler blindly wrote the storage key even if
the page was mapped R/O in the host. Lets try a COW before continuing
and bail out in case of errors.
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com> Reviewed-by: Dominik Dingel <dingel@linux.vnet.ibm.com> Cc: stable@vger.kernel.org
In the early days, we had some special handling for the
KVM_EXIT_S390_SIEIC exit, but this was gone in 2009 with commit d7b0b5eb3000 (KVM: s390: Make psw available on all exits, not
just a subset).
Now this switch statement is just a sanity check for userspace
not messing with the kvm_run structure. Unfortunately, this
allows userspace to trigger a kernel BUG. Let's just remove
this switch statement.
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com> Reviewed-by: Cornelia Huck <cornelia.huck@de.ibm.com> Reviewed-by: David Hildenbrand <dahi@linux.vnet.ibm.com> Cc: stable@vger.kernel.org
Radim Krčmář [Thu, 21 Aug 2014 16:08:08 +0000 (18:08 +0200)]
KVM: VMX: dynamise PLE window
Window is increased on every PLE exit and decreased on every sched_in.
The idea is that we don't want to PLE exit if there is no preemption
going on.
We do this with sched_in() because it does not hold rq lock.
There are two new kernel parameters for changing the window:
ple_window_grow and ple_window_shrink
ple_window_grow affects the window on PLE exit and ple_window_shrink
does it on sched_in; depending on their value, the window is modifier
like this: (ple_window is kvm_intel's global)
Radim Krčmář [Thu, 21 Aug 2014 16:08:05 +0000 (18:08 +0200)]
KVM: add kvm_arch_sched_in
Introduce preempt notifiers for architecture specific code.
Advantage over creating a new notifier in every arch is slightly simpler
code and guaranteed call order with respect to kvm_sched_in.
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
We dont have to wait for a grace period if there is no oldpid that
we are going to free. putpid also checks for NULL, so this patch
only fences synchronize_rcu.
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Paolo Bonzini [Wed, 20 Aug 2014 08:08:23 +0000 (10:08 +0200)]
KVM: emulate: warn on invalid or uninitialized exception numbers
These were reported when running Jailhouse on AMD processors.
Initialize ctxt->exception.vector with an invalid exception number,
and warn if it remained invalid even though the emulator got
an X86EMUL_PROPAGATE_FAULT return code.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Nadav Amit [Wed, 20 Aug 2014 10:25:52 +0000 (13:25 +0300)]
KVM: x86: Clarify PMU related features bit manipulation
kvm_pmu_cpuid_update makes a lot of bit manuiplation operations, when in fact
there are already unions that can be used instead. Changing the bit
manipulation to the union for clarity. This patch does not change the
functionality.
Signed-off-by: Nadav Amit <namit@cs.technion.ac.il> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Wanpeng Li [Wed, 20 Aug 2014 07:31:53 +0000 (15:31 +0800)]
KVM: vmx: fix ept reserved bits for 1-GByte page
EPT misconfig handler in kvm will check which reason lead to EPT
misconfiguration after vmexit. One of the reasons is that an EPT
paging-structure entry is configured with settings reserved for
future functionality. However, the handler can't identify if
paging-structure entry of reserved bits for 1-GByte page are
configured, since PDPTE which point to 1-GByte page will reserve
bits 29:12 instead of bits 7:3 which are reserved for PDPTE that
references an EPT Page Directory. This patch fix it by reserve
bits 29:12 for 1-GByte page.
Signed-off-by: Wanpeng Li <wanpeng.li@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Nadav Amit [Mon, 18 Aug 2014 21:03:00 +0000 (00:03 +0300)]
KVM: x86: recalculate_apic_map after enabling apic
Currently, recalculate_apic_map ignores vcpus whose lapic is software disabled
through the spurious interrupt vector. However, once it is re-enabled, the map
is not recalculated. Therefore, if the guest OS configured DFR while lapic is
software-disabled, the map may be incorrect. This patch recalculates apic map
after software enabling the lapic.
Signed-off-by: Nadav Amit <namit@cs.technion.ac.il> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Nadav Amit [Mon, 18 Aug 2014 19:42:13 +0000 (22:42 +0300)]
KVM: x86: Clear apic tsc-deadline after deadline
Intel SDM 10.5.4.1 says "When the timer generates an interrupt, it disarms
itself and clears the IA32_TSC_DEADLINE MSR".
This patch clears the MSR upon timer interrupt delivery which delivered on
deadline mode. Since the MSR may be reconfigured while an interrupt is
pending, causing the new value to be overriden, pending timer interrupts are
checked before setting a new deadline.
Signed-off-by: Nadav Amit <namit@cs.technion.ac.il> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Wanpeng Li [Tue, 19 Aug 2014 09:04:40 +0000 (17:04 +0800)]
KVM: x86: #GP when attempts to write reserved bits of Variable Range MTRRs
Section 11.11.2.3 of the SDM mentions "All other bits in the IA32_MTRR_PHYSBASEn
and IA32_MTRR_PHYSMASKn registers are reserved; the processor generates a
general-protection exception(#GP) if software attempts to write to them". This
patch do it in kvm.
Signed-off-by: Wanpeng Li <wanpeng.li@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Wanpeng Li [Tue, 19 Aug 2014 09:04:39 +0000 (17:04 +0800)]
KVM: x86: fix check legal type of Variable Range MTRRs
The first entry in each pair(IA32_MTRR_PHYSBASEn) defines the base
address and memory type for the range; the second entry(IA32_MTRR_PHYSMASKn)
contains a mask used to determine the address range. The legal values
for the type field of IA32_MTRR_PHYSBASEn are 0,1,4,5, and 6. However,
IA32_MTRR_PHYSMASKn don't have type field. This patch avoid check if
the type field is legal for IA32_MTRR_PHYSMASKn.
Signed-off-by: Wanpeng Li <wanpeng.li@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Monam Agarwal [Sat, 22 Mar 2014 06:58:10 +0000 (12:28 +0530)]
arch/x86: Use RCU_INIT_POINTER(x, NULL) in kvm/vmx.c
Here rcu_assign_pointer() is ensuring that the
initialization of a structure is carried out before storing a pointer
to that structure.
So, rcu_assign_pointer(p, NULL) can always safely be converted to
RCU_INIT_POINTER(p, NULL).
Signed-off-by: Monam Agarwal <monamagarwal123@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Wanpeng Li [Mon, 18 Aug 2014 09:50:28 +0000 (17:50 +0800)]
KVM: x86: drop fpu_activate hook
The only user of the fpu_activate hook was dropped in commit 2d04a05bd7e9 (KVM: x86 emulator: emulate CLTS internally, 2011-04-20).
vmx_fpu_activate and svm_fpu_activate are still called on #NM (and for
Intel CLTS), but never from common code; hence, there's no need for
a hook.
Reviewed-by: Yang Zhang <yang.z.zhang@intel.com> Signed-off-by: Wanpeng Li <wanpeng.li@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Wei Huang [Wed, 13 Aug 2014 16:06:14 +0000 (12:06 -0400)]
KVM: SVM: add rdmsr support for AMD event registers
Current KVM only supports RDMSR for K7_EVNTSEL0 and K7_PERFCTR0
MSRs. Reading the rest MSRs will trigger KVM to inject #GP into
guest VM. This causes a warning message "Failed to access perfctr
msr (MSR c0010001 is ffffffffffffffff)" on AMD host. This patch
adds RDMSR support for all K7_EVNTSELn and K7_PERFCTRn registers
and thus supresses the warning message.
Signed-off-by: Wei Huang <wehuang@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Chen Gang [Fri, 8 Aug 2014 15:37:59 +0000 (23:37 +0800)]
virt/kvm/assigned-dev.c: Set 'dev->irq_source_id' to '-1' after free it
As a generic function, deassign_guest_irq() assumes it can be called
even if assign_guest_irq() is not be called successfully (which can be
triggered by ioctl from user mode, indirectly).
So for assign_guest_irq() failure process, need set 'dev->irq_source_id'
to -1 after free 'dev->irq_source_id', or deassign_guest_irq() may free
it again.
Signed-off-by: Chen Gang <gang.chen.5i5j@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
SeaBIOS has a limit on the number of MTRRs that it can handle,
and this patch exceeded the limit. Better revert it.
Thanks to Nadav Amit for debugging the cause.
Cc: stable@nongnu.org Reported-by: Wanpeng Li <wanpeng.li@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Paolo Bonzini [Mon, 18 Aug 2014 11:15:51 +0000 (13:15 +0200)]
KVM: x86: do not check CS.DPL against RPL during task switch
This reverts the check added by commit 5045b468037d (KVM: x86: check CS.DPL
against RPL during task switch, 2014-05-15). Although the CS.DPL=CS.RPL
check is mentioned in table 7-1 of the SDM as causing a #TSS exception,
it is not mentioned in table 6-6 that lists "invalid TSS conditions"
which cause #TSS exceptions. In fact it causes some tests to fail, which
pass on bare-metal.
Keep the rest of the commit, since we will find new uses for it in 3.18.
Reported-by: Nadav Amit <namit@cs.technion.ac.il> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Nadav Amit [Wed, 13 Aug 2014 13:50:13 +0000 (16:50 +0300)]
KVM: x86: Avoid emulating instructions on #UD mistakenly
Commit d40a6898e5 mistakenly caused instructions which are not marked as
EmulateOnUD to be emulated upon #UD exception. The commit caused the check of
whether the instruction flags include EmulateOnUD to never be evaluated. As a
result instructions whose emulation is broken may be emulated. This fix moves
the evaluation of EmulateOnUD so it would be evaluated.
Signed-off-by: Nadav Amit <namit@cs.technion.ac.il>
[Tweak operand order in &&, remove EmulateOnUD where it's now superfluous.
- Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
PC, KVM, CMA: Fix regression caused by wrong get_order() use
fc95ca7284bc54953165cba76c3228bd2cdb9591 claims that there is no
functional change but this is not true as it calls get_order() (which
takes bytes) where it should have called order_base_2() and the kernel
stops on VM_BUG_ON().
This replaces get_order() with order_base_2() (round-up version of ilog2).
Suggested-by: Paul Mackerras <paulus@samba.org> Cc: Alexander Graf <agraf@suse.de> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
kvm: iommu: fix the third parameter of kvm_iommu_put_pages (CVE-2014-3601)
The third parameter of kvm_iommu_put_pages is wrong,
It should be 'gfn - slot->base_gfn'.
By making gfn very large, malicious guest or userspace can cause kvm to
go to this error path, and subsequently to pass a huge value as size.
Alternatively if gfn is small, then pages would be pinned but never
unpinned, causing host memory leak and local DOS.
Passing a reasonable but large value could be the most dangerous case,
because it would unpin a page that should have stayed pinned, and thus
allow the device to DMA into arbitrary memory. However, this cannot
happen because of the condition that can trigger the error:
- out of memory (where you can't allocate even a single page)
should not be possible for the attacker to trigger
- when exceeding the iommu's address space, guest pages after gfn
will also exceed the iommu's address space, and inside
kvm_iommu_put_pages() the iommu_iova_to_phys() will fail. The
page thus would not be unpinned at all.
Reported-by: Jack Morgenstein <jackm@mellanox.com> Cc: stable@vger.kernel.org Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Linus Torvalds [Sat, 16 Aug 2014 15:32:27 +0000 (09:32 -0600)]
Merge branch 'for_linus' of git://cavan.codon.org.uk/platform-drivers-x86
Pull x86 platform driver updates from Matthew Garrett:
"A moderate number of changes, but nothing awfully significant.
A lot of const cleanups, some reworking and additions to the rfkill
quirks in the asus driver, a new driver for generating falling laptop
events on Toshibas and some misc fixes.
Maybe vendors have stopped inventing things"
* 'for_linus' of git://cavan.codon.org.uk/platform-drivers-x86: (41 commits)
platform/x86: Enable build support for toshiba_haps
Documentation: Add file about toshiba_haps module
platform/x86: Toshiba HDD Active Protection Sensor
asus-nb-wmi: Add wapf4 quirk for the U32U
alienware-wmi: make hdmi_mux enabled on case-by-case basis
ideapad-laptop: Constify DMI table and other r/o variables
asus-nb-wmi.c: Rename x401u quirk to wapf4
compal-laptop: correct invalid hwmon name
toshiba_acpi: Add Qosmio X75-A to the alt keymap dmi list
toshiba_acpi: Add extra check to backlight code
Fix log message about future removal of interface
ideapad-laptop: Disable touchpad interface on Yoga models
asus-nb-wmi: Add wapf4 quirk for the X550CC
intel_ips: Make ips_mcp_limits variables static
thinkpad_acpi: Mark volume_alsa_control_{vol,mute} as __initdata
fujitsu-laptop: Mark fujitsu_dmi_table[] DMI table as __initconst
hp-wmi: Add missing __init annotations to initialization code
hp_accel: Constify ACPI and DMI tables
fujitsu-tablet: Mark DMI callbacks as __init code
dell-laptop: Mark dell_quirks[] DMI table as __initconst
...
Linus Torvalds [Sat, 16 Aug 2014 15:25:34 +0000 (09:25 -0600)]
Merge branch 'release' of git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux
Pull idle update from Len Brown:
"Two Intel-platform-specific updates to intel_idle, and a cosmetic
tweak to the turbostat utility"
* 'release' of git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux:
tools/power turbostat: tweak whitespace in output format
intel_idle: Broadwell support
intel_idle: Disable Baytrail Core and Module C6 auto-demotion
Linus Torvalds [Sat, 16 Aug 2014 15:24:41 +0000 (09:24 -0600)]
Merge tag 'fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux
Pull module fix from Rusty Russell:
"Nasty potential bug if someone uses a known module param with an
invalid value (we don't fail unknown module params any more, just
warn)"
* tag 'fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux:
module: Clean up ro/nx after early module load failures
Linus Torvalds [Sat, 16 Aug 2014 15:06:55 +0000 (09:06 -0600)]
Merge branch 'for-linus2' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs updates from Chris Mason:
"These are all fixes I'd like to get out to a broader audience.
The biggest of the bunch is Mark's quota fix, which is also in the
SUSE kernel, and makes our subvolume quotas dramatically more
accurate.
I've been running xfstests with these against your current git
overnight, but I'm queueing up longer tests as well"
* 'for-linus2' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
btrfs: disable strict file flushes for renames and truncates
Btrfs: fix csum tree corruption, duplicate and outdated checksums
Btrfs: Fix memory corruption by ulist_add_merge() on 32bit arch
Btrfs: fix compressed write corruption on enospc
btrfs: correctly handle return from ulist_add
btrfs: qgroup: account shared subtrees during snapshot delete
Btrfs: read lock extent buffer while walking backrefs
Btrfs: __btrfs_mod_ref should always use no_quota
btrfs: adjust statfs calculations according to raid profiles
Linus Torvalds [Sat, 16 Aug 2014 14:58:47 +0000 (08:58 -0600)]
Merge tag 'locks-v3.17-2' of git://git.samba.org/jlayton/linux
Pull file locking bugfixes from Jeff Layton:
"Most of these patches are to fix a long-standing regression that crept
in when the BKL was removed from the file-locking code. The code was
converted to use a conventional spinlock, but some fl_release_private
ops can block and you can end up sleeping inside the lock.
There's also a patch to make /proc/locks show delegations as 'DELEG'"
* tag 'locks-v3.17-2' of git://git.samba.org/jlayton/linux:
locks: update Locking documentation to clarify fl_release_private behavior
locks: move locks_free_lock calls in do_fcntl_add_lease outside spinlock
locks: defer freeing locks in locks_delete_lock until after i_lock has been dropped
locks: don't reuse file_lock in __posix_lock_file
locks: don't call locks_release_private from locks_copy_lock
locks: show delegations as "DELEG" in /proc/locks
Linus Torvalds [Sat, 16 Aug 2014 14:56:27 +0000 (08:56 -0600)]
Merge git://git.kvack.org/~bcrl/aio-next
Pull aio updates from Ben LaHaise.
* git://git.kvack.org/~bcrl/aio-next:
aio: use iovec array rather than the single one
aio: fix some comments
aio: use the macro rather than the inline magic number
aio: remove the needless registration of ring file's private_data
aio: remove no longer needed preempt_disable()
aio: kill the misleading rcu read locks in ioctx_add_table() and kill_ioctx()
aio: change exit_aio() to load mm->ioctx_table once and avoid rcu_read_lock()