Alexander Graf [Thu, 31 Jul 2014 08:21:59 +0000 (10:21 +0200)]
KVM: PPC: PR: Handle FSCR feature deselects
We handle FSCR feature bits (well, TAR only really today) lazily when the guest
starts using them. So when a guest activates the bit and later uses that feature
we enable it for real in hardware.
However, when the guest stops using that bit we don't stop setting it in
hardware. That means we can potentially lose a trap that the guest expects to
happen because it thinks a feature is not active.
This patch adds support to drop TAR when then guest turns it off in FSCR. While
at it it also restricts FSCR access to 64bit systems - 32bit ones don't have it.
Now that we have properly split load/store instruction emulation and generic
instruction emulation, we can move the generic one from kvm.ko to kvm-pr.ko
on book3s_64.
This reduces the attack surface and amount of code loaded on HV KVM kernels.
KVM: PPC: BOOKEHV: rename e500hv_spr to bookehv_spr
This are not specific to e500hv but applicable for bookehv
(As per comment from Scott Wood on my patch
"kvm: ppc: bookehv: Added wrapper macros for shadow registers")
Signed-off-by: Bharat Bhushan <Bharat.Bhushan@freescale.com> Signed-off-by: Alexander Graf <agraf@suse.de>
Alexander Graf [Wed, 18 Jun 2014 19:56:55 +0000 (21:56 +0200)]
KVM: PPC: Expose helper functions for data/inst faults
We're going to implement guest code interpretation in KVM for some rare
corner cases. This code needs to be able to inject data and instruction
faults into the guest when it encounters them.
Expose generic APIs to do this in a reasonably subarch agnostic fashion.
Alexander Graf [Wed, 18 Jun 2014 12:53:49 +0000 (14:53 +0200)]
KVM: PPC: Separate loadstore emulation from priv emulation
Today the instruction emulator can get called via 2 separate code paths. It
can either be called by MMIO emulation detection code or by privileged
instruction traps.
This is bad, as both code paths prepare the environment differently. For MMIO
emulation we already know the virtual address we faulted on, so instructions
there don't have to actually fetch that information.
Split out the two separate use cases into separate files.
Alexander Graf [Fri, 20 Jun 2014 12:43:36 +0000 (14:43 +0200)]
KVM: PPC: Handle magic page in kvmppc_ld/st
We use kvmppc_ld and kvmppc_st to emulate load/store instructions that may as
well access the magic page. Special case it out so that we can properly access
it.
Alexander Graf [Fri, 20 Jun 2014 12:17:30 +0000 (14:17 +0200)]
KVM: PPC: Use kvm_read_guest in kvmppc_ld
We have a nice and handy helper to read from guest physical address space,
so we should make use of it in kvmppc_ld as we already do for its counterpart
in kvmppc_st.
Alexander Graf [Fri, 20 Jun 2014 11:58:16 +0000 (13:58 +0200)]
KVM: PPC: Move kvmppc_ld/st to common code
We have enough common infrastructure now to resolve GVA->GPA mappings at
runtime. With this we can move our book3s specific helpers to load / store
in guest virtual address space to common code as well.
Alexander Graf [Fri, 20 Jun 2014 11:52:36 +0000 (13:52 +0200)]
KVM: PPC: Implement kvmppc_xlate for all targets
We have a nice API to find the translated GPAs of a GVA including protection
flags. So far we only use it on Book3S, but there's no reason the same shouldn't
be used on BookE as well.
Implement a kvmppc_xlate() version for BookE and clean it up to make it more
readable in general.
KVM: PPC: BOOK3S: HV: Update compute_tlbie_rb to handle 16MB base page
When calculating the lower bits of AVA field, use the shift
count based on the base page size. Also add the missing segment
size and remove stale comment.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Acked-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Alexander Graf <agraf@suse.de>
Alexander Graf [Mon, 14 Jul 2014 16:55:19 +0000 (18:55 +0200)]
KVM: PPC: Book3S: Provide different CAPs based on HV or PR mode
With Book3S KVM we can create both PR and HV VMs in parallel on the same
machine. That gives us new challenges on the CAPs we return - both have
different capabilities.
When we get asked about CAPs on the kvm fd, there's nothing we can do. We
can try to be smart and assume we're running HV if HV is available, PR
otherwise. However with the newly added VM CHECK_EXTENSION we can now ask
for capabilities directly on a VM which knows whether it's PR or HV.
With this patch I can successfully expose KVM PVINFO data to user space
in the PR case, fixing magic page mapping for PAPR guests.
Signed-off-by: Alexander Graf <agraf@suse.de> Acked-by: Paolo Bonzini <pbonzini@redhat.com>
Alexander Graf [Mon, 14 Jul 2014 16:33:08 +0000 (18:33 +0200)]
KVM: Allow KVM_CHECK_EXTENSION on the vm fd
The KVM_CHECK_EXTENSION is only available on the kvm fd today. Unfortunately
on PPC some of the capabilities change depending on the way a VM was created.
So instead we need a way to expose capabilities as VM ioctl, so that we can
see which VM type we're using (HV or PR). To enable this, add the
KVM_CHECK_EXTENSION ioctl to our vm ioctl portfolio.
Signed-off-by: Alexander Graf <agraf@suse.de> Acked-by: Paolo Bonzini <pbonzini@redhat.com>
Alexander Graf [Mon, 14 Jul 2014 16:27:35 +0000 (18:27 +0200)]
KVM: Rename and add argument to check_extension
In preparation to make the check_extension function available to VM scope
we add a struct kvm * argument to the function header and rename the function
accordingly. It will still be called from the /dev/kvm fd, but with a NULL
argument for struct kvm *.
Signed-off-by: Alexander Graf <agraf@suse.de> Acked-by: Paolo Bonzini <pbonzini@redhat.com>
Stewart Smith [Fri, 18 Jul 2014 04:18:43 +0000 (14:18 +1000)]
Use the POWER8 Micro Partition Prefetch Engine in KVM HV on POWER8
The POWER8 processor has a Micro Partition Prefetch Engine, which is
a fancy way of saying "has way to store and load contents of L2 or
L2+MRU way of L3 cache". We initiate the storing of the log (list of
addresses) using the logmpp instruction and start restore by writing
to a SPR.
The logmpp instruction takes parameters in a single 64bit register:
- starting address of the table to store log of L2/L2+L3 cache contents
- 32kb for L2
- 128kb for L2+L3
- Aligned relative to maximum size of the table (32kb or 128kb)
- Log control (no-op, L2 only, L2 and L3, abort logout)
We should abort any ongoing logging before initiating one.
To initiate restore, we write to the MPPR SPR. The format of what to write
to the SPR is similar to the logmpp instruction parameter:
- starting address of the table to read from (same alignment requirements)
- table size (no data, until end of table)
- prefetch rate (from fastest possible to slower. about every 8, 16, 24 or
32 cycles)
The idea behind loading and storing the contents of L2/L3 cache is to
reduce memory latency in a system that is frequently swapping vcores on
a physical CPU.
The best case scenario for doing this is when some vcores are doing very
cache heavy workloads. The worst case is when they have about 0 cache hits,
so we just generate needless memory operations.
This implementation just does L2 store/load. In my benchmarks this proves
to be useful.
Benchmark 1:
- 16 core POWER8
- 3x Ubuntu 14.04LTS guests (LE) with 8 VCPUs each
- No split core/SMT
- two guests running sysbench memory test.
sysbench --test=memory --num-threads=8 run
- one guest running apache bench (of default HTML page)
ab -n 490000 -c 400 http://localhost/
This benchmark aims to measure performance of real world application (apache)
where other guests are cache hot with their own workloads. The sysbench memory
benchmark does pointer sized writes to a (small) memory buffer in a loop.
In this benchmark with this patch I can see an improvement both in requests
per second (~5%) and in mean and median response times (again, about 5%).
The spread of minimum and maximum response times were largely unchanged.
benchmark 2:
- Same VM config as benchmark 1
- all three guests running sysbench memory benchmark
This benchmark aims to see if there is a positive or negative affect to this
cache heavy benchmark. Although due to the nature of the benchmark (stores) we
may not see a difference in performance, but rather hopefully an improvement
in consistency of performance (when vcore switched in, don't have to wait
many times for cachelines to be pulled in)
The results of this benchmark are improvements in consistency of performance
rather than performance itself. With this patch, the few outliers in duration
go away and we get more consistent performance in each guest.
benchmark 3:
- same 3 guests and CPU configuration as benchmark 1 and 2.
- two idle guests
- 1 guest running STREAM benchmark
This scenario also saw performance improvement with this patch. On Copy and
Scale workloads from STREAM, I got 5-6% improvement with this patch. For
Add and triad, it was around 10% (or more).
benchmark 4:
- same 3 guests as previous benchmarks
- two guests running sysbench --memory, distinctly different cache heavy
workload
- one guest running STREAM benchmark.
Stewart Smith [Fri, 18 Jul 2014 04:18:42 +0000 (14:18 +1000)]
Split out struct kvmppc_vcore creation to separate function
No code changes, just split it out to a function so that with the addition
of micro partition prefetch buffer allocation (in subsequent patch) looks
neater and doesn't require excessive indentation.
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com> Acked-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Alexander Graf <agraf@suse.de>
Paul Mackerras [Sat, 19 Jul 2014 07:59:36 +0000 (17:59 +1000)]
KVM: PPC: Book3S: Make kvmppc_ld return a more accurate error indication
At present, kvmppc_ld calls kvmppc_xlate, and if kvmppc_xlate returns
any error indication, it returns -ENOENT, which is taken to mean an
HPTE not found error. However, the error could have been a segment
found (no SLB entry) or a permission error. Similarly,
kvmppc_pte_to_hva currently does permission checking, but any error
from it is taken by kvmppc_ld to mean that the access is an emulated
MMIO access. Also, kvmppc_ld does no execute permission checking.
This fixes these problems by (a) returning any error from kvmppc_xlate
directly, (b) moving the permission check from kvmppc_pte_to_hva
into kvmppc_ld, and (c) adding an execute permission check to kvmppc_ld.
This is similar to what was done for kvmppc_st() by commit 82ff911317c3
("KVM: PPC: Deflect page write faults properly in kvmppc_st").
Signed-off-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Alexander Graf <agraf@suse.de>
Paul Mackerras [Sat, 19 Jul 2014 07:59:35 +0000 (17:59 +1000)]
KVM: PPC: Book3S PR: Take SRCU read lock around RTAS kvm_read_guest() call
This does for PR KVM what c9438092cae4 ("KVM: PPC: Book3S HV: Take SRCU
read lock around kvm_read_guest() call") did for HV KVM, that is,
eliminate a "suspicious rcu_dereference_check() usage!" warning by
taking the SRCU lock around the call to kvmppc_rtas_hcall().
It also fixes a return of RESUME_HOST to return EMULATE_FAIL instead,
since kvmppc_h_pr() is supposed to return EMULATE_* values.
Signed-off-by: Paul Mackerras <paulus@samba.org> Cc: stable@vger.kernel.org Signed-off-by: Alexander Graf <agraf@suse.de>
Unfortunately, the LPCR got defined as a 32-bit register in the
one_reg interface. This is unfortunate because KVM allows userspace
to control the DPFD (default prefetch depth) field, which is in the
upper 32 bits. The result is that DPFD always get set to 0, which
reduces performance in the guest.
We can't just change KVM_REG_PPC_LPCR to be a 64-bit register ID,
since that would break existing userspace binaries. Instead we define
a new KVM_REG_PPC_LPCR_64 id which is 64-bit. Userspace can still use
the old KVM_REG_PPC_LPCR id, but it now only modifies those fields in
the bottom 32 bits that userspace can modify (ILE, TC and AIL).
If userspace uses the new KVM_REG_PPC_LPCR_64 id, it can modify DPFD
as well.
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Signed-off-by: Paul Mackerras <paulus@samba.org> Cc: stable@vger.kernel.org Signed-off-by: Alexander Graf <agraf@suse.de>
Alexander Graf [Fri, 25 Jul 2014 08:38:59 +0000 (10:38 +0200)]
KVM: PPC: Remove 440 support
The 440 target hasn't been properly functioning for a few releases and
before I was the only one who fixes a very serious bug that indicates to
me that nobody used it before either.
Furthermore KVM on 440 is slow to the extent of unusable.
We don't have to carry along completely unused code. Remove 440 and give
us one less thing to worry about.
KVM: PPC: Remove comment saying SPRG1 is used for vcpu pointer
Scott Wood pointed out that We are no longer using SPRG1 for vcpu pointer,
but using SPRN_SPRG_THREAD <=> SPRG3 (thread->vcpu). So this comment
is not valid now.
Note: SPRN_SPRG3R is not supported (do not see any need as of now),
and if we want to support this in future then we have to shift to using
SPRG1 for VCPU pointer.
Signed-off-by: Bharat Bhushan <Bharat.Bhushan@freescale.com> Signed-off-by: Alexander Graf <agraf@suse.de>
Mihai Caraman [Wed, 23 Jul 2014 16:06:22 +0000 (19:06 +0300)]
KVM: PPC: Bookehv: Get vcpu's last instruction for emulation
On book3e, KVM uses load external pid (lwepx) dedicated instruction to read
guest last instruction on the exit path. lwepx exceptions (DTLB_MISS, DSI
and LRAT), generated by loading a guest address, needs to be handled by KVM.
These exceptions are generated in a substituted guest translation context
(EPLC[EGS] = 1) from host context (MSR[GS] = 0).
Currently, KVM hooks only interrupts generated from guest context (MSR[GS] = 1),
doing minimal checks on the fast path to avoid host performance degradation.
lwepx exceptions originate from host state (MSR[GS] = 0) which implies
additional checks in DO_KVM macro (beside the current MSR[GS] = 1) by looking
at the Exception Syndrome Register (ESR[EPID]) and the External PID Load Context
Register (EPLC[EGS]). Doing this on each Data TLB miss exception is obvious
too intrusive for the host.
Read guest last instruction from kvmppc_load_last_inst() by searching for the
physical address and kmap it. This address the TODO for TLB eviction and
execute-but-not-read entries, and allow us to get rid of lwepx until we are
able to handle failures.
A simple stress benchmark shows a 1% sys performance degradation compared with
previous approach (lwepx without failure handling):
time for i in `seq 1 10000`; do /bin/echo > /dev/null; done
real 0m 8.85s
user 0m 4.34s
sys 0m 4.48s
vs
real 0m 8.84s
user 0m 4.36s
sys 0m 4.44s
A solution to use lwepx and to handle its exceptions in KVM would be to temporary
highjack the interrupt vector from host. This imposes additional synchronizations
for cores like FSL e6500 that shares host IVOR registers between hardware threads.
This optimized solution can be later developed on top of this patch.
Signed-off-by: Mihai Caraman <mihai.caraman@freescale.com> Signed-off-by: Alexander Graf <agraf@suse.de>
Mihai Caraman [Wed, 23 Jul 2014 16:06:21 +0000 (19:06 +0300)]
KVM: PPC: Allow kvmppc_get_last_inst() to fail
On book3e, guest last instruction is read on the exit path using load
external pid (lwepx) dedicated instruction. This load operation may fail
due to TLB eviction and execute-but-not-read entries.
This patch lay down the path for an alternative solution to read the guest
last instruction, by allowing kvmppc_get_lat_inst() function to fail.
Architecture specific implmentations of kvmppc_load_last_inst() may read
last guest instruction and instruct the emulation layer to re-execute the
guest in case of failure.
Make kvmppc_get_last_inst() definition common between architectures.
Signed-off-by: Mihai Caraman <mihai.caraman@freescale.com> Signed-off-by: Alexander Graf <agraf@suse.de>
Mihai Caraman [Wed, 23 Jul 2014 16:06:20 +0000 (19:06 +0300)]
KVM: PPC: Book3s: Remove kvmppc_read_inst() function
In the context of replacing kvmppc_ld() function calls with a version of
kvmppc_get_last_inst() which allow to fail, Alex Graf suggested this:
"If we get EMULATE_AGAIN, we just have to make sure we go back into the guest.
No need to inject an ISI into the guest - it'll do that all by itself.
With an error returning kvmppc_get_last_inst we can just use completely
get rid of kvmppc_read_inst() and only use kvmppc_get_last_inst() instead."
As a intermediate step get rid of kvmppc_read_inst() and only use kvmppc_ld()
instead.
Signed-off-by: Mihai Caraman <mihai.caraman@freescale.com> Signed-off-by: Alexander Graf <agraf@suse.de>
Mihai Caraman [Wed, 23 Jul 2014 16:06:18 +0000 (19:06 +0300)]
KVM: PPC: e500mc: Revert "add load inst fixup"
The commit 1d628af7 "add load inst fixup" made an attempt to handle
failures generated by reading the guest current instruction. The fixup
code that was added works by chance hiding the real issue.
Load external pid (lwepx) instruction, used by KVM to read guest
instructions, is executed in a subsituted guest translation context
(EPLC[EGS] = 1). In consequence lwepx's TLB error and data storage
interrupts need to be handled by KVM, even though these interrupts
are generated from host context (MSR[GS] = 0) where lwepx is executed.
Currently, KVM hooks only interrupts generated from guest context
(MSR[GS] = 1), doing minimal checks on the fast path to avoid host
performance degradation. As a result, the host kernel handles lwepx
faults searching the faulting guest data address (loaded in DEAR) in
its own Logical Partition ID (LPID) 0 context. In case a host translation
is found the execution returns to the lwepx instruction instead of the
fixup, the host ending up in an infinite loop.
Revert the commit "add load inst fixup". lwepx issue will be addressed
in a subsequent patch without needing fixup code.
Signed-off-by: Mihai Caraman <mihai.caraman@freescale.com> Signed-off-by: Alexander Graf <agraf@suse.de>
kvm: ppc: bookehv: Added wrapper macros for shadow registers
There are shadow registers like, GSPRG[0-3], GSRR0, GSRR1 etc on
BOOKE-HV and these shadow registers are guest accessible.
So these shadow registers needs to be updated on BOOKE-HV.
This patch adds new macro for get/set helper of shadow register .
Signed-off-by: Bharat Bhushan <Bharat.Bhushan@freescale.com> Signed-off-by: Alexander Graf <agraf@suse.de>
Alexander Graf [Sun, 13 Jul 2014 14:37:12 +0000 (16:37 +0200)]
KVM: PPC: Book3S: Make magic page properly 4k mappable
The magic page is defined as a 4k page of per-vCPU data that is shared
between the guest and the host to accelerate accesses to privileged
registers.
However, when the host is using 64k page size granularity we weren't quite
as strict about that rule anymore. Instead, we partially treated all of the
upper 64k as magic page and mapped only the uppermost 4k with the actual
magic contents.
This works well enough for Linux which doesn't use any memory in kernel
space in the upper 64k, but Mac OS X got upset. So this patch makes magic
page actually stay in a 4k range even on 64k page size hosts.
This patch fixes magic page usage with Mac OS X (using MOL) on 64k PAGE_SIZE
hosts for me.
Alexander Graf [Fri, 11 Jul 2014 00:58:58 +0000 (02:58 +0200)]
KVM: PPC: Book3S: Add hack for split real mode
Today we handle split real mode by mapping both instruction and data faults
into a special virtual address space that only exists during the split mode
phase.
This is good enough to catch 32bit Linux guests that use split real mode for
copy_from/to_user. In this case we're always prefixed with 0xc0000000 for our
instruction pointer and can map the user space process freely below there.
However, that approach fails when we're running KVM inside of KVM. Here the 1st
level last_inst reader may well be in the same virtual page as a 2nd level
interrupt handler.
It also fails when running Mac OS X guests. Here we have a 4G/4G split, so a
kernel copy_from/to_user implementation can easily overlap with user space
addresses.
The architecturally correct way to fix this would be to implement an instruction
interpreter in KVM that kicks in whenever we go into split real mode. This
interpreter however would not receive a great amount of testing and be a lot of
bloat for a reasonably isolated corner case.
So I went back to the drawing board and tried to come up with a way to make
split real mode work with a single flat address space. And then I realized that
we could get away with the same trick that makes it work for Linux:
Whenever we see an instruction address during split real mode that may collide,
we just move it higher up the virtual address space to a place that hopefully
does not collide (keep your fingers crossed!).
That approach does work surprisingly well. I am able to successfully run
Mac OS X guests with KVM and QEMU (no split real mode hacks like MOL) when I
apply a tiny timing probe hack to QEMU. I'd say this is a win over even more
broken split real mode :).
Alexander Graf [Thu, 10 Jul 2014 17:22:03 +0000 (19:22 +0200)]
KVM: PPC: Book3S: Stop PTE lookup on write errors
When a page lookup failed because we're not allowed to write to the page, we
should not overwrite that value with another lookup on the second PTEG which
will return "page not found". Instead, we should just tell the caller that we
had a permission problem.
This fixes Mac OS X guests looping endlessly in page lookup code for me.
Alexander Graf [Thu, 10 Jul 2014 17:19:35 +0000 (19:19 +0200)]
KVM: PPC: Deflect page write faults properly in kvmppc_st
When we have a page that we're not allowed to write to, xlate() will already
tell us -EPERM on lookup of that page. With the code as is we change it into
a "page missing" error which a guest may get confused about. Instead, just
tell the caller about the -EPERM directly.
This fixes Mac OS X guests when run with DCBZ32 emulation.
Mihai Caraman [Fri, 4 Jul 2014 08:17:28 +0000 (11:17 +0300)]
KVM: PPC: e500: Emulate power management control SPR
For FSL e6500 core the kernel uses power management SPR register (PWRMGTCR0)
to enable idle power down for cores and devices by setting up the idle count
period at boot time. With the host already controlling the power management
configuration the guest could simply benefit from it, so emulate guest request
as a general store.
Signed-off-by: Mihai Caraman <mihai.caraman@freescale.com> Signed-off-by: Alexander Graf <agraf@suse.de>
Alexander Graf [Wed, 11 Jun 2014 08:37:52 +0000 (10:37 +0200)]
KVM: PPC: Book3S HV: Access XICS in BE
On the exit path from the guest we check what type of interrupt we received
if we received one. This means we're doing hardware access to the XICS interrupt
controller.
However, when running on a little endian system, this access is byte reversed.
So let's make sure to swizzle the bytes back again and virtually make XICS
accesses big endian.
Alexander Graf [Wed, 11 Jun 2014 08:36:17 +0000 (10:36 +0200)]
KVM: PPC: Book3S HV: Access host lppaca and shadow slb in BE
Some data structures are always stored in big endian. Among those are the LPPACA
fields as well as the shadow slb. These structures might be shared with a
hypervisor.
So whenever we access those fields, make sure we do so in big endian byte order.
Alexander Graf [Wed, 11 Jun 2014 08:07:40 +0000 (10:07 +0200)]
PPC: Add asm helpers for BE 32bit load/store
From assembly code we might not only have to explicitly BE access 64bit values,
but sometimes also 32bit ones. Add helpers that allow for easy use of lwzx/stwx
in their respective byte-reverse or native form.
Signed-off-by: Alexander Graf <agraf@suse.de> CC: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Mihai Caraman [Mon, 30 Jun 2014 12:54:58 +0000 (15:54 +0300)]
KVM: PPC: e500: Fix default tlb for victim hint
Tlb search operation used for victim hint relies on the default tlb set by the
host. When hardware tablewalk support is enabled in the host, the default tlb is
TLB1 which leads KVM to evict the bolted entry. Set and restore the default tlb
when searching for victim hint.
Signed-off-by: Mihai Caraman <mihai.caraman@freescale.com> Reviewed-by: Scott Wood <scottwood@freescale.com> Signed-off-by: Alexander Graf <agraf@suse.de>
This adds support for the H_SET_MODE hcall. This hcall is a
multiplexer that has several functions, some of which are called
rarely, and some which are potentially called very frequently.
Here we add support for the functions that set the debug registers
CIABR (Completed Instruction Address Breakpoint Register) and
DAWR/DAWRX (Data Address Watchpoint Register and eXtension),
since they could be updated by the guest as often as every context
switch.
This also adds a kvmppc_power8_compatible() function to test to see
if a guest is compatible with POWER8 or not. The CIABR and DAWR/X
only exist on POWER8.
Signed-off-by: Michael Neuling <mikey@neuling.org> Signed-off-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Alexander Graf <agraf@suse.de>
Paul Mackerras [Mon, 2 Jun 2014 01:03:00 +0000 (11:03 +1000)]
KVM: PPC: Book3S: Allow only implemented hcalls to be enabled or disabled
This adds code to check that when the KVM_CAP_PPC_ENABLE_HCALL
capability is used to enable or disable in-kernel handling of an
hcall, that the hcall is actually implemented by the kernel.
If not an EINVAL error is returned.
This also checks the default-enabled list of hcalls and prints a
warning if any hcall there is not actually implemented.
Signed-off-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Alexander Graf <agraf@suse.de>
Paul Mackerras [Mon, 2 Jun 2014 01:02:59 +0000 (11:02 +1000)]
KVM: PPC: Book3S: Controls for in-kernel sPAPR hypercall handling
This provides a way for userspace controls which sPAPR hcalls get
handled in the kernel. Each hcall can be individually enabled or
disabled for in-kernel handling, except for H_RTAS. The exception
for H_RTAS is because userspace can already control whether
individual RTAS functions are handled in-kernel or not via the
KVM_PPC_RTAS_DEFINE_TOKEN ioctl, and because the numeric value for
H_RTAS is out of the normal sequence of hcall numbers.
Hcalls are enabled or disabled using the KVM_ENABLE_CAP ioctl for the
KVM_CAP_PPC_ENABLE_HCALL capability on the file descriptor for the VM.
The args field of the struct kvm_enable_cap specifies the hcall number
in args[0] and the enable/disable flag in args[1]; 0 means disable
in-kernel handling (so that the hcall will always cause an exit to
userspace) and 1 means enable. Enabling or disabling in-kernel
handling of an hcall is effective across the whole VM.
The ability for KVM_ENABLE_CAP to be used on a VM file descriptor
on PowerPC is new, added by this commit. The KVM_CAP_ENABLE_CAP_VM
capability advertises that this ability exists.
When a VM is created, an initial set of hcalls are enabled for
in-kernel handling. The set that is enabled is the set that have
an in-kernel implementation at this point. Any new hcall
implementations from this point onwards should not be added to the
default set without a good reason.
No distinction is made between real-mode and virtual-mode hcall
implementations; the one setting controls them both.
Signed-off-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Alexander Graf <agraf@suse.de>
Mihai Caraman [Wed, 18 Jun 2014 07:15:22 +0000 (10:15 +0300)]
KVM: PPC: e500mc: Enhance tlb invalidation condition on vcpu schedule
On vcpu schedule, the condition checked for tlb pollution is too loose.
The tlb entries of a vcpu become polluted (vs stale) only when a different
vcpu within the same logical partition runs in-between. Optimize the tlb
invalidation condition keeping last_vcpu per logical partition id.
With the new invalidation condition, a guest shows 4% performance improvement
on P5020DS while running a memory stress application with the cpu oversubscribed,
the other guest running a cpu intensive workload.
Guest - old invalidation condition
real 3.89
user 3.87
sys 0.01
Guest - enhanced invalidation condition
real 3.75
user 3.73
sys 0.01
Host
real 3.70
user 1.85
sys 0.00
The memory stress application accesses 4KB pages backed by 75% of available
TLB0 entries:
for (i = 0; i < ITERATIONS; i++)
for (j = 0; j < ENTRIES; j++)
bar = foo[j][0];
return 0;
}
Signed-off-by: Mihai Caraman <mihai.caraman@freescale.com> Reviewed-by: Scott Wood <scottwood@freescale.com> Signed-off-by: Alexander Graf <agraf@suse.de>
Alexander Graf [Mon, 16 Jun 2014 12:37:53 +0000 (14:37 +0200)]
KVM: PPC: Book3S PR: Fix ABIv2 on LE
We switched to ABIv2 on Little Endian systems now which gets rid of the
dotted function names. Branch to the actual functions when we see such
a system.
Alexander Graf [Wed, 11 Jun 2014 15:13:55 +0000 (17:13 +0200)]
KVM: PPC: Book3s HV: Fix tlbie compile error
Some compilers complain about uninitialized variables in the compute_tlbie_rb
function. When you follow the code path you'll realize that we'll never get
to that point, but the compiler isn't all that smart.
So just default to 4k page sizes for everything, making the compiler happy
and the code slightly easier to read.
Signed-off-by: Alexander Graf <agraf@suse.de> Acked-by: Paul Mackerras <paulus@samba.org>
Alexander Graf [Sun, 8 Jun 2014 23:16:32 +0000 (01:16 +0200)]
KVM: PPC: Book3s PR: Disable AIL mode with OPAL
When we're using PR KVM we must not allow the CPU to take interrupts
in virtual mode, as the SLB does not contain host kernel mappings
when running inside the guest context.
To make sure we get good performance for non-KVM tasks but still
properly functioning PR KVM, let's just disable AIL whenever a vcpu
is scheduled in.
This is fundamentally different from how we deal with AIL on pSeries
type machines where we disable AIL for the whole machine as soon as
a single KVM VM is up.
The reason for that is easy - on pSeries we do not have control over
per-cpu configuration of AIL. We also don't want to mess with CPU hotplug
races and AIL configuration, so setting it per CPU is easier and more
flexible.
This patch fixes running PR KVM on POWER8 bare metal for me.
Signed-off-by: Alexander Graf <agraf@suse.de> Acked-by: Paul Mackerras <paulus@samba.org>
virtual time base register is a per VM, per cpu register that needs
to be saved and restored on vm exit and entry. Writing to VTB is not
allowed in the privileged mode.
KVM: PPC: BOOK3S: PR: Fix PURR and SPURR emulation
We use time base for PURR and SPURR emulation with PR KVM since we
are emulating a single threaded core. When using time base
we need to make sure that we don't accumulate time spent in the host
in PURR and SPURR value.
Also we don't need to emulate mtspr because both the registers are
hypervisor resource.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Alexander Graf <agraf@suse.de>
Jan Kiszka [Sun, 29 Jun 2014 15:12:43 +0000 (17:12 +0200)]
KVM: SVM: Fix CPL export via SS.DPL
We import the CPL via SS.DPL since ae9fedc793. However, we fail to
export it this way so far. This caused spurious guest crashes, e.g. of
Linux when accessing the vmport from guest user space which triggered
register saving/restoring to/from host user space.
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Heiko Carstens [Thu, 5 Jun 2014 11:22:49 +0000 (13:22 +0200)]
KVM: s390: add sie.h uapi header file to Kbuild and remove header dependency
sie.h was missing in arch/s390/include/uapi/asm/Kbuild and therefore missed
the "make headers_check" target.
If added it reveals that also arch/s390/include/asm/sigp.h would become uapi.
This is something we certainly do not want. So remove that dependency as well.
The header file was merged with ceae283bb2e0176c "KVM: s390: add sie exit
reasons tables", therefore we never had a kernel release with this commit and
can still change anything.
Nadav Amit [Wed, 18 Jun 2014 14:19:26 +0000 (17:19 +0300)]
KVM: vmx: vmx instructions handling does not consider cs.l
VMX instructions use 32-bit operands in 32-bit mode, and 64-bit operands in
64-bit mode. The current implementation is broken since it does not use the
register operands correctly, and always uses 64-bit for reads and writes.
Moreover, write to memory in vmwrite only considers long-mode, so it ignores
cs.l. This patch fixes this behavior.
Signed-off-by: Nadav Amit <namit@cs.technion.ac.il> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Nadav Amit [Wed, 18 Jun 2014 14:19:25 +0000 (17:19 +0300)]
KVM: vmx: handle_cr ignores 32/64-bit mode
On 32-bit mode only bits [31:0] of the CR should be used for setting the CR
value. Otherwise, the host may incorrectly assume the value is invalid if bits
[63:32] are not zero. Moreover, the CR is currently being read twice when CR8
is used. Last, nested mov-cr exiting is modified to handle the CR value
correctly as well.
Signed-off-by: Nadav Amit <namit@cs.technion.ac.il> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Nadav Amit [Wed, 18 Jun 2014 14:19:24 +0000 (17:19 +0300)]
KVM: x86: Hypercall handling does not considers opsize correctly
Currently, the hypercall handling routine only considers LME as an indication
to whether the guest uses 32/64-bit mode. This is incosistent with hyperv
hypercalls handling and against the common sense of considering cs.l as well.
This patch uses is_64_bit_mode instead of is_long_mode for that matter. In
addition, the result is masked in respect to the guest execution mode. Last, it
changes kvm_hv_hypercall to use is_64_bit_mode as well to simplify the code.
Signed-off-by: Nadav Amit <namit@cs.technion.ac.il> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Nadav Amit [Wed, 18 Jun 2014 14:19:23 +0000 (17:19 +0300)]
KVM: x86: check DR6/7 high-bits are clear only on long-mode
When the guest sets DR6 and DR7, KVM asserts the high 32-bits are clear, and
otherwise injects a #GP exception. This exception should only be injected only
if running in long-mode.
Signed-off-by: Nadav Amit <namit@cs.technion.ac.il> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Jan Kiszka [Mon, 16 Jun 2014 11:59:43 +0000 (13:59 +0200)]
KVM: nVMX: Allow to disable VM_{ENTRY_LOAD,EXIT_SAVE}_DEBUG_CONTROLS
Allow L1 to "leak" its debug controls into L2, i.e. permit cleared
VM_{ENTRY_LOAD,EXIT_SAVE}_DEBUG_CONTROLS. This requires to manually
transfer the state of DR7 and IA32_DEBUGCTLMSR from L1 into L2 as both
run on different VMCS.
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Jan Kiszka [Mon, 16 Jun 2014 11:59:41 +0000 (13:59 +0200)]
KVM: nVMX: Allow to disable CR3 access interception
We already have this control enabled by exposing a broken
MSR_IA32_VMX_PROCBASED_CTLS value. This will properly advertise our
capability once the value is fixed by clearing the right bits in
MSR_IA32_VMX_TRUE_PROCBASED_CTLS. We also have to ensure to test the
right value on L2 entry.
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Jan Kiszka [Mon, 16 Jun 2014 11:59:40 +0000 (13:59 +0200)]
KVM: nVMX: Advertise support for MSR_IA32_VMX_TRUE_*_CTLS
We already implemented them but failed to advertise them. Currently they
all return the identical values to the capability MSRs they are
augmenting. So there is no change in exposed features yet.
Drop related comments at this chance that are partially incorrect and
redundant anyway.
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Jan Kiszka [Thu, 12 Jun 2014 17:40:32 +0000 (19:40 +0200)]
KVM: x86: Fix constant value of VM_{EXIT_SAVE,ENTRY_LOAD}_DEBUG_CONTROLS
The spec says those controls are at bit position 2 - makes 4 as value.
The impact of this mistake is effectively zero as we only use them to
ensure that these features are set at position 2 (or, previously, 1) in
MSR_IA32_VMX_{EXIT,ENTRY}_CTLS - which is and will be always true
according to the spec.
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Nadav Amit [Sun, 15 Jun 2014 13:13:00 +0000 (16:13 +0300)]
KVM: x86: emulation of dword cmov on long-mode should clear [63:32]
Even if the condition of cmov is not satisfied, bits[63:32] should be cleared.
This is clearly stated in Intel's CMOVcc documentation. The solution is to
reassign the destination onto itself if the condition is unsatisfied. For that
matter the original destination value needs to be read.
Signed-off-by: Nadav Amit <namit@cs.technion.ac.il> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Nadav Amit [Sun, 15 Jun 2014 13:12:59 +0000 (16:12 +0300)]
KVM: x86: Inter-privilege level ret emulation is not implemeneted
Return unhandlable error on inter-privilege level ret instruction. This is
since the current emulation does not check the privilege level correctly when
loading the CS, and does not pop RSP/SS as needed.
Cc: stable@vger.kernel.org Signed-off-by: Nadav Amit <namit@cs.technion.ac.il> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Nadav Amit [Sun, 15 Jun 2014 13:12:58 +0000 (16:12 +0300)]
KVM: x86: Wrong emulation on 'xadd X, X'
The emulator does not emulate the xadd instruction correctly if the two
operands are the same. In this (unlikely) situation the result should be the
sum of X and X (2X) when it is currently X. The solution is to first perform
writeback to the source, before writing to the destination. The only
instruction which should be affected is xadd, as the other instructions that
perform writeback to the source use the extended accumlator (e.g., RAX:RDX).
Signed-off-by: Nadav Amit <namit@cs.technion.ac.il> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Nadav Amit [Mon, 2 Jun 2014 15:34:11 +0000 (18:34 +0300)]
KVM: x86: smsw emulation is incorrect in 64-bit mode
In 64-bit mode, when the destination is a register, the assignment is done
according to the operand size. Otherwise (memory operand or no 64-bit mode), a
16-bit assignment is performed.
Currently, 16-bit assignment is always done to the destination.
Signed-off-by: Nadav Amit <namit@cs.technion.ac.il> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Nadav Amit [Mon, 2 Jun 2014 15:34:09 +0000 (18:34 +0300)]
KVM: x86: rdpmc emulation checks the counter incorrectly
The rdpmc emulation checks that the counter (ECX) is not higher than 2, without
taking into considerations bits 30:31 role (e.g., bit 30 marks whether the
counter is fixed). The fix uses the pmu information for checking the validity
of the pmu counter.
Signed-off-by: Nadav Amit <namit@cs.technion.ac.il> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Nadav Amit [Mon, 2 Jun 2014 15:34:07 +0000 (18:34 +0300)]
KVM: x86: cmpxchg emulation should compare in reverse order
The current implementation of cmpxchg does not update the flags correctly,
since the accumulator should be compared with the destination and not the other
way around. The current implementation does not update the flags correctly.
Signed-off-by: Nadav Amit <namit@cs.technion.ac.il> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>