Alexander Graf [Tue, 29 Apr 2014 14:48:44 +0000 (16:48 +0200)]
KVM: PPC: Book3S PR: Handle Facility interrupt and FSCR
POWER8 introduced a new interrupt type called "Facility unavailable interrupt"
which contains its status message in a new register called FSCR.
Handle these exits and try to emulate instructions for unhandled facilities.
Follow-on patches enable KVM to expose specific facilities into the guest.
Alexander Graf [Fri, 25 Apr 2014 14:07:21 +0000 (16:07 +0200)]
KVM: PPC: Book3S PR: Emulate TIR register
In parallel to the Processor ID Register (PIR) threaded POWER8 also adds a
Thread ID Register (TIR). Since PR KVM doesn't emulate more than one thread
per core, we can just always expose 0 here.
Alexander Graf [Thu, 24 Apr 2014 11:52:01 +0000 (13:52 +0200)]
KVM: PPC: Book3S PR: Do dcbz32 patching with big endian instructions
When the host CPU we're running on doesn't support dcbz32 itself, but the
guest wants to have dcbz only clear 32 bytes of data, we loop through every
executable mapped page to search for dcbz instructions and patch them with
a special privileged instruction that we emulate as dcbz32.
The only guests that want to see dcbz act as 32byte are book3s_32 guests, so
we don't have to worry about little endian instruction ordering. So let's
just always search for big endian dcbz instructions, also when we're on a
little endian host.
Alexander Graf [Thu, 24 Apr 2014 11:46:24 +0000 (13:46 +0200)]
KVM: PPC: Make shared struct aka magic page guest endian
The shared (magic) page is a data structure that contains often used
supervisor privileged SPRs accessible via memory to the user to reduce
the number of exits we have to take to read/write them.
When we actually share this structure with the guest we have to maintain
it in guest endianness, because some of the patch tricks only work with
native endian load/store operations.
Since we only share the structure with either host or guest in little
endian on book3s_64 pr mode, we don't have to worry about booke or book3s hv.
For booke, the shared struct stays big endian. For book3s_64 hv we maintain
the struct in host native endian, since it never gets shared with the guest.
For book3s_64 pr we introduce a variable that tells us which endianness the
shared struct is in and route every access to it through helper inline
functions that evaluate this variable.
Alexander Graf [Thu, 24 Apr 2014 11:39:16 +0000 (13:39 +0200)]
KVM: PPC: PR: Fill pvinfo hcall instructions in big endian
We expose a blob of hypercall instructions to user space that it gives to
the guest via device tree again. That blob should contain a stream of
instructions necessary to do a hypercall in big endian, as it just gets
passed into the guest and old guests use them straight away.
Alexander Graf [Thu, 24 Apr 2014 11:10:33 +0000 (13:10 +0200)]
KVM: PPC: Book3S PR: PAPR: Access RTAS in big endian
When the guest does an RTAS hypercall it keeps all RTAS variables inside a
big endian data structure.
To make sure we don't have to bother about endianness inside the actual RTAS
handlers, let's just convert the whole structure to host endian before we
call our RTAS handlers and back to big endian when we return to the guest.
Alexander Graf [Thu, 24 Apr 2014 11:09:15 +0000 (13:09 +0200)]
KVM: PPC: Book3S PR: PAPR: Access HTAB in big endian
The HTAB on PPC is always in big endian. When we access it via hypercalls
on behalf of the guest and we're running on a little endian host, we need
to make sure we swap the bits accordingly.
Alexander Graf [Thu, 24 Apr 2014 10:57:11 +0000 (12:57 +0200)]
KVM: PPC: Book3S_64 PR: Access shadow slb in big endian
The "shadow SLB" in the PACA is shared with the hypervisor, so it has to
be big endian. We access the shadow SLB during world switch, so let's make
sure we access it in big endian even when we're on a little endian host.
Alexander Graf [Thu, 24 Apr 2014 10:54:54 +0000 (12:54 +0200)]
KVM: PPC: Book3S_64 PR: Access HTAB in big endian
The HTAB is always big endian. We access the guest's HTAB using
copy_from/to_user, but don't yet take care of the fact that we might
be running on an LE host.
Wrap all accesses to the guest HTAB with big endian accessors.
Alexander Graf [Thu, 24 Apr 2014 10:51:44 +0000 (12:51 +0200)]
KVM: PPC: Book3S_32: PR: Access HTAB in big endian
The HTAB is always big endian. We access the guest's HTAB using
copy_from/to_user, but don't yet take care of the fact that we might
be running on an LE host.
Wrap all accesses to the guest HTAB with big endian accessors.
Alexander Graf [Thu, 24 Apr 2014 10:48:19 +0000 (12:48 +0200)]
KVM: PPC: Book3S: PR: Fix C/R bit setting
Commit 9308ab8e2d made C/R HTAB updates go byte-wise into the target HTAB.
However, it didn't update the guest's copy of the HTAB, but instead the
host local copy of it.
Write to the guest's HTAB instead.
Signed-off-by: Alexander Graf <agraf@suse.de> CC: Paul Mackerras <paulus@samba.org> Acked-by: Paul Mackerras <paulus@samba.org>
KVM: PPC: BOOK3S: PR: Fix WARN_ON with debug options on
With debug option "sleep inside atomic section checking" enabled we get
the below WARN_ON during a PR KVM boot. This is because upstream now
have PREEMPT_COUNT enabled even if we have preempt disabled. Fix the
warning by adding preempt_disable/enable around floating point and altivec
enable.
Alexander Graf [Thu, 17 Apr 2014 11:25:33 +0000 (13:25 +0200)]
KVM: PPC: E500: Add dcbtls emulation
The dcbtls instruction is able to lock data inside the L1 cache.
We don't want to give the guest actual access to hardware cache locks,
as that could influence other VMs on the same system. But we can tell
the guest that its locking attempt failed.
By implementing the instruction we at least don't give the guest a
program exception which it definitely does not expect.
Alexander Graf [Thu, 17 Apr 2014 10:53:13 +0000 (12:53 +0200)]
KVM: PPC: E500: Ignore L1CSR1_ICFI,ICLFR
The L1 instruction cache control register contains bits that indicate
that we're still handling a request. Mask those out when we set the SPR
so that a read doesn't assume we're still doing something.
Dave Hansen [Fri, 16 May 2014 19:45:15 +0000 (12:45 -0700)]
x86: fix page fault tracing when KVM guest support enabled
I noticed on some of my systems that page fault tracing doesn't
work:
cd /sys/kernel/debug/tracing
echo 1 > events/exceptions/enable
cat trace;
# nothing shows up
I eventually traced it down to CONFIG_KVM_GUEST. At least in a
KVM VM, enabling that option breaks page fault tracing, and
disabling fixes it. I tried on some old kernels and this does
not appear to be a regression: it never worked.
There are two page-fault entry functions today. One when tracing
is on and another when it is off. The KVM code calls do_page_fault()
directly instead of calling the traced version:
I'm also having problems with the page fault tracing on bare
metal (same symptom of no trace output). I'm unsure if it's
related.
Steven had an alternative to this which has zero overhead when
tracing is off where this includes the standard noops even when
tracing is disabled. I'm unconvinced that the extra complexity
of his apporach:
Paolo Bonzini [Wed, 14 May 2014 07:39:49 +0000 (09:39 +0200)]
KVM: x86: get CPL from SS.DPL
CS.RPL is not equal to the CPL in the few instructions between
setting CR0.PE and reloading CS. And CS.DPL is also not equal
to the CPL for conforming code segments.
However, SS.DPL *is* always equal to the CPL except for the weird
case of SYSRET on AMD processors, which sets SS.DPL=SS.RPL from the
value in the STAR MSR, but force CPL=3 (Intel instead forces
SS.DPL=SS.RPL=CPL=3).
So this patch:
- modifies SVM to update the CPL from SS.DPL rather than CS.RPL;
the above case with SYSRET is not broken further, and the way
to fix it would be to pass the CPL to userspace and back
- modifies VMX to always return the CPL from SS.DPL (except
forcing it to 0 if we are emulating real mode via vm86 mode;
in vm86 mode all DPLs have to be 3, but real mode does allow
privileged instructions). It also removes the CPL cache,
which becomes a duplicate of the SS access rights cache.
This fixes doing KVM_IOCTL_SET_SREGS exactly after setting
CR0.PE=1 but before CS has been reloaded.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Paolo Bonzini [Thu, 15 May 2014 15:56:57 +0000 (17:56 +0200)]
KVM: x86: use new CS.RPL as CPL during task switch
During task switch, all of CS.DPL, CS.RPL, SS.DPL must match (in addition
to all the other requirements) and will be the new CPL. So far this
worked by carefully setting the CS selector and flag before doing the
task switch; setting CS.selector will already change the CPL.
However, this will not work once we get the CPL from SS.DPL, because
then you will have to set the full segment descriptor cache to change
the CPL. ctxt->ops->cpl(ctxt) will then return the old CPL during the
task switch, and the check that SS.DPL == CPL will fail.
Temporarily assume that the CPL comes from CS.RPL during task switch
to a protected-mode task. This is the same approach used in QEMU's
emulation code, which (until version 2.0) manually tracks the CPL.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Paolo Bonzini [Fri, 16 May 2014 21:02:40 +0000 (23:02 +0200)]
Merge tag 'kvm-s390-20140516' of git://git.kernel.org/pub/scm/linux/kernel/git/kvms390/linux into kvm-next
1. Correct locking for lazy storage key handling
A test loop with multiple CPUs triggered a race in the lazy storage
key handling as introduced by commit 934bc131efc3e4be6a52f7dd6c4dbf
(KVM: s390: Allow skeys to be enabled for the current process). This
race should not happen with Linux guests, but let's fix it anyway.
Patch touches !/kvm/ code, but is from the s390 maintainer.
2. Better handling of broken guests
If we detect a program check loop we stop the guest instead of
wasting CPU cycles.
3. Better handling on MVPG emulation
The move page handling is improved to be architecturally correct.
3. Trace point rework
Let's rework the kvm trace points to have a common header file (for
later perf usage) and provided a table based instruction decoder.
4. Interpretive execution of SIGP external call
Let the hardware handle most cases of SIGP external call (IPI) and
wire up the fixup code for the corner cases.
5. Initial preparations for the IBC facility
Prepare the code to handle instruction blocking
KVM: s390: interpretive execution of SIGP EXTERNAL CALL
If the sigp interpretation facility is installed, most SIGP EXTERNAL CALL
operations will be interpreted instead of intercepted. A partial execution
interception will occurr at the sending cpu only if the target cpu is in the
wait state ("W" bit in the cpuflags set). Instruction interception will only
happen in error cases (e.g. cpu addr invalid).
As a sending cpu might set the external call interrupt pending flags at the
target cpu at every point in time, we can't handle this kind of interrupt using
our kvm interrupt injection mechanism. The injection will be done automatically
by the SIE when preparing the start of the target cpu.
Signed-off-by: David Hildenbrand <dahi@linux.vnet.ibm.com> Reviewed-by: Cornelia Huck <cornelia.huck@de.ibm.com> CC: Thomas Huth <thuth@linux.vnet.ibm.com>
[Adopt external call injection to check for sigp interpretion] Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
KVM: s390: Use intercept_insn decoder in trace event
The current trace definition doesn't work very well with the perf tool.
Perf shows a "insn_to_mnemonic not found" message. Let's handle the
decoding completely in a parseable format.
Signed-off-by: Alexander Yarygin <yarygin@linux.vnet.ibm.com> Acked-by: Christian Borntraeger <borntraeger@de.ibm.com> Reviewed-by: Cornelia Huck <cornelia.huck@de.ibm.com> Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
This patch defines tables of reasons for exiting from SIE mode
in a new sie.h header file. Tables contain SIE intercepted codes,
intercepted instructions and program interruptions codes.
Signed-off-by: Alexander Yarygin <yarygin@linux.vnet.ibm.com> Reviewed-by: Cornelia Huck <cornelia.huck@de.ibm.com> Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Use the new helper function kvm_arch_fault_in_page() for faulting-in
the guest pages and only inject addressing errors when we've really
hit a bad address (and return other error codes to userspace instead).
Signed-off-by: Thomas Huth <thuth@linux.vnet.ibm.com> Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com> Reviewed-by: Cornelia Huck <cornelia.huck@de.ibm.com> Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Thomas Huth [Tue, 6 May 2014 15:20:16 +0000 (17:20 +0200)]
KVM: s390: Introduce helper function for faulting-in a guest page
Rework the function kvm_arch_fault_in_sync() to become a proper helper
function for faulting-in a guest page. Now it takes the guest address as
a parameter and does not ignore the possible error code from gmap_fault()
anymore (which could cause undetected error conditions before).
Signed-off-by: Thomas Huth <thuth@linux.vnet.ibm.com> Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com> Reviewed-by: Cornelia Huck <cornelia.huck@de.ibm.com> Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Thomas Huth [Thu, 17 Apr 2014 07:57:10 +0000 (09:57 +0200)]
KVM: s390: Avoid endless loops of specification exceptions
If the new PSW for program interrupts is invalid, the VM ends up
in an endless loop of specification exceptions. Since there is not
much left we can do in this case, we should better drop to userspace
instead so that the crash can be reported to the user.
Signed-off-by: Thomas Huth <thuth@linux.vnet.ibm.com> Reviewed-by: Cornelia Huck <cornelia.huck@de.ibm.com> Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Thomas Huth [Thu, 17 Apr 2014 07:10:40 +0000 (09:10 +0200)]
KVM: s390: Improve is_valid_psw()
As a program status word is also invalid (and thus generates an
specification exception) if the instruction address is not even,
we should test this in is_valid_psw(), too. This patch also exports
the function so that it becomes available for other parts of the
S390 KVM code as well.
Signed-off-by: Thomas Huth <thuth@linux.vnet.ibm.com> Reviewed-by: Cornelia Huck <cornelia.huck@de.ibm.com> Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Use the mm semaphore to serialize multiple invocations of s390_enable_skey.
The second CPU faulting on a storage key operation needs to wait for the
completion of the page table update. Taking the mm semaphore writable
has the positive side-effect that it prevents any host faults from
taking place which does have implications on keys vs PGSTE.
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com> Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
kvm: x86: emulate monitor and mwait instructions as nop
Treat monitor and mwait instructions as nop, which is architecturally
correct (but inefficient) behavior. We do this to prevent misbehaving
guests (e.g. OS X <= 10.7) from crashing after they fail to check for
monitor/mwait availability via cpuid.
Since mwait-based idle loops relying on these nop-emulated instructions
would keep the host CPU pegged at 100%, do NOT advertise their presence
via cpuid, to prevent compliant guests from using them inadvertently.
Signed-off-by: Gabriel L. Somlo <somlo@cmu.edu> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
It seems that it's easy to implement the EOI assist
on top of the PV EOI feature: simply convert the
page address to the format expected by PV EOI.
Notes:
-"No EOI required" is set only if interrupt injected
is edge triggered; this is true because level interrupts are going
through IOAPIC which disables PV EOI.
In any case, if guest triggers EOI the bit will get cleared on exit.
-For migration, set of HV_X64_MSR_APIC_ASSIST_PAGE sets
KVM_PV_EOI_EN internally, so restoring HV_X64_MSR_APIC_ASSIST_PAGE
seems sufficient
In any case, bit is cleared on exit so worst case it's never re-enabled
-no handling of PV EOI data is performed at HV_X64_MSR_EOI write;
HV_X64_MSR_EOI is a separate optimization - it's an X2APIC
replacement that lets you do EOI with an MSR and not IO.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Nadav Amit [Wed, 7 May 2014 12:32:50 +0000 (15:32 +0300)]
KVM: x86: Mark bit 7 in long-mode PDPTE according to 1GB pages support
In long-mode, bit 7 in the PDPTE is not reserved only if 1GB pages are
supported by the CPU. Currently the bit is considered by KVM as always
reserved.
Signed-off-by: Nadav Amit <namit@cs.technion.ac.il> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Nadav Amit [Wed, 7 May 2014 12:32:49 +0000 (15:32 +0300)]
KVM: vmx: handle_dr does not handle RSP correctly
The RSP register is not automatically cached, causing mov DR instruction with
RSP to fail. Instead the regular register accessing interface should be used.
Signed-off-by: Nadav Amit <namit@cs.technion.ac.il> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Bandan Das [Tue, 6 May 2014 06:19:18 +0000 (02:19 -0400)]
KVM: nVMX: move vmclear and vmptrld pre-checks to nested_vmx_check_vmptr
Some checks are common to all, and moreover,
according to the spec, the check for whether any bits
beyond the physical address width are set are also
applicable to all of them
Signed-off-by: Bandan Das <bsd@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Bandan Das [Tue, 6 May 2014 06:19:17 +0000 (02:19 -0400)]
KVM: nVMX: fail on invalid vmclear/vmptrld pointer
The spec mandates that if the vmptrld or vmclear
address is equal to the vmxon region pointer, the
instruction should fail with error "VMPTRLD with
VMXON pointer" or "VMCLEAR with VMXON pointer"
Signed-off-by: Bandan Das <bsd@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Bandan Das [Tue, 6 May 2014 06:19:16 +0000 (02:19 -0400)]
KVM: nVMX: additional checks on vmxon region
Currently, the vmxon region isn't used in the nested case.
However, according to the spec, the vmxon instruction performs
additional sanity checks on this region and the associated
pointer. Modify emulated vmxon to better adhere to the spec
requirements
Signed-off-by: Bandan Das <bsd@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Paolo Bonzini [Tue, 6 May 2014 15:20:37 +0000 (17:20 +0200)]
Merge tag 'kvm-s390-20140506' of git://git.kernel.org/pub/scm/linux/kernel/git/kvms390/linux into kvm-next
1. Fixes an error return code for the breakpoint setup
2. External interrupt fixes
2.1. Some interrupt conditions like cpu timer or clock comparator
stay pending even after the interrupt is injected. If the external
new PSW is enabled for interrupts this will result in an endless
loop. Usually this indicates a programming error in the guest OS.
Lets detect such situations and go to userspace. We will provide
a QEMU patch that sets the guest in panicked/crashed state to avoid
wasting CPU cycles.
2.2 Resend external interrupts back to the guest if the HW could
not do it.
-
Thomas Huth [Wed, 15 Jan 2014 15:46:07 +0000 (16:46 +0100)]
KVM: s390: Fix external interrupt interception
The external interrupt interception can only occur in rare cases, e.g.
when the PSW of the interrupt handler has a bad value. The old handler
for this interception simply ignored these events (except for increasing
the exit_external_interrupt counter), but for proper operation we either
have to inject the interrupts manually or we should drop to userspace in
case of errors.
Signed-off-by: Thomas Huth <thuth@linux.vnet.ibm.com> Reviewed-by: David Hildenbrand <dahi@linux.vnet.ibm.com> Reviewed-by: Cornelia Huck <cornelia.huck@de.ibm.com> Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Dan Carpenter [Sat, 3 May 2014 20:18:11 +0000 (23:18 +0300)]
KVM: s390: return -EFAULT if copy_from_user() fails
When copy_from_user() fails, this code returns the number of bytes
remaining instead of a negative error code. The positive number is
returned to the user but otherwise it is harmless.
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
KVM: x86: improve the usability of the 'kvm_pio' tracepoint
This patch moves the 'kvm_pio' tracepoint to emulator_pio_in_emulated()
and emulator_pio_out_emulated(), and it adds an argument (a pointer to
the 'pio_data'). A single 8-bit or 16-bit or 32-bit data item is fetched
from 'pio_data' (depending on 'size'), and the value is included in the
trace record ('val'). If 'count' is greater than one, this is indicated
by the string "(...)" in the trace output.
When starting lots of dataplane devices the bootup takes very long on
Christian's s390 with irqfd patches. With larger setups he is even
able to trigger some timeouts in some components. Turns out that the
KVM_SET_GSI_ROUTING ioctl takes very long (strace claims up to 0.1 sec)
when having multiple CPUs. This is caused by the synchronize_rcu and
the HZ=100 of s390. By changing the code to use a private srcu we can
speed things up. This patch reduces the boot time till mounting root
from 8 to 2 seconds on my s390 guest with 100 disks.
Uses of hlist_for_each_entry_rcu, hlist_add_head_rcu, hlist_del_init_rcu
are fine because they do not have lockdep checks (hlist_for_each_entry_rcu
uses rcu_dereference_raw rather than rcu_dereference, and write-sides
do not do rcu lockdep at all).
Note that we're hardly relying on the "sleepable" part of srcu. We just
want SRCU's faster detection of grace periods.
Testing was done by Andrew Theurer using netperf tests STREAM, MAERTS
and RR. The difference between results "before" and "after" the patch
has mean -0.2% and standard deviation 0.6%. Using a paired t-test on the
data points says that there is a 2.5% probability that the patch is the
cause of the performance difference (rather than a random fluctuation).
(Restricting the t-test to RR, which is the most likely to be affected,
changes the numbers to respectively -0.3% mean, 0.7% stdev, and 8%
probability that the numbers actually say something about the patch.
The probability increases mostly because there are fewer data points).
Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: Michael S. Tsirkin <mst@redhat.com> Tested-by: Christian Borntraeger <borntraeger@de.ibm.com> # s390 Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com> Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Paolo Bonzini [Wed, 30 Apr 2014 10:29:41 +0000 (12:29 +0200)]
Merge tag 'kvm-s390-20140429' of git://git.kernel.org/pub/scm/linux/kernel/git/kvms390/linux into kvm-next
1. Guest handling fixes
The handling of MVPG, PFMF and Test Block is fixed to better follow
the architecture. None of these fixes is critical for any current
Linux guests, but let's play safe.
2. Optimization for single CPU guests
We can enable the IBS facility if only one VCPU is running (!STOPPED
state). We also enable this optimization for guest > 1 VCPU as soon
as all but one VCPU is in stopped state. Thus will help guests that
have tools like cpuplugd (from s390-utils) that do dynamic offline/
online of CPUs.
3. NOTES
There is one non-s390 change in include/linux/kvm_host.h that
introduces 2 defines for VCPU requests:
define KVM_REQ_ENABLE_IBS 23
define KVM_REQ_DISABLE_IBS 24
This patch enables the IBS facility when a single VCPU is running.
The facility is dynamically turned on/off as soon as other VCPUs
enter/leave the stopped state.
When this facility is operating, some instructions can be executed
faster for single-cpu guests.
Signed-off-by: David Hildenbrand <dahi@linux.vnet.ibm.com> Reviewed-by: Dominik Dingel <dingel@linux.vnet.ibm.com> Reviewed-by: Cornelia Huck <cornelia.huck@de.ibm.com> Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
This patch introduces two new functions to set/clear the CPUSTAT_STOPPED bit and
makes use of it at all applicable places. These functions prepare the additional
execution of code when starting/stopping a vcpu.
The CPUSTAT_STOPPED bit should not be touched outside of these functions.
Signed-off-by: David Hildenbrand <dahi@linux.vnet.ibm.com> Reviewed-by: Cornelia Huck <cornelia.huck@de.ibm.com> Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Thomas Huth [Mon, 9 Sep 2013 15:58:38 +0000 (17:58 +0200)]
KVM: s390: Fixes for PFMF
Add a check for low-address protection to the PFMF handler and
convert real-addresses to absolute if necessary, as it is defined
in the Principles of Operations specification.
Signed-off-by: Thomas Huth <thuth@linux.vnet.ibm.com> Reviewed-by: Cornelia Huck <cornelia.huck@de.ibm.com> Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Thomas Huth [Mon, 3 Mar 2014 22:34:42 +0000 (23:34 +0100)]
KVM: s390: Add a function for checking the low-address protection
The s390 architecture has a special protection mechanism that can
be used to prevent write access to the vital data in the low-core
memory area. This patch adds a new helper function that can be used
to check for such write accesses and in case of protection, it also
sets up the exception data accordingly.
Signed-off-by: Thomas Huth <thuth@linux.vnet.ibm.com> Reviewed-by: Cornelia Huck <cornelia.huck@de.ibm.com> Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
When the guest executes the MVPG instruction with DAT disabled,
and the source or destination page is not mapped in the host,
the so-called partial execution interception occurs. We need to
handle this event by setting up a mapping for the corresponding
user pages.
Signed-off-by: Thomas Huth <thuth@linux.vnet.ibm.com> Reviewed-by: Cornelia Huck <cornelia.huck@de.ibm.com> Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
KVM: async_pf: change async_pf_execute() to use get_user_pages(tsk => NULL)
async_pf_execute() passes tsk == current to gup(), this is doesn't
hurt but unnecessary and misleading. "tsk" is only used to account
the number of faults and current is the random workqueue thread.
Signed-off-by: Oleg Nesterov <oleg@redhat.com> Suggested-by: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
KVM: MMU: flush tlb out of mmu lock when write-protect the sptes
Now we can flush all the TLBs out of the mmu lock without TLB corruption when
write-proect the sptes, it is because:
- we have marked large sptes readonly instead of dropping them that means we
just change the spte from writable to readonly so that we only need to care
the case of changing spte from present to present (changing the spte from
present to nonpresent will flush all the TLBs immediately), in other words,
the only case we need to care is mmu_spte_update()
- in mmu_spte_update(), we haved checked
SPTE_HOST_WRITEABLE | PTE_MMU_WRITEABLE instead of PT_WRITABLE_MASK, that
means it does not depend on PT_WRITABLE_MASK anymore
KVM: MMU: flush tlb if the spte can be locklessly modified
Relax the tlb flush condition since we will write-protect the spte out of mmu
lock. Note lockless write-protection only marks the writable spte to readonly
and the spte can be writable only if both SPTE_HOST_WRITEABLE and
SPTE_MMU_WRITEABLE are set (that are tested by spte_is_locklessly_modifiable)
Currently, kvm zaps the large spte if write-protected is needed, the later
read can fault on that spte. Actually, we can make the large spte readonly
instead of making them un-present, the page fault caused by read access can
be avoided
The idea is from Avi:
| As I mentioned before, write-protecting a large spte is a good idea,
| since it moves some work from protect-time to fault-time, so it reduces
| jitter. This removes the need for the return value.
This version has fixed the issue reported in 6b73a9606, the reason of that
issue is that fast_page_fault() directly sets the readonly large spte to
writable but only dirty the first page into the dirty-bitmap that means
other pages are missed. Fixed it by only the normal sptes (on the
PT_PAGE_TABLE_LEVEL level) can be fast fixed
KVM: MMU: properly check last spte in fast_page_fault()
Using sp->role.level instead of @level since @level is not got from the
page table hierarchy
There is no issue in current code since the fast page fault currently only
fixes the fault caused by dirty-log that is always on the last level
(level = 1)
This patch makes the code more readable and avoids potential issue in the
further development
Nadav Amit [Fri, 18 Apr 2014 00:35:10 +0000 (03:35 +0300)]
KVM: x86: IN instruction emulation should ignore REP-prefix
The IN instruction is not be affected by REP-prefix as INS is. Therefore, the
emulation should ignore the REP prefix as well. The current emulator
implementation tries to perform writeback when IN instruction with REP-prefix
is emulated. This causes it to perform wrong memory write or spurious #GP
exception to be injected to the guest.
Signed-off-by: Nadav Amit <namit@cs.technion.ac.il> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Nadav Amit [Fri, 18 Apr 2014 00:35:09 +0000 (03:35 +0300)]
KVM: x86: Fix CR3 reserved bits
According to Intel specifications, PAE and non-PAE does not have any reserved
bits. In long-mode, regardless to PCIDE, only the high bits (above the
physical address) are reserved.
Signed-off-by: Nadav Amit <namit@cs.technion.ac.il> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Nadav Amit [Fri, 18 Apr 2014 00:35:08 +0000 (03:35 +0300)]
KVM: x86: Fix wrong/stuck PMU when guest does not use PMI
If a guest enables a performance counter but does not enable PMI, the
hypervisor currently does not reprogram the performance counter once it
overflows. As a result the host performance counter is kept with the original
sampling period which was configured according to the value of the guest's
counter when the counter was enabled.
Such behaviour can cause very bad consequences. The most distrubing one can
cause the guest not to make any progress at all, and keep exiting due to host
PMI before any guest instructions is exeucted. This situation occurs when the
performance counter holds a very high value when the guest enables the
performance counter. As a result the host's sampling period is configured to be
very short. The host then never reconfigures the sampling period and get stuck
at entry->PMI->exit loop. We encountered such a scenario in our experiments.
The solution is to reprogram the counter even if the guest does not use PMI.
Signed-off-by: Nadav Amit <namit@cs.technion.ac.il> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Bandan Das [Sat, 19 Apr 2014 22:17:45 +0000 (18:17 -0400)]
KVM: nVMX: Ack and write vector info to intr_info if L1 asks us to
This feature emulates the "Acknowledge interrupt on exit" behavior.
We can safely emulate it for L1 to run L2 even if L0 itself has it
disabled (to run L1).
Signed-off-by: Bandan Das <bsd@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Bandan Das [Sat, 19 Apr 2014 22:17:44 +0000 (18:17 -0400)]
KVM: nVMX: Don't advertise single context invalidation for invept
For single context invalidation, we fall through to global
invalidation in handle_invept() except for one case - when
the operand supplied by L1 is different from what we have in
vmcs12. However, typically hypervisors will only call invept
for the currently loaded eptp, so the condition will
never be true.
Signed-off-by: Bandan Das <bsd@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Merge tag 'kvm-s390-20140422' of git://git.kernel.org/pub/scm/linux/kernel/git/kvms390/linux into queue
Lazy storage key handling
-------------------------
Linux does not use the ACC and F bits of the storage key. Newer Linux
versions also do not use the storage keys for dirty and reference
tracking. We can optimize the guest handling for those guests for faults
as well as page-in and page-out by simply not caring about the guest
visible storage key. We trap guest storage key instruction to enable
those keys only on demand.
Migration bitmap
Until now s390 never provided a proper dirty bitmap. Let's provide a
proper migration bitmap for s390. We also change the user dirty tracking
to a fault based mechanism. This makes the host completely independent
from the storage keys. Long term this will allow us to back guest memory
with large pages.
per-VM device attributes
------------------------
To avoid the introduction of new ioctls, let's provide the
attribute semanantic also on the VM-"device".
Userspace controlled CMMA
-------------------------
The CMMA assist is changed from "always on" to "on if requested" via
per-VM device attributes. In addition a callback to reset all usage
states is provided.
Proper guest DAT handling for intercepts
----------------------------------------
While instructions handled by SIE take care of all addressing aspects,
KVM/s390 currently does not care about guest address translation of
intercepts. This worked out fine, because
- the s390 Linux kernel has a 1:1 mapping between kernel virtual<->real
for all pages up to memory size
- intercepts happen only for a small amount of cases
- all of these intercepts happen to be in the kernel text for current
distros
Of course we need to be better for other intercepts, kernel modules etc.
We provide the infrastructure and rework all in-kernel intercepts to work
on logical addresses (paging etc) instead of real ones. The code has
been running internally for several months now, so it is time for going
public.
GDB support
-----------
We provide breakpoints, single stepping and watchpoints.
Fixes/Cleanups
--------------
- Improve program check delivery
- Factor out the handling of transactional memory on program checks
- Use the existing define __LC_PGM_TDB
- Several cleanups in the lowcore structure
- Documentation
NOTES
-----
- All patches touching base s390 are either ACKed or written by the s390
maintainers
- One base KVM patch "KVM: add kvm_is_error_gpa() helper"
- One patch introduces the notion of VM device attributes
Michael Mueller [Thu, 13 Mar 2014 11:16:45 +0000 (12:16 +0100)]
KVM: s390: Factor out handle_itdb to handle TX aborts
Factor out the new function handle_itdb(), which copies the ITDB into
guest lowcore to fully handle a TX abort.
Signed-off-by: Michael Mueller <mimu@linux.vnet.ibm.com> Acked-by: Christian Borntraeger <borntraeger@de.ibm.com> Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
KVM: s390: no timer interrupts when single-stepping a guest
When a guest is single-stepped, we want to disable timer interrupts. Otherwise,
the guest will continuously execute the external interrupt handler and make
debugging of code where timer interrupts are enabled almost impossible.
The delivery of timer interrupts can be enforced in such sections by setting a
breakpoint and continuing execution.
In order to disable timer interrupts, they are disabled in the control register
of the guest just before SIE entry and are suppressed in the interrupt
check/delivery methods.
Signed-off-by: David Hildenbrand <dahi@linux.vnet.ibm.com> Reviewed-by: Cornelia Huck <cornelia.huck@de.ibm.com> Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
KVM: s390: move timer interrupt checks into own functions
This patch moves the checks for enabled timer (clock-comparator) interrupts and pending
timer interrupts into own functions, making the code better readable and easier to
maintain.
The method kvm_cpu_has_pending_timer is filled with life.
Signed-off-by: David Hildenbrand <dahi@linux.vnet.ibm.com> Reviewed-by: Cornelia Huck <cornelia.huck@de.ibm.com> Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
This patch adds support to debug the guest using the PER facility on s390.
Single-stepping, hardware breakpoints and hardware watchpoints are supported. In
order to use the PER facility of the guest without it noticing it, the control
registers of the guest have to be patched and access to them has to be
intercepted(stctl, stctg, lctl, lctlg).
All PER program interrupts have to be intercepted and only the relevant PER
interrupts for the guest have to be given back. Special care has to be taken
about repeated exits on the same hardware breakpoint. The intervention of the
host in the guests PER configuration is not fully transparent. PER instruction
nullification can not be used by the guest and too many storage alteration
events may be reported to the guest (if it is activated for special address
ranges only) when the host concurrently debugging it.
Signed-off-by: David Hildenbrand <dahi@linux.vnet.ibm.com> Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
KVM: s390: kernel header addition for guest debugging
This patch adds the structs to the kernel headers needed to pass information
from/to userspace in order to debug a guest on s390 with hardware support.
Signed-off-by: David Hildenbrand <dahi@linux.vnet.ibm.com> Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
KVM: s390: deliver program irq parameters and use correct ilc
When a program interrupt was to be delivered until now, no program interrupt
parameters were stored in the low-core of the target vcpu.
This patch enables the delivery of those program interrupt parameters, takes
care of concurrent PER events which can be injected in addition to any program
interrupt and uses the correct instruction length code (depending on the
interception code) for the injection of program interrupts.
Signed-off-by: David Hildenbrand <dahi@linux.vnet.ibm.com> Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
KVM: s390: extract irq parameters of intercepted program irqs
Whenever a program interrupt is intercepted, some parameters are stored in the
sie control block. These parameters have to be extracted in order to be
reinjected correctly. This patch also takes care of intercepted PER events which
can occurr in addition to any program interrupt.
Signed-off-by: David Hildenbrand <dahi@linux.vnet.ibm.com> Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Jens Freimann [Mon, 10 Feb 2014 09:55:37 +0000 (10:55 +0100)]
s390: fix name of lowcore field at offset 0xa3
According to the Principles of Operation, at offset 0xA3
in the lowcore we have the "Architectural-Mode identification",
not an "access identification".
Heiko Carstens [Wed, 1 Jan 2014 15:55:48 +0000 (16:55 +0100)]
KVM: s390: convert handle_tpi()
Convert handle_tpi() to new guest access functions.
The code now sets up a structure which is copied with a single call to
guest space instead of issuing several separate guest access calls.
This is necessary since the to be copied data may cross a page boundary.
If a protection exception happens while accessing any of the pages, the
instruction is suppressed and may not have modified any memory contents.
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>