Marc Zyngier [Tue, 6 Dec 2016 14:34:22 +0000 (14:34 +0000)]
arm64: KVM: pmu: Reset PMSELR_EL0.SEL to a sane value before entering the guest
The ARMv8 architecture allows the cycle counter to be configured
by setting PMSELR_EL0.SEL==0x1f and then accessing PMXEVTYPER_EL0,
hence accessing PMCCFILTR_EL0. But it disallows the use of
PMSELR_EL0.SEL==0x1f to access the cycle counter itself through
PMXEVCNTR_EL0.
Linux itself doesn't violate this rule, but we may end up with
PMSELR_EL0.SEL being set to 0x1f when we enter a guest. If that
guest accesses PMXEVCNTR_EL0, the access may UNDEF at EL1,
despite the guest not having done anything wrong.
In order to avoid this unfortunate course of events (haha!), let's
sanitize PMSELR_EL0 on guest entry. This ensures that the guest
won't explode unexpectedly.
Cc: stable@vger.kernel.org #4.6+ Acked-by: Will Deacon <will.deacon@arm.com> Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
KVM: arm/arm64: timer: Check for properly initialized timer on init
When the arch timer code fails to initialize (for example because the
memory mapped timer doesn't work, which is currently seen with the AEM
model), then KVM just continues happily with a final result that KVM
eventually does a NULL pointer dereference of the uninitialized cycle
counter.
Check directly for this in the init path and give the user a reasonable
error in this case.
Cc: Shih-Wei Li <shihwei@cs.columbia.edu> Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org> Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
Andre Przywara [Wed, 16 Nov 2016 17:57:16 +0000 (17:57 +0000)]
KVM: arm/arm64: vgic-v2: Limit ITARGETSR bits to number of VCPUs
The GICv2 spec says in section 4.3.12 that a "CPU targets field bit that
corresponds to an unimplemented CPU interface is RAZ/WI."
Currently we allow the guest to write any value in there and it can
read that back.
Mask the written value with the proper CPU mask to be spec compliant.
Signed-off-by: Andre Przywara <andre.przywara@arm.com> Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
Petr Mladek [Wed, 19 Oct 2016 11:50:47 +0000 (13:50 +0200)]
KVM: x86: Handle the kthread worker using the new API
Use the new API to create and destroy the "kvm-pit" kthread
worker. The API hides some implementation details.
In particular, kthread_create_worker() allocates and initializes
struct kthread_worker. It runs the kthread the right way
and stores task_struct into the worker structure.
kthread_destroy_worker() flushes all pending works, stops
the kthread and frees the structure.
This patch does not change the existing behavior except for
dynamically allocating struct kthread_worker and storing
only the pointer of this structure.
It is compile tested only because I did not find an easy
way how to run the code. Well, it should be pretty safe
given the nature of the change.
Signed-off-by: Petr Mladek <pmladek@suse.com>
Message-Id: <1476877847-11217-1-git-send-email-pmladek@suse.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Ladi Prosek [Wed, 30 Nov 2016 15:03:11 +0000 (16:03 +0100)]
KVM: nVMX: check host CR3 on vmentry and vmexit
This commit adds missing host CR3 checks. Before entering guest mode, the value
of CR3 is checked for reserved bits. After returning, nested_vmx_load_cr3 is
called to set the new CR3 value and check and load PDPTRs.
Ladi Prosek [Wed, 30 Nov 2016 15:03:10 +0000 (16:03 +0100)]
KVM: nVMX: introduce nested_vmx_load_cr3 and call it on vmentry
Loading CR3 as part of emulating vmentry is different from regular CR3 loads,
as implemented in kvm_set_cr3, in several ways.
* different rules are followed to check CR3 and it is desirable for the caller
to distinguish between the possible failures
* PDPTRs are not loaded if PAE paging and nested EPT are both enabled
* many MMU operations are not necessary
This patch introduces nested_vmx_load_cr3 suitable for CR3 loads as part of
nested vmentry and vmexit, and makes use of it on the nested vmentry path.
Ladi Prosek [Wed, 30 Nov 2016 15:03:09 +0000 (16:03 +0100)]
KVM: nVMX: propagate errors from prepare_vmcs02
It is possible that prepare_vmcs02 fails to load the guest state. This
patch adds the proper error handling for such a case. L1 will receive
an INVALID_STATE vmexit with the appropriate exit qualification if it
happens.
A failure to set guest CR3 is the only error propagated from prepare_vmcs02
at the moment.
Ladi Prosek [Wed, 30 Nov 2016 15:03:08 +0000 (16:03 +0100)]
KVM: nVMX: fix CR3 load if L2 uses PAE paging and EPT
KVM does not correctly handle L1 hypervisors that emulate L2 real mode with
PAE and EPT, such as Hyper-V. In this mode, the L1 hypervisor populates guest
PDPTE VMCS fields and leaves guest CR3 uninitialized because it is not used
(see 26.3.2.4 Loading Page-Directory-Pointer-Table Entries). KVM always
dereferences CR3 and tries to load PDPTEs if PAE is on. This leads to two
related issues:
1) On the first nested vmentry, the guest PDPTEs, as populated by L1, are
overwritten in ept_load_pdptrs because the registers are believed to have
been loaded in load_pdptrs as part of kvm_set_cr3. This is incorrect. L2 is
running with PAE enabled but PDPTRs have been set up by L1.
2) When L2 is about to enable paging and loads its CR3, we, again, attempt
to load PDPTEs in load_pdptrs called from kvm_set_cr3. There are no guarantees
that this will succeed (it's just a CR3 load, paging is not enabled yet) and
if it doesn't, kvm_set_cr3 returns early without persisting the CR3 which is
then lost and L2 crashes right after it enables paging.
This patch replaces the kvm_set_cr3 call with a simple register write if PAE
and EPT are both on. CR3 is not to be interpreted in this case.
David Matlack [Wed, 30 Nov 2016 02:14:10 +0000 (18:14 -0800)]
KVM: nVMX: load GUEST_EFER after GUEST_CR0 during emulated VM-entry
vmx_set_cr0() modifies GUEST_EFER and "IA-32e mode guest" in the current
VMCS. Call vmx_set_efer() after vmx_set_cr0() so that emulated VM-entry
is more faithful to VMCS12.
This patch correctly causes VM-entry to fail when "IA-32e mode guest" is
1 and GUEST_CR0.PG is 0. Previously this configuration would succeed and
"IA-32e mode guest" would silently be disabled by KVM.
Signed-off-by: David Matlack <dmatlack@google.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
David Matlack [Wed, 30 Nov 2016 02:14:09 +0000 (18:14 -0800)]
KVM: nVMX: generate MSR_IA32_CR{0,4}_FIXED1 from guest CPUID
MSR_IA32_CR{0,4}_FIXED1 define which bits in CR0 and CR4 are allowed to
be 1 during VMX operation. Since the set of allowed-1 bits is the same
in and out of VMX operation, we can generate these MSRs entirely from
the guest's CPUID. This lets userspace avoiding having to save/restore
these MSRs.
This patch also initializes MSR_IA32_CR{0,4}_FIXED1 from the CPU's MSRs
by default. This is a saner than the current default of -1ull, which
includes bits that the host CPU does not support.
Signed-off-by: David Matlack <dmatlack@google.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
David Matlack [Wed, 30 Nov 2016 02:14:08 +0000 (18:14 -0800)]
KVM: nVMX: fix checks on CR{0,4} during virtual VMX operation
KVM emulates MSR_IA32_VMX_CR{0,4}_FIXED1 with the value -1ULL, meaning
all CR0 and CR4 bits are allowed to be 1 during VMX operation.
This does not match real hardware, which disallows the high 32 bits of
CR0 to be 1, and disallows reserved bits of CR4 to be 1 (including bits
which are defined in the SDM but missing according to CPUID). A guest
can induce a VM-entry failure by setting these bits in GUEST_CR0 and
GUEST_CR4, despite MSR_IA32_VMX_CR{0,4}_FIXED1 indicating they are
valid.
Since KVM has allowed all bits to be 1 in CR0 and CR4, the existing
checks on these registers do not verify must-be-0 bits. Fix these checks
to identify must-be-0 bits according to MSR_IA32_VMX_CR{0,4}_FIXED1.
This patch should introduce no change in behavior in KVM, since these
MSRs are still -1ULL.
Signed-off-by: David Matlack <dmatlack@google.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
David Matlack [Wed, 30 Nov 2016 02:14:07 +0000 (18:14 -0800)]
KVM: nVMX: support restore of VMX capability MSRs
The VMX capability MSRs advertise the set of features the KVM virtual
CPU can support. This set of features varies across different host CPUs
and KVM versions. This patch aims to addresses both sources of
differences, allowing VMs to be migrated across CPUs and KVM versions
without guest-visible changes to these MSRs. Note that cross-KVM-
version migration is only supported from this point forward.
When the VMX capability MSRs are restored, they are audited to check
that the set of features advertised are a subset of what KVM and the
CPU support.
Since the VMX capability MSRs are read-only, they do not need to be on
the default MSR save/restore lists. The userspace hypervisor can set
the values of these MSRs or read them from KVM at VCPU creation time,
and restore the same value after every save/restore.
Signed-off-by: David Matlack <dmatlack@google.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
David Matlack [Wed, 30 Nov 2016 02:14:06 +0000 (18:14 -0800)]
KVM: nVMX: generate non-true VMX MSRs based on true versions
The "non-true" VMX capability MSRs can be generated from their "true"
counterparts, by OR-ing the default1 bits. The default1 bits are fixed
and defined in the SDM.
Since we can generate the non-true VMX MSRs from the true versions,
there's no need to store both in struct nested_vmx. This also lets
userspace avoid having to restore the non-true MSRs.
Note this does not preclude emulating MSR_IA32_VMX_BASIC[55]=0. To do so,
we simply need to set all the default1 bits in the true MSRs (such that
the true MSRs and the generated non-true MSRs are equal).
Signed-off-by: David Matlack <dmatlack@google.com> Suggested-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Kyle Huey [Tue, 29 Nov 2016 20:40:40 +0000 (12:40 -0800)]
KVM: x86: Add kvm_skip_emulated_instruction and use it.
kvm_skip_emulated_instruction calls both
kvm_x86_ops->skip_emulated_instruction and kvm_vcpu_check_singlestep,
skipping the emulated instruction and generating a trap if necessary.
Replacing skip_emulated_instruction calls with
kvm_skip_emulated_instruction is straightforward, except for:
- ICEBP, which is already inside a trap, so avoid triggering another trap.
- Instructions that can trigger exits to userspace, such as the IO insns,
MOVs to CR8, and HALT. If kvm_skip_emulated_instruction does trigger a
KVM_GUESTDBG_SINGLESTEP exit, and the handling code for
IN/OUT/MOV CR8/HALT also triggers an exit to userspace, the latter will
take precedence. The singlestep will be triggered again on the next
instruction, which is the current behavior.
- Task switch instructions which would require additional handling (e.g.
the task switch bit) and are instead left alone.
- Cases where VMLAUNCH/VMRESUME do not proceed to the next instruction,
which do not trigger singlestep traps as mentioned previously.
Kyle Huey [Tue, 29 Nov 2016 20:40:39 +0000 (12:40 -0800)]
KVM: VMX: Move skip_emulated_instruction out of nested_vmx_check_vmcs12
We can't return both the pass/fail boolean for the vmcs and the upcoming
continue/exit-to-userspace boolean for skip_emulated_instruction out of
nested_vmx_check_vmcs, so move skip_emulated_instruction out of it instead.
Additionally, VMENTER/VMRESUME only trigger singlestep exceptions when
they advance the IP to the following instruction, not when they a) succeed,
b) fail MSR validation or c) throw an exception. Add a separate call to
skip_emulated_instruction that will later not be converted to the variant
that checks the singlestep flag.
Kyle Huey [Tue, 29 Nov 2016 20:40:38 +0000 (12:40 -0800)]
KVM: VMX: Reorder some skip_emulated_instruction calls
The functions being moved ahead of skip_emulated_instruction here don't
need updated IPs, and skipping the emulated instruction at the end will
make it easier to return its value.
Kyle Huey [Tue, 29 Nov 2016 20:40:37 +0000 (12:40 -0800)]
KVM: x86: Add a return value to kvm_emulate_cpuid
Once skipping the emulated instruction can potentially trigger an exit to
userspace (via KVM_GUESTDBG_SINGLESTEP) kvm_emulate_cpuid will need to
propagate a return value.
Paul Mackerras [Thu, 1 Dec 2016 03:03:46 +0000 (14:03 +1100)]
KVM: PPC: Book3S: Move prototypes for KVM functions into kvm_ppc.h
This moves the prototypes for functions that are only called from
assembler code out of asm/asm-prototypes.h into asm/kvm_ppc.h.
The prototypes were added in commit ebe4535fbe7a ("KVM: PPC:
Book3S HV: sparse: prototypes for functions called from assembler",
2016-10-10), but given that the functions are KVM functions,
having them in a KVM header will be better for long-term
maintenance.
Radim Krčmář [Tue, 29 Nov 2016 13:26:55 +0000 (14:26 +0100)]
Merge branch 'kvm-ppc-next' of git://git.kernel.org/pub/scm/linux/kernel/git/paulus/powerpc
PPC KVM update for 4.10:
* Support for KVM guests on POWER9 using the hashed page table MMU.
* Updates and improvements to the halt-polling support on PPC, from
Suraj Jitindar Singh.
* An optimization to speed up emulated MMIO, from Yongji Xie.
* Various other minor cleanups.
Radim Krčmář [Tue, 29 Nov 2016 13:25:58 +0000 (14:25 +0100)]
Merge tag 'kvm-s390-next-4.10-1' of git://git.kernel.org/pub/scm/linux/kernel/git/kvms390/linux
KVM: s390: Changes for 4.10 (via kvm/next)
Two small optimizations to not do register reloading in
vcpu_put/get, instead do it in the ioctl path. This reduces
the overhead for schedule-intense workload that does not
exit to QEMU. (e.g. KVM guest with eventfd/irqfd that
does a lot of context switching with vhost or iothreads).
There is currently no documentation about the halt polling capabilities
of the kvm module. Add some documentation describing the mechanism as well
as the module parameters to all better understanding of how halt polling
should be used and the effect of tuning the module parameters.
Signed-off-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
KVM: PPC: Decrease the powerpc default halt poll max value
KVM_HALT_POLL_NS_DEFAULT is an arch specific constant which sets the
default value of the halt_poll_ns kvm module parameter which determines
the global maximum halt polling interval.
The current value for powerpc is 500000 (500us) which means that any
repetitive workload with a period of less than that can drive the cpu
usage to 100% where it may have been mostly idle without halt polling.
This presents the possibility of a large increase in power usage with
a comparatively small performance benefit.
Reduce the default to 10000 (10us) and a user can tune this themselves
to set their affinity for halt polling based on the trade off between power
and performance which they are willing to make.
Signed-off-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
KVM: PPC: Book3S HV: Add check for module parameter halt_poll_ns
The kvm module parameter halt_poll_ns defines the global maximum halt
polling interval and can be dynamically changed by writing to the
/sys/module/kvm/parameters/halt_poll_ns sysfs file. However in kvm-hv
this module parameter value is only ever checked when we grow the current
polling interval for the given vcore. This means that if we decrease the
halt_poll_ns value below the current polling interval we won't see any
effect unless we try to grow the polling interval above the new max at some
point or it happens to be shrunk below the halt_poll_ns value.
Update the halt polling code so that we always check for a new module param
value of halt_poll_ns and set the current halt polling interval to it if
it's currently greater than the new max. This means that it's redundant to
also perform this check in the grow_halt_poll_ns() function now.
Signed-off-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
KVM: PPC: Book3S HV: Use generic kvm module parameters
The previous patch exported the variables which back the module parameters
of the generic kvm module. Now use these variables in the kvm-hv module
so that any change to the generic module parameters will also have the
same effect for the kvm-hv module. This removes the duplication of the
kvm module parameters which was redundant and should reduce confusion when
tuning them.
Signed-off-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
The kvm module has the parameters halt_poll_ns, halt_poll_ns_grow, and
halt_poll_ns_shrink. Halt polling was recently added to the powerpc kvm-hv
module and these parameters were essentially duplicated for that. There is
no benefit to this duplication and it can lead to confusion when trying to
tune halt polling.
Thus move the definition of these variables to kvm_host.h and export them.
This will allow the kvm-hv module to use the same module parameters by
accessing these variables, which will be implemented in the next patch,
meaning that they will no longer be duplicated.
Signed-off-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Tom Lendacky [Wed, 23 Nov 2016 17:01:38 +0000 (12:01 -0500)]
kvm: svm: Add support for additional SVM NPF error codes
AMD hardware adds two additional bits to aid in nested page fault handling.
Bit 32 - NPF occurred while translating the guest's final physical address
Bit 33 - NPF occurred while translating the guest page tables
The guest page tables fault indicator can be used as an aid for nested
virtualization. Using V0 for the host, V1 for the first level guest and
V2 for the second level guest, when both V1 and V2 are using nested paging
there are currently a number of unnecessary instruction emulations. When
V2 is launched shadow paging is used in V1 for the nested tables of V2. As
a result, KVM marks these pages as RO in the host nested page tables. When
V2 exits and we resume V1, these pages are still marked RO.
Every nested walk for a guest page table is treated as a user-level write
access and this causes a lot of NPFs because the V1 page tables are marked
RO in the V0 nested tables. While executing V1, when these NPFs occur KVM
sees a write to a read-only page, emulates the V1 instruction and unprotects
the page (marking it RW). This patch looks for cases where we get a NPF due
to a guest page table walk where the page was marked RO. It immediately
unprotects the page and resumes the guest, leading to far fewer instruction
emulations when nested virtualization is used.
David Gibson [Wed, 23 Nov 2016 05:14:07 +0000 (16:14 +1100)]
KVM: PPC: Correctly report KVM_CAP_PPC_ALLOC_HTAB
At present KVM on powerpc always reports KVM_CAP_PPC_ALLOC_HTAB as enabled.
However, the ioctl() it advertises (KVM_PPC_ALLOCATE_HTAB) only actually
works on KVM HV. On KVM PR it will fail with ENOTTY.
QEMU already has a workaround for this, so it's not breaking things in
practice, but it would be better to advertise this correctly.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
David Gibson [Wed, 23 Nov 2016 05:14:06 +0000 (16:14 +1100)]
KVM: PPC: Move KVM_PPC_PVINFO_FLAGS_EV_IDLE definition next to its structure
The KVM_PPC_PVINFO_FLAGS_EV_IDLE macro defines a bit for use in the flags
field of struct kvm_ppc_pvinfo. However, changes since that was introduced
have moved it away from that structure definition, which is confusing.
Move it back next to the structure it belongs with.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Paul Mackerras [Thu, 24 Nov 2016 03:10:43 +0000 (14:10 +1100)]
KVM: PPC: Book3S HV: Fix compilation with unusual configurations
This adds the "again" parameter to the dummy version of
kvmppc_check_passthru(), so that it matches the real version.
This fixes compilation with CONFIG_BOOK3S_64_HV set but
CONFIG_KVM_XICS=n.
This includes asm/smp.h in book3s_hv_builtin.c to fix compilation
with CONFIG_SMP=n. The explicit inclusion is necessary to provide
definitions of hard_smp_processor_id() and get_hard_smp_processor_id()
in UP configs.
KVM: PPC: Book3S HV: Update kvmppc_set_arch_compat() for ISA v3.00
The function kvmppc_set_arch_compat() is used to determine the value of the
processor compatibility register (PCR) for a guest running in a given
compatibility mode. There is currently no support for v3.00 of the ISA.
Add support for v3.00 of the ISA which adds an ISA v2.07 compatilibity mode
to the PCR.
We also add a check to ensure the processor we are running on is capable of
emulating the chosen processor (for example a POWER7 cannot emulate a
POWER8, similarly with a POWER8 and a POWER9).
Based on work by: Paul Mackerras <paulus@ozlabs.org>
[paulus@ozlabs.org - moved dummy PCR_ARCH_300 definition here; set
guest_pcr_bit when arch_compat == 0, added comment.]
Signed-off-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Paul Mackerras [Fri, 18 Nov 2016 06:43:30 +0000 (17:43 +1100)]
KVM: PPC: Book3S HV: Treat POWER9 CPU threads as independent subcores
With POWER9, each CPU thread has its own MMU context and can be
in the host or a guest independently of the other threads; there is
still however a restriction that all threads must use the same type
of address translation, either radix tree or hashed page table (HPT).
Since we only support HPT guests on a HPT host at this point, we
can treat the threads as being independent, and avoid all of the
work of coordinating the CPU threads. To make this simpler, we
introduce a new threads_per_vcore() function that returns 1 on
POWER9 and threads_per_subcore on POWER7/8, and use that instead
of threads_per_subcore or threads_per_core in various places.
This also changes the value of the KVM_CAP_PPC_SMT capability on
POWER9 systems from 4 to 1, so that userspace will not try to
create VMs with multiple vcpus per vcore. (If userspace did create
a VM that thought it was in an SMT mode, the VM might try to use
the msgsndp instruction, which will not work as expected. In
future it may be possible to trap and emulate msgsndp in order to
allow VMs to think they are in an SMT mode, if only for the purpose
of allowing migration from POWER8 systems.)
With all this, we can now run guests on POWER9 as long as the host
is running with HPT translation. Since userspace currently has no
way to request radix tree translation for the guest, the guest has
no choice but to use HPT translation.
Paul Mackerras [Tue, 22 Nov 2016 03:30:14 +0000 (14:30 +1100)]
KVM: PPC: Book3S HV: Enable hypervisor virtualization interrupts while in guest
The new XIVE interrupt controller on POWER9 can direct external
interrupts to the hypervisor or the guest. The interrupts directed to
the hypervisor are controlled by an LPCR bit called LPCR_HVICE, and
come in as a "hypervisor virtualization interrupt". This sets the
LPCR bit so that hypervisor virtualization interrupts can occur while
we are in the guest. We then also need to cope with exiting the guest
because of a hypervisor virtualization interrupt.
Paul Mackerras [Fri, 18 Nov 2016 03:34:07 +0000 (14:34 +1100)]
KVM: PPC: Book3S HV: Use stop instruction rather than nap on POWER9
POWER9 replaces the various power-saving mode instructions on POWER8
(doze, nap, sleep and rvwinkle) with a single "stop" instruction, plus
a register, PSSCR, which controls the depth of the power-saving mode.
This replaces the use of the nap instruction when threads are idle
during guest execution with the stop instruction, and adds code to
set PSSCR to a value which will allow an SMT mode switch while the
thread is idle (given that the core as a whole won't be idle in these
cases).
Paul Mackerras [Thu, 17 Nov 2016 22:02:08 +0000 (09:02 +1100)]
KVM: PPC: Book3S HV: Use OPAL XICS emulation on POWER9
POWER9 includes a new interrupt controller, called XIVE, which is
quite different from the XICS interrupt controller on POWER7 and
POWER8 machines. KVM-HV accesses the XICS directly in several places
in order to send and clear IPIs and handle interrupts from PCI
devices being passed through to the guest.
In order to make the transition to XIVE easier, OPAL firmware will
include an emulation of XICS on top of XIVE. Access to the emulated
XICS is via OPAL calls. The one complication is that the EOI
(end-of-interrupt) function can now return a value indicating that
another interrupt is pending; in this case, the XIVE will not signal
an interrupt in hardware to the CPU, and software is supposed to
acknowledge the new interrupt without waiting for another interrupt
to be delivered in hardware.
This adapts KVM-HV to use the OPAL calls on machines where there is
no XICS hardware. When there is no XICS, we look for a device-tree
node with "ibm,opal-intc" in its compatible property, which is how
OPAL indicates that it provides XICS emulation.
In order to handle the EOI return value, kvmppc_read_intr() has
become kvmppc_read_one_intr(), with a boolean variable passed by
reference which can be set by the EOI functions to indicate that
another interrupt is pending. The new kvmppc_read_intr() keeps
calling kvmppc_read_one_intr() until there are no more interrupts
to process. The return value from kvmppc_read_intr() is the
largest non-zero value of the returns from kvmppc_read_one_intr().
Paul Mackerras [Thu, 17 Nov 2016 21:47:08 +0000 (08:47 +1100)]
KVM: PPC: Book3S HV: Use msgsnd for IPIs to other cores on POWER9
On POWER9, the msgsnd instruction is able to send interrupts to
other cores, as well as other threads on the local core. Since
msgsnd is generally simpler and faster than sending an IPI via the
XICS, we use msgsnd for all IPIs sent by KVM on POWER9.
Paul Mackerras [Thu, 17 Nov 2016 21:28:51 +0000 (08:28 +1100)]
KVM: PPC: Book3S HV: Adapt TLB invalidations to work on POWER9
POWER9 adds new capabilities to the tlbie (TLB invalidate entry)
and tlbiel (local tlbie) instructions. Both instructions get a
set of new parameters (RIC, PRS and R) which appear as bits in the
instruction word. The tlbiel instruction now has a second register
operand, which contains a PID and/or LPID value if needed, and
should otherwise contain 0.
This adapts KVM-HV's usage of tlbie and tlbiel to work on POWER9
as well as older processors. Since we only handle HPT guests so
far, we need RIC=0 PRS=0 R=0, which ends up with the same instruction
word as on previous processors, so we don't need to conditionally
execute different instructions depending on the processor.
The local flush on first entry to a guest in book3s_hv_rmhandlers.S
is a loop which depends on the number of TLB sets. Rather than
using feature sections to set the number of iterations based on
which CPU we're on, we now work out this number at VM creation time
and store it in the kvm_arch struct. That will make it possible to
get the number from the device tree in future, which will help with
compatibility with future processors.
Since mmu_partition_table_set_entry() does a global flush of the
whole LPID, we don't need to do the TLB flush on first entry to the
guest on each processor. Therefore we don't set all bits in the
tlb_need_flush bitmap on VM startup on POWER9.
Paul Mackerras [Fri, 18 Nov 2016 02:11:42 +0000 (13:11 +1100)]
KVM: PPC: Book3S HV: Add new POWER9 guest-accessible SPRs
This adds code to handle two new guest-accessible special-purpose
registers on POWER9: TIDR (thread ID register) and PSSCR (processor
stop status and control register). They are context-switched
between host and guest, and the guest values can be read and set
via the one_reg interface.
The PSSCR contains some fields which are guest-accessible and some
which are only accessible in hypervisor mode. We only allow the
guest-accessible fields to be read or set by userspace.
Paul Mackerras [Wed, 16 Nov 2016 11:33:27 +0000 (22:33 +1100)]
KVM: PPC: Book3S HV: Adjust host/guest context switch for POWER9
Some special-purpose registers that were present and accessible
by guests on POWER8 no longer exist on POWER9, so this adds
feature sections to ensure that we don't try to context-switch
them when going into or out of a guest on POWER9. These are
all relatively obscure, rarely-used registers, but we had to
context-switch them on POWER8 to avoid creating a covert channel.
They are: SPMC1, SPMC2, MMCRS, CSIGR, TACR, TCSCR, and ACOP.
Paul Mackerras [Wed, 16 Nov 2016 11:25:20 +0000 (22:25 +1100)]
KVM: PPC: Book3S HV: Set partition table rather than SDR1 on POWER9
On POWER9, the SDR1 register (hashed page table base address) is no
longer used, and instead the hardware reads the HPT base address
and size from the partition table. The partition table entry also
contains the bits that specify the page size for the VRMA mapping,
which were previously in the LPCR. The VPM0 bit of the LPCR is
now reserved; the processor now always uses the VRMA (virtual
real-mode area) mechanism for guest real-mode accesses in HPT mode,
and the RMO (real-mode offset) mechanism has been dropped.
When entering or exiting the guest, we now only have to set the
LPIDR (logical partition ID register), not the SDR1 register.
There is also no requirement now to transition via a reserved
LPID value.
Paul Mackerras [Wed, 16 Nov 2016 05:57:24 +0000 (16:57 +1100)]
KVM: PPC: Book3S HV: Adapt to new HPTE format on POWER9
This adapts the KVM-HV hashed page table (HPT) code to read and write
HPT entries in the new format defined in Power ISA v3.00 on POWER9
machines. The new format moves the B (segment size) field from the
first doubleword to the second, and trims some bits from the AVA
(abbreviated virtual address) and ARPN (abbreviated real page number)
fields. As far as possible, the conversion is done when reading or
writing the HPT entries, and the rest of the code continues to use
the old format.
Michael Ellerman [Tue, 22 Nov 2016 03:50:39 +0000 (14:50 +1100)]
powerpc/reg: Add definition for LPCR_PECE_HVEE
ISA 3.0 defines a new PECE (Power-saving mode Exit Cause Enable) field
in the LPCR (Logical Partitioning Control Register), called
LPCR_PECE_HVEE (Hypervisor Virtualization Exit Enable).
KVM code will need to know about this bit, so add a definition for it.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Paul Mackerras [Mon, 21 Nov 2016 05:00:58 +0000 (16:00 +1100)]
powerpc/64: Provide functions for accessing POWER9 partition table
POWER9 requires the host to set up a partition table, which is a
table in memory indexed by logical partition ID (LPID) which
contains the pointers to page tables and process tables for the
host and each guest.
This factors out the initialization of the partition table into
a single function. This code was previously duplicated between
hash_utils_64.c and pgtable-radix.c.
This provides a function for setting a partition table entry,
which is used in early MMU initialization, and will be used by
KVM whenever a guest is created. This function includes a tlbie
instruction which will flush all TLB entries for the LPID and
all caches of the partition table entry for the LPID, across the
system.
This also moves a call to memblock_set_current_limit(), which was
in radix_init_partition_table(), but has nothing to do with the
partition table. By analogy with the similar code for hash, the
call gets moved to near the end of radix__early_init_mmu(). It
now gets called when running as a guest, whereas previously it
would only be called if the kernel is running as the host.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
KVM: s390: handle floating point registers in the run ioctl not in vcpu_put/load
Right now we switch the host fprs/vrs in kvm_arch_vcpu_load and switch
back in kvm_arch_vcpu_put. This process is already optimized
since commit 9977e886cbbc7 ("s390/kernel: lazy restore fpu registers")
avoiding double save/restores on schedule. We still reload the pointers
and test the guest fpc on each context switch, though.
We can minimize the cost of vcpu_load/put by doing the test in the
VCPU_RUN ioctl itself. As most VCPU threads almost never exit to
userspace in the common fast path, this allows to avoid this overhead
for the common case (eventfd driven I/O, all exits including sleep
handled in the kernel) - making kvm_arch_vcpu_load/put basically
disappear in perf top.
Also adapt the fpu get/set ioctls.
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com> Reviewed-by: Cornelia Huck <cornelia.huck@de.ibm.com> Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
KVM: s390: handle access registers in the run ioctl not in vcpu_put/load
Right now we save the host access registers in kvm_arch_vcpu_load
and load them in kvm_arch_vcpu_put. Vice versa for the guest access
registers. On schedule this means, that we load/save access registers
multiple times.
e.g. VCPU_RUN with just one reschedule and then return does
[from user space via VCPU_RUN]
- save the host registers in kvm_arch_vcpu_load (via ioctl)
- load the guest registers in kvm_arch_vcpu_load (via ioctl)
- do guest stuff
- decide to schedule/sleep
- save the guest registers in kvm_arch_vcpu_put (via sched)
- load the host registers in kvm_arch_vcpu_put (via sched)
- save the host registers in switch_to (via sched)
- schedule
- return
- load the host registers in switch_to (via sched)
- save the host registers in kvm_arch_vcpu_load (via sched)
- load the guest registers in kvm_arch_vcpu_load (via sched)
- do guest stuff
- decide to go to userspace
- save the guest registers in kvm_arch_vcpu_put (via ioctl)
- load the host registers in kvm_arch_vcpu_put (via ioctl)
[back to user space]
As the kernel does not use access registers, we can avoid
this reloading and simply piggy back on switch_to (let it save
the guest values instead of host values in thread.acrs) by
moving the host/guest switch into the VCPU_RUN ioctl function.
We now do
[from user space via VCPU_RUN]
- save the host registers in kvm_arch_vcpu_ioctl_run
- load the guest registers in kvm_arch_vcpu_ioctl_run
- do guest stuff
- decide to schedule/sleep
- save the guest registers in switch_to
- schedule
- return
- load the guest registers in switch_to (via sched)
- do guest stuff
- decide to go to userspace
- save the guest registers in kvm_arch_vcpu_ioctl_run
- load the host registers in kvm_arch_vcpu_ioctl_run
This seems to save about 10% of the vcpu_put/load functions
according to perf.
As vcpu_load no longer switches the acrs, We can also loading
the acrs in kvm_arch_vcpu_ioctl_set_sregs.
Suggested-by: Fan Zhang <zhangfan@linux.vnet.ibm.com> Reviewed-by: Cornelia Huck <cornelia.huck@de.ibm.com> Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Bandan Das [Tue, 15 Nov 2016 06:36:18 +0000 (01:36 -0500)]
kvm: x86: don't print warning messages for unimplemented msrs
Change unimplemented msrs messages to use pr_debug.
If CONFIG_DYNAMIC_DEBUG is set, then these messages can be
enabled at run time or else -DDEBUG can be used at compile
time to enable them. These messages will still be printed if
ignore_msrs=1.
Signed-off-by: Bandan Das <bsd@redhat.com> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Jim Mattson [Wed, 9 Nov 2016 17:50:11 +0000 (09:50 -0800)]
kvm: x86: CPUID.01H:EDX.APIC[bit 9] should mirror IA32_APIC_BASE[11]
From the Intel SDM, volume 3, section 10.4.3, "Enabling or Disabling the
Local APIC,"
When IA32_APIC_BASE[11] is 0, the processor is functionally equivalent
to an IA-32 processor without an on-chip APIC. The CPUID feature flag
for the APIC (see Section 10.4.2, "Presence of the Local APIC") is
also set to 0.
Signed-off-by: Jim Mattson <jmattson@google.com>
[Changed subject tag from nVMX to x86.] Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Paul Mackerras [Wed, 16 Nov 2016 05:43:28 +0000 (16:43 +1100)]
KVM: PPC: Book3S HV: Don't lose hardware R/C bit updates in H_PROTECT
The hashed page table MMU in POWER processors can update the R
(reference) and C (change) bits in a HPTE at any time until the
HPTE has been invalidated and the TLB invalidation sequence has
completed. In kvmppc_h_protect, which implements the H_PROTECT
hypercall, we read the HPTE, modify the second doubleword,
invalidate the HPTE in memory, do the TLB invalidation sequence,
and then write the modified value of the second doubleword back
to memory. In doing so we could overwrite an R/C bit update done
by hardware between when we read the HPTE and when the TLB
invalidation completed. To fix this we re-read the second
doubleword after the TLB invalidation and OR in the (possibly)
new values of R and C. We can use an OR since hardware only ever
sets R and C, never clears them.
This race was found by code inspection. In principle this bug could
cause occasional guest memory corruption under host memory pressure.
Fixes: a8606e20e41a ("KVM: PPC: Handle some PAPR hcalls in the kernel", 2011-06-29) Cc: stable@vger.kernel.org # v3.19+ Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Paul Mackerras [Mon, 7 Nov 2016 04:09:58 +0000 (15:09 +1100)]
KVM: PPC: Book3S HV: Save/restore XER in checkpointed register state
When switching from/to a guest that has a transaction in progress,
we need to save/restore the checkpointed register state. Although
XER is part of the CPU state that gets checkpointed, the code that
does this saving and restoring doesn't save/restore XER.
This fixes it by saving and restoring the XER. To allow userspace
to read/write the checkpointed XER value, we also add a new ONE_REG
specifier.
The visible effect of this bug is that the guest may see its XER
value being corrupted when it uses transactions.
Fixes: e4e38121507a ("KVM: PPC: Book3S HV: Add transactional memory support") Fixes: 0a8eccefcb34 ("KVM: PPC: Book3S HV: Add missing code for transaction reclaim on guest exit") Cc: stable@vger.kernel.org # v3.15+ Signed-off-by: Paul Mackerras <paulus@ozlabs.org> Reviewed-by: Thomas Huth <thuth@redhat.com> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Yongji Xie [Fri, 4 Nov 2016 05:55:12 +0000 (13:55 +0800)]
KVM: PPC: Book3S HV: Add a per vcpu cache for recently page faulted MMIO entries
This keeps a per vcpu cache for recently page faulted MMIO entries.
On a page fault, if the entry exists in the cache, we can avoid some
time-consuming paths, for example, looking up HPT, locking HPTE twice
and searching mmio gfn from memslots, then directly call
kvmppc_hv_emulate_mmio().
In current implenment, we limit the size of cache to four. We think
it's enough to cover the high-frequency MMIO HPTEs in most case.
For example, considering the case of using virtio device, for virtio
legacy devices, one HPTE could handle notifications from up to
1024 (64K page / 64 byte Port IO register) devices, so one cache entry
is enough; for virtio modern devices, we always need one HPTE to handle
notification for each device because modern device would use a 8M MMIO
register to notify host instead of Port IO register, typically the
system's configuration should not exceed four virtio devices per
vcpu, four cache entry is also enough in this case. Of course, if needed,
we could also modify the macro to a module parameter in the future.
Signed-off-by: Yongji Xie <xyjxie@linux.vnet.ibm.com> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Yongji Xie [Fri, 4 Nov 2016 05:55:11 +0000 (13:55 +0800)]
KVM: PPC: Book3S HV: Clear the key field of HPTE when the page is paged out
Currently we mark a HPTE for emulated MMIO with HPTE_V_ABSENT bit
set as well as key 0x1f. However, those HPTEs may be conflicted with
the HPTE for real guest RAM page HPTE with key 0x1f when the page
get paged out.
This patch clears the key field of HPTE when the page is paged out,
then recover it when HPTE is re-established.
Signed-off-by: Yongji Xie <xyjxie@linux.vnet.ibm.com> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Radim Krčmář [Wed, 9 Nov 2016 18:07:06 +0000 (19:07 +0100)]
KVM: x86: emulate FXSAVE and FXRSTOR
Internal errors were reported on 16 bit fxsave and fxrstor with ipxe.
Old Intels don't have unrestricted_guest, so we have to emulate them.
The patch takes advantage of the hardware implementation.
AMD and Intel differ in saving and restoring other fields in first 32
bytes. A test wrote 0xff to the fxsave area, 0 to upper bits of MCSXR
in the fxsave area, executed fxrstor, rewrote the fxsave area to 0xee,
and executed fxsave:
Intel (Nehalem):
7f 1f 7f 7f ff 00 ff 07 ff ff ff ff ff ff 00 00
ff ff ff ff ff ff 00 00 ff ff 00 00 ff ff 00 00
Intel (Haswell -- deprecated FPU CS and FPU DS):
7f 1f 7f 7f ff 00 ff 07 ff ff ff ff 00 00 00 00
ff ff ff ff 00 00 00 00 ff ff 00 00 ff ff 00 00
AMD (Opteron 2300-series):
7f 1f 7f 7f ff 00 ee ee ee ee ee ee ee ee ee ee
ee ee ee ee ee ee ee ee ff ff 00 00 ff ff 02 00
fxsave/fxrstor will only be emulated on early Intels, so KVM can't do
much to improve the situation.
Jiang Biao [Mon, 7 Nov 2016 00:57:16 +0000 (08:57 +0800)]
kvm: x86: remove unused but set variable
The local variable *gpa_offset* is set but not used afterwards,
which make the compiler issue a warning with option
-Wunused-but-set-variable. Remove it to avoid the warning.
Signed-off-by: Jiang Biao <jiang.biao2@zte.com.cn> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
He Chen [Fri, 11 Nov 2016 09:25:35 +0000 (17:25 +0800)]
x86/cpuid: Provide get_scattered_cpuid_leaf()
Sparse populated CPUID leafs are collected in a software provided leaf to
avoid bloat of the x86_capability array, but there is no way to rebuild the
real leafs (e.g. for KVM CPUID enumeration) other than rereading the CPUID
leaf from the CPU. While this is possible it is problematic as it does not
take software disabled features into account. If a feature is disabled on
the host it should not be exposed to a guest either.
Add get_scattered_cpuid_leaf() which rebuilds the leaf from the scattered
cpuid table information and the active CPU features.
[ tglx: Rewrote changelog ]
Signed-off-by: He Chen <he.chen@linux.intel.com> Reviewed-by: Borislav Petkov <bp@suse.de> Cc: Luwei Kang <luwei.kang@intel.com> Cc: kvm@vger.kernel.org Cc: Radim Krčmář <rkrcmar@redhat.com> Cc: Piotr Luc <Piotr.Luc@intel.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Paolo Bonzini <pbonzini@redhat.com> Link: http://lkml.kernel.org/r/1478856336-9388-3-git-send-email-he.chen@linux.intel.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Paul Mackerras [Fri, 11 Nov 2016 05:55:03 +0000 (16:55 +1100)]
powerpc/64: Simplify adaptation to new ISA v3.00 HPTE format
This changes the way that we support the new ISA v3.00 HPTE format.
Instead of adapting everything that uses HPTE values to handle either
the old format or the new format, depending on which CPU we are on,
we now convert explicitly between old and new formats if necessary
in the low-level routines that actually access HPTEs in memory.
This limits the amount of code that needs to know about the new
format and makes the conversions explicit. This is OK because the
old format contains all the information that is in the new format.
This also fixes operation under a hypervisor, because the H_ENTER
hypercall (and other hypercalls that deal with HPTEs) will continue
to require the HPTE value to be supplied in the old format. At
present the kernel will not boot in HPT mode on POWER9 under a
hypervisor.
This fixes and partially reverts commit 50de596de8be
("powerpc/mm/hash: Add support for Power9 Hash", 2016-04-29).
Fixes: 50de596de8be ("powerpc/mm/hash: Add support for Power9 Hash") Signed-off-by: Paul Mackerras <paulus@ozlabs.org> Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Vladimir Murzin [Wed, 2 Nov 2016 11:55:34 +0000 (11:55 +0000)]
ARM: KVM: Support vGICv3 ITS
This patch allows to build and use vGICv3 ITS in 32-bit mode.
Signed-off-by: Vladimir Murzin <vladimir.murzin@arm.com> Reviewed-by: Andre Przywara <andre.przywara@arm.com> Reviewed-by: Marc Zyngier <marc.zyngier@arm.com> Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
Vladimir Murzin [Wed, 2 Nov 2016 11:55:33 +0000 (11:55 +0000)]
KVM: arm64: vgic-its: Fix compatibility with 32-bit
Evaluate GITS_BASER_ENTRY_SIZE once as an int data (GITS_BASER<n>'s
Entry Size is 5-bit wide only), so when used as divider no reference
to __aeabi_uldivmod is generated when build for AArch32.
Use unsigned long long for GITS_BASER_PAGE_SIZE_* since they are
used in conjunction with 64-bit data.
Signed-off-by: Vladimir Murzin <vladimir.murzin@arm.com> Reviewed-by: Andre Przywara <andre.przywara@arm.com> Reviewed-by: Marc Zyngier <marc.zyngier@arm.com> Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
Linus Torvalds [Sun, 13 Nov 2016 18:28:53 +0000 (10:28 -0800)]
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull KVM fixes from Paolo Bonzini:
"ARM fixes. There are a couple pending x86 patches but they'll have to
wait for next week"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
KVM: arm/arm64: vgic: Kick VCPUs when queueing already pending IRQs
KVM: arm/arm64: vgic: Prevent access to invalid SPIs
arm/arm64: KVM: Perform local TLB invalidation when multiplexing vcpus on a single CPU
Linus Torvalds [Sun, 13 Nov 2016 18:26:05 +0000 (10:26 -0800)]
Merge branch 'media-fixes' (patches from Mauro)
Merge media fixes from Mauro Carvalho Chehab:
"This contains two patches fixing problems with my patch series meant
to make USB drivers to work again after the DMA on stack changes.
The last patch on this series is actually not related to DMA on stack.
It solves a longstanding bug affecting module unload, causing
module_put() to be called twice. It was reported by the user who
reported and tested the issues with the gp8psk driver with the DMA
fixup patches. As we're late at -rc cycle, maybe you prefer to not
apply it right now. If this is the case, I'll add to the pile of
patches for 4.10.
Exceptionally this time, I'm sending the patches via e-mail, because
I'm on another trip, and won't be able to use the usual procedure
until Monday. Also, it is only three patches, and you followed already
the discussions about the first one"
Linus Torvalds [Sun, 13 Nov 2016 18:24:08 +0000 (10:24 -0800)]
Merge tag 'char-misc-4.9-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc
Pull char/misc fixes from Greg KH:
"Here are three small driver fixes for some reported issues for
4.9-rc5.
One for the hyper-v subsystem, fixing up a naming issue that showed up
in 4.9-rc1, one mei driver fix, and one fix for parallel ports,
resolving a reported regression.
All have been in linux-next with no reported issues"
* tag 'char-misc-4.9-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc:
ppdev: fix double-free of pp->pdev->name
vmbus: make sysfs names consistent with PCI
mei: bus: fix received data size check in NFC fixup
Linus Torvalds [Sun, 13 Nov 2016 18:22:07 +0000 (10:22 -0800)]
Merge tag 'driver-core-4.9-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core
Pull driver core fixes from Greg KH:
"Here are two driver core fixes for 4.9-rc5.
The first resolves an issue with some drivers not liking to be unbound
and bound again (if CONFIG_DEBUG_TEST_DRIVER_REMOVE is enabled), which
solves some reported problems with graphics and storage drivers. The
other resolves a smatch error with the 4.9-rc1 driver core changes
around this feature.
Both have been in linux-next with no reported issues"
* tag 'driver-core-4.9-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core:
driver core: fix smatch warning on dev->bus check
driver core: skip removal test for non-removable drivers
Linus Torvalds [Sun, 13 Nov 2016 18:13:33 +0000 (10:13 -0800)]
Merge tag 'staging-4.9-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging
Pull staging/IIO fixes from Grek KH:
"Here are a few small staging and iio driver fixes for reported issues.
The last one was cherry-picked from my -next branch to resolve a build
warning that Arnd fixed, in his quest to be able to turn
-Wmaybe-uninitialized back on again. That patch, and all of the
others, have been in linux-next for a while with no reported issues"
* tag 'staging-4.9-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging:
iio: maxim_thermocouple: detect invalid storage size in read()
staging: nvec: remove managed resource from PS2 driver
Revert "staging: nvec: ps2: change serio type to passthrough"
drivers: staging: nvec: remove bogus reset command for PS/2 interface
staging: greybus: arche-platform: fix device reference leak
staging: comedi: ni_tio: fix buggy ni_tio_clock_period_ps() return value
staging: sm750fb: Fix bugs introduced by early commits
iio: hid-sensors: Increase the precision of scale to fix wrong reading interpretation.
iio: orientation: hid-sensor-rotation: Add PM function (fix non working driver)
iio: st_sensors: fix scale configuration for h3lis331dl
staging: iio: ad5933: avoid uninitialized variable in error case
Linus Torvalds [Sun, 13 Nov 2016 18:09:04 +0000 (10:09 -0800)]
Merge branch 'for-linus' of git://git.kernel.dk/linux-block
Pull more block fixes from Jens Axboe:
"Since I mistakenly left out the lightnvm regression fix yesterday and
the aoeblk seems adequately tested at this point, might as well send
out another pull to make -rc5"
* 'for-linus' of git://git.kernel.dk/linux-block:
aoe: fix crash in page count manipulation
lightnvm: invalid offset calculation for lba_shift
Linus Torvalds [Sun, 13 Nov 2016 18:07:08 +0000 (10:07 -0800)]
Merge tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi
Pull SCSI fixes from James Bottomley:
"The megaraid_sas patch in here fixes a major regression in the last
fix set that made all megaraid_sas cards unusable. It turns out no-one
had actually tested such an "obvious" fix, sigh. The fix for the fix
has been tested ...
The next most serious is the vmw_pvscsi abort problem which basically
means that aborts don't work on the vmware paravirt devices and error
handling always escalates to reset.
The rest are an assortment of missed reference counting in certain
paths and corner case bugs that show up on some architectures"
* tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
scsi: megaraid_sas: fix macro MEGASAS_IS_LOGICAL to avoid regression
scsi: qla2xxx: fix invalid DMA access after command aborts in PCI device remove
scsi: qla2xxx: do not queue commands when unloading
scsi: libcxgbi: fix incorrect DDP resource cleanup
scsi: qla2xxx: Fix scsi scan hang triggered if adapter fails during init
scsi: scsi_dh_alua: Fix a reference counting bug
scsi: vmw_pvscsi: return SUCCESS for successful command aborts
scsi: mpt3sas: Fix for block device of raid exists even after deleting raid disk
scsi: scsi_dh_alua: fix missing kref_put() in alua_rtpg_work()
Linus Torvalds [Sun, 13 Nov 2016 18:04:55 +0000 (10:04 -0800)]
Merge tag 'clk-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/clk/linux
Pull clk fixes from Stephen Boyd:
"The typical collection of minor bug fixes in clk drivers. We don't
have anything in the core framework here, just driver fixes.
There's a boot fix for Samsung devices and a safety measure for qoriq
to prevent CPUs from running too fast. There's also a fix for i.MX6Q
to properly handle audio clock rates. We also have some "that's
obviously wrong" fixes like bad NULL pointer checks in the MPP driver
and a poor usage of __pa in the xgene clk driver that are fixed here"
* tag 'clk-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/clk/linux:
clk: mmp: pxa910: fix return value check in pxa910_clk_init()
clk: mmp: pxa168: fix return value check in pxa168_clk_init()
clk: mmp: mmp2: fix return value check in mmp2_clk_init()
clk: qoriq: Don't allow CPU clocks higher than starting value
clk: imx: fix integer overflow in AV PLL round rate
clk: xgene: Don't call __pa on ioremaped address
clk/samsung: Use CLK_OF_DECLARE_DRIVER initialization method for CLKOUT
clk: rockchip: don't return NULL when failing to register ddrclk branch
The DVB binding schema at the DVB core assumes that the frontend is a
separate driver. Faling to do that causes OOPS when the module is
removed, as it tries to do a symbol_put_addr on an internal symbol,
causing craches like:
Commit bc29131ecb10 ("[media] gp8psk: don't do DMA on stack") fixed the
usage of DMA on stack, but the memcpy was wrong for gp8psk_usb_in_op().
Fix it.
From Derek's email:
"Fix confirmed using 2 different Skywalker models with
HD mpeg4, SD mpeg2."
Suggested-by: Johannes Stezenbach <js@linuxtv.org> Fixes: bc29131ecb10 ("[media] gp8psk: don't do DMA on stack") Tested-by: Derek <user.vdr@gmail.com> Signed-off-by: Mauro Carvalho Chehab <mchehab@s-opensource.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Arnd Bergmann [Tue, 25 Oct 2016 15:55:04 +0000 (17:55 +0200)]
iio: maxim_thermocouple: detect invalid storage size in read()
As found by gcc -Wmaybe-uninitialized, having a storage_bytes value other
than 2 or 4 will result in undefined behavior:
drivers/iio/temperature/maxim_thermocouple.c: In function 'maxim_thermocouple_read':
drivers/iio/temperature/maxim_thermocouple.c:141:5: error: 'ret' may be used uninitialized in this function [-Werror=maybe-uninitialized]
This probably cannot happen, but returning -EINVAL here is appropriate
and makes gcc happy and the code more robust.
Fixes: 231147ee77f3 ("iio: maxim_thermocouple: Align 16 bit big endian value of raw reads") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Jonathan Cameron <jic23@kernel.org>
(cherry picked from commit 32cb7d27e65df9daa7cee8f1fdf7b259f214bee2) Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Jens Axboe [Sat, 12 Nov 2016 01:28:50 +0000 (18:28 -0700)]
aoe: fix crash in page count manipulation
aoeblk contains some mysterious code, that wants to elevate the bio
vec page counts while it's under IO. That is not needed, it's
fragile, and it's causing kernel oopses for some.
Reported-by: Tested-by: Don Koch <kochd@us.ibm.com> Tested-by: Tested-by: Don Koch <kochd@us.ibm.com> Signed-off-by: Jens Axboe <axboe@fb.com>
Matias Bjørling [Thu, 10 Nov 2016 11:26:57 +0000 (12:26 +0100)]
lightnvm: invalid offset calculation for lba_shift
The ns->lba_shift assumes its value to be the logarithmic of the
LA size. A previous patch duplicated the lba_shift calculation into
lightnvm. It prematurely also subtracted a 512byte shift, which commonly
is applied per-command. The 512byte shift being subtracted twice led to
data loss when restoring the logical to physical mapping table from
device and when issuing I/O commands using rrpc.
Fix offset by removing the 512byte shift subtraction when calculating
lba_shift.
Fixes: b0b4e09c1ae7 "lightnvm: control life of nvm_dev in driver" Reported-by: Javier González <javier@cnexlabs.com> Signed-off-by: Matias Bjørling <m@bjorling.me> Signed-off-by: Jens Axboe <axboe@fb.com>