KVM: halt_polling: provide a way to qualify wakeups during poll
Some wakeups should not be considered a sucessful poll. For example on
s390 I/O interrupts are usually floating, which means that _ALL_ CPUs
would be considered runnable - letting all vCPUs poll all the time for
transactional like workload, even if one vCPU would be enough.
This can result in huge CPU usage for large guests.
This patch lets architectures provide a way to qualify wakeups if they
should be considered a good/bad wakeups in regard to polls.
For s390 the implementation will fence of halt polling for anything but
known good, single vCPU events. The s390 implementation for floating
interrupts does a wakeup for one vCPU, but the interrupt will be delivered
by whatever CPU checks first for a pending interrupt. We prefer the
woken up CPU by marking the poll of this CPU as "good" poll.
This code will also mark several other wakeup reasons like IPI or
expired timers as "good". This will of course also mark some events as
not sucessful. As KVM on z runs always as a 2nd level hypervisor,
we prefer to not poll, unless we are really sure, though.
This patch successfully limits the CPU usage for cases like uperf 1byte
transactional ping pong workload or wakeup heavy workload like OLTP
while still providing a proper speedup.
This also introduced a new vcpu stat "halt_poll_no_tuning" that marks
wakeups that are considered not good for polling.
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com> Acked-by: Radim Krčmář <rkrcmar@redhat.com> (for an earlier version) Cc: David Matlack <dmatlack@google.com> Cc: Wanpeng Li <kernellwp@gmail.com>
[Rename config symbol. - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Paul Mackerras [Wed, 4 May 2016 11:07:52 +0000 (21:07 +1000)]
KVM: PPC: Book3S HV: Re-enable XICS fast path for irqfd-generated interrupts
Commit c9a5eccac1ab ("kvm/eventfd: add arch-specific set_irq",
2015-10-16) added the possibility for architecture-specific code
to handle the generation of virtual interrupts in atomic context
where possible, without having to schedule a work function.
Since we can easily generate virtual interrupts on XICS without
having to do anything worse than take a spinlock, we define a
kvm_arch_set_irq_inatomic() for XICS. We also remove kvm_set_msi()
since it is not used any more.
The one slightly tricky thing is that with the new interface, we
don't get told whether the interrupt is an MSI (or other edge
sensitive interrupt) vs. level-sensitive. The difference as far
as interrupt generation is concerned is that for LSIs we have to
set the asserted flag so it will continue to fire until it is
explicitly cleared.
In fact the XICS code gets told which interrupts are LSIs by userspace
when it configures the interrupt via the KVM_DEV_XICS_GRP_SOURCES
attribute group on the XICS device. To store this information, we add
a new "lsi" field to struct ics_irq_state. With that we can also do a
better job of returning accurate values when reading the attribute
group.
Alex Williamson [Thu, 5 May 2016 17:58:35 +0000 (11:58 -0600)]
kvm: Conditionally register IRQ bypass consumer
If we don't support a mechanism for bypassing IRQs, don't register as
a consumer. This eliminates meaningless dev_info()s when the connect
fails between producer and consumer, such as on AMD systems where
kvm_x86_ops->update_pi_irte is not implemented
Signed-off-by: Alex Williamson <alex.williamson@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Alex Williamson [Thu, 5 May 2016 17:58:29 +0000 (11:58 -0600)]
irqbypass: Disallow NULL token
A NULL token is meaningless and can only lead to unintended problems.
Error on registration with a NULL token, ignore de-registrations with
a NULL token.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Greg Kurz [Mon, 9 May 2016 16:13:37 +0000 (18:13 +0200)]
kvm: introduce KVM_MAX_VCPU_ID
The KVM_MAX_VCPUS define provides the maximum number of vCPUs per guest, and
also the upper limit for vCPU ids. This is okay for all archs except PowerPC
which can have higher ids, depending on the cpu/core/thread topology. In the
worst case (single threaded guest, host with 8 threads per core), it limits
the maximum number of vCPUS to KVM_MAX_VCPUS / 8.
This patch separates the vCPU numbering from the total number of vCPUs, with
the introduction of KVM_MAX_VCPU_ID, as the maximal valid value for vCPU ids
plus one.
The corresponding KVM_CAP_MAX_VCPU_ID allows userspace to validate vCPU ids
before passing them to KVM_CREATE_VCPU.
This patch only implements KVM_MAX_VCPU_ID with a specific value for PowerPC.
Other archs continue to return KVM_MAX_VCPUS instead.
Suggested-by: Radim Krcmar <rkrcmar@redhat.com> Signed-off-by: Greg Kurz <gkurz@linux.vnet.ibm.com> Reviewed-by: Cornelia Huck <cornelia.huck@de.ibm.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Greg Kurz [Mon, 9 May 2016 16:11:54 +0000 (18:11 +0200)]
KVM: remove NULL return path for vcpu ids >= KVM_MAX_VCPUS
Commit c896939f7cff ("KVM: use heuristic for fast VCPU lookup by id") added
a return path that prevents vcpu ids to exceed KVM_MAX_VCPUS. This is a
problem for powerpc where vcpu ids can grow up to 8*KVM_MAX_VCPUS.
This patch simply reverses the logic so that we only try fast path if the
vcpu id can be tried as an index in kvm->vcpus[]. The slow path is not
affected by the change.
Reviewed-by: David Hildenbrand <dahi@linux.vnet.ibm.com> Reviewed-by: Cornelia Huck <cornelia.huck@de.ibm.com> Signed-off-by: Greg Kurz <gkurz@linux.vnet.ibm.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Paolo Bonzini [Wed, 11 May 2016 20:37:37 +0000 (22:37 +0200)]
Merge tag 'kvm-arm-for-4.7' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD
KVM/ARM Changes for Linux v4.7
Reworks our stage 2 page table handling to have page table manipulation
macros separate from those of the host systems as the underlying
hardware page tables can be configured to be noticably different in
layout from the stage 1 page tables used by the host.
Adds 16K page size support based on the above.
Adds a generic firmware probing layer for the timer and GIC so that KVM
initializes using the same logic based on both ACPI and FDT.
Finally adds support for hardware updating of the access flag.
Gavin Shan [Wed, 11 May 2016 01:15:55 +0000 (11:15 +1000)]
KVM: PPC: Book3S HV: Fix build error in book3s_hv.c
When CONFIG_KVM_XICS is enabled, CPU_UP_PREPARE and other macros for
CPU states in linux/cpu.h are needed by arch/powerpc/kvm/book3s_hv.c.
Otherwise, build error as below is seen:
gwshan@gwshan:~/sandbox/l$ make arch/powerpc/kvm/book3s_hv.o
:
CC arch/powerpc/kvm/book3s_hv.o
arch/powerpc/kvm/book3s_hv.c: In function ‘kvmppc_cpu_notify’:
arch/powerpc/kvm/book3s_hv.c:3072:7: error: ‘CPU_UP_PREPARE’ \
undeclared (first use in this function)
This fixes the issue introduced by commit <6f3bb80944> ("KVM: PPC:
Book3S HV: kvmppc_host_rm_ops - handle offlining CPUs").
Paul Mackerras [Thu, 5 May 2016 06:17:10 +0000 (16:17 +1000)]
KVM: PPC: Fix emulated MMIO sign-extension
When the guest does a sign-extending load instruction (such as lha
or lwa) to an emulated MMIO location, it results in a call to
kvmppc_handle_loads() in the host. That function sets the
vcpu->arch.mmio_sign_extend flag and calls kvmppc_handle_load()
to do the rest of the work. However, kvmppc_handle_load() sets
the mmio_sign_extend flag to 0 unconditionally, so the sign
extension never gets done.
To fix this, we rename kvmppc_handle_load to __kvmppc_handle_load
and add an explicit parameter to indicate whether sign extension
is required. kvmppc_handle_load() and kvmppc_handle_loads() then
become 1-line functions that just call __kvmppc_handle_load()
with the extra parameter.
Reported-by: Bin Lu <lblulb@linux.vnet.ibm.com> Signed-off-by: Paul Mackerras <paulus@ozlabs.org> Reviewed-by: Thomas Huth <thuth@redhat.com> Signed-off-by: Paul Mackerras <paulus@samba.org>
Until now, when we connect gdb to the QEMU gdb-server, the
single-step mode is not managed.
This patch adds this, only for kvm-pr:
If KVM_GUESTDBG_SINGLESTEP is set, we enable single-step trace bit in the
MSR (MSR_SE) just before the __kvmppc_vcpu_run(), and disable it just after.
In kvmppc_handle_exit_pr, instead of routing the interrupt to
the guest, we return to host, with KVM_EXIT_DEBUG reason.
Signed-off-by: Laurent Vivier <lvivier@redhat.com> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Reviewed-by: Thomas Huth <thuth@redhat.com> Signed-off-by: Paul Mackerras <paulus@samba.org>
Paolo Bonzini [Tue, 10 May 2016 14:37:38 +0000 (16:37 +0200)]
Merge tag 'kvm-s390-next-4.7-2' of git://git.kernel.org/pub/scm/linux/kernel/git/kvms390/linux into HEAD
KVM: s390: features and fixes for 4.7 part2
- Use hardware provided information about facility bits that do not
need any hypervisor activitiy
- Add missing documentation for KVM_CAP_S390_RI
- Some updates/fixes for handling cpu models and facilities
Reading the KVM_CAP_MIPS_FPU capability returns cpu_has_fpu, however
this uses smp_processor_id() to read the current CPU capabilities (since
some old MIPS systems could have FPUs present on only a subset of CPUs).
We don't support any such systems, so work around the warning by using
raw_cpu_has_fpu instead.
We should probably instead claim not to support FPU at all if any one
CPU is lacking an FPU, but this should do for now.
Signed-off-by: James Hogan <james.hogan@imgtec.com> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: "Radim KrÄ\8dmář" <rkrcmar@redhat.com> Cc: linux-mips@linux-mips.org Cc: kvm@vger.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
There are a couple of places in KVM fault handling code which implicitly
use smp_processor_id() via kvm_mips_get_kernel_asid() and
kvm_mips_get_user_asid() from preemptable context. This is unsafe as a
preemption could cause the guest kernel ASID to be changed, resulting in
a host TLB entry being written with the wrong ASID.
Fix by disabling preemption around the kvm_mips_get_*_asid() call and
the corresponding kvm_mips_host_tlb_write().
Signed-off-by: James Hogan <james.hogan@imgtec.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: "Radim KrÄ\8dmář" <rkrcmar@redhat.com> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: linux-mips@linux-mips.org Cc: kvm@vger.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
James Hogan [Fri, 22 Apr 2016 09:38:46 +0000 (10:38 +0100)]
MIPS: KVM: Fix timer IRQ race when writing CP0_Compare
Writing CP0_Compare clears the timer interrupt pending bit
(CP0_Cause.TI), but this wasn't being done atomically. If a timer
interrupt raced with the write of the guest CP0_Compare, the timer
interrupt could end up being pending even though the new CP0_Compare is
nowhere near CP0_Count.
We were already updating the hrtimer expiry with
kvm_mips_update_hrtimer(), which used both kvm_mips_freeze_hrtimer() and
kvm_mips_resume_hrtimer(). Close the race window by expanding out
kvm_mips_update_hrtimer(), and clearing CP0_Cause.TI and setting
CP0_Compare between the freeze and resume. Since the pending timer
interrupt should not be cleared when CP0_Compare is written via the KVM
user API, an ack argument is added to distinguish the source of the
write.
James Hogan [Fri, 22 Apr 2016 09:38:45 +0000 (10:38 +0100)]
MIPS: KVM: Fix timer IRQ race when freezing timer
There's a particularly narrow and subtle race condition when the
software emulated guest timer is frozen which can allow a guest timer
interrupt to be missed.
This happens due to the hrtimer expiry being inexact, so very
occasionally the freeze time will be after the moment when the emulated
CP0_Count transitions to the same value as CP0_Compare (so an IRQ should
be generated), but before the moment when the hrtimer is due to expire
(so no IRQ is generated). The IRQ won't be generated when the timer is
resumed either, since the resume CP0_Count will already match CP0_Compare.
With VZ guests in particular this is far more likely to happen, since
the soft timer may be frozen frequently in order to restore the timer
state to the hardware guest timer. This happens after 5-10 hours of
guest soak testing, resulting in an overflow in guest kernel timekeeping
calculations, hanging the guest. A more focussed test case to
intentionally hit the race (with the help of a new hypcall to cause the
timer state to migrated between hardware & software) hits the condition
fairly reliably within around 30 seconds.
Instead of relying purely on the inexact hrtimer expiry to determine
whether an IRQ should be generated, read the guest CP0_Compare and
directly check whether the freeze time is before or after it. Only if
CP0_Count is on or after CP0_Compare do we check the hrtimer expiry to
determine whether the last IRQ has already been generated (which will
have pushed back the expiry by one timer period).
kvm: arm64: Enable hardware updates of the Access Flag for Stage 2 page tables
The ARMv8.1 architecture extensions introduce support for hardware
updates of the access and dirty information in page table entries. With
VTCR_EL2.HA enabled (bit 21), when the CPU accesses an IPA with the
PTE_AF bit cleared in the stage 2 page table, instead of raising an
Access Flag fault to EL2 the CPU sets the actual page table entry bit
(10). To ensure that kernel modifications to the page table do not
inadvertently revert a bit set by hardware updates, certain Stage 2
software pte/pmd operations must be performed atomically.
The main user of the AF bit is the kvm_age_hva() mechanism. The
kvm_age_hva_handler() function performs a "test and clear young" action
on the pte/pmd. This needs to be atomic in respect of automatic hardware
updates of the AF bit. Since the AF bit is in the same position for both
Stage 1 and Stage 2, the patch reuses the existing
ptep_test_and_clear_young() functionality if
__HAVE_ARCH_PTEP_TEST_AND_CLEAR_YOUNG is defined. Otherwise, the
existing pte_young/pte_mkold mechanism is preserved.
The kvm_set_s2pte_readonly() (and the corresponding pmd equivalent) have
to perform atomic modifications in order to avoid a race with updates of
the AF bit. The arm64 implementation has been re-written using
exclusives.
Currently, kvm_set_s2pte_writable() (and pmd equivalent) take a pointer
argument and modify the pte/pmd in place. However, these functions are
only used on local variables rather than actual page table entries, so
it makes more sense to follow the pte_mkwrite() approach for stage 1
attributes. The change to kvm_s2pte_mkwrite() makes it clear that these
functions do not modify the actual page table entries.
The (pte|pmd)_mkyoung() uses on Stage 2 entries (setting the AF bit
explicitly) do not need to be modified since hardware updates of the
dirty status are not supported by KVM, so there is no possibility of
losing such information.
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Acked-by: Marc Zyngier <marc.zyngier@arm.com> Reviewed-by: Christoffer Dall <christoffer.dall@linaro.org> Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
KVM: s390: Populate mask of non-hypervisor managed facility bits
When a guest is initializing, KVM provides facility bits that can be
successfully used by the guest. It's done by applying
kvm_s390_fac_list_mask mask on host facility bits stored by the STFLE
instruction. Facility bits can be one of two kinds: it's either a
hypervisor managed bit or non-hypervisor managed.
The hardware provides information which bits need special handling.
Let's automatically passthrough to guests new facility bits, that
don't require hypervisor support.
Signed-off-by: Alexander Yarygin <yarygin@linux.vnet.ibm.com> Reviewed-by: David Hildenbrand <dahi@linux.vnet.ibm.com> Reviewed-by: Eric Farman <farman@linux.vnet.ibm.com> Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Let's add hypervisor-managed facility-apportionment indications field to
SCLP structs. KVM will use it to reduce maintenance cost of
Non-Hypervisor-Managed facility bits.
Signed-off-by: Alexander Yarygin <yarygin@linux.vnet.ibm.com> Reviewed-by: David Hildenbrand <dahi@linux.vnet.ibm.com> Reviewed-by: Eric Farman <farman@linux.vnet.ibm.com> Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
KVM: s390: Enable all facility bits that are known good for passthrough
Some facility bits are in a range that is defined to be "ok for guests
without any necessary hypervisor changes". Enable those bits.
Signed-off-by: Alexander Yarygin <yarygin@linux.vnet.ibm.com> Reviewed-by: David Hildenbrand <dahi@linux.vnet.ibm.com> Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
We forgot to document that capability, let's add documentation.
Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com> Signed-off-by: David Hildenbrand <dahi@linux.vnet.ibm.com> Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Some hardware variants will round the ibc value up/down themselves,
others will report a validity intercept. Let's always round it up/down.
This patch will also make sure that the ibc is set to 0 in case we don't
have ibc support (lowest_ibc == 0).
Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com> Signed-off-by: David Hildenbrand <dahi@linux.vnet.ibm.com> Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
We only have one cpuid for all VCPUs, so let's directly use the one in the
cpu model. Also always store it directly as u64, no need for struct cpuid.
Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com> Signed-off-by: David Hildenbrand <dahi@linux.vnet.ibm.com> Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
KVM: s390: enable SRS only if enabled for the guest
If we don't have SIGP SENSE RUNNING STATUS enabled for the guest, let's
not enable interpretation so we can correctly report an invalid order.
Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com> Signed-off-by: David Hildenbrand <dahi@linux.vnet.ibm.com> Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
KVM: s390: support NQ only if the facility is enabled for the guest
While we can not fully fence of the Nonquiescing Key-Setting facility,
we should as try our best to hide it.
Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com> Signed-off-by: David Hildenbrand <dahi@linux.vnet.ibm.com> Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
We should never inject an exception after we manually rewound the PSW
(to retry the ESSA instruction in this case). This will mess up the PSW.
So this never worked and therefore never really triggered.
Looking at the details, we don't even have to perform any validity checks.
1. Bits 52-63 of an entry are stored as 0 by the hardware.
2. We are dealing with absolute addresses but only check for the prefix
starting at address 0. This isn't correct and doesn't make much sense,
cpus could still zap the prefix of other cpus. But as prefix pages
cannot be swapped out without a notifier being called for the affected
VCPU, a zap can never remove a protected prefix.
Reviewed-by: Dominik Dingel <dingel@linux.vnet.ibm.com> Signed-off-by: David Hildenbrand <dahi@linux.vnet.ibm.com> Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Wanpeng Li [Tue, 3 May 2016 03:43:10 +0000 (11:43 +0800)]
kvm: robustify steal time record
Guest should only trust data to be valid when version haven't changed
before and after reads of steal time. Besides not changing, it has to
be an even number. Hypervisor may write an odd number to version field
to indicate that an update is in progress.
kvm_steal_clock() in guest has already done the read side, make write
side in hypervisor more robust by following the above rule.
Reviewed-by: Wincy Van <fanwenyi0529@gmail.com> Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
KVM: arm/arm64: vgic: Rely on the GIC driver to parse the firmware tables
Currently, the firmware tables are parsed 2 times: once in the GIC
drivers, the other time when initializing the vGIC. It means code
duplication and make more tedious to add the support for another
firmware table (like ACPI).
Use the recently introduced helper gic_get_kvm_info() to get
information about the virtual GIC.
With this change, the virtual GIC becomes agnostic to the firmware
table and KVM will be able to initialize the vGIC on ACPI.
Signed-off-by: Julien Grall <julien.grall@arm.com> Reviewed-by: Christoffer Dall <christoffer.dall@linaro.org> Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
KVM: arm/arm64: arch_timer: Rely on the arch timer to parse the firmware tables
The firmware table is currently parsed by the virtual timer code in
order to retrieve the virtual timer interrupt. However, this is already
done by the arch timer driver.
To avoid code duplication, use the newly function arch_timer_get_kvm_info()
which return all the information required by the virtual timer code.
Signed-off-by: Julien Grall <julien.grall@arm.com> Acked-by: Christoffer Dall <christoffer.dall@linaro.org> Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
irqchip/gic-v3: Parse and export virtual GIC information
Fill up the recently introduced gic_kvm_info with the hardware
information used for virtualization.
Signed-off-by: Julien Grall <julien.grall@arm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Jason Cooper <jason@lakedaemon.net> Cc: Marc Zyngier <marc.zyngier@arm.com> Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
irqchip/gic-v3: Gather all ACPI specific data in a single structure
The ACPI code requires to use global variables in order to collect
information from the tables.
To make clear those variables are ACPI specific, gather all of them in a
single structure.
Furthermore, even if some of the variables are not marked with
__initdata, they are all only used during the initialization. Therefore,
the new variable, which hold the structure, can be marked with
__initdata.
Signed-off-by: Julien Grall <julien.grall@arm.com> Acked-by: Christoffer Dall <christoffer.dall@linaro.org> Reviewed-by: Hanjun Guo <hanjun.guo@linaro.org> Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
irqchip/gic-v2: Parse and export virtual GIC information
For now, the firmware tables are parsed 2 times: once in the GIC
drivers, the other timer when initializing the vGIC. It means code
duplication and make more tedious to add the support for another
firmware table (like ACPI).
Introduce a new structure and set of helpers to get/set the virtual GIC
information. Also fill up the structure for GICv2.
Signed-off-by: Julien Grall <julien.grall@arm.com> Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
irqchip/gic-v2: Gather ACPI specific data in a single structure
The ACPI code requires to use global variables in order to collect
information from the tables.
For now, a single global variable is used, but more will be added in a
subsequent patch. To make clear they are ACPI specific, gather all the
information in a single structure.
Signed-off-by: Julien Grall <julien.grall@arm.com> Acked-by: Christofer Dall <christoffer.dall@linaro.org> Acked-by: Hanjun Guo <hanjun.guo@linaro.org> Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
clocksource: arm_arch_timer: Extend arch_timer_kvm_info to get the virtual IRQ
Currently, the firmware table is parsed by the virtual timer code in
order to retrieve the virtual timer interrupt. However, this is already
done by the arch timer driver.
To avoid code duplication, extend arch_timer_kvm_info to get the virtual
IRQ.
Note that the KVM code will be modified in a subsequent patch.
Signed-off-by: Julien Grall <julien.grall@arm.com> Acked-by: Christoffer Dall <christoffer.dall@linaro.org> Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
Bruce Rogers [Thu, 28 Apr 2016 20:49:21 +0000 (14:49 -0600)]
KVM: x86: fix ordering of cr0 initialization code in vmx_cpu_reset
Commit d28bc9dd25ce reversed the order of two lines which initialize cr0,
allowing the current (old) cr0 value to mess up vcpu initialization.
This was observed in the checks for cr0 X86_CR0_WP bit in the context of
kvm_mmu_reset_context(). Besides, setting vcpu->arch.cr0 after vmx_set_cr0()
is completely redundant. Change the order back to ensure proper vcpu
initialization.
The combination of booting with ovmf firmware when guest vcpus > 1 and kvm's
ept=N option being set results in a VM-entry failure. This patch fixes that.
Fixes: d28bc9dd25ce ("KVM: x86: INIT and reset sequences are different") Cc: stable@vger.kernel.org Signed-off-by: Bruce Rogers <brogers@suse.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Marc Zyngier [Thu, 28 Apr 2016 15:16:31 +0000 (16:16 +0100)]
arm/arm64: KVM: Enforce Break-Before-Make on Stage-2 page tables
The ARM architecture mandates that when changing a page table entry
from a valid entry to another valid entry, an invalid entry is first
written, TLB invalidated, and only then the new entry being written.
The current code doesn't respect this, directly writing the new
entry and only then invalidating TLBs. Let's fix it up.
Cc: <stable@vger.kernel.org> Reported-by: Christoffer Dall <christoffer.dall@linaro.org> Signed-off-by: Marc Zyngier <marc.zyngier@arm.com> Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
Suzuki K Poulose [Thu, 17 Mar 2016 14:29:24 +0000 (14:29 +0000)]
arm64: kvm: Add support for 16K pages
Now that we can handle stage-2 page tables independent
of the host page table levels, wire up the 16K page
support.
Cc: Marc Zyngier <marc.zyngier@arm.com> Reviewed-by: Christoffer Dall <christoffer.dall@linaro.org> Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com>
Suzuki K Poulose [Tue, 22 Mar 2016 17:01:21 +0000 (17:01 +0000)]
kvm-arm: Cleanup stage2 pgd handling
Now that we don't have any fake page table levels for arm64,
cleanup the common code to get rid of the dead code.
Cc: Marc Zyngier <marc.zyngier@arm.com> Acked-by: Christoffer Dall <christoffer.dall@linaro.org> Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com>
Suzuki K Poulose [Wed, 23 Mar 2016 12:22:33 +0000 (12:22 +0000)]
kvm: arm64: Get rid of fake page table levels
On arm64, the hardware supports concatenation of upto 16 tables,
at entry level for stage2 translations and we make use that whenever
possible. This could lead to reduced number of translation levels than
the normal (stage1 table) table. Also, since the IPA(40bit) is smaller
than the some of the supported VA_BITS (e.g, 48bit), there could be
different number of levels in stage-1 vs stage-2 tables. To reuse the
kernel host page table walker for stage2 we have been using a fake
software page table level, not known to the hardware. But with 16K
translations, there could be upto 2 fake software levels (with 48bit VA
and 40bit IPA), which complicates the code. Hence, we want to get rid of
the hack.
Now that we have explicit accessors for hyp vs stage2 page tables,
define the stage2 walker helpers accordingly based on the actual
table used by the hardware.
Once we know the number of translation levels used by the hardware,
it is merely a job of defining the helpers based on whether a
particular level is folded or not, looking at the number of levels.
Some facts before we calculate the translation levels:
1) Smallest page size supported by arm64 is 4K.
2) The minimum number of bits resolved at any page table level
is (PAGE_SHIFT - 3) at intermediate levels.
Both of them implies, minimum number of bits required for a level
change is 9.
Since we can concatenate upto 16 tables at stage2 entry, the total
number of page table levels used by the hardware for resolving N bits
is same as that for (N - 4) bits (with concatenation), as there cannot
be a level in between (N, N-4) as per the above rules.
With the current IPA limit (40bit), for all supported translations
and VA_BITS, we have the following condition (even for 36bit VA with
16K page size):
CONFIG_PGTABLE_LEVELS >= STAGE2_PGTABLE_LEVELS.
So, for e.g, if PUD is present in stage2, it is present in the hyp(host).
Hence, we fall back to the host definition if we find that a level is not
folded. Otherwise we redefine it accordingly. A build time check is added
to make sure the above condition holds. If this condition breaks in future,
we can rearrange the host level helpers and fix our code easily.
Cc: Marc Zyngier <marc.zyngier@arm.com> Cc: Christoffer Dall <christoffer.dall@linaro.org> Reviewed-by: Christoffer Dall <christoffer.dall@linaro.org> Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com>
Suzuki K Poulose [Tue, 22 Mar 2016 17:14:25 +0000 (17:14 +0000)]
kvm-arm: Cleanup kvm_* wrappers
Now that we have switched to explicit page table routines,
get rid of the obsolete kvm_* wrappers.
Also, kvm_tlb_flush_vmid_by_ipa is now called only on stage2
page tables, hence get rid of the redundant check.
Cc: Marc Zyngier <marc.zyngier@arm.com> Acked-by: Christoffer Dall <christoffer.dall@linaro.org> Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com>
Suzuki K Poulose [Wed, 23 Mar 2016 12:08:02 +0000 (12:08 +0000)]
kvm-arm: Add stage2 page table modifiers
Now that the hyp page table is handled by different set of
routines, rename the original shared routines to stage2 handlers.
Also make explicit use of the stage2 page table helpers.
unmap_range has been merged to existing unmap_stage2_range.
Cc: Marc Zyngier <marc.zyngier@arm.com> Acked-by: Christoffer Dall <christoffer.dall@linaro.org> Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com>
Suzuki K Poulose [Tue, 22 Mar 2016 18:56:21 +0000 (18:56 +0000)]
kvm-arm: Add explicit hyp page table modifiers
We have common routines to modify hyp and stage2 page tables
based on the 'kvm' parameter. For a smoother transition to
using separate routines for each, duplicate the routines
and modify the copy to work on hyp.
Marks the forked routines with _hyp_ and gets rid of the
kvm parameter which is no longer needed and is NULL for hyp.
Also, gets rid of calls to kvm_tlb_flush_by_vmid_ipa() calls
from the hyp versions. Uses explicit host page table accessors
instead of the kvm_* page table helpers.
Suggested-by: Christoffer Dall <christoffer.dall@linaro.org> Cc: Marc Zyngier <marc.zyngier@arm.com> Reviewed-by: Christoffer Dall <christoffer.dall@linaro.org> Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com>
Suzuki K Poulose [Tue, 22 Mar 2016 18:33:45 +0000 (18:33 +0000)]
kvm-arm: Use explicit stage2 helper routines
We have stage2 page table helpers for both arm and arm64. Switch to
the stage2 helpers for routines that only deal with stage2 page table.
Cc: Marc Zyngier <marc.zyngier@arm.com> Acked-by: Christoffer Dall <christoffer.dall@linaro.org> Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com>
Introduce hyp_pxx_table_empty helpers for checking whether
a given table entry is empty. This will be used explicitly
once we switch to explicit routines for hyp page table walk.
Acked-by: Marc Zyngier <marc.zyngier@arm.com> Acked-by: Christoffer Dall <christoffer.dall@linaro.org> Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com>
Introduce stage2 page table helpers for arm64. With the fake
page table level still in place, the stage2 table has the same
number of levels as that of the host (and hyp), so they all
fallback to the host version.
Acked-by: Marc Zyngier <marc.zyngier@arm.com> Reviewed-by: Christoffer Dall <christoffer.dall@linaro.org> Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com>
Introduce hyp_pxx_table_empty helpers for checking whether
a given table entry is empty. This will be used explicitly
once we switch to explicit routines for hyp page table walk.
Acked-by: Marc Zyngier <marc.zyngier@arm.com> Reviewed-by: Christoffer Dall <christoffer.dall@linaro.org> Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com> Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
Define the page table helpers for walking the stage2 pagetable
for arm. Since both hyp and stage2 have the same number of levels,
as that of the host we reuse the host helpers.
The exceptions are the p.d_addr_end routines which have to deal
with IPA > 32bit, hence we use the open coded version of their host helpers
which supports 64bit.
Acked-by: Marc Zyngier <marc.zyngier@arm.com> Reviewed-by: Christoffer Dall <christoffer.dall@linaro.org> Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com> Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
Suzuki K Poulose [Tue, 22 Mar 2016 14:06:47 +0000 (14:06 +0000)]
kvm-arm: Remove kvm_pud_huge()
Get rid of kvm_pud_huge() which falls back to pud_huge. Use
pud_huge instead.
Acked-by: Christoffer Dall <christoffer.dall@linaro.org> Acked-by: Marc Zyngier <marc.zyngier@arm.com> Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com>
kvm-arm: Replace kvm_pmd_huge with pmd_thp_or_huge
Both arm and arm64 now provides a helper, pmd_thp_or_huge()
to check if the given pmd represents a huge page. Use that
instead of our own custom check.
Suggested-by: Mark Rutland <mark.rutland@arm.com> Cc: Marc Zyngier <marc.zyngier@arm.com> Acked-by: Marc Zyngier <marc.zyngier@arm.com> Acked-by: Christoffer Dall <christoffer.dall@linaro.org> Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com>
Suzuki K Poulose [Tue, 15 Mar 2016 10:46:34 +0000 (10:46 +0000)]
arm64: Introduce pmd_thp_or_huge
Add a helper to determine if a given pmd represents a huge page
either by hugetlb or thp, as we have for arm. This will be used
by KVM MMU code.
Suggested-by: Mark Rutland <mark.rutland@arm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Steve Capper <steve.capper@linaro.org> Cc: Will Deacon <will.deacon@arm.com> Acked-by: Marc Zyngier <marc.zyngier@arm.com> Acked-by: Will Deacon <will.deacon@arm.com> Acked-by: Christoffer Dall <christoffer.dall@linaro.org> Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com>
kvm arm: Move fake PGD handling to arch specific files
Rearrange the code for fake pgd handling, which is applicable
only for arm64. This will later be removed once we introduce
the stage2 page table walker macros.
Reviewed-by: Marc Zyngier <marc.zyngier@arm.com> Reviewed-by: Christoffer Dall <christoffer.dall@linaro.org> Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com>
We share most of the bits for VTCR_EL2 for different page sizes,
except for the TG0 value and the entry level value. This patch
makes the definitions a bit more cleaner to reflect this fact.
Also cleans up the VTTBR_X calculation. No functional changes.
Cc: Marc Zyngier <marc.zyngier@arm.com> Acked-by: Marc Zyngier <marc.zyngier@arm.com> Reviewed-by: Christoffer Dall <christoffer.dall@linaro.org> Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com>
arm64: Reuse TCR field definitions for EL1 and EL2
TCR_EL1, TCR_EL2 and VTCR_EL2, all share some field positions
(TG0, ORGN0, IRGN0 and SH0) and their corresponding value definitions.
This patch makes the TCR_EL1 definitions reusable and uses them for TCR_EL2
and VTCR_EL2 fields.
This also fixes a bug where we assume TG0 in {V}TCR_EL2 is 1bit field.
Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Mark Rutland <mark.rutland@arm.com> Acked-by: Marc Zyngier <marc.zyngier@arm.com> Acked-by: Will Deacon <will.deacon@arm.com> Reviewed-by: Christoffer Dall <christoffer.dall@linaro.org> Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com>
Paolo Bonzini [Thu, 10 Mar 2016 15:30:22 +0000 (16:30 +0100)]
KVM: add missing memory barrier in kvm_{make,check}_request
kvm_make_request and kvm_check_request imply a producer-consumer
relationship; add implicit memory barriers to them. There was indeed
already a place that was adding an explicit smp_mb() to order between
kvm_check_request and the processing of the request. That memory
barrier can be removed (as an added benefit, kvm_check_request can use
smp_mb__after_atomic which is free on x86).
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Xiao Guangrong [Fri, 25 Mar 2016 13:19:38 +0000 (21:19 +0800)]
KVM: MMU: skip obsolete sp in for_each_gfn_*()
The obsolete sp should not be used on current vCPUs and should not hurt
vCPU's running, so skip it from for_each_gfn_sp() and
for_each_gfn_indirect_valid_sp()
The side effort is we will double check role.invalid in kvm_mmu_get_page()
but i think it is okay as role is well cached
Signed-off-by: Xiao Guangrong <guangrong.xiao@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Liang Chen [Wed, 16 Mar 2016 11:33:16 +0000 (19:33 +0800)]
KVM: x86: optimize steal time calculation
Since accumulate_steal_time is now only called in record_steal_time, it
doesn't quite make sense to put the delta calculation in a separate
function. The function could be called thousands of times before guest
enables the steal time MSR (though the compiler may optimize out this
function call). And after it's enabled, the MSR enable bit is tested twice
every time. Removing the accumulate_steal_time function also avoids the
necessity of having the accum_steal field.
Halil Pasic [Mon, 25 Jan 2016 18:10:40 +0000 (19:10 +0100)]
KVM: s390: add clear I/O irq operation for FLIC
Introduce a FLIC operation for clearing I/O interrupts for a subchannel.
Rationale: According to the platform specification, pending I/O
interruption requests have to be revoked in certain situations. For
instance, according to the Principles of Operation (page 17-27), a
subchannel put into the installed parameters initialized state is in the
same state as after an I/O system reset (just parameters possibly changed).
This implies that any I/O interrupts for that subchannel are no longer
pending (as I/O system resets clear I/O interrupts). Therefore, we need an
interface to clear pending I/O interrupts.
Signed-off-by: Halil Pasic <pasic@linux.vnet.ibm.com> Reviewed-by: Cornelia Huck <cornelia.huck@de.ibm.com> Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com> Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com> Signed-off-by: Cornelia Huck <cornelia.huck@de.ibm.com>
Halil Pasic [Thu, 25 Feb 2016 12:26:32 +0000 (13:26 +0100)]
KVM: s390: document FLIC behavior on unsupported
FLIC behavior deviates from the API documentation in reporting EINVAL
instead of ENXIO for KVM_SET_DEVICE_ATTR/KVM_GET_DEVICE_ATTR when the group
or attribute is unknown/unsupported. Unfortunately this can not be fixed
for historical reasons. Let us at least have it documented.
Signed-off-by: Halil Pasic <pasic@linux.vnet.ibm.com> Reviewed-by: Cornelia Huck <cornelia.huck@de.ibm.com> Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com> Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com> Signed-off-by: Cornelia Huck <cornelia.huck@de.ibm.com>
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull KVM fixes from Radim Krčmář:
"ARM fixes:
- Wrong indentation in the PMU code from the merge window
- A long-time bug occuring with running ntpd on the host, candidate
for stable
- Properly handle (and warn about) the unsupported configuration of
running on systems with less than 40 bits of PA space
- More fixes to the PM and hotplug notifier stuff from the merge
window
x86:
- leak of guest xcr0 (typically shows up as SIGILL)
- new maintainer (who is sending the pull request too)
- fix for merge window regression
- fix for guest CPUID"
Paolo Bonzini points out:
"For the record, this tag is signed by me because I prepared the pull
request. Further pull requests for 4.6 will be signed and sent out by
Radim directly"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
KVM: x86: mask CPUID(0xD,0x1).EAX against host value
kvm: x86: do not leak guest xcr0 into host interrupt handlers
KVM: MMU: fix permission_fault()
KVM: new maintainer on the block
arm64: KVM: unregister notifiers in hyp mode teardown path
arm64: KVM: Warn when PARange is less than 40 bits
KVM: arm/arm64: Handle forward time correction gracefully
arm64: KVM: Add braces to multi-line if statement in virtual PMU code
Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/hid
Pull HID fixes from Jiri Kosina:
- fix for how scaling linearization is computed in wiimote driver, by
Cyan Ogilvie
- endless retry loop fix in generic USB HID core reset-resume handling,
by Alan Stern
- two functional fixes affecting particular devices, and oops fix for
wacom driver, by Jason Gerecke
- multitouch slot numbering fix from Gabriele Mazzotta
- a couple more small fixes on top
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/hid:
HID: wacom: Support switching from vendor-defined device mode on G9 and G11
HID: wacom: Initialize hid_data.inputmode to -1
HID: microsoft: add support for 3 more devices
HID: multitouch: Synchronize MT frame on reset_resume
HID: wacom: fix Bamboo ONE oops
HID: lenovo: Don't use stack variables for DMA buffers
HID: usbhid: fix inconsistent reset/resume/reset-resume behavior
HID: wiimote: Fix wiimote mp scale linearization
Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/geert/linux-m68k
Pull m68k update from Geert Uytterhoeven.
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/geert/linux-m68k:
m68k/defconfig: Update defconfigs for v4.6-rc2
m68k: Wire up preadv2 and pwritev2
Merge tag 'arc-4.6-rc4-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vgupta/arc
Pull ARC fixes from Vineet Gupta:
- fix Kconfig splat due to pcie rework
- make ethernet work again on axs103
- provide fb_pgprotect() for future video driver integration
* tag 'arc-4.6-rc4-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vgupta/arc:
ARC: [plat-axs103] Enable loop block devices
Revert "ARC: [plat-axs10x] add Ethernet PHY description in .dts"
arc: Add our own implementation of fb_pgprotect()
ARC: Don't source drivers/pci/pcie/Kconfig ourselves
Merge tag 'mmc-v4.6-rc1' of git://git.linaro.org/people/ulf.hansson/mmc
Pull MMC fixes from Ulf Hansson:
"Here are a couple of mmc fixes intended for v4.6 rc3:
MMC host:
- sdhci: Fix regression setting power on Trats2 board
- sdhci-pci: Add support and PCI IDs for more Broxton host controllers"
* tag 'mmc-v4.6-rc1' of git://git.linaro.org/people/ulf.hansson/mmc:
mmc: sdhci-pci: Add support and PCI IDs for more Broxton host controllers
mmc: sdhci: Fix regression setting power on Trats2 board
Merge branch 'i2c/for-current' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux
Pull i2c fixes from Wolfram Sang:
"Some bugfixes from I2C:
- fix a uevent triggered boot problem by removing a useless debug
print
- fix sysfs-attributes of the new i2c-demux-pinctrl driver to follow
standard kernel behaviour
- fix a potential division-by-zero error (needed two takes)"
* 'i2c/for-current' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux:
i2c: jz4780: really prevent potential division by zero
Revert "i2c: jz4780: prevent potential division by zero"
i2c: jz4780: prevent potential division by zero
i2c: mux: demux-pinctrl: Update docs to new sysfs-attributes
i2c: mux: demux-pinctrl: Clean up sysfs attributes
i2c: prevent endless uevent loop with CONFIG_I2C_DEBUG_CORE
It's broken: it makes ext4 return an error at an invalid point, causing
the readdir wrappers to write the the position of the last successful
directory entry into the position field, which means that the next
readdir will now return that last successful entry _again_.
You can only return fatal errors (that terminate the readdir directory
walk) from within the filesystem readdir functions, the "normal" errors
(that happen when the readdir buffer fills up, for example) happen in
the iterorator where we know the position of the actual failing entry.
I do have a very different patch that does the "signal_pending()"
handling inside the iterator function where it is allowable, but while
that one passes all the sanity checks, I screwed up something like four
times while emailing it out, so I'm not going to commit it today.
So my track record is not good enough, and the stars will have to align
better before that one gets committed. And it would be good to get some
review too, of course, since celestial alignments are always an iffy
debugging model.
IOW, let's just revert the commit that caused the problem for now.
David Matlack [Wed, 30 Mar 2016 19:24:47 +0000 (12:24 -0700)]
kvm: x86: do not leak guest xcr0 into host interrupt handlers
An interrupt handler that uses the fpu can kill a KVM VM, if it runs
under the following conditions:
- the guest's xcr0 register is loaded on the cpu
- the guest's fpu context is not loaded
- the host is using eagerfpu
Note that the guest's xcr0 register and fpu context are not loaded as
part of the atomic world switch into "guest mode". They are loaded by
KVM while the cpu is still in "host mode".
Usage of the fpu in interrupt context is gated by irq_fpu_usable(). The
interrupt handler will look something like this:
if (irq_fpu_usable()) {
kernel_fpu_begin();
[... code that uses the fpu ...]
kernel_fpu_end();
}
As long as the guest's fpu is not loaded and the host is using eager
fpu, irq_fpu_usable() returns true (interrupted_kernel_fpu_idle()
returns true). The interrupt handler proceeds to use the fpu with
the guest's xcr0 live.
kernel_fpu_begin() saves the current fpu context. If this uses
XSAVE[OPT], it may leave the xsave area in an undesirable state.
According to the SDM, during XSAVE bit i of XSTATE_BV is not modified
if bit i is 0 in xcr0. So it's possible that XSTATE_BV[i] == 1 and
xcr0[i] == 0 following an XSAVE.
kernel_fpu_end() restores the fpu context. Now if any bit i in
XSTATE_BV == 1 while xcr0[i] == 0, XRSTOR generates a #GP. The
fault is trapped and SIGSEGV is delivered to the current process.
Only pre-4.2 kernels appear to be vulnerable to this sequence of
events. Commit 653f52c ("kvm,x86: load guest FPU context more eagerly")
from 4.2 forces the guest's fpu to always be loaded on eagerfpu hosts.
This patch fixes the bug by keeping the host's xcr0 loaded outside
of the interrupts-disabled region where KVM switches into guest mode.
Cc: stable@vger.kernel.org Suggested-by: Andy Lutomirski <luto@amacapital.net> Signed-off-by: David Matlack <dmatlack@google.com>
[Move load after goto cancel_injection. - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Xiao Guangrong [Fri, 25 Mar 2016 13:19:35 +0000 (21:19 +0800)]
KVM: MMU: fix permission_fault()
kvm-unit-tests complained about the PFEC is not set properly, e.g,:
test pte.rw pte.d pte.nx pde.p pde.rw pde.pse user fetch: FAIL: error code 15
expected 5
Dump mapping: address: 0x123400000000
------L4: 3e95007
------L3: 3e96007
------L2: 2000083
It's caused by the reason that PFEC returned to guest is copied from the
PFEC triggered by shadow page table
This patch fixes it and makes the logic of updating errcode more clean
Signed-off-by: Xiao Guangrong <guangrong.xiao@linux.intel.com>
[Do not assume pfec.p=1. - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Merge branch 'parisc-4.6-3' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux
Pull parisc fixes from Helge Deller:
"Since commit 0de798584bde ("parisc: Use generic extable search and
sort routines") module loading is boken on parisc, because the parisc
module loader wasn't prepared for the new R_PARISC_PCREL32 relocations.
In addition, due to that breakage, Mikulas Patocka noticed that
handling exceptions from modules probably never worked on parisc. It
was just masked by the fact that exceptions from modules don't happen
during normal use.
This patch series fixes those issues and survives the tests of the
lib/test_user_copy kernel module test. Some patches are tagged for
stable"
* 'parisc-4.6-3' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux:
parisc: Update comment regarding relative extable support
parisc: Unbreak handling exceptions from kernel modules
parisc: Fix kernel crash with reversed copy_from_user()
parisc: Avoid function pointers for kernel exception routines
parisc: Handle R_PARISC_PCREL32 relocations in kernel modules
Merge tag 'gpio-v4.6-3' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-gpio
Pull GPIO fixes from Linus Walleij:
"Here is a set of four GPIO fixes. The two fixes to the core are
serious as they are regressing minor architectures.
Core fixes:
- Defer GPIO device setup until after gpiolib is initialized.
It turns out that a few very tightly integrated GPIO platform
drivers initialize so early (befor core_initcall()) so that the
gpiolib isn't even initialized itself. That limits what the
library can do, and we cannot reference uninitialized fields until
later.
Defer some of the initialization until right after the gpiolib is
initialized in these (rare) cases.
- As a consequence: do not use devm_* resources when allocating the
states in the initial set-up of the gpiochip.
Driver fixes:
- In ACPI retrieveal: ignore GpioInt when looking for output GPIOs.
- Fix legacy builds on the PXA without a backing pin controller.
- Use correct datatype on pca953x register writes"
* tag 'gpio-v4.6-3' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-gpio:
gpio: pca953x: Use correct u16 value for register word write
gpiolib: Defer gpio device setup until after gpiolib initialization
gpiolib: Do not use devm functions when registering gpio chip
gpio: pxa: fix legacy non pinctrl aware builds
gpio / ACPI: ignore GpioInt() GPIOs when requesting GPIO_OUT_*
Merge tag 'usb-4.6-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb
Pull USB fixes from Greg KH:
"Here are some USB fixes and new device ids for 4.6-rc3.
Nothing major, the normal USB gadget fixes and usb-serial driver ids,
along with some other fixes mixed in. All except the USB serial ids
have been tested in linux-next, the id additions should be fine as
they are 'trivial'"
* tag 'usb-4.6-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb: (25 commits)
USB: option: add "D-Link DWM-221 B1" device id
USB: serial: cp210x: Adding GE Healthcare Device ID
USB: serial: ftdi_sio: Add support for ICP DAS I-756xU devices
usb: dwc3: keystone: drop dma_mask configuration
usb: gadget: udc-core: remove manual dma configuration
usb: dwc3: pci: add ID for one more Intel Broxton platform
usb: renesas_usbhs: fix to avoid using a disabled ep in usbhsg_queue_done()
usb: dwc2: do not override forced dr_mode in gadget setup
usb: gadget: f_midi: unlock on error
USB: digi_acceleport: do sanity checking for the number of ports
USB: cypress_m8: add endpoint sanity check
USB: mct_u232: add sanity checking in probe
usb: fix regression in SuperSpeed endpoint descriptor parsing
USB: usbip: fix potential out-of-bounds write
usb: renesas_usbhs: disable TX IRQ before starting TX DMAC transfer
usb: renesas_usbhs: avoid NULL pointer derefernce in usbhsf_pkt_handler()
usb: gadget: f_midi: Fixed a bug when buflen was smaller than wMaxPacketSize
usb: phy: qcom-8x16: fix regulator API abuse
usb: ch9: Fix SSP Device Cap wFunctionalitySupport type
usb: gadget: composite: Access SSP Dev Cap fields properly
...
Merge tag 'staging-4.6-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging
Pull staging and IIO driver fixes from Greg KH:
"Here are some IIO driver fixes, along with two staging driver fixes
for 4.6-rc3.
One staging driver patch reverts the deletion of a driver that
happened in 4.6-rc1. We thought that laptop.org was dead, but it's
still alive and kicking, and has users that were mad we broke their
hardware by deleting a driver for their machines. So that driver is
added back and everyone is happy again.
All of these have been in linux-next for a while with no reported
issues"
* tag 'staging-4.6-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging:
Revert "Staging: olpc_dcon: Remove obsolete driver"
staging/rdma/hfi1: select CRC32
iio: gyro: bmg160: fix buffer read values
iio: gyro: bmg160: fix endianness when reading axes
iio: accel: bmc150: fix endianness when reading axes
iio: st_magn: always define ST_MAGN_TRIGGER_SET_STATE
iio: fix config watermark initial value
iio: health: max30100: correct FIFO check condition
iio: imu: Fix inv_mpu6050 dependencies
iio: adc: Fix build error of missing devm_ioremap_resource on UM
iio: light: apds9960: correct FIFO check condition
iio: adc: max1363: correct reference voltage
iio: adc: max1363: add missing adc to max1363_id
Merge tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi
Pull SCSI fixes from James Bottomley:
"This is a set of eight fixes.
Two are trivial gcc-6 updates (brace additions and unused variable
removal). There's a couple of cxlflash regressions, a correction for
sd being overly chatty on revalidation (causing excess log increases).
A VPD issue which could crash USB devices because they seem very
intolerant to VPD inquiries, an ALUA deadlock fix and a mpt3sas buffer
overrun fix"
* tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
scsi: Do not attach VPD to devices that don't support it
sd: Fix excessive capacity printing on devices with blocks bigger than 512 bytes
scsi_dh_alua: Fix a recently introduced deadlock
scsi: Declare local symbols static
cxlflash: Move to exponential back-off when cmd_room is not available
cxlflash: Fix regression issue with re-ordering patch
mpt3sas: Don't overreach ioc->reply_post[] during initialization
aacraid: add missing curly braces
- fix error handling (Guoqing)
- fix a crash when a disk is hotremoved (me)
- fix a dead loop (Wei Fang)"
* tag 'md/4.6-rc2-fix' of git://git.kernel.org/pub/scm/linux/kernel/git/shli/md:
md/bitmap: clear bitmap if bitmap_create failed
MD: add rdev reference for super write
md: fix a trivial typo in comments
md:raid1: fix a dead loop when read from a WriteMostly disk
Merge tag 'pm+acpi-4.6-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull power management and ACPI fixes from Rafael Wysocki:
"Fixes for some issues discovered after recent changes and for some
that have just been found lately regardless of those changes
(intel_pstate, intel_idle, PM core, mailbox/pcc, turbostat) plus
support for some new CPU models (intel_idle, Intel RAPL driver,
turbostat) and documentation updates (intel_pstate, PM core).
Specifics:
- intel_pstate fixes for two issues exposed by the recent switch over
from using timers and for one issue introduced during the 4.4 cycle
plus new comments describing data structures used by the driver
(Rafael Wysocki, Srinivas Pandruvada).
- intel_idle fixes related to CPU offline/online (Richard Cochran).
- intel_idle support (new CPU IDs and state definitions mostly) for
Skylake-X and Kabylake processors (Len Brown).
- PCC mailbox driver fix for an out-of-bounds memory access that may
cause the kernel to panic() (Shanker Donthineni).
- New (missing) CPU ID for one apparently overlooked Haswell model in
the Intel RAPL power capping driver (Srinivas Pandruvada).
- Fix for the PM core's wakeup IRQs framework to make it work after
wakeup settings reconfiguration from sysfs (Grygorii Strashko).
- Runtime PM documentation update to make it describe what needs to
be done during device removal more precisely (Krzysztof Kozlowski).
- Stale comment removal cleanup in the cpufreq-dt driver (Viresh
Kumar).
- turbostat utility fixes and support for Broxton, Skylake-X and
Kabylake processors (Len Brown)"
* tag 'pm+acpi-4.6-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (28 commits)
PM / wakeirq: fix wakeirq setting after wakup re-configuration from sysfs
tools/power turbostat: work around RC6 counter wrap
tools/power turbostat: initial KBL support
tools/power turbostat: initial SKX support
tools/power turbostat: decode BXT TSC frequency via CPUID
tools/power turbostat: initial BXT support
tools/power turbostat: print IRTL MSRs
tools/power turbostat: SGX state should print only if --debug
intel_idle: Add KBL support
intel_idle: Add SKX support
intel_idle: Clean up all registered devices on exit.
intel_idle: Propagate hot plug errors.
intel_idle: Don't overreact to a cpuidle registration failure.
intel_idle: Setup the timer broadcast only on successful driver load.
intel_idle: Avoid a double free of the per-CPU data.
intel_idle: Fix dangling registration on error path.
intel_idle: Fix deallocation order on the driver exit path.
intel_idle: Remove redundant initialization calls.
intel_idle: Fix a helper function's return value.
intel_idle: remove useless return from void function.
...
1) Stale SKB data pointer access across pskb_may_pull() calls in L2TP,
from Haishuang Yan.
2) Fix multicast frame handling in mac80211 AP code, from Felix
Fietkau.
3) mac80211 station hashtable insert errors not handled properly, fix
from Johannes Berg.
4) Fix TX descriptor count limit handling in e1000, from Alexander
Duyck.
5) Revert a buggy netdev refcount fix in netpoll, from Bjorn Helgaas.
6) Must assign rtnl_link_ops of the device before registering it, fix
in ip6_tunnel from Thadeu Lima de Souza Cascardo.
7) Memory leak fix in tc action net exit, from WANG Cong.
8) Add missing AF_KCM entries to name tables, from Dexuan Cui.
9) Fix regression in GRE handling of csums wrt. FOU, from Alexander
Duyck.
10) Fix memory allocation alignment and congestion map corruption in
RDS, from Shamir Rabinovitch.
11) Fix default qdisc regression in tuntap driver, from Jason Wang.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (44 commits)
bridge, netem: mark mailing lists as moderated
tuntap: restore default qdisc
mpls: find_outdev: check for err ptr in addition to NULL check
ipv6: Count in extension headers in skb->network_header
RDS: fix congestion map corruption for PAGE_SIZE > 4k
RDS: memory allocated must be align to 8
GRE: Disable segmentation offloads w/ CSUM and we are encapsulated via FOU
net: add the AF_KCM entries to family name tables
MAINTAINERS: intel-wired-lan list is moderated
lib/test_bpf: Add additional BPF_ADD tests
lib/test_bpf: Add test to check for result of 32-bit add that overflows
lib/test_bpf: Add tests for unsigned BPF_JGT
lib/test_bpf: Fix JMP_JSET tests
VSOCK: Detach QP check should filter out non matching QPs.
stmmac: fix adjust link call in case of a switch is attached
af_packet: tone down the Tx-ring unsupported spew.
net_sched: fix a memory leak in tc action
samples/bpf: Enable powerpc support
samples/bpf: Use llc in PATH, rather than a hardcoded value
samples/bpf: Fix build breakage with map_perf_test_user.c
...
Merge branch 'for-linus-4.6' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs fixes from Chris Mason:
"These are bug fixes, including a really old fsync bug, and a few trace
points to help us track down problems in the quota code"
* 'for-linus-4.6' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
Btrfs: fix file/data loss caused by fsync after rename and new inode
btrfs: Reset IO error counters before start of device replacing
btrfs: Add qgroup tracing
Btrfs: don't use src fd for printk
btrfs: fallback to vmalloc in btrfs_compare_tree
btrfs: handle non-fatal errors in btrfs_qgroup_inherit()
btrfs: Output more info for enospc_debug mount option
Btrfs: fix invalid reference in replace_path
Btrfs: Improve FL_KEEP_SIZE handling in fallocate
Merge tag 'for-linus-4.6-ofs1' of git://git.kernel.org/pub/scm/linux/kernel/git/hubcap/linux
Pull orangefs fixes from Mike Marshall:
"Orangefs cleanups and a strncpy vulnerability fix.
Cleanups:
- remove an unused variable from orangefs_readdir.
- clean up printk wrapper used for ofs "gossip" debugging.
- clean up truncate ctime and mtime setting in inode.c
- remove a useless null check found by coccinelle.
- optimize some memcpy/memset boilerplate code.
- remove some useless sanity checks from xattr.c
Fix:
- fix a potential strncpy vulnerability"
* tag 'for-linus-4.6-ofs1' of git://git.kernel.org/pub/scm/linux/kernel/git/hubcap/linux:
orangefs: remove unused variable
orangefs: Add KERN_<LEVEL> to gossip_<level> macros
orangefs: strncpy -> strscpy
orangefs: clean up truncate ctime and mtime setting
Orangefs: fix ifnullfree.cocci warnings
Orangefs: optimize boilerplate code.
Orangefs: xattr.c cleanup
Merge tag 'iommu-fixes-v4.6-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu
Pull IOMMU fixes from Joerg Roedel:
- compile-time fixes (warnings and failures)
- a bug in iommu core code which could cause the group->domain pointer
to be falsly cleared
- fix in scatterlist handling of the ARM common DMA-API code
- stall detection fix for the Rockchip IOMMU driver
* tag 'iommu-fixes-v4.6-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu:
iommu/vt-d: Silence an uninitialized variable warning
iommu/rockchip: Fix "is stall active" check
iommu: Don't overwrite domain pointer when there is no default_domain
iommu/dma: Restore scatterlist offsets correctly
iommu: provide of_xlate pointer unconditionally
David S. Miller [Fri, 8 Apr 2016 20:41:28 +0000 (16:41 -0400)]
Merge tag 'mac80211-for-davem-2016-04-06' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211
Johannes Berg says:
====================
For the current RC series, we have the following fixes:
* TDLS fixes from Arik and Ilan
* rhashtable fixes from Ben and myself
* documentation fixes from Luis
* U-APSD fixes from Emmanuel
* a TXQ fix from Felix
* and a compiler warning suppression from Jeff
====================
Signed-off-by: David S. Miller <davem@davemloft.net>