git.karo-electronics.de Git - linux-beck.git/log

KVM: Remove SMEP bit from CR4_RESERVED_BITS

This patch removes SMEP bit from CR4_RESERVED_BITS.

Signed-off-by: Yang, Wei <wei.y.yang@intel.com>
Signed-off-by: Shan, Haitao <haitao.shan@intel.com>
Signed-off-by: Li, Xin <xin.li@intel.com>
Signed-off-by: Avi Kivity <avi@redhat.com>

KVM: nVMX: Fix bug preventing more than two levels of nesting

The nested VMX feature is supposed to fully emulate VMX for the guest. This
(theoretically) not only allows it to run its own guests, but also also
to further emulate VMX for its own guests, and allow arbitrarily deep nesting.

This patch fixes a bug (discovered by Kevin Tian) in handling a VMLAUNCH
by L2, which prevented deeper nesting.

Deeper nesting now works (I only actually tested L3), but is currently
*absurdly* slow, to the point of being unusable.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

KVM: Fixup documentation section numbering

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

KVM: x86 emulator: fold decode_cache into x86_emulate_ctxt

This saves a lot of pointless casts x86_emulate_ctxt and decode_cache.

Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

KVM: x86 emulator: rename decode_cache::eip to _eip

The name eip conflicts with a field of the same name in x86_emulate_ctxt,
which we plan to fold decode_cache into.

The name _eip is unfortunate, but what's really needed is a refactoring
here, not a better name.

Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

KVM: VMX: Silence warning on 32-bit hosts

a is unused now on CONFIG_X86_32.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

KVM: x86 emulator: Use opcode::execute for CLI/STI(FA/FB)

Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

KVM: x86 emulator: Use opcode::execute for LOOP/JCXZ

LOOP/LOOPcc : E0-E2
JCXZ/JECXZ/JRCXZ : E3

Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

KVM: x86 emulator: Clean up INT n/INTO/INT 3(CC/CD/CE)

Call emulate_int() directly to avoid spaghetti goto's.

Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

KVM: x86 emulator: Use opcode::execute for MOV(8C/8E)

Different functions for those which take segment register operands.

Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

KVM: x86 emulator: Use opcode::execute for RET(C3)

Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

KVM: x86 emulator: Use opcode::execute for XCHG(86/87)

In addition, replace one "goto xchg" with an em_xchg() call.

Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

KVM: x86 emulator: Use opcode::execute for TEST(84/85, A8/A9)

Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

KVM: x86 emulator: Use opcode::execute for some instructions

Move the following functions to the opcode tables:

  RET (Far return) : CB
  IRET             : CF
  JMP (Jump far)   : EA

  SYSCALL          : 0F 05
  CLTS             : 0F 06
  SYSENTER         : 0F 34
  SYSEXIT          : 0F 35

Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

KVM: x86 emulator: Rename emulate_xxx() to em_xxx()

The next patch will change these to be called by opcode::execute.

Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

KVM: x86 emulator: Use the pointers ctxt and c consistently

We should use the local variables ctxt and c when the emulate_ctxt and
decode appears many times. At least, we need to be consistent about
how we use these in a function.

Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

KVM: Document KVM_IOEVENTFD

Document KVM_IOEVENTFD that can be used to receive
notifications of PIO/MMIO events without triggering
an exit.

Signed-off-by: Sasha Levin <levinsasha928@gmail.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

KVM: nVMX: Documentation

This patch includes a brief introduction to the nested vmx feature in the
Documentation/kvm directory. The document also includes a copy of the
vmcs12 structure, as requested by Avi Kivity.

[marcelo: move to Documentation/virtual/kvm]

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

KVM: nVMX: Miscellenous small corrections

Small corrections of KVM (spelling, etc.) not directly related to nested VMX.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

KVM: nVMX: Add VMX to list of supported cpuid features

If the "nested" module option is enabled, add the "VMX" CPU feature to the
list of CPU features KVM advertises with the KVM_GET_SUPPORTED_CPUID ioctl.

Qemu uses this ioctl, and intersects KVM's list with its own list of desired
cpu features (depending on the -cpu option given to qemu) to determine the
final list of features presented to the guest.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

KVM: nVMX: Additional TSC-offset handling

In the unlikely case that L1 does not capture MSR_IA32_TSC, L0 needs to
emulate this MSR write by L2 by modifying vmcs02.tsc_offset. We also need to
set vmcs12.tsc_offset, for this change to survive the next nested entry (see
prepare_vmcs02()).
Additionally, we also need to modify vmx_adjust_tsc_offset: The semantics
of this function is that the TSC of all guests on this vcpu, L1 and possibly
several L2s, need to be adjusted. To do this, we need to adjust vmcs01's
tsc_offset (this offset will also apply to each L2s we enter). We can't set
vmcs01 now, so we have to remember this adjustment and apply it when we
later exit to L1.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

KVM: nVMX: Further fixes for lazy FPU loading

KVM's "Lazy FPU loading" means that sometimes L0 needs to set CR0.TS, even
if a guest didn't set it. Moreover, L0 must also trap CR0.TS changes and
NM exceptions, even if we have a guest hypervisor (L1) who didn't want these
traps. And of course, conversely: If L1 wanted to trap these events, we
must let it, even if L0 is not interested in them.

This patch fixes some existing KVM code (in update_exception_bitmap(),
vmx_fpu_activate(), vmx_fpu_deactivate()) to do the correct merging of L0's
and L1's needs. Note that handle_cr() was already fixed in the above patch,
and that new code in introduced in previous patches already handles CR0
correctly (see prepare_vmcs02(), prepare_vmcs12(), and nested_vmx_vmexit()).

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

KVM: nVMX: Handling of CR0 and CR4 modifying instructions

When L2 tries to modify CR0 or CR4 (with mov or clts), and modifies a bit
which L1 asked to shadow (via CR[04]_GUEST_HOST_MASK), we already do the right
thing: we let L1 handle the trap (see nested_vmx_exit_handled_cr() in a
previous patch).
When L2 modifies bits that L1 doesn't care about, we let it think (via
CR[04]_READ_SHADOW) that it did these modifications, while only changing
(in GUEST_CR[04]) the bits that L0 doesn't shadow.

This is needed for corect handling of CR0.TS for lazy FPU loading: L0 may
want to leave TS on, while pretending to allow the guest to change it.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

KVM: nVMX: Correct handling of idt vectoring info

This patch adds correct handling of IDT_VECTORING_INFO_FIELD for the nested
case.

When a guest exits while delivering an interrupt or exception, we get this
information in IDT_VECTORING_INFO_FIELD in the VMCS. When L2 exits to L1,
there's nothing we need to do, because L1 will see this field in vmcs12, and
handle it itself. However, when L2 exits and L0 handles the exit itself and
plans to return to L2, L0 must inject this event to L2.

In the normal non-nested case, the idt_vectoring_info case is discovered after
the exit, and the decision to inject (though not the injection itself) is made
at that point. However, in the nested case a decision of whether to return
to L2 or L1 also happens during the injection phase (see the previous
patches), so in the nested case we can only decide what to do about the
idt_vectoring_info right after the injection, i.e., in the beginning of
vmx_vcpu_run, which is the first time we know for sure if we're staying in
L2.

Therefore, when we exit L2 (is_guest_mode(vcpu)), we disable the regular
vmx_complete_interrupts() code which queues the idt_vectoring_info for
injection on next entry - because such injection would not be appropriate
if we will decide to exit to L1. Rather, we just save the idt_vectoring_info
and related fields in vmcs12 (which is a convenient place to save these
fields). On the next entry in vmx_vcpu_run (*after* the injection phase,
potentially exiting to L1 to inject an event requested by user space), if
we find ourselves in L1 we don't need to do anything with those values
we saved (as explained above). But if we find that we're in L2, or rather
*still* at L2 (it's not nested_run_pending, meaning that this is the first
round of L2 running after L1 having just launched it), we need to inject
the event saved in those fields - by writing the appropriate VMCS fields.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

KVM: nVMX: Correct handling of exception injection

Similar to the previous patch, but concerning injection of exceptions rather
than external interrupts.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

KVM: nVMX: Correct handling of interrupt injection

The code in this patch correctly emulates external-interrupt injection
while a nested guest L2 is running.

Because of this code's relative un-obviousness, I include here a longer-than-
usual justification for what it does - much longer than the code itself ;-)

To understand how to correctly emulate interrupt injection while L2 is
running, let's look first at what we need to emulate: How would things look
like if the extra L0 hypervisor layer is removed, and instead of L0 injecting
an interrupt, we had hardware delivering an interrupt?

Now we have L1 running on bare metal with a guest L2, and the hardware
generates an interrupt. Assuming that L1 set PIN_BASED_EXT_INTR_MASK to 1, and
VM_EXIT_ACK_INTR_ON_EXIT to 0 (we'll revisit these assumptions below), what
happens now is this: The processor exits from L2 to L1, with an external-
interrupt exit reason but without an interrupt vector. L1 runs, with
interrupts disabled, and it doesn't yet know what the interrupt was. Soon
after, it enables interrupts and only at that moment, it gets the interrupt
from the processor. when L1 is KVM, Linux handles this interrupt.

Now we need exactly the same thing to happen when that L1->L2 system runs
on top of L0, instead of real hardware. This is how we do this:

When L0 wants to inject an interrupt, it needs to exit from L2 to L1, with
external-interrupt exit reason (with an invalid interrupt vector), and run L1.
Just like in the bare metal case, it likely can't deliver the interrupt to
L1 now because L1 is running with interrupts disabled, in which case it turns
on the interrupt window when running L1 after the exit. L1 will soon enable
interrupts, and at that point L0 will gain control again and inject the
interrupt to L1.

Finally, there is an extra complication in the code: when nested_run_pending,
we cannot return to L1 now, and must launch L2. We need to remember the
interrupt we wanted to inject (and not clear it now), and do it on the
next exit.

The above explanation shows that the relative strangeness of the nested
interrupt injection code in this patch, and the extra interrupt-window
exit incurred, are in fact necessary for accurate emulation, and are not
just an unoptimized implementation.

Let's revisit now the two assumptions made above:

If L1 turns off PIN_BASED_EXT_INTR_MASK (no hypervisor that I know
does, by the way), things are simple: L0 may inject the interrupt directly
to the L2 guest - using the normal code path that injects to any guest.
We support this case in the code below.

If L1 turns on VM_EXIT_ACK_INTR_ON_EXIT, things look very different from the
description above: L1 expects to see an exit from L2 with the interrupt vector
already filled in the exit information, and does not expect to be interrupted
again with this interrupt. The current code does not (yet) support this case,
so we do not allow the VM_EXIT_ACK_INTR_ON_EXIT exit-control to be turned on
by L1.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

KVM: nVMX: Deciding if L0 or L1 should handle an L2 exit

This patch contains the logic of whether an L2 exit should be handled by L0
and then L2 should be resumed, or whether L1 should be run to handle this
exit (using the nested_vmx_vmexit() function of the previous patch).

The basic idea is to let L1 handle the exit only if it actually asked to
trap this sort of event. For example, when L2 exits on a change to CR0,
we check L1's CR0_GUEST_HOST_MASK to see if L1 expressed interest in any
bit which changed; If it did, we exit to L1. But if it didn't it means that
it is we (L0) that wished to trap this event, so we handle it ourselves.

The next two patches add additional logic of what to do when an interrupt or
exception is injected: Does L0 need to do it, should we exit to L1 to do it,
or should we resume L2 and keep the exception to be injected later.

We keep a new flag, "nested_run_pending", which can override the decision of
which should run next, L1 or L2. nested_run_pending=1 means that we *must* run
L2 next, not L1. This is necessary in particular when L1 did a VMLAUNCH of L2
and therefore expects L2 to be run (and perhaps be injected with an event it
specified, etc.). Nested_run_pending is especially intended to avoid switching
to L1 in the injection decision-point described above.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

KVM: nVMX: vmcs12 checks on nested entry

This patch adds a bunch of tests of the validity of the vmcs12 fields,
according to what the VMX spec and our implementation allows. If fields
we cannot (or don't want to) honor are discovered, an entry failure is
emulated.

According to the spec, there are two types of entry failures: If the problem
was in vmcs12's host state or control fields, the VMLAUNCH instruction simply
fails. But a problem is found in the guest state, the behavior is more
similar to that of an exit.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

KVM: nVMX: Exiting from L2 to L1

This patch implements nested_vmx_vmexit(), called when the nested L2 guest
exits and we want to run its L1 parent and let it handle this exit.

Note that this will not necessarily be called on every L2 exit. L0 may decide
to handle a particular exit on its own, without L1's involvement; In that
case, L0 will handle the exit, and resume running L2, without running L1 and
without calling nested_vmx_vmexit(). The logic for deciding whether to handle
a particular exit in L1 or in L0, i.e., whether to call nested_vmx_vmexit(),
will appear in a separate patch below.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

KVM: nVMX: No need for handle_vmx_insn function any more

Before nested VMX support, the exit handler for a guest executing a VMX
instruction (vmclear, vmlaunch, vmptrld, vmptrst, vmread, vmread, vmresume,
vmwrite, vmon, vmoff), was handle_vmx_insn(). This handler simply threw a #UD
exception. Now that all these exit reasons are properly handled (and emulate
the respective VMX instruction), nothing calls this dummy handler and it can
be removed.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

KVM: nVMX: Implement VMLAUNCH and VMRESUME

Implement the VMLAUNCH and VMRESUME instructions, allowing a guest
hypervisor to run its own guests.

This patch does not include some of the necessary validity checks on
vmcs12 fields before the entry. These will appear in a separate patch
below.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

KVM: nVMX: Prepare vmcs02 from vmcs01 and vmcs12

This patch contains code to prepare the VMCS which can be used to actually
run the L2 guest, vmcs02. prepare_vmcs02 appropriately merges the information
in vmcs12 (the vmcs that L1 built for L2) and in vmcs01 (our desires for our
own guests).

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

KVM: nVMX: Move control field setup to functions

Move some of the control field setup to common functions. These functions will
also be needed for running L2 guests - L0's desires (expressed in these
functions) will be appropriately merged with L1's desires.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

KVM: nVMX: Move host-state field setup to a function

Move the setting of constant host-state fields (fields that do not change
throughout the life of the guest) from vmx_vcpu_setup to a new common function
vmx_set_constant_host_state(). This function will also be used to set the
host state when running L2 guests.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

KVM: nVMX: Implement VMREAD and VMWRITE

Implement the VMREAD and VMWRITE instructions. With these instructions, L1
can read and write to the VMCS it is holding. The values are read or written
to the fields of the vmcs12 structure introduced in a previous patch.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

KVM: nVMX: Implement VMPTRST

This patch implements the VMPTRST instruction.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

KVM: nVMX: Implement VMPTRLD

This patch implements the VMPTRLD instruction.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

KVM: nVMX: Implement VMCLEAR

This patch implements the VMCLEAR instruction.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

KVM: nVMX: Success/failure of VMX instructions.

VMX instructions specify success or failure by setting certain RFLAGS bits.
This patch contains common functions to do this, and they will be used in
the following patches which emulate the various VMX instructions.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

KVM: nVMX: Add VMCS fields to the vmcs12

In this patch we add to vmcs12 (the VMCS that L1 keeps for L2) all the
standard VMCS fields.

Later patches will enable L1 to read and write these fields using VMREAD/
VMWRITE, and they will be used during a VMLAUNCH/VMRESUME in preparing vmcs02,
a hardware VMCS for running L2.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

KVM: nVMX: Introduce vmcs02: VMCS used to run L2

We saw in a previous patch that L1 controls its L2 guest with a vcms12.
L0 needs to create a real VMCS for running L2. We call that "vmcs02".
A later patch will contain the code, prepare_vmcs02(), for filling the vmcs02
fields. This patch only contains code for allocating vmcs02.

In this version, prepare_vmcs02() sets *all* of vmcs02's fields each time we
enter from L1 to L2, so keeping just one vmcs02 for the vcpu is enough: It can
be reused even when L1 runs multiple L2 guests. However, in future versions
we'll probably want to add an optimization where vmcs02 fields that rarely
change will not be set each time. For that, we may want to keep around several
vmcs02s of L2 guests that have recently run, so that potentially we could run
these L2s again more quickly because less vmwrites to vmcs02 will be needed.

This patch adds to each vcpu a vmcs02 pool, vmx->nested.vmcs02_pool,
which remembers the vmcs02s last used to run up to VMCS02_POOL_SIZE L2s.
As explained above, in the current version we choose VMCS02_POOL_SIZE=1,
I.e., one vmcs02 is allocated (and loaded onto the processor), and it is
reused to enter any L2 guest. In the future, when prepare_vmcs02() is
optimized not to set all fields every time, VMCS02_POOL_SIZE should be
increased.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

KVM: nVMX: Decoding memory operands of VMX instructions

This patch includes a utility function for decoding pointer operands of VMX
instructions issued by L1 (a guest hypervisor)

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

KVM: nVMX: Implement reading and writing of VMX MSRs

When the guest can use VMX instructions (when the "nested" module option is
on), it should also be able to read and write VMX MSRs, e.g., to query about
VMX capabilities. This patch adds this support.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

KVM: nVMX: Introduce vmcs12: a VMCS structure for L1

An implementation of VMX needs to define a VMCS structure. This structure
is kept in guest memory, but is opaque to the guest (who can only read or
write it with VMX instructions).

This patch starts to define the VMCS structure which our nested VMX
implementation will present to L1. We call it "vmcs12", as it is the VMCS
that L1 keeps for its L2 guest. We will add more content to this structure
in later patches.

This patch also adds the notion (as required by the VMX spec) of L1's "current
VMCS", and finally includes utility functions for mapping the guest-allocated
VMCSs in host memory.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

KVM: nVMX: Allow setting the VMXE bit in CR4

This patch allows the guest to enable the VMXE bit in CR4, which is a
prerequisite to running VMXON.

Whether to allow setting the VMXE bit now depends on the architecture (svm
or vmx), so its checking has moved to kvm_x86_ops->set_cr4(). This function
now returns an int: If kvm_x86_ops->set_cr4() returns 1, __kvm_set_cr4()
will also return 1, and this will cause kvm_set_cr4() will throw a #GP.

Turning on the VMXE bit is allowed only when the nested VMX feature is
enabled, and turning it off is forbidden after a vmxon.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

KVM: nVMX: Implement VMXON and VMXOFF

This patch allows a guest to use the VMXON and VMXOFF instructions, and
emulates them accordingly. Basically this amounts to checking some
prerequisites, and then remembering whether the guest has enabled or disabled
VMX operation.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

KVM: nVMX: Add "nested" module option to kvm_intel

This patch adds to kvm_intel a module option "nested". This option controls
whether the guest can use VMX instructions, i.e., whether we allow nested
virtualization. A similar, but separate, option already exists for the
SVM module.

This option currently defaults to 0, meaning that nested VMX must be
explicitly enabled by giving nested=1. When nested VMX matures, the default
should probably be changed to enable nested VMX by default - just like
nested SVM is currently enabled by default.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

KVM: x86 emulator: Avoid clearing the whole decode_cache

During tracing the emulator, we noticed that init_emulate_ctxt()
sometimes took a bit longer time than we expected.

This patch is for mitigating the problem by some degree.

By looking into the function, we soon notice that it clears the whole
decode_cache whose size is about 2.5K bytes now. Furthermore, most of
the bytes are taken for the two read_cache arrays, which are used only
by a few instructions.

Considering the fact that we are not assuming the cache arrays have
been cleared when we store actual data, we do not need to clear the
arrays: 2K bytes elimination. In addition, we can avoid clearing the
fetch_cache and regs arrays.

This patch changes the initialization not to clear the arrays.

On our 64-bit host, init_emulate_ctxt() becomes 0.3 to 0.5us faster with
this patch applied.

Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
Cc: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>

KVM: x86 emulator: Clean up init_emulate_ctxt()

Use a local pointer to the emulate_ctxt for simplicity. Then, arrange
the hard-to-read mode selection lines neatly.

Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
Signed-off-by: Avi Kivity <avi@redhat.com>

KVM: Clean up error handling during VCPU creation

So far kvm_arch_vcpu_setup is responsible for freeing the vcpu struct if
it fails. Move this confusing resonsibility back into the hands of
kvm_vm_ioctl_create_vcpu. Only kvm_arch_vcpu_setup of x86 is affected,
all other archs cannot fail.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Avi Kivity <avi@redhat.com>

KVM: VMX: Keep list of loaded VMCSs, instead of vcpus

In VMX, before we bring down a CPU we must VMCLEAR all VMCSs loaded on it
because (at least in theory) the processor might not have written all of its
content back to memory. Since a patch from June 26, 2008, this is done using
a per-cpu "vcpus_on_cpu" linked list of vcpus loaded on each CPU.

The problem is that with nested VMX, we no longer have the concept of a
vcpu being loaded on a cpu: A vcpu has multiple VMCSs (one for L1, a pool for
L2s), and each of those may be have been last loaded on a different cpu.

So instead of linking the vcpus, we link the VMCSs, using a new structure
loaded_vmcs. This structure contains the VMCS, and the information pertaining
to its loading on a specific cpu (namely, the cpu number, and whether it
was already launched on this cpu once). In nested we will also use the same
structure to hold L2 VMCSs, and vmx->loaded_vmcs is a pointer to the
currently active VMCS.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
Acked-by: Acked-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Avi Kivity <avi@redhat.com>

KVM: Sanitize cpuid

Instead of blacklisting known-unsupported cpuid leaves, whitelist known-
supported leaves. This is more conservative and prevents us from reporting
features we don't support. Also whitelist a few more leaves while at it.

Signed-off-by: Avi Kivity <avi@redhat.com>
Acked-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

KVM: MMU: cleanup for dropping parent pte

Introduce drop_parent_pte to remove the rmap of parent pte and
clear parent pte

Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

KVM: MMU: cleanup for kvm_mmu_page_unlink_children

Cleanup the same operation between kvm_mmu_page_unlink_children and
mmu_pte_write_zap_pte

Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

KVM: MMU: remove the arithmetic of parent pte rmap

Parent pte rmap and page rmap are very similar, so use the same arithmetic
for them

Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

KVM: MMU: abstract the operation of rmap

Abstract the operation of rmap to spte_list, then we can use it for the
reverse mapping of parent pte in the later patch

Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

KVM: fix uninitialized warning

Fix:

warning: ‘cs_sel’ may be used uninitialized in this function
warning: ‘ss_sel’ may be used uninitialized in this function

Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

KVM: use __copy_to_user/__clear_user to write guest page

Simply use __copy_to_user/__clear_user to write guest page since we have
already verified the user address when the memslot is set

Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

KVM: MMU: optimize pte write path if don't have protected sp

Simply return from kvm_mmu_pte_write path if no shadow page is
write-protected, then we can avoid to walk all shadow pages and hold
mmu-lock

Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

KVM: VMX: always_inline VMREADs

vmcs_readl() and friends are really short, but gcc thinks they are long because of
the out-of-line exception handlers. Mark them always_inline to clear the
misunderstanding.

Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

KVM: VMX: Move VMREAD cleanup to exception handler

We clean up a failed VMREAD by clearing the output register. Do
it in the exception handler instead of unconditionally. This is
worthwhile since there are more than a hundred call sites.

Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

KVM: x86 emulator: Stop passing ctxt->ops as arg of emul functions

Dereference it in the actual users.

This not only cleans up the emulator but also makes it easy to convert
the old emulation functions to the new em_xxx() form later.

Note: Remove some inline keywords to let the compiler decide inlining.

Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

KVM: x86 emulator: Stop passing ctxt->ops as arg of decode helpers

Dereference it in the actual users: only do_insn_fetch_byte().

This is consistent with the way __linearize() dereferences it.

Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

KVM: x86 emulator: Place insn_fetch helpers together

The two macros need special care to use:
Assume rc, ctxt, ops and done exist outside of them.
Can goto outside.

Considering the fact that these are used only in decode functions,
moving these right after do_insn_fetch() seems to be a right thing
to improve the readability.

We also rename do_fetch_insn_byte() to do_insn_fetch_byte() to be
consistent.

Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

KVM: Document KVM_GET_LAPIC, KVM_SET_LAPIC ioctl

Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

Merge branch 'gpio/merge' of git://git.secretlab.ca/git/linux-2.6

* 'gpio/merge' of git://git.secretlab.ca/git/linux-2.6:
gpio: tps65910: add missing breaks in tps65910_gpio_init

Documentation: fix cgroup blkio throttle filenames

All the blkio.throttle.* file names are incorrectly reported without
".throttle" in the documentation. Fix it.

Signed-off-by: Andrea Righi <andrea@betterlinux.com>
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Documentation: update CodingStyle memory allocators

The list of available general purpose memory allocators in
Documentation/CodingStyle chapter 14 is incomplete. This patch adds
the missing vzalloc() to the list.

Signed-off-by: Jesper Juhl <jj@chaosbits.net>
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

MAINTAINERS: move kernel-doc patches location

Move location of quilt series for kernel-doc patches.

Signed-off-by: Randy Dunlap <rdunlap@xenotime.net>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Merge branch 'v4l_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-2.6

* 'v4l_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-2.6: (46 commits)
  [media] rc: call input_sync after scancode reports
  [media] imon: allow either proto on unknown 0xffdc
  [media] imon: auto-config ffdc 7e device
  [media] saa7134: fix raw IR timeout value
  [media] rc: fix ghost keypresses with certain hw
  [media] [staging] lirc_serial: allocate irq at init time
  [media] lirc_zilog: fix spinning rx thread
  [media] keymaps: fix table for pinnacle pctv hd devices
  [media] ite-cir: 8709 needs to use pnp resource 2
  [media] V4L: mx1-camera: fix uninitialized variable
  [media] omap_vout: Added check in reqbuf & mmap for buf_size allocation
  [media] OMAP_VOUT: Change hardcoded device node number to -1
  [media] OMAP_VOUTLIB: Fix wrong resizer calculation
  [media] uvcvideo: Disable the queue when failing to start
  [media] uvcvideo: Remove buffers from the queues when freeing
  [media] uvcvideo: Ignore entities for terminals with no supported format
  [media] v4l: Don't access media entity after is has been destroyed
  [media] media: omap3isp: fix a potential NULL deref
  [media] media: vb2: fix allocation failure check
  [media] media: vb2: reset queued_count value during queue reinitialization
  ...

Fix up trivial conflict in MAINTAINERS as per Mauro

FDPIC: Fix memory leak

The shdr4extnum variable isn't being freed in the cleanup process of
elf_fdpic_core_dump().

Signed-off-by: Davidlohr Bueso <dave@gnu.org>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

resource: ability to resize an allocated resource

Provides the ability to resize a resource that is already allocated.
This functionality is put in place to support reallocation needs of
pci resources.

Signed-off-by: Ram Pai <linuxram@us.ibm.com>
Acked-by: Jesse Barnes <jbarnes@virtuousgeek.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

fs: fix lock initialization

locks_alloc_lock() assumed that the allocated struct file_lock is
already initialized to zero members. This is only true for the first
allocation of the structure, after reuse some of the members will have
random values.

This will for example result in passing random fl_start values to
userspace in fuse for FL_FLOCK locks, which is an information leak at
best.

Fix by reinitializing those members which may be non-zero after freeing.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
CC: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

gpio: tps65910: add missing breaks in tps65910_gpio_init

Signed-off-by: Axel Lin <axel.lin@gmail.com>
Signed-off-by: Grant Likely <grant.likely@secretlab.ca>

Merge branch 'usb-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb-2.6

* 'usb-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb-2.6:
USB: fix regression occurring during device removal
USB: fsl_udc_core: fix build breakage when building for ARM arch

Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sameo/mfd-2.6

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sameo/mfd-2.6:
  mfd: Add Makefile and Kconfig Entries for tps65911 comparator
  mfd: Fix build error for tps65911-comparator.c
  Revert "mfd: Add omap-usbhs runtime PM support"
  input: pmic8xxx-pwrkey: Do not use mfd_get_data()
  input: pmic8xxx-keypad: Do not use mfd_get_data()

Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client:
  ceph: fix sync and dio writes across stripe boundaries
  libceph: fix page calculation for non-page-aligned io
  ceph: fix page alignment corrections

Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/hch/hfsplus

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/hch/hfsplus:
hfsplus: Fix double iput of the same inode in hfsplus_fill_super()
hfsplus: add missing call to bio_put()

mfd: Add Makefile and Kconfig Entries for tps65911 comparator

Base on Mark's comment [1], I make the Kconfig entry invisible to users.
[1] https://lkml.org/lkml/2011/5/14/136

Signed-off-by: Axel Lin <axel.lin@gmail.com>
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>

mfd: Fix build error for tps65911-comparator.c

Fix below build error:
CC drivers/mfd/tps65911-comparator.o
drivers/mfd/tps65911-comparator.c: In function 'tps65911_comparator_probe':
drivers/mfd/tps65911-comparator.c:131: error: 'struct tps65910_platform_data' has no member named 'vmbch_threshold'
drivers/mfd/tps65911-comparator.c:137: error: 'struct tps65910_platform_data' has no member named 'vmbch2_threshold'
make[2]: *** [drivers/mfd/tps65911-comparator.o] Error 1
make[1]: *** [drivers/mfd] Error 2
make: *** [drivers] Error 2

Signed-off-by: Axel Lin <axel.lin@gmail.com>
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>

Revert "mfd: Add omap-usbhs runtime PM support"

This reverts commit 7e6502d577106fb5b202bbaac64c5f1b065e6daa.

Oops are produced during initialization of ehci and ohci
drivers. This is because the run time pm apis are used by
the driver but the corresponding hwmod structures and
initialization is not merged. hence revering back the
commit id 7e6502d577106fb5b202bbaac64c5f1b065e6daa

Signed-off-by: Keshava Munegowda <keshava_mgowda@ti.com>
Reported-by: Luciano Coelho <coelho@ti.com>
Acked-by: Felipe Balbi <balbi@ti.com>
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>

input: pmic8xxx-pwrkey: Do not use mfd_get_data()

mfd_get_data() has been removed from the MFD API.

Cc: Anirudh Ghayal <aghayal@codeaurora.org>
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>

input: pmic8xxx-keypad: Do not use mfd_get_data()

mfd_get_data() has been removed from the MFD API.

Cc: Anirudh Ghayal <aghayal@codeaurora.org>
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>

Linux 3.0-rc6

Merge git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi-rc-fixes-2.6

* git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi-rc-fixes-2.6: (277 commits)
  [SCSI] isci: fix checkpatch errors
  isci: Device reset should request sas_phy_reset(phy, true)
  isci: pare back error messsages
  isci: cleanup silicon revision detection
  isci: merge scu_unsolicited_frame.h into unsolicited_frame_control.h
  isci: merge sata.[ch] into request.c
  isci: kill 'get/set' macros
  isci: retire scic_sds_ and scic_ prefixes
  isci: unify isci_host and scic_sds_controller
  isci: unify isci_remote_device and scic_sds_remote_device
  isci: unify isci_port and scic_sds_port
  isci: fix scic_sds_remote_device_terminate_requests
  isci: unify isci_phy and scic_sds_phy
  isci: unify isci_request and scic_sds_request
  isci: rename / clean up scic_sds_stp_request
  isci: preallocate requests
  isci: combine request flags
  isci: unify can_queue tracking on the tci_pool, uplevel tag assignment
  isci: Terminate dev requests on FIS err bit rx in NCQ
  isci: fix frame received locking
  ...

Merge branch 'at91/fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/linux-2.6-arm-soc

* 'at91/fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/linux-2.6-arm-soc:
  AT91: Change nand buswidth logic to match hardware default configuration
  at91: Use "pclk" as con_id on at91cap9 and at91rm9200
  at91: fix udc, ehci and mmc clock device name for cap9/9g45/9rl
  atmel_serial: fix internal port num
  at91: fix at91_set_serial_console: use platform device id

Merge branch 'fbdev-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/lethal/fbdev-3.x

* 'fbdev-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/lethal/fbdev-3.x:
  vesafb: fix memory leak
  fbdev: amba: Link fb device to its parent
  fsl-diu-fb: remove check for pixel clock ranges
  udlfb: Correct sub-optimal resolution selection.
  hecubafb: add module_put on error path in hecubafb_probe()
  sm501fb: fix section mismatch warning
  gx1fb: Fix section mismatch warnings
  fbdev: sh_mobile_meram: Correct pointer check for YCbCr chroma plane

RDMA: Check for NULL mode in .devnode methods

Commits 71c29bd5c235 ("IB/uverbs: Add devnode method to set path/mode")
and c3af0980ce01 ("IB: Add devnode methods to cm_class and umad_class")
added devnode methods that set the mode.

However, these methods don't check for a NULL mode, and so we get a
crash when unloading modules because devtmpfs_delete_node() calls
device_get_devnode() with mode == NULL.

Add the missing checks.

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.de>
[ Also fix cm.c. - Roland ]
Signed-off-by: Roland Dreier <roland@purestorage.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

AT91: Change nand buswidth logic to match hardware default configuration

The recently modified nand buswitth configuration is not aligned with
board reality: the double footprint on boards is always populated with 8bits
buswidth nand flashes.
So we have to consider that without particular configuration the 8bits
buswidth is selected by default.
Moreover, the previous logic was always using !board_have_nand_8bit(), we
change it to a simpler: board_have_nand_16bit().

Signed-off-by: Nicolas Ferre <nicolas.ferre@atmel.com>
Tested-by: Ludovic Desroches <ludovic.desroches@atmel.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>

vesafb: fix memory leak

When releasing framebuffer, free colourmap allocations.

Signed-off-by: Daniel J Blueman <daniel.blueman@gmail.com>
Signed-off-by: Paul Mundt <lethal@linux-sh.org>

Merge branch 'pm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/suspend-2.6

* 'pm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/suspend-2.6:
PM / Runtime: Update documentation regarding driver removal
PM: Documentation: fix typo: pm_runtime_idle_sync() doesn't exist.

[SCSI] isci: fix checkpatch errors

Signed-off-by: James Bottomley <JBottomley@Parallels.com>

Merge git://git.kernel.org/pub/scm/linux/kernel/git/djbw/isci

Merge git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6

* git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6:
cifs: set socket send and receive timeouts before attempting connect

Merge branch 'hwmon-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jdelvare/staging

* 'hwmon-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jdelvare/staging:
  hwmon: (k10temp) Update documentation for Fam12h
  hwmon-vid: Fix typo in VIA CPU name
  hwmon: (f71882fg) Add support for the F71869A
  hwmon: Use <> rather than () around my e-mail address
  hwmon: (emc6w201) Properly handle all errors

hwmon: (k10temp) Update documentation for Fam12h

Add some CPU series IDs and links to the Fam12h datasheets.

Signed-off-by: Clemens Ladisch <clemens@ladisch.de>
Signed-off-by: Jean Delvare <khali@linux-fr.org>

hwmon-vid: Fix typo in VIA CPU name

It's Nehemiah, not Nemiah.

Signed-off-by: Jean Delvare <khali@linux-fr.org>
Acked-by: Guenter Roeck <guenter.roeck@ericsson.com>

hwmon: (f71882fg) Add support for the F71869A

The F71869A is almost the same as the F71869F/E, except that it has
the normal number of temp and pwm zones for a F71882FG derived chip,
rather then the limited number of the F71869F/E.

Signed-off-by: Hans de Goede <hdegoede@redhat.com>
Tested-by: Max Baldwin <archerseven@gmail.com>
Acked-by: Guenter Roeck <guenter.roeck@ericsson.com>
Signed-off-by: Jean Delvare <khali@linux-fr.org>

hwmon: Use <> rather than () around my e-mail address

Signed-off-by: Hans de Goede <hdegoede@redhat.com>
Signed-off-by: Jean Delvare <khali@linux-fr.org>

hwmon: (emc6w201) Properly handle all errors

Handle errors on 8-bit register reads and writes too. Also use likely
and unlikely to make the functions faster on success.

Signed-off-by: Jean Delvare <khali@linux-fr.org>
Acked-by: Guenter Roeck <guenter.roeck@ericsson.com>