Dan Carpenter [Wed, 19 Oct 2011 06:15:10 +0000 (09:15 +0300)]
KVM: make checks stricter in coalesced_mmio_in_range()
My testing version of Smatch complains that addr and len come from
the user and they can wrap. The path is:
-> kvm_vm_ioctl()
-> kvm_vm_ioctl_unregister_coalesced_mmio()
-> coalesced_mmio_in_range()
I don't know what the implications are of wrapping here, but we may
as well fix it, if only to silence the warning.
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Jan Kiszka [Wed, 14 Sep 2011 07:58:32 +0000 (09:58 +0200)]
KVM: x86: Simplify kvm timer handler
The vcpu reference of a kvm_timer can't become NULL while the timer is
valid, so drop this redundant test. This also makes it pointless to
carry a separate __kvm_timer_fn, fold it into kvm_timer_fn.
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Detecting write-flooding does not work well, when we handle page written, if
the last speculative spte is not accessed, we treat the page is
write-flooding, however, we can speculative spte on many path, such as pte
prefetch, page synced, that means the last speculative spte may be not point
to the written page and the written page can be accessed via other sptes, so
depends on the Accessed bit of the last speculative spte is not enough
Instead of detected page accessed, we can detect whether the spte is accessed
after it is written, if the spte is not accessed but it is written frequently,
we treat is not a page table or it not used for a long time
Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com> Signed-off-by: Avi Kivity <avi@redhat.com>
Sometimes, we only modify the last one byte of a pte to update status bit,
for example, clear_bit is used to clear r/w bit in linux kernel and 'andb'
instruction is used in this function, in this case, kvm_mmu_pte_write will
treat it as misaligned access, and the shadow page table is zapped
Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com> Signed-off-by: Avi Kivity <avi@redhat.com>
If the emulation is caused by #PF and it is non-page_table writing instruction,
it means the VM-EXIT is caused by shadow page protected, we can zap the shadow
page and retry this instruction directly
The idea is from Avi
Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com> Signed-off-by: Avi Kivity <avi@redhat.com>
KVM: x86: tag the instructions which are used to write page table
The idea is from Avi:
| tag instructions that are typically used to modify the page tables, and
| drop shadow if any other instruction is used.
| The list would include, I'd guess, and, or, bts, btc, mov, xchg, cmpxchg,
| and cmpxchg8b.
This patch is used to tag the instructions and in the later path, shadow page
is dropped if it is written by other instructions
Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com> Signed-off-by: Avi Kivity <avi@redhat.com>
KVM: MMU: avoid pte_list_desc running out in kvm_mmu_pte_write
kvm_mmu_pte_write is unsafe since we need to alloc pte_list_desc in the
function when spte is prefetched, unfortunately, we can not know how many
spte need to be prefetched on this path, that means we can use out of the
free pte_list_desc object in the cache, and BUG_ON() is triggered, also some
path does not fill the cache, such as INS instruction emulated that does not
trigger page fault
Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com> Signed-off-by: Avi Kivity <avi@redhat.com>
When L0 wishes to inject an interrupt while L2 is running, it emulates an exit
to L1 with EXIT_REASON_EXTERNAL_INTERRUPT. This was explained in the original
nVMX patch 23, titled "Correct handling of interrupt injection".
Unfortunately, it is possible (though rare) that at this point there is valid
idt_vectoring_info in vmcs02. For example, L1 injected some interrupt to L2,
and when L2 tried to run this interrupt's handler, it got a page fault - so
it returns the original interrupt vector in idt_vectoring_info. The problem
is that if this is the case, we cannot exit to L1 with EXTERNAL_INTERRUPT
like we wished to, because the VMX spec guarantees that idt_vectoring_info
and exit_reason_external_interrupt can never happen together. This is not
just specified in the spec - a KVM L1 actually prints a kernel warning
"unexpected, valid vectoring info" if we violate this guarantee, and some
users noticed these warnings in L1's logs.
In order to better emulate a processor, which would never return the external
interrupt and the idt-vectoring-info together, we need to separate the two
injection steps: First, complete L1's injection into L2 (i.e., enter L2,
injecting to it the idt-vectoring-info); Second, after entry into L2 succeeds
and it exits back to L0, exit to L1 with the EXIT_REASON_EXTERNAL_INTERRUPT.
Most of this is already in the code - the only change we need is to remain
in L2 (and not exit to L1) in this case.
Note that the previous patch ensures (by using KVM_REQ_IMMEDIATE_EXIT) that
although we do enter L2 first, it will exit immediately after processing its
injection, allowing us to promptly inject to L1.
Note how we test vmcs12->idt_vectoring_info_field; This isn't really the
vmcs12 value (we haven't exited to L1 yet, so vmcs12 hasn't been updated),
but rather the place we save, at the end of vmx_vcpu_run, the vmcs02 value
of this field. This was explained in patch 25 ("Correct handling of idt
vectoring info") of the original nVMX patch series.
Thanks to Dave Allan and to Federico Simoncelli for reporting this bug,
to Abel Gordon for helping me figure out the solution, and to Avi Kivity
for helping to improve it.
Signed-off-by: Nadav Har'El <nyh@il.ibm.com> Signed-off-by: Avi Kivity <avi@redhat.com>
This patch adds a new vcpu->requests bit, KVM_REQ_IMMEDIATE_EXIT.
This bit requests that when next entering the guest, we should run it only
for as little as possible, and exit again.
We use this new option in nested VMX: When L1 launches L2, but L0 wishes L1
to continue running so it can inject an event to it, we unfortunately cannot
just pretend to have run L2 for a little while - We must really launch L2,
otherwise certain one-off vmcs12 parameters (namely, L1 injection into L2)
will be lost. So the existing code runs L2 in this case.
But L2 could potentially run for a long time until it exits, and the
injection into L1 will be delayed. The new KVM_REQ_IMMEDIATE_EXIT allows us
to request that L2 will be entered, as necessary, but will exit as soon as
possible after entry.
Our implementation of this request uses smp_send_reschedule() to send a
self-IPI, with interrupts disabled. The interrupts remain disabled until the
guest is entered, and then, after the entry is complete (often including
processing an injection and jumping to the relevant handler), the physical
interrupt is noticed and causes an exit.
On recent Intel processors, we could have achieved the same goal by using
MTF instead of a self-IPI. Another technique worth considering in the future
is to use VM_EXIT_ACK_INTR_ON_EXIT and a highest-priority vector IPI - to
slightly improve performance by avoiding the useless interrupt handler
which ends up being called when smp_send_reschedule() is used.
Signed-off-by: Nadav Har'El <nyh@il.ibm.com> Signed-off-by: Avi Kivity <avi@redhat.com>
Keith Packard [Tue, 27 Dec 2011 01:02:11 +0000 (17:02 -0800)]
drm/i915: Disable RC6 on Sandybridge by default
RC6 fails again.
> I found my system freeze mostly during starting up X and KDE. Sometimes it
> works for some minutes, sometimes it freezes immediatly. When the freeze
> happens, everything is dead (even the reset button does not work, I need to
> power cycle).
> I disabled RC6, and my system runs wonderfully.
> The system is a Z68 Pro board with Sandybridge i5-2500K processor, 8
> GB of RAM and UEFI firmware.
Reported-by: Kai Krakow <hurikhan77@gmail.com> Signed-off-by: Keith Packard <keithp@keithp.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Keith Packard [Tue, 27 Dec 2011 01:02:10 +0000 (17:02 -0800)]
drm/i915: Disable semaphores by default on SNB
Semaphores still cause problems on some machines:
> From Udo Steinberg:
>
> With Linux-3.2-rc6 I'm frequently seeing GPU hangs when large amounts of
> text scroll in an xterm, such as when extracting a tar archive. Such as this
> one (note the timestamps):
>
> I can reproduce it fairly easily with something
> as simple as:
>
> while true; do dmesg; done
This patch turns them off on SNB while leaving them on for IVB.
Reported-by: Udo Steinberg <udo@hypervisor.org> Cc: Daniel Vetter <daniel@ffwll.ch> Cc: Eugeni Dodonov <eugeni@dodonov.net> Signed-off-by: Keith Packard <keithp@keithp.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Linus Torvalds [Mon, 26 Dec 2011 21:17:00 +0000 (13:17 -0800)]
Merge branch 'kvm-updates/3.2' of git://git.kernel.org/pub/scm/virt/kvm/kvm
* 'kvm-updates/3.2' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
KVM: PPC: e500: include linux/export.h
KVM: PPC: fix kvmppc_start_thread() for CONFIG_SMP=N
KVM: PPC: protect use of kvmppc_h_pr
KVM: PPC: move compute_tlbie_rb to book3s_64 common header
KVM: Don't automatically expose the TSC deadline timer in cpuid
KVM: Device assignment permission checks
KVM: Remove ability to assign a device without iommu support
KVM: x86: Prevent starting PIT timers in the absence of irqchip support
Linus Torvalds [Mon, 26 Dec 2011 18:25:26 +0000 (10:25 -0800)]
vfs: fix handling of lock allocation failure in lease-break case
Bruce Fields notes that commit 778fc546f749 ("locks: fix tracking of
inprogress lease breaks") introduced a possible error pointer
dereference on failure to allocate memory. locks_conflict() will
dereference the passed-in new lease lock structure that may be an error pointer.
This means an open (without O_NONBLOCK set) on a file with a lease
applied (generally only done when Samba or nfsd (with v4) is running)
could crash if a kmalloc() fails.
So instead of playing games with IS_ERROR() all over the place, just
check the allocation failure early. That makes the code more
straightforward, and avoids this possible bad pointer dereference.
Based-on-patch-by: J. Bruce Fields <bfields@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Michael Neuling [Thu, 10 Nov 2011 16:03:20 +0000 (16:03 +0000)]
KVM: PPC: fix kvmppc_start_thread() for CONFIG_SMP=N
Currently kvmppc_start_thread() tries to wake other SMT threads via
xics_wake_cpu(). Unfortunately xics_wake_cpu only exists when
CONFIG_SMP=Y so when compiling with CONFIG_SMP=N we get:
arch/powerpc/kvm/built-in.o: In function `.kvmppc_start_thread':
book3s_hv.c:(.text+0xa1e0): undefined reference to `.xics_wake_cpu'
The following should be fine since kvmppc_start_thread() shouldn't
called to start non-zero threads when SMP=N since threads_per_core=1.
Signed-off-by: Michael Neuling <mikey@neuling.org> Signed-off-by: Alexander Graf <agraf@suse.de>
Jan Kiszka [Wed, 21 Dec 2011 11:28:29 +0000 (12:28 +0100)]
KVM: Don't automatically expose the TSC deadline timer in cpuid
Unlike all of the other cpuid bits, the TSC deadline timer bit is set
unconditionally, regardless of what userspace wants.
This is broken in several ways:
- if userspace doesn't use KVM_CREATE_IRQCHIP, and doesn't emulate the TSC
deadline timer feature, a guest that uses the feature will break
- live migration to older host kernels that don't support the TSC deadline
timer will cause the feature to be pulled from under the guest's feet;
breaking it
- guests that are broken wrt the feature will fail.
Fix by not enabling the feature automatically; instead report it to userspace.
Because the feature depends on KVM_CREATE_IRQCHIP, which we cannot guarantee
will be called, we expose it via a KVM_CAP_TSC_DEADLINE_TIMER and not
KVM_GET_SUPPORTED_CPUID.
Fixes the Illumos guest kernel, which uses the TSC deadline timer feature.
[avi: add the KVM_CAP + documentation]
Reported-by: Alexey Zaytsev <alexey.zaytsev@gmail.com> Tested-by: Alexey Zaytsev <alexey.zaytsev@gmail.com> Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com> Signed-off-by: Avi Kivity <avi@redhat.com>
Alex Williamson [Wed, 21 Dec 2011 04:59:09 +0000 (21:59 -0700)]
KVM: Device assignment permission checks
Only allow KVM device assignment to attach to devices which:
- Are not bridges
- Have BAR resources (assume others are special devices)
- The user has permissions to use
Assigning a bridge is a configuration error, it's not supported, and
typically doesn't result in the behavior the user is expecting anyway.
Devices without BAR resources are typically chipset components that
also don't have host drivers. We don't want users to hold such devices
captive or cause system problems by fencing them off into an iommu
domain. We determine "permission to use" by testing whether the user
has access to the PCI sysfs resource files. By default a normal user
will not have access to these files, so it provides a good indication
that an administration agent has granted the user access to the device.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com> Signed-off-by: Yang Bai <hamo.by@gmail.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Alex Williamson [Wed, 21 Dec 2011 04:59:03 +0000 (21:59 -0700)]
KVM: Remove ability to assign a device without iommu support
This option has no users and it exposes a security hole that we
can allow devices to be assigned without iommu protection. Make
KVM_DEV_ASSIGN_ENABLE_IOMMU a mandatory option.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Prevent this by checking the irqchip mode before starting a timer. We
can't deny creating the PIT if the irqchips aren't set up yet as
current user land expects this order to work.
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Linus Torvalds [Sat, 24 Dec 2011 21:34:44 +0000 (13:34 -0800)]
Merge branch 'drm-fixes' of git://people.freedesktop.org/~airlied/linux
* 'drm-fixes' of git://people.freedesktop.org/~airlied/linux:
vmwgfx: fix incorrect VRAM size check in vmw_kms_fb_create()
drm/radeon/kms: bail on BTC parts if MC ucode is missing
Linus Torvalds [Fri, 23 Dec 2011 22:59:08 +0000 (14:59 -0800)]
Merge branch 'v4l_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media
* 'v4l_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media:
[media] omap3isp: Fix crash caused by subdevs now having a pointer to devnodes
Linus Torvalds [Fri, 23 Dec 2011 22:58:39 +0000 (14:58 -0800)]
Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
Btrfs: call d_instantiate after all ops are setup
Btrfs: fix worker lock misuse in find_worker
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net:
netfilter: xt_connbytes: handle negation correctly
net: relax rcvbuf limits
rps: fix insufficient bounds checking in store_rps_dev_flow_table_cnt()
net: introduce DST_NOPEER dst flag
mqprio: Avoid panic if no options are provided
bridge: provide a mtu() method for fake_dst_ops
"! --connbytes 23:42" should match if the packet/byte count is not in range.
As there is no explict "invert match" toggle in the match structure,
userspace swaps the from and to arguments
(i.e., as if "--connbytes 42:23" were given).
However, "what <= 23 && what >= 42" will always be false.
Change things so we use "||" in case "from" is larger than "to".
This change may look like it breaks backwards compatibility when "to" is 0.
However, older iptables binaries will refuse "connbytes 42:0",
and current releases treat it to mean "! --connbytes 0:42",
so we should be fine.
Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Al Viro [Fri, 23 Dec 2011 12:58:13 +0000 (07:58 -0500)]
Btrfs: call d_instantiate after all ops are setup
This closes races where btrfs is calling d_instantiate too soon during
inode creation. All of the callers of btrfs_add_nondir are updated to
instantiate after the inode is fully setup in memory.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Chris Mason <chris.mason@oracle.com>
This is caused by bridge netfilter special dst_entry (fake_rtable), a
special shared entry, where attaching an inetpeer makes no sense.
Problem is present since commit 87c48fa3b46 (ipv6: make fragment
identifications less predictable)
Introduce DST_NOPEER dst flag and make sure ipv6_select_ident() and
__ip_select_ident() fallback to the 'no peer attached' handling.
Reported-by: Chris Boot <bootc@bootc.net> Tested-by: Chris Boot <bootc@bootc.net> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Thomas Graf [Thu, 22 Dec 2011 02:05:07 +0000 (02:05 +0000)]
mqprio: Avoid panic if no options are provided
Userspace may not provide TCA_OPTIONS, in fact tc currently does
so not do so if no arguments are specified on the command line.
Return EINVAL instead of panicing.
Signed-off-by: Thomas Graf <tgraf@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Wed, 21 Dec 2011 20:00:32 +0000 (20:00 +0000)]
bridge: provide a mtu() method for fake_dst_ops
Commit 618f9bc74a039da76 (net: Move mtu handling down to the protocol
depended handlers) forgot the bridge netfilter case, adding a NULL
dereference in ip_fragment().
Reported-by: Chris Boot <bootc@bootc.net> CC: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Acked-by: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Linus Torvalds [Thu, 22 Dec 2011 23:36:17 +0000 (15:36 -0800)]
Merge branch 'for-linus' of git://neil.brown.name/md
* 'for-linus' of git://neil.brown.name/md:
md/bitmap: It is OK to clear bits during recovery.
md: don't give up looking for spares on first failure-to-add
md/raid5: ensure correct assessment of drives during degraded reshape.
md/linear: fix hot-add of devices to linear arrays.
When writing to an array that is undergoing recovery (a spare
in being integrated into the array), writing to the array will
set bits in the bitmap, but they will not be cleared when the
write completes.
For bits covering areas that have not been recovered yet this is not a
problem as the recovery will clear the bits. However bits set in
already-recovered region will stay set and never be cleared.
This doesn't risk data integrity. The only negatives are:
- next time there is a crash, more resyncing than necessary will
be done.
- the bitmap doesn't look clean, which is confusing.
While an array is recovering we don't want to update the
'events_cleared' setting in the bitmap but we do still want to clear
bits that have very recently been set - providing they were written to
the recovering device.
So split those two needs - which previously both depended on 'success'
and always clear the bit of the write went to all devices.
NeilBrown [Thu, 22 Dec 2011 22:57:19 +0000 (09:57 +1100)]
md: don't give up looking for spares on first failure-to-add
Before performing a recovery we try to remove any spares that
might not be working, then add any that might have become relevant.
Currently we abort on the first spare that cannot be added.
This is a false optimisation.
It is conceivable that - depending on rules in the personality - a
subsequent spare might be accepted.
Also the loop does other things like count the available spares and
reset the 'recovery_offset' value.
If we abort early these might not happen properly.
So remove the early abort.
In particular if you have an array what is undergoing recovery and
which has extra spares, then the recovery may not restart after as
reboot as the could of 'spares' might end up as zero.
NeilBrown [Thu, 22 Dec 2011 22:57:00 +0000 (09:57 +1100)]
md/raid5: ensure correct assessment of drives during degraded reshape.
While reshaping a degraded array (as when reshaping a RAID0 by first
converting it to a degraded RAID4) we currently get confused about
which devices are in_sync. In most cases we get it right, but in the
region that is being reshaped we need to treat non-failed devices as
in-sync when we have the data but haven't actually written it out yet.
Reported-by: Adam Kwolek <adam.kwolek@intel.com> Signed-off-by: NeilBrown <neilb@suse.de>
NeilBrown [Thu, 22 Dec 2011 22:56:55 +0000 (09:56 +1100)]
md/linear: fix hot-add of devices to linear arrays.
commit d70ed2e4fafdbef0800e73942482bb075c21578b
broke hot-add to a linear array.
After that commit, metadata if not written to devices until they
have been fully integrated into the array as determined by
saved_raid_disk. That patch arranged to clear that field after
a recovery completed.
However for linear arrays, there is no recovery - the integration is
instantaneous. So we need to explicitly clear the saved_raid_disk
field.
David S. Miller [Thu, 22 Dec 2011 21:23:59 +0000 (13:23 -0800)]
sparc64: Fix MSIQ HV call ordering in pci_sun4v_msiq_build_irq().
This silently was working for many years and stopped working on
Niagara-T3 machines.
We need to set the MSIQ to VALID before we can set it's state to IDLE.
On Niagara-T3, setting the state to IDLE first was causing HV_EINVAL
errors. The hypervisor documentation says, rather ambiguously, that
the MSIQ must be "initialized" before one can set the state.
I previously understood this to mean merely that a successful setconf()
operation has been performed on the MSIQ, which we have done at this
point. But it seems to also mean that it has been set VALID too.
Signed-off-by: David S. Miller <davem@davemloft.net>
Linus Torvalds [Thu, 22 Dec 2011 20:59:47 +0000 (12:59 -0800)]
Merge branch 'usb-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb
* 'usb-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb:
USB: Fix usb/isp1760 build on sparc
usb: gadget: epautoconf: do not change number of streams
usb: dwc3: core: fix cached revision on our structure
usb: musb: fix reset issue with full speed device
Stephen Rothwell [Thu, 22 Dec 2011 06:03:29 +0000 (17:03 +1100)]
ipv4: using prefetch requires including prefetch.h
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> Acked-by: Eric Dumazet <eric.dumazet@gmail.com> Acked-by: David Miller <davem@davemloft.net> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Xi Wang [Wed, 21 Dec 2011 10:18:33 +0000 (05:18 -0500)]
vmwgfx: fix incorrect VRAM size check in vmw_kms_fb_create()
Commit e133e737 didn't correctly fix the integer overflow issue.
- unsigned int required_size;
+ u64 required_size;
...
required_size = mode_cmd->pitch * mode_cmd->height;
- if (unlikely(required_size > dev_priv->vram_size)) {
+ if (unlikely(required_size > (u64) dev_priv->vram_size)) {
Note that both pitch and height are u32. Their product is still u32 and
would overflow before being assigned to required_size. A correct way is
to convert pitch and height to u64 before the multiplication.
This patch calls the existing vmw_kms_validate_mode_vram() for
validation.
Signed-off-by: Xi Wang <xi.wang@gmail.com> Reviewed-and-tested-by: Thomas Hellstrom <thellstrom@vmware.com> Signed-off-by: Dave Airlie <airlied@redhat.com>
Alex Deucher [Wed, 21 Dec 2011 16:58:17 +0000 (11:58 -0500)]
drm/radeon/kms: bail on BTC parts if MC ucode is missing
We already do this for cayman, need to also do it for
BTC parts. The default memory and voltage setup is not
adequate for advanced operation. Continuing will
result in an unusable display.
Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Cc: stable@kernel.org Cc: Jean Delvare <khali@linux-fr.org> Signed-off-by: Dave Airlie <airlied@redhat.com>
Srivatsa S. Bhat [Wed, 21 Dec 2011 21:15:29 +0000 (02:45 +0530)]
VFS: Fix race between CPU hotplug and lglocks
Currently, the *_global_[un]lock_online() routines are not at all synchronized
with CPU hotplug. Soft-lockups detected as a consequence of this race was
reported earlier at https://lkml.org/lkml/2011/8/24/185. (Thanks to Cong Meng
for finding out that the root-cause of this issue is the race condition
between br_write_[un]lock() and CPU hotplug, which results in the lock states
getting messed up).
Fixing this race by just adding {get,put}_online_cpus() at appropriate places
in *_global_[un]lock_online() is not a good option, because, then suddenly
br_write_[un]lock() would become blocking, whereas they have been kept as
non-blocking all this time, and we would want to keep them that way.
So, overall, we want to ensure 3 things:
1. br_write_lock() and br_write_unlock() must remain as non-blocking.
2. The corresponding lock and unlock of the per-cpu spinlocks must not happen
for different sets of CPUs.
3. Either prevent any new CPU online operation in between this lock-unlock, or
ensure that the newly onlined CPU does not proceed with its corresponding
per-cpu spinlock unlocked.
To achieve all this:
(a) We introduce a new spinlock that is taken by the *_global_lock_online()
routine and released by the *_global_unlock_online() routine.
(b) We register a callback for CPU hotplug notifications, and this callback
takes the same spinlock as above.
(c) We maintain a bitmap which is close to the cpu_online_mask, and once it is
initialized in the lock_init() code, all future updates to it are done in
the callback, under the above spinlock.
(d) The above bitmap is used (instead of cpu_online_mask) while locking and
unlocking the per-cpu locks.
The callback takes the spinlock upon the CPU_UP_PREPARE event. So, if the
br_write_lock-unlock sequence is in progress, the callback keeps spinning,
thus preventing the CPU online operation till the lock-unlock sequence is
complete. This takes care of requirement (3).
The bitmap that we maintain remains unmodified throughout the lock-unlock
sequence, since all updates to it are managed by the callback, which takes
the same spinlock as the one taken by the lock code and released only by the
unlock routine. Combining this with (d) above, satisfies requirement (2).
Overall, since we use a spinlock (mentioned in (a)) to prevent CPU hotplug
operations from racing with br_write_lock-unlock, requirement (1) is also
taken care of.
By the way, it is to be noted that a CPU offline operation can actually run
in parallel with our lock-unlock sequence, because our callback doesn't react
to notifications earlier than CPU_DEAD (in order to maintain our bitmap
properly). And this means, since we use our own bitmap (which is stale, on
purpose) during the lock-unlock sequence, we could end up unlocking the
per-cpu lock of an offline CPU (because we had locked it earlier, when the
CPU was online), in order to satisfy requirement (2). But this is harmless,
though it looks a bit awkward.
Debugged-by: Cong Meng <mc@linux.vnet.ibm.com> Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Cc: stable@vger.kernel.org
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net:
net: Add a flow_cache_flush_deferred function
ipv4: reintroduce route cache garbage collector
net: have ipconfig not wait if no dev is available
sctp: Do not account for sizeof(struct sk_buff) in estimated rwnd
asix: new device id
davinci-cpdma: fix locking issue in cpdma_chan_stop
sctp: fix incorrect overflow check on autoclose
r8169: fix Config2 MSIEnable bit setting.
llc: llc_cmsg_rcv was getting called after sk_eat_skb.
net: bpf_jit: fix an off-one bug in x86_64 cond jump target
iwlwifi: update SCD BC table for all SCD queues
Revert "Bluetooth: Revert: Fix L2CAP connection establishment"
Bluetooth: Clear RFCOMM session timer when disconnecting last channel
Bluetooth: Prevent uninitialized data access in L2CAP configuration
iwlwifi: allow to switch to HT40 if not associated
iwlwifi: tx_sync only on PAN context
mwifiex: avoid double list_del in command cancel path
ath9k: fix max phy rate at rate control init
nfc: signedness bug in __nci_request()
iwlwifi: do not set the sequence control bit is not needed
Linus Torvalds [Thu, 22 Dec 2011 02:29:05 +0000 (18:29 -0800)]
Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound:
ALSA: atmel/ac97c: using software reset instead hardware reset if not available
Linus Torvalds [Thu, 22 Dec 2011 02:28:52 +0000 (18:28 -0800)]
Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sameo/mfd-2.6
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sameo/mfd-2.6:
mfd: Include linux/io.h to jz4740-adc
mfd: Use request_threaded_irq for twl4030-irq instead of irq_set_chained_handler
mfd: Base interrupt for twl4030-irq must be one-shot
mfd: Handle tps65910 clear-mask correctly
mfd: add #ifdef CONFIG_DEBUG_FS guard for ab8500_debug_resources
mfd: Fix twl-core oops while calling twl_i2c_* for unbound driver
mfd: include linux/module.h for ab5500-debugfs
mfd: Update wm8994 active device checks for WM1811
mfd: Set tps6586x bits if new value is different from the old one
mfd: Set da903x bits if new value is different from the old one
mfd: Set adp5520 bits if new value is different from the old one
mfd: Add missed free_irq in da903x_remove
Dave Kleikamp [Wed, 21 Dec 2011 17:05:48 +0000 (11:05 -0600)]
vfs: __read_cache_page should use gfp argument rather than GFP_KERNEL
lockdep reports a deadlock in jfs because a special inode's rw semaphore
is taken recursively. The mapping's gfp mask is GFP_NOFS, but is not
used when __read_cache_page() calls add_to_page_cache_lru().
Signed-off-by: Dave Kleikamp <dave.kleikamp@oracle.com> Acked-by: Hugh Dickins <hughd@google.com> Acked-by: Al Viro <viro@zeniv.linux.org.uk> Cc: stable@kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Merge branch 'for-greg' of git://git.kernel.org/pub/scm/linux/kernel/git/balbi/usb into usb-linus
* 'for-greg' of git://git.kernel.org/pub/scm/linux/kernel/git/balbi/usb:
usb: gadget: epautoconf: do not change number of streams
usb: dwc3: core: fix cached revision on our structure
usb: musb: fix reset issue with full speed device
usb/isp1760: Let OF bindings depend on general CONFIG_OF instead of PPC_OF .
To be able to use the driver on other OF-aware architectures, too.
And add necessary OF related #includes to fix compilation error.
Signed-off-by: Joachim Foerster <joachim.foerster@missinglinkelectronics.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
enabled the build on all CONFIG_OF architectures, but it cannot do
this.
This driver depends upon CONFIG_OF_IRQ but not all CONFIG_OF platforms
support that infrastructure, in particular Sparc does not so the
build fails.
Please push a patch like the following to Linus so that this code only
gets built where it actually should.
--------------------
usb/isp1760: Add missing CONFIG_OF_IRQ dependency on OF code.
Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Steffen Klassert [Wed, 21 Dec 2011 21:48:08 +0000 (16:48 -0500)]
net: Add a flow_cache_flush_deferred function
flow_cach_flush() might sleep but can be called from
atomic context via the xfrm garbage collector. So add
a flow_cache_flush_deferred() function and use this if
the xfrm garbage colector is invoked from within the
packet path.
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com> Acked-by: Timo Teräs <timo.teras@iki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Wed, 21 Dec 2011 20:47:16 +0000 (15:47 -0500)]
ipv4: reintroduce route cache garbage collector
Commit 2c8cec5c10b (ipv4: Cache learned PMTU information in inetpeer)
removed IP route cache garbage collector a bit too soon, as this gc was
responsible for expired routes cleanup, releasing their neighbour
reference.
As pointed out by Robert Gladewitz, recent kernels can fill and exhaust
their neighbour cache.
Reintroduce the garbage collection, since we'll have to wait our
neighbour lookups become refcount-less to not depend on this stuff.
Reported-by: Robert Gladewitz <gladewitz@gmx.de> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Linus Torvalds [Wed, 21 Dec 2011 02:39:37 +0000 (18:39 -0800)]
Merge git://git.infradead.org/mtd-2.6
* git://git.infradead.org/mtd-2.6:
mtd: plat_ram: call mtd_device_register only if partition data exists
mtd: pxa2xx-flash.c: It used to fall back to provided table.
mtd: gpmi: add missing include 'module.h'
mtd: ndfc: fix typo in structure dereference
Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: Paul Moore <paul@paul-moore.com> Acked-by: Eric Dumazet <eric.dumazet@gmail.com> Cc: stable@kernel.org Signed-off-by: James Morris <jmorris@namei.org>
Linus Torvalds [Tue, 20 Dec 2011 19:42:38 +0000 (11:42 -0800)]
Merge branch 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
* 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
time/clocksource: Fix kernel-doc warnings
rtc: m41t80: Workaround broken alarm functionality
rtc: Expire alarms after the time is set.
Linus Torvalds [Tue, 20 Dec 2011 19:41:17 +0000 (11:41 -0800)]
Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
oprofile: Fix uninitialized memory access when writing to writing to oprofilefs
Linus Torvalds [Tue, 20 Dec 2011 19:40:48 +0000 (11:40 -0800)]
Merge branch 'stable/for-linus-fixes-3.2' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen
* 'stable/for-linus-fixes-3.2' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen:
Revert "xen/pv-on-hvm kexec: add xs_reset_watches to shutdown watches from old kernel"
Linus Torvalds [Tue, 20 Dec 2011 19:31:56 +0000 (11:31 -0800)]
Merge branch 'bugfixes' of git://git.linux-nfs.org/projects/trondmy/linux-nfs
* 'bugfixes' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
NFS: Fix a regression in nfs_file_llseek()
NFSv4: Do not accept delegated opens when a delegation recall is in effect
NFSv4: Ensure correct locking when accessing the 'lock_states' list
NFSv4.1: Ensure that we handle _all_ SEQUENCE status bits.
NFSv4: Don't error if we handled it in nfs4_recovery_handle_error
SUNRPC: Ensure we always bump the backlog queue in xprt_free_slot
SUNRPC: Fix the execution time statistics in the face of RPC restarts
Linus Torvalds [Tue, 20 Dec 2011 19:31:44 +0000 (11:31 -0800)]
Merge branch 'drm-fixes' of git://people.freedesktop.org/~airlied/linux
* 'drm-fixes' of git://people.freedesktop.org/~airlied/linux:
vmwgfx: Clip cliprects against screen boundaries in present and dirty
vmwgfx: Resend the cursor after legacy modeset
vmwgfx: Do better culling of presents
vmwgfx: Refactor kms code to use vmw_user_lookup_handle helper
vmwgfx: Add helper function to get surface or dmabuf
vmwgfx: Refactor cursor update
vmwgfx: Remove dmabuf check in present ioctl
vmwgfx: Use the revised fifo hw version register when present
Before waiting (predefined value 120s), check that at least one device
was successfully brought up. Otherwise (e.g. buggy bootloader
which does not set the MAC address) there is no point in waiting
for carrier.
Cc: Micha Nelissen <micha@neli.hopto.org> Cc: Holger Brunck <holger.brunck@keymile.com> Signed-off-by: Gerlando Falauto <gerlando.falauto@keymile.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Thomas Graf [Mon, 19 Dec 2011 04:11:40 +0000 (04:11 +0000)]
sctp: Do not account for sizeof(struct sk_buff) in estimated rwnd
When checking whether a DATA chunk fits into the estimated rwnd a
full sizeof(struct sk_buff) is added to the needed chunk size. This
quickly exhausts the available rwnd space and leads to packets being
sent which are much below the PMTU limit. This can lead to much worse
performance.
The reason for this behaviour was to avoid putting too much memory
pressure on the receiver. The concept is not completely irational
because a Linux receiver does in fact clone an skb for each DATA chunk
delivered. However, Linux also reserves half the available socket
buffer space for data structures therefore usage of it is already
accounted for.
When proposing to change this the last time it was noted that this
behaviour was introduced to solve a performance issue caused by rwnd
overusage in combination with small DATA chunks.
Trying to reproduce this I found that with the sk_buff overhead removed,
the performance would improve significantly unless socket buffer limits
are increased.
The following numbers have been gathered using a patched iperf
supporting SCTP over a live 1 Gbit ethernet network. The -l option
was used to limit DATA chunk sizes. The numbers listed are based on
the average of 3 test runs each. Default values have been used for
sk_(r|w)mem.
binary_sysctl() calls sysctl_getname() which allocates from names_cache
slab usin __getname()
The matching function to free the name is __putname(), and not putname()
which should be used only to match getname() allocations.
This is because when auditing is enabled, putname() calls audit_putname
*instead* (not in addition) to __putname(). Then, if a syscall is in
progress, audit_putname does not release the name - instead, it expects
the name to get released when the syscall completes, but that will happen
only if audit_getname() was called previously, i.e. if the name was
allocated with getname() rather than the naked __getname(). So,
__getname() followed by putname() ends up leaking memory.
Signed-off-by: Michel Lespinasse <walken@google.com> Acked-by: Al Viro <viro@zeniv.linux.org.uk> Cc: Christoph Hellwig <hch@infradead.org> Cc: Eric Paris <eparis@redhat.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
where the oom score computation was divided into several steps and it's no
longer computed as one expression in unsigned long(rss, swapents, nr_pte
are unsigned long), where the result value assigned to points(int) is in
range(1..1000). So there could be an int overflow while computing
176 points *= 1000;
and points may have negative value. Meaning the oom score for a mem hog task
will be one.
196 if (points <= 0)
197 return 1;
For example:
[ 3366] 0 3366 3539048024303939 5 0 0 oom01
Out of memory: Kill process 3366 (oom01) score 1 or sacrifice child
Here the oom1 process consumes more than 24303939(rss)*4096~=92GB physical
memory, but it's oom score is one.
In this situation the mem hog task is skipped and oom killer kills another and
most probably innocent task with oom score greater than one.
The points variable should be of type long instead of int to prevent the
int overflow.
Haogang Chen [Tue, 20 Dec 2011 01:11:56 +0000 (17:11 -0800)]
nilfs2: potential integer overflow in nilfs_ioctl_clean_segments()
There is a potential integer overflow in nilfs_ioctl_clean_segments().
When a large argv[n].v_nmembs is passed from the userspace, the subsequent
call to vmalloc() will allocate a buffer smaller than expected, which
leads to out-of-bound access in nilfs_ioctl_move_blocks() and
lfs_clean_segments().
The following check does not prevent the overflow because nsegs is also
controlled by the userspace and could be very large.
if (argv[n].v_nmembs > nsegs * nilfs->ns_blocks_per_segment)
goto out_free;
This patch clamps argv[n].v_nmembs to UINT_MAX / argv[n].v_size, and
returns -EINVAL when overflow.
David Rientjes [Tue, 20 Dec 2011 01:11:52 +0000 (17:11 -0800)]
cpusets: stall when updating mems_allowed for mempolicy or disjoint nodemask
Kernels where MAX_NUMNODES > BITS_PER_LONG may temporarily see an empty
nodemask in a tsk's mempolicy if its previous nodemask is remapped onto a
new set of allowed cpuset nodes where the two nodemasks, as a result of
the remap, are now disjoint.
c0ff7453bb5c ("cpuset,mm: fix no node to alloc memory when changing
cpuset's mems") adds get_mems_allowed() to prevent the set of allowed
nodes from changing for a thread. This causes any update to a set of
allowed nodes to stall until put_mems_allowed() is called.
This stall is unncessary, however, if at least one node remains unchanged
in the update to the set of allowed nodes. This was addressed by 89e8a244b97e ("cpusets: avoid looping when storing to mems_allowed if one
node remains set"), but it's still possible that an empty nodemask may be
read from a mempolicy because the old nodemask may be remapped to the new
nodemask during rebind. To prevent this, only avoid the stall if there is
no mempolicy for the thread being changed.
This is a temporary solution until all reads from mempolicy nodemasks can
be guaranteed to not be empty without the get_mems_allowed()
synchronization.
Also moves the check for nodemask intersection inside task_lock() so that
tsk->mems_allowed cannot change. This ensures that nothing can set this
tsk's mems_allowed out from under us and also protects tsk->mempolicy.
Reported-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: David Rientjes <rientjes@google.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Paul Menage <paul@paulmenage.org> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Axel Lin [Fri, 9 Dec 2011 03:27:55 +0000 (11:27 +0800)]
mfd: Include linux/io.h to jz4740-adc
Include linux/io.h to fix below build error:
CC drivers/mfd/jz4740-adc.o
drivers/mfd/jz4740-adc.c: In function 'jz4740_adc_irq_demux':
drivers/mfd/jz4740-adc.c:73: error: implicit declaration of function 'readb'
drivers/mfd/jz4740-adc.c: In function 'jz4740_adc_set_enabled':
drivers/mfd/jz4740-adc.c:110: error: implicit declaration of function 'writeb'
drivers/mfd/jz4740-adc.c: In function 'jz4740_adc_set_config':
drivers/mfd/jz4740-adc.c:146: error: implicit declaration of function 'readl'
drivers/mfd/jz4740-adc.c:151: error: implicit declaration of function 'writel'
drivers/mfd/jz4740-adc.c: In function 'jz4740_adc_probe':
drivers/mfd/jz4740-adc.c:249: error: implicit declaration of function 'ioremap_nocache'
drivers/mfd/jz4740-adc.c:249: warning: assignment makes pointer from integer without a cast
drivers/mfd/jz4740-adc.c:289: warning: passing argument 3 of 'mfd_add_devices' discards qualifiers from pointer target type
include/linux/mfd/core.h:93: note: expected 'struct mfd_cell *' but argument is of type 'const struct mfd_cell *'
drivers/mfd/jz4740-adc.c:299: error: implicit declaration of function 'iounmap'
make[2]: *** [drivers/mfd/jz4740-adc.o] Error 1
make[1]: *** [drivers/mfd] Error 2
make: *** [drivers] Error 2
Signed-off-by: Axel Lin <axel.lin@gmail.com> Acked-by: Lars-Peter Clausen <lars@metafoo.de> Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
NeilBrown [Sat, 26 Nov 2011 20:17:41 +0000 (07:17 +1100)]
mfd: Use request_threaded_irq for twl4030-irq instead of irq_set_chained_handler
irq_set_chained_handler sets 'desc->handle_irq'.
However this irq is called by handle_nested_irq from handle_twl4030_pih,
and that uses action->thread_fn.
So the handled set with irq_set_chained_handler is never called.
So change to use request_threaded_irq instead - that sets the correct field.
Tested on GTA04 Phoenux.
Signed-off-by: NeilBrown <neilb@suse.de> Tested-by: Felipe Contreras <felipe.contreras@gmail.com> Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
NeilBrown [Sat, 26 Nov 2011 20:17:41 +0000 (07:17 +1100)]
mfd: Base interrupt for twl4030-irq must be one-shot
As the interrupt source is only cleared by the threaded interrupt
service routine, we need to make the base interrupt IRQF_ONESHOT.
Without this, the first interrupt from the TWL4030 cause the CPU to
enter an infinite loop trying to handle to interrupt but never
clearing it.
Signed-off-by: NeilBrown <neilb@suse.de> Tested-by: Felipe Contreras <felipe.contreras@gmail.com> Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>