Suresh Siddha [Fri, 24 Aug 2012 21:13:02 +0000 (14:13 -0700)]
x86, fpu: use non-lazy fpu restore for processors supporting xsave
Fundamental model of the current Linux kernel is to lazily init and
restore FPU instead of restoring the task state during context switch.
This changes that fundamental lazy model to the non-lazy model for
the processors supporting xsave feature.
Reasons driving this model change are:
i. Newer processors support optimized state save/restore using xsaveopt and
xrstor by tracking the INIT state and MODIFIED state during context-switch.
This is faster than modifying the cr0.TS bit which has serializing semantics.
ii. Newer glibc versions use SSE for some of the optimized copy/clear routines.
With certain workloads (like boot, kernel-compilation etc), application
completes its work with in the first 5 task switches, thus taking upto 5 #DNA
traps with the kernel not getting a chance to apply the above mentioned
pre-load heuristic.
iii. Some xstate features (like AMD's LWP feature) don't honor the cr0.TS bit
and thus will not work correctly in the presence of lazy restore. Non-lazy
state restore is needed for enabling such features.
Some data on a two socket SNB system:
* Saved 20K DNA exceptions during boot on a two socket SNB system.
* Saved 50K DNA exceptions during kernel-compilation workload.
* Improved throughput of the AVX based checksumming function inside the
kernel by ~15% as xsave/xrstor is faster than the serializing clts/stts
pair.
Also now kernel_fpu_begin/end() relies on the patched
alternative instructions. So move check_fpu() which uses the
kernel_fpu_begin/end() after alternative_instructions().
Suresh Siddha [Fri, 24 Aug 2012 21:13:01 +0000 (14:13 -0700)]
lguest, x86: handle guest TS bit for lazy/non-lazy fpu host models
Instead of using unlazy_fpu() check if user_has_fpu() and set/clear
the host TS bits so that the lguest works fine with both the
lazy/non-lazy FPU host models with minimal changes.
Suresh Siddha [Fri, 24 Aug 2012 21:13:00 +0000 (14:13 -0700)]
x86, fpu: always use kernel_fpu_begin/end() for in-kernel FPU usage
use kernel_fpu_begin/end() instead of unconditionally accessing cr0 and
saving/restoring just the few used xmm/ymm registers.
This has some advantages like:
* If the task's FPU state is already active, then kernel_fpu_begin()
will just save the user-state and avoiding the read/write of cr0.
In general, cr0 accesses are much slower.
* Manual save/restore of xmm/ymm registers will affect the 'modified' and
the 'init' optimizations brought in the by xsaveopt/xrstor
infrastructure.
* Foward compatibility with future vector register extensions will be a
problem if the xmm/ymm registers are manually saved and restored
(corrupting the extended state of those vector registers).
With this patch, there was no significant difference in the xor throughput
using AVX, measured during boot.
Suresh Siddha [Fri, 24 Aug 2012 21:12:59 +0000 (14:12 -0700)]
x86, kvm: use kernel_fpu_begin/end() in kvm_load/put_guest_fpu()
kvm's guest fpu save/restore should be wrapped around
kernel_fpu_begin/end(). This will avoid for example taking a DNA
in kvm_load_guest_fpu() when it tries to load the fpu immediately
after doing unlazy_fpu() on the host side.
More importantly this will prevent the host process fpu from being
corrupted.
Suresh Siddha [Fri, 24 Aug 2012 21:12:58 +0000 (14:12 -0700)]
x86, fpu: remove unnecessary user_fpu_end() in save_xstate_sig()
Few lines below we do drop_fpu() which is more safer. Remove the
unnecessary user_fpu_end() in save_xstate_sig(), which allows
the drop_fpu() to ignore any pending exceptions from the user-space
and drop the current fpu.
Suresh Siddha [Fri, 24 Aug 2012 21:12:57 +0000 (14:12 -0700)]
x86, fpu: drop_fpu() before restoring new state from sigframe
No need to save the state with unlazy_fpu(), that is about to get overwritten
by the state from the signal frame. Instead use drop_fpu() and continue
to restore the new state.
x86, fpu: Unify signal handling code paths for x86 and x86_64 kernels
Currently for x86 and x86_32 binaries, fpstate in the user sigframe is copied
to/from the fpstate in the task struct.
And in the case of signal delivery for x86_64 binaries, if the fpstate is live
in the CPU registers, then the live state is copied directly to the user
sigframe. Otherwise fpstate in the task struct is copied to the user sigframe.
During restore, fpstate in the user sigframe is restored directly to the live
CPU registers.
Historically, different code paths led to different bugs. For example,
x86_64 code path was not preemption safe till recently. Also there is lot
of code duplication for support of new features like xsave etc.
Unify signal handling code paths for x86 and x86_64 kernels.
New strategy is as follows:
Signal delivery: Both for 32/64-bit frames, align the core math frame area to
64bytes as needed by xsave (this where the main fpu/extended state gets copied
to and excludes the legacy compatibility fsave header for the 32-bit [f]xsave
frames). If the state is live, copy the register state directly to the user
frame. If not live, copy the state in the thread struct to the user frame. And
for 32-bit [f]xsave frames, construct the fsave header separately before
the actual [f]xsave area.
Signal return: As the 32-bit frames with [f]xstate has an additional
'fsave' header, copy everything back from the user sigframe to the
fpstate in the task structure and reconstruct the fxstate from the 'fsave'
header (Also user passed pointers may not be correctly aligned for
any attempt to directly restore any partial state). At the next fpstate usage,
everything will be restored to the live CPU registers.
For all the 64-bit frames and the 32-bit fsave frame, restore the state from
the user sigframe directly to the live CPU registers. 64-bit signals always
restored the math frame directly, so we can expect the math frame pointer
to be correctly aligned. For 32-bit fsave frames, there are no alignment
requirements, so we can restore the state directly.
"lat_sig catch" microbenchmark numbers (for x86, x86_64, x86_32 binaries) are
with in the noise range with this change.
Use config_enabled() to cleanup the definitions of is_ia32/is_x32. Move
the function prototypes to the header file to cleanup ifdefs,
and move the x32_setup_rt_frame() code around.
1) NLA_PUT* --> nla_put_* conversion got one case wrong in
nfnetlink_log, fix from Patrick McHardy.
2) Missed error return check in ipw2100 driver, from Julia Lawall.
3) PMTU updates in ipv4 were setting the expiry time incorrectly, fix
from Eric Dumazet.
4) SFC driver erroneously reversed src and dst when reporting filters
via ethtool.
5) Memory leak in CAN protocol and wrong setting of IRQF_SHARED in
sja1000 can platform driver, from Alexey Khoroshilov and Sven
Schmitt.
6) Fix multicast traffic scaling regression in ipv4_dst_destroy, only
take the lock when we really need to. From Eric Dumazet.
7) Fix non-root process spoofing in netlink, from Pablo Neira Ayuso.
8) CWND reduction in TCP is done incorrectly during non-SACK recovery,
fix from Yuchung Cheng.
9) Revert netpoll change, and fix what was actually a driver specific
problem. From Amerigo Wang. This should cure bootup hangs with
netconsole some people reported.
10) Fix xen-netfront invoking __skb_fill_page_desc() with a NULL page
pointer. From Ian Campbell.
11) SIP NAT fix for expectiontation creation, from Pablo Neira Ayuso.
12) __ip_rt_update_pmtu() needs RCU locking, from Eric Dumazet.
13) Fix usbnet deadlock on resume, can't use GFP_KERNEL in this
situation. From Oliver Neukum.
14) The davinci ethernet driver triggers an OOPS on removal because it
frees an MDIO object before unregistering it. Fix from Bin Liu.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (41 commits)
net: qmi_wwan: add several new Gobi devices
fddi: 64 bit bug in smt_add_para()
net: ethernet: fix kernel OOPS when remove davinci_mdio module
net/xfrm/xfrm_state.c: fix error return code
net: ipv6: fix error return code
net: qmi_wwan: new device: Foxconn/Novatel E396
usbnet: fix deadlock in resume
cs89x0 : packet reception not working
netfilter: nf_conntrack: fix racy timer handling with reliable events
bnx2x: Correct the ndo_poll_controller call
bnx2x: Move netif_napi_add to the open call
ipv4: must use rcu protection while calling fib_lookup
bnx2x: fix 57840_MF pci id
net: ipv4: ipmr_expire_timer causes crash when removing net namespace
e1000e: DoS while TSO enabled caused by link partner with small MSS
l2tp: avoid to use synchronize_rcu in tunnel free function
gianfar: fix default tx vlan offload feature flag
netfilter: nf_nat_sip: fix incorrect handling of EBUSY for RTCP expectation
xen-netfront: use __pskb_pull_tail to ensure linear area is big enough on RX
netfilter: nfnetlink_log: fix error return code in init path
...
Gobi devices are composite, needing both the qcserial and
qmi_wwan drivers to support all functions. Re-syncing the
list of supported devices with qcserial.
Cc: Aleksander Morgado <aleksander@lanedo.com> Cc: Thomas Tuttle <ttuttle@chromium.org> Signed-off-by: Bjørn Mork <bjorn@mork.no> Signed-off-by: David S. Miller <davem@tempietto.lan>
Dan Carpenter [Sat, 1 Sep 2012 09:57:40 +0000 (09:57 +0000)]
fddi: 64 bit bug in smt_add_para()
The intent was to set 4 bytes of data so that's why the sp_len is set
to 4 on the next line. The cast to u_long pointer clears 8 bytes
on 64 bit arches.
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: David S. Miller <davem@tempietto.lan>
John Stultz [Fri, 31 Aug 2012 17:30:06 +0000 (13:30 -0400)]
time: Move ktime_t overflow checking into timespec_valid_strict
Andreas Bombe reported that the added ktime_t overflow checking added to
timespec_valid in commit 4e8b14526ca7 ("time: Improve sanity checking of
timekeeping inputs") was causing problems with X.org because it caused
timeouts larger then KTIME_T to be invalid.
Previously, these large timeouts would be clamped to KTIME_MAX and would
never expire, which is valid.
This patch splits the ktime_t overflow checking into a new
timespec_valid_strict function, and converts the timekeeping codes
internal checking to use this more strict function.
Reported-and-tested-by: Andreas Bombe <aeb@debian.org> Cc: Zhouping Liu <zliu@redhat.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Prarit Bhargava <prarit@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: stable@vger.kernel.org Signed-off-by: John Stultz <john.stultz@linaro.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Merge tag 'parisc-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/parisc-2.6
Pull PARISC fixes from James Bottomley:
"This is a set of two bug fixes. One is the ATOMIC problem which is
now causing a compile failure in certain situations. The other is
mishandling of PER_LINUX32 which may also cause user visible effects.
Signed-off-by: James Bottomley <JBottomley@Parallels.com>"
* tag 'parisc-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/parisc-2.6:
[PARISC] fix personality flag check in copy_thread()
[PARISC] Redefine ATOMIC_INIT and ATOMIC64_INIT to drop the casts
Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux
Pull s390 fixes from Martin Schwidefsky:
"A couple of s390 bug fixes for 3.5-rc4"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux:
s390/32: Don't clobber personality flags on exec
s390/smp: add missing smp_store_status() for !SMP
s390/dasd: fix ioctl return value
s390: Always use "long" for ssize_t to match size_t
Cc: Dan Williams <dcbw@redhat.com> Cc: Bjørn Mork <bjorn@mork.no> Cc: Ben Chan <benchan@google.com> Signed-off-by: Aleksander Morgado <aleksander@lanedo.com> Acked-by: Bjørn Mork <bjorn@mork.no> Signed-off-by: David S. Miller <davem@davemloft.net>
Oliver Neukum [Sun, 26 Aug 2012 20:41:38 +0000 (20:41 +0000)]
usbnet: fix deadlock in resume
A usbnet device can share a multifunction device
with a storage device. If the storage device is autoresumed
the usbnet devices also needs to be autoresumed. Allocating
memory with GFP_KERNEL can deadlock in this case.
netfilter: nf_conntrack: fix racy timer handling with reliable events
Existing code assumes that del_timer returns true for alive conntrack
entries. However, this is not true if reliable events are enabled.
In that case, del_timer may return true for entries that were
just inserted in the dying list. Note that packets / ctnetlink may
hold references to conntrack entries that were just inserted to such
list.
This patch fixes the issue by adding an independent timer for
event delivery. This increases the size of the ecache extension.
Still we can revisit this later and use variable size extensions
to allocate this area on demand.
Tested-by: Oliver Smith <olipro@8.c.9.b.0.7.4.0.1.0.0.2.ip6.arpa> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Merav Sicron [Mon, 27 Aug 2012 03:26:19 +0000 (03:26 +0000)]
bnx2x: Move netif_napi_add to the open call
Move netif_napi_add for all queues from the probe call to the open call, to
avoid the case that napi objects are added for queues that may eventually not
be initialized and activated. With the former behavior, the driver could crash
when netpoll was calling ndo_poll_controller.
Signed-off-by: Merav Sicron <meravs@broadcom.com> Signed-off-by: Dmitry Kravkov <dmitry@broadcom.com> Signed-off-by: Eilon Greenstein <eilong@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Yuval Mintz [Sun, 26 Aug 2012 00:35:45 +0000 (00:35 +0000)]
bnx2x: fix 57840_MF pci id
Commit c3def943c7117d42caaed3478731ea7c3c87190e have added support for
new pci ids of the 57840 board, while failing to change the obsolete value
in 'pci_ids.h'.
This patch does so, allowing the probe of such devices.
Signed-off-by: Yuval Mintz <yuvalmin@broadcom.com> Signed-off-by: Eilon Greenstein <eilong@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
net: ipv4: ipmr_expire_timer causes crash when removing net namespace
When tearing down a net namespace, ipv4 mr_table structures are freed
without first deactivating their timers. This can result in a crash in
run_timer_softirq.
This patch mimics the corresponding behaviour in ipv6.
Locking and synchronization seem to be adequate.
We are about to kfree mrt, so existing code should already make sure that
no other references to mrt are pending or can be created by incoming traffic.
The functions invoked here do not cause new references to mrt or other
race conditions to be created.
Invoking del_timer_sync guarantees that ipmr_expire_timer is inactive.
Both ipmr_expire_process (whose completion we may have to wait in
del_timer_sync) and mroute_clean_tables internally use mfc_unres_lock
or other synchronizations when needed, and they both only modify mrt.
Tested in Linux 3.4.8.
Signed-off-by: Francesco Ruggeri <fruggeri@aristanetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Bruce Allan [Fri, 24 Aug 2012 20:38:11 +0000 (20:38 +0000)]
e1000e: DoS while TSO enabled caused by link partner with small MSS
With a low enough MSS on the link partner and TSO enabled locally, the
networking stack can periodically send a very large (e.g. 64KB) TCP
message for which the driver will attempt to use more Tx descriptors than
are available by default in the Tx ring. This is due to a workaround in
the code that imposes a limit of only 4 MSS-sized segments per descriptor
which appears to be a carry-over from the older e1000 driver and may be
applicable only to some older PCI or PCIx parts which are not supported in
e1000e. When the driver gets a message that is too large to fit across the
configured number of Tx descriptors, it stops the upper stack from queueing
any more and gets stuck in this state. After a timeout, the upper stack
assumes the adapter is hung and calls the driver to reset it.
Remove the unnecessary limitation of using up to only 4 MSS-sized segments
per Tx descriptor, and put in a hard failure test to catch when attempting
to check for message sizes larger than would fit in the whole Tx ring.
Refactor the remaining logic that limits the size of data per Tx descriptor
from a seemingly arbitrary 8KB to a limit based on the dynamic size of the
Tx packet buffer as described in the hardware specification.
Also, fix the logic in the check for space in the Tx ring for the next
largest possible packet after the current one has been successfully queued
for transmit, and use the appropriate defines for default ring sizes in
e1000_probe instead of magic values.
This issue goes back to the introduction of e1000e in 2.6.24 when it was
split off from e1000.
Reported-by: Ben Hutchings <bhutchings@solarflare.com> Signed-off-by: Bruce Allan <bruce.w.allan@intel.com> Cc: Stable <stable@vger.kernel.org> [2.6.24+] Tested-by: Aaron Brown <aaron.f.brown@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Claudiu Manoil [Thu, 23 Aug 2012 21:46:25 +0000 (21:46 +0000)]
gianfar: fix default tx vlan offload feature flag
Commit -
"b852b72 gianfar: fix bug caused by 87c288c6e9aa31720b72e2bc2d665e24e1653c3e"
disables by default (on mac init) the hw vlan tag insertion.
The "features" flags were not updated to reflect this, and
"ethtool -K" shows tx-vlan-offload to be "on" by default.
Cc: Sebastian Poehn <sebastian.poehn@belden.com> Signed-off-by: Claudiu Manoil <claudiu.manoil@freescale.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Ian Campbell [Wed, 22 Aug 2012 00:26:47 +0000 (00:26 +0000)]
xen-netfront: use __pskb_pull_tail to ensure linear area is big enough on RX
I'm slightly concerned by the "only in exceptional circumstances"
comment on __pskb_pull_tail but the structure of an skb just created
by netfront shouldn't hit any of the especially slow cases.
This approach still does slightly more work than the old way, since if
we pull up the entire first frag we now have to shuffle everything
down where before we just received into the right place in the first
place.
Signed-off-by: Ian Campbell <ian.campbell@citrix.com> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Jeremy Fitzhardinge <jeremy@goop.org> Cc: Mel Gorman <mgorman@suse.de> Cc: xen-devel@lists.xensource.com Cc: netdev@vger.kernel.org Cc: linux-kernel@vger.kernel.org Tested-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Acked-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Linus Torvalds [Thu, 30 Aug 2012 16:11:33 +0000 (09:11 -0700)]
Merge branch 'drm-fixes' of git://people.freedesktop.org/~airlied/linux
Pull drm fixes from Dave Airlie:
"A bunch of scattered fixes ati/intel/nouveau, couple of core ones,
nothing too shocking or different."
* 'drm-fixes' of git://people.freedesktop.org/~airlied/linux:
drm: Add EDID_QUIRK_FORCE_REDUCED_BLANKING for ASUS VW222S
gma500: Consider CRTC initially active.
drm/radeon: fix dig encoder selection on DCE61
drm/radeon: fix double free in radeon_gpu_reset
drm/radeon: force dma32 to fix regression rs4xx,rs6xx,rs740
drm/radeon: rework panel mode setup
drm/radeon/atom: powergating fixes for DCE6
drm/radeon/atom: rework DIG modesetting on DCE3+
drm/radeon: don't disable plls that are in use by other crtcs
drm/radeon: add proper checking of RESOLVE_BOX command for r600-r700
drm/radeon: initialize tracked CS state
drm/radeon: fix reading CB_COLORn_MASK from the CS
drm/nvc0/copy: check PUNITS to determine which copy engines are disabled
i915: Quirk no_lvds on Gigabyte GA-D525TUD ITX motherboard
drm/i915: Use the correct size of the GTT for placing the per-process entries
drm: Check for invalid cursor flags
drm: Initialize object type when using DRM_MODE() macro
drm/i915: fix color order for BGR formats on IVB
drm/i915: fix wrong order of parameters in port checking functions
Heiko Carstens [Tue, 28 Aug 2012 08:02:08 +0000 (10:02 +0200)]
s390/32: Don't clobber personality flags on exec
In native 32 bit mode the personality flags were not correctly inherited.
This is the s390 version of 59e4c3a2 "powerpc/32: Don't clobber personality
flags on exec".
Reported-by: Mike Frysinger <vapier@gentoo.org> Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=17629 Signed-off-by: Paul Menzel <paulepanter@users.sourceforge.net> Cc: <dri-devel@lists.freedesktop.org> Cc: Adam Jackson <ajax@redhat.com> Cc: Ian Pilcher <arequipeno@gmail.com> Cc: <stable@vger.kernel.org> Signed-off-by: Dave Airlie <airlied@redhat.com>
Dave Airlie [Thu, 30 Aug 2012 00:35:34 +0000 (10:35 +1000)]
Merge branch 'drm-fixes-3.6' of git://people.freedesktop.org/~agd5f/linux into drm-fixes
Alex writes:
Highlights:
- fix a gart regression on older IGP chips
- more MSAA fixes
- fix a double free in gpu reset code
- modesetting fixes
- trinity dig encoder fix.
* 'drm-fixes-3.6' of git://people.freedesktop.org/~agd5f/linux:
drm/radeon: fix dig encoder selection on DCE61
drm/radeon: fix double free in radeon_gpu_reset
drm/radeon: force dma32 to fix regression rs4xx,rs6xx,rs740
drm/radeon: rework panel mode setup
drm/radeon/atom: powergating fixes for DCE6
drm/radeon/atom: rework DIG modesetting on DCE3+
drm/radeon: don't disable plls that are in use by other crtcs
drm/radeon: add proper checking of RESOLVE_BOX command for r600-r700
drm/radeon: initialize tracked CS state
drm/radeon: fix reading CB_COLORn_MASK from the CS
Forest Bond [Mon, 13 Aug 2012 16:31:24 +0000 (16:31 +0000)]
gma500: Consider CRTC initially active.
[this one ideally should make 3.6 - it fixes the very annoying mode setting bug]
This causes the pipe to be forced off prior to initial mode set, which
roughly mirrors the behavior of the i915 driver. It fixes initial mode
setting on my Intel DN2800MT (Cedarview) board. Without it, mode
setting triggers an out-of-range error from the monitor for most modes,
but only on initial configuration (i.e. they can be configured
successfully from userspace after that).
Signed-off-by: Forest Bond <forest.bond@rapidrollout.com> Signed-off-by: Alan Cox <alan@linux.intel.com> Cc: stable <stable@vger.kernel.org> Signed-off-by: Dave Airlie <airlied@redhat.com>
This patch reverts the offending commit and fixes be_poll() by
avoid disabling BH there, this is okay because be_poll()
can be called either by poll_napi() which already disables
IRQ, or by net_rx_action() which already disables BH.
Reported-by: Andrew Morton <akpm@linux-foundation.org> Reported-by: Sylvain Munaut <s.munaut@whatever-company.com> Cc: Sylvain Munaut <s.munaut@whatever-company.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: David Miller <davem@davemloft.net> Cc: Sathya Perla <sathya.perla@emulex.com> Cc: Subbu Seetharaman <subbu.seetharaman@emulex.com> Cc: Ajit Khaparde <ajit.khaparde@emulex.com> Signed-off-by: Cong Wang <amwang@redhat.com> Tested-by: Sylvain Munaut <s.munaut@whatever-company.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Linus Torvalds [Wed, 29 Aug 2012 18:36:22 +0000 (11:36 -0700)]
Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs fixes from Chris Mason:
"I've split out the big send/receive update from my last pull request
and now have just the fixes in my for-linus branch. The send/recv
branch will wander over to linux-next shortly though.
The largest patches in this pull are Josef's patches to fix DIO
locking problems and his patch to fix a crash during balance. They
are both well tested.
The rest are smaller fixes that we've had queued. The last rc came
out while I was hacking new and exciting ways to recover from a
misplaced rm -rf on my dev box, so these missed rc3."
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (25 commits)
Btrfs: fix that repair code is spuriously executed for transid failures
Btrfs: fix ordered extent leak when failing to start a transaction
Btrfs: fix a dio write regression
Btrfs: fix deadlock with freeze and sync V2
Btrfs: revert checksum error statistic which can cause a BUG()
Btrfs: remove superblock writing after fatal error
Btrfs: allow delayed refs to be merged
Btrfs: fix enospc problems when deleting a subvol
Btrfs: fix wrong mtime and ctime when creating snapshots
Btrfs: fix race in run_clustered_refs
Btrfs: don't run __tree_mod_log_free_eb on leaves
Btrfs: increase the size of the free space cache
Btrfs: barrier before waitqueue_active
Btrfs: fix deadlock in wait_for_more_refs
btrfs: fix second lock in btrfs_delete_delayed_items()
Btrfs: don't allocate a seperate csums array for direct reads
Btrfs: do not strdup non existent strings
Btrfs: do not use missing devices when showing devname
Btrfs: fix that error value is changed by mistake
Btrfs: lock extents as we map them in DIO
...
David Rientjes [Wed, 29 Aug 2012 02:57:21 +0000 (19:57 -0700)]
mm, slab: lock the correct nodelist after reenabling irqs
cache_grow() can reenable irqs so the cpu (and node) can change, so ensure
that we take list_lock on the correct nodelist.
This fixes an issue with commit 072bb0aa5e06 ("mm: sl[au]b: add
knowledge of PFMEMALLOC reserve pages") where list_lock for the wrong
node was taken after growing the cache.
Reported-and-tested-by: Haggai Eran <haggaie@mellanox.com> Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Christian König [Wed, 29 Aug 2012 11:24:15 +0000 (13:24 +0200)]
drm/radeon: fix double free in radeon_gpu_reset
radeon_ring_restore is freeing the memory for the saved
ring data. We need to remember that, otherwise we try to
restore the ring data again on the next try. Additional
to that it shouldn't try the reset infinitely if we have
saved ring data.
Signed-off-by: Christian König <deathsimple@vodafone.de> Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Jerome Glisse [Tue, 28 Aug 2012 20:50:22 +0000 (16:50 -0400)]
drm/radeon: force dma32 to fix regression rs4xx,rs6xx,rs740
It seems some of those IGP dislike non dma32 page despite what
documentation says. Fix regression since we allowed non dma32
pages. It seems it only affect some revision of those IGP chips
as we don't know which one just force dma32 for all of them.
Alex Deucher [Mon, 27 Aug 2012 21:48:18 +0000 (17:48 -0400)]
drm/radeon: rework panel mode setup
Adjust the panel mode setup to match the behavior
of the vbios. Rather than checking for specific
bridge chip ids, just check the eDP configuration register.
This saves extra aux transactions and works across
DP bridge chips without requiring additional per chip
id checking.
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Alex Deucher [Fri, 24 Aug 2012 22:21:21 +0000 (18:21 -0400)]
drm/radeon/atom: powergating fixes for DCE6
Power gating is per crtc pair, but the powergating registers
should be called individually. The hw handles power up/down
properly. The pair is powered up if either crtc in the pair
is powered up and the pair is not powered down until both
crtcs in the pair are powered down. This simplifies
programming and should save additional power as the previous
code never actually power gated the crtc pair.
Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Cc: stable@vger.kernel.org
Alex Deucher [Wed, 22 Aug 2012 13:54:56 +0000 (09:54 -0400)]
drm/radeon/atom: rework DIG modesetting on DCE3+
The ordering is important and the current drm code
wasn't cutting it for modern DIG encoders. We need
to have information about crtc before setting up
the encoders so I've shifted the ordering a bit.
Probably we'll need a full rework akin to danvet's
recent intel patchs. This patch fixes numerous
issues with DP bridge chips and makes link training
much more reliable.
Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Cc: stable@vger.kernel.org
Marek Olšák [Fri, 24 Aug 2012 12:27:36 +0000 (14:27 +0200)]
drm/radeon: add proper checking of RESOLVE_BOX command for r600-r700
Checking of the second colorbuffer was skipped on r700, because
CB_TARGET_MASK was 0xf. With r600, CB_TARGET_MASK is changed to 0xff,
so we must set the number of samples of the second colorbuffer to 1 in order
to pass the CS checker.
The DRM version is bumped, because RESOLVE_BOX is always rejected without this
fix on r600.
Signed-off-by: Marek Olšák <maraeo@gmail.com> Reviewed-by: Jerome Glisse <jglisse@redhat.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Dave Airlie [Wed, 29 Aug 2012 10:09:23 +0000 (20:09 +1000)]
Merge branch 'drm-intel-fixes' of git://people.freedesktop.org/~danvet/drm-intel into drm-fixes
Daniel writes:
"Just a few smaller things:
- Fix up a pipe vs. plane confusion from a refactoring, fixes a regression
from 3.1 (Anhua Xu).
- Fix ivb sprite pixel formats (Vijay).
- Fixup ppgtt pde placement for machines where the Bios artifically limits
the availbale gtt space in the name of ... product differentiation
(Chris). This fixes an oops.
- Yet another no_lvds quirk entry."
* 'drm-intel-fixes' of git://people.freedesktop.org/~danvet/drm-intel:
i915: Quirk no_lvds on Gigabyte GA-D525TUD ITX motherboard
drm/i915: Use the correct size of the GTT for placing the per-process entries
drm/i915: fix color order for BGR formats on IVB
drm/i915: fix wrong order of parameters in port checking functions
Ben Skeggs [Mon, 27 Aug 2012 06:22:49 +0000 (16:22 +1000)]
drm/nvc0/copy: check PUNITS to determine which copy engines are disabled
On some Fermi chipsets (NVCE particularly) PCOPY1 doesn't exist. And if
what I've seen on Kepler is true of Fermi too, chipsets of the same type
can have different PCOPY units available.
This should fix a v3.5 regression reported by a number of people effecting
suspend/resume on NVC8/NVCE chipsets.
Cc: stable@vger.kernel.org [3.5] Signed-off-by: Ben Skeggs <bskeggs@redhat.com>
Stefan Behrens [Fri, 10 Aug 2012 14:58:21 +0000 (08:58 -0600)]
Btrfs: fix that repair code is spuriously executed for transid failures
If verify_parent_transid() fails for all mirrors, the current code
calls repair_io_failure() anyway which means:
- that the disk block is rewritten without repairing anything and
- that a kernel log message is printed which misleadingly claims
that a read error was corrected.
This is an example:
parent transid verify failed on 615015833600 wanted 110423 found 110424
parent transid verify failed on 615015833600 wanted 110423 found 110424
btrfs read error corrected: ino 1 off 615015833600 (dev /dev/...)
It is wrong to ignore the results from verify_parent_transid() and to
call repair_eb_io_failure() when the verification of the transids failed.
This commit fixes the issue.
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de> Signed-off-by: Chris Mason <chris.mason@oracle.com>
In dio write, we should unlock the section which we didn't do IO on in case that
we fall back to buffered write. But we need to not only unlock the section
but also cleanup reserved space for the section.
This bug was found while running xfstests 133, with this 133 no longer complains.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
Josef Bacik [Fri, 24 Aug 2012 18:53:03 +0000 (12:53 -0600)]
Btrfs: fix deadlock with freeze and sync V2
We can deadlock with freeze right now because we unconditionally start a
transaction in our ->sync_fs() call. To fix this just check and see if we
have a running transaction to commit. This saves us from the deadlock
because at this point we'll have the umount sem for the sb so we're safe
from freezes coming in after we've done our check. With this patch the
freeze xfstests no longer deadlocks. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
Stefan Behrens [Mon, 27 Aug 2012 14:30:03 +0000 (08:30 -0600)]
Btrfs: revert checksum error statistic which can cause a BUG()
Commit 442a4f6308e694e0fa6025708bd5e4e424bbf51c added btrfs device
statistic counters for detected IO and checksum errors to Linux 3.5.
The statistic part that counts checksum errors in
end_bio_extent_readpage() can cause a BUG() in a subfunction:
"kernel BUG at fs/btrfs/volumes.c:3762!"
That part is reverted with the current patch.
However, the counting of checksum errors in the scrub context remains
active, and the counting of detected IO errors (read, write or flush
errors) in all contexts remains active.
Cc: stable <stable@vger.kernel.org> # 3.5 Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de> Signed-off-by: Chris Mason <chris.mason@oracle.com>
Stefan Behrens [Wed, 1 Aug 2012 11:45:52 +0000 (05:45 -0600)]
Btrfs: remove superblock writing after fatal error
With commit acce952b0, btrfs was changed to flag the filesystem with
BTRFS_SUPER_FLAG_ERROR and switch to read-only mode after a fatal
error happened like a write I/O errors of all mirrors.
In such situations, on unmount, the superblock is written in
btrfs_error_commit_super(). This is done with the intention to be able
to evaluate the error flag on the next mount. A warning is printed
in this case during the next mount and the log tree is ignored.
The issue is that it is possible that the superblock points to a root
that was not written (due to write I/O errors).
The result is that the filesystem cannot be mounted. btrfsck also does
not start and all the other btrfs-progs tools fail to start as well.
However, mount -o recovery is working well and does the right things
to recover the filesystem (i.e., don't use the log root, clear the
free space cache and use the next mountable root that is stored in the
root backup array).
This patch removes the writing of the superblock when
BTRFS_SUPER_FLAG_ERROR is set, and removes the handling of the error
flag in the mount function.
These lines can be used to reproduce the issue (using /dev/sdm):
SCRATCH_DEV=/dev/sdm
SCRATCH_MNT=/mnt
echo 0 25165824 linear $SCRATCH_DEV 0 | dmsetup create foo
ls -alLF /dev/mapper/foo
mkfs.btrfs /dev/mapper/foo
mount /dev/mapper/foo $SCRATCH_MNT
echo bar > $SCRATCH_MNT/foo
sync
echo 0 25165824 error | dmsetup reload foo
dmsetup resume foo
ls -alF $SCRATCH_MNT
touch $SCRATCH_MNT/1
ls -alF $SCRATCH_MNT
sleep 35
echo 0 25165824 linear $SCRATCH_DEV 0 | dmsetup reload foo
dmsetup resume foo
sleep 1
umount $SCRATCH_MNT
btrfsck /dev/mapper/foo
dmsetup remove foo
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de> Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
Josef Bacik [Tue, 7 Aug 2012 20:00:32 +0000 (16:00 -0400)]
Btrfs: allow delayed refs to be merged
Daniel Blueman reported a bug with fio+balance on a ramdisk setup.
Basically what happens is the balance relocates a tree block which will drop
the implicit refs for all of its children and adds a full backref. Once the
block is relocated we have to add the implicit refs back, so when we cow the
block again we add the implicit refs for its children back. The problem
comes when the original drop ref doesn't get run before we add the implicit
refs back. The delayed ref stuff will specifically prefer ADD operations
over DROP to keep us from freeing up an extent that will have references to
it, so we try to add the implicit ref before it is actually removed and we
panic. This worked fine before because the add would have just canceled the
drop out and we would have been fine. But the backref walking work needs to
be able to freeze the delayed ref stuff in time so we have this ever
increasing sequence number that gets attached to all new delayed ref updates
which makes us not merge refs and we run into this issue.
So to fix this we need to merge delayed refs. So everytime we run a
clustered ref we need to try and merge all of its delayed refs. The backref
walking stuff locks the delayed ref head before processing, so if we have it
locked we are safe to merge any refs inside of the sequence number. If
there is no sequence number we can merge all refs. Doing this not only
fixes our bug but keeps the delayed ref code from adding and removing
useless refs and batching together multiple refs into one search instead of
one search per delayed ref, which will really help our commit times. I ran
this with Daniels test and 276 and I haven't seen any problems. Thanks,
Reported-by: Daniel J Blueman <daniel@quora.org> Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Josef Bacik [Wed, 8 Aug 2012 16:12:59 +0000 (10:12 -0600)]
Btrfs: fix enospc problems when deleting a subvol
Subvol delete is a special kind of awful where we use the global reserve to
cover the ENOSPC requirements. The problem is once we're done removing
everything we do a btrfs_update_inode(), which by default will try to do the
delayed update stuff which will use it's own reserve. There will be no
space in this reserve and we'll return ENOSPC. So instead use
btrfs_update_inode_fallback() which will just fallback to updating the inode
item in the case of enospc. This is fine because the global reserve covers
the space requirements for this. With this patch I can now delete a subvol
on a problem image Dave Sterba sent me. Thanks,
Reported-by: David Sterba <dave@jikos.cz> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
I added a window where the delayed_ref's head->ref_mod code can diverge
from the sum of the remaining refs, because we release the head->mutex
in the middle. This leads to btrfs_lookup_extent_info returning wrong
numbers. This patch fixes this by adjusting the head's ref_mod with each
delayed ref we run.
Signed-off-by: Arne Jansen <sensille@gmx.net> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Chris Mason [Tue, 7 Aug 2012 19:34:49 +0000 (15:34 -0400)]
Btrfs: don't run __tree_mod_log_free_eb on leaves
When we split a leaf, we may end up inserting a new root on top of that
leaf. The reflog code was incorrectly assuming the old root was always
a node. This makes sure we skip over leaves.
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Josef Bacik [Mon, 6 Aug 2012 19:46:38 +0000 (13:46 -0600)]
Btrfs: increase the size of the free space cache
Arne was complaining about the space cache having mismatching generation
numbers when debugging a deadlock. This is because we can run out of space
in our preallocated range for our space cache if you have a pretty
fragmented amount of space in your pinned space. So just increase the
amount of space we preallocate for space cache so we can be sure to have
enough space. This will only really affect data ranges since their the only
chunks that end up larger than 256MB. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Josef Bacik [Wed, 1 Aug 2012 19:36:24 +0000 (15:36 -0400)]
Btrfs: barrier before waitqueue_active
We need a barrir before calling waitqueue_active otherwise we will miss
wakeups. So in places that do atomic_dec(); then atomic_read() use
atomic_dec_return() which imply a memory barrier (see memory-barriers.txt)
and then add an explicit memory barrier everywhere else that need them.
Thanks,
Arne Jansen [Mon, 6 Aug 2012 20:18:51 +0000 (14:18 -0600)]
Btrfs: fix deadlock in wait_for_more_refs
Commit a168650c introduced a waiting mechanism to prevent busy waiting in
btrfs_run_delayed_refs. This can deadlock with btrfs_run_ordered_operations,
where a tree_mod_seq is held while waiting for the io to complete, while
the end_io calls btrfs_run_delayed_refs.
This whole mechanism is unnecessary. If not enough runnable refs are
available to satisfy count, just return as count is more like a guideline
than a strict requirement.
In case we have to run all refs, commit transaction makes sure that no
other threads are working in the transaction anymore, so we just assert
here that no refs are blocked.
Signed-off-by: Arne Jansen <sensille@gmx.net> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Josef Bacik [Fri, 3 Aug 2012 20:49:19 +0000 (16:49 -0400)]
Btrfs: don't allocate a seperate csums array for direct reads
We've been allocating a big array for csums instead of storing them in the
io_tree like we do for buffered reads because previously we were locking the
entire range, so we didn't have an extent state for each sector of the
range. But now that we do the range locking as we map the buffers we can
limit the mapping lenght to sectorsize and use the private part of the
io_tree for our csums. This allows us to avoid an extra memory allocation
for direct reads which could incur latency. Thanks,
Josef Bacik [Thu, 2 Aug 2012 14:23:59 +0000 (10:23 -0400)]
Btrfs: do not strdup non existent strings
When we close devices we add back empty devices for some reason that escapes
me. In the case of a missing dev we don't allocate an rcu_string for it's
name, so check to see if the device has a name and if it doesn't don't
bother strdup()'ing it. Thanks,
the box will panic trying to deref the name for the missing dev since it is
the lower numbered devid. So fix show_devname to not use missing devices.
Thanks,
Stefan Behrens [Wed, 1 Aug 2012 10:28:01 +0000 (04:28 -0600)]
Btrfs: fix that error value is changed by mistake
In iterate_inodes_from_logical() the error result from
extent_from_logical() is patched by mistake. Typically ENOENT is
patched to EINVAL because (-ENOENT & BTRFS_EXTENT_FLAG_TREE_BLOCK)
evaluates to true.
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
This is because we would not return EIOCBQUEUED for short AIO reads, instead
we'd wait for the DIO to complete and then return the amount of data we
transferred, which would allow our stuff to unlock the remaning amount. But
with this change this no longer happens, so if we have a short AIO read (for
example if we try to read past EOF), we could leave the section from EOF to
the end of where we tried to read locked. Fixing this is tricky since there
is no clear way to know exactly how much data DIO truly submitted for IO, so
to make this less hard on ourselves and less combersome we need to lock the
extents as we try to map them, and then we unlock any areas we didn't
actually map. This makes us completely safe from deadlocks and reliance on
a particular behavior of the DIO code. This also lays the groundwork for
allowing us to use the normal csum storage method for reads which means we
can remove an allocation. Thanks,
Heiko Carstens [Mon, 27 Aug 2012 13:18:45 +0000 (15:18 +0200)]
s390/smp: add missing smp_store_status() for !SMP
Fix this compile error:
arch/s390/kernel/machine_kexec.c: In function ‘setup_regs’:
arch/s390/kernel/machine_kexec.c:63:3: error: implicit declaration
of function ‘smp_store_status’ [-Werror=implicit-function-declaration]
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>