Merge branch 'for-linus' of git://git.kernel.dk/linux-block
Pull block fixes from Jens Axboe:
"A small collection of driver fixes/updates and a core fix for 3.6. It
contains:
- Bug fixes for mtip32xx, and support for new hardware (just addition
of IDs). They have been queued up for 3.7 for a few weeks as well.
- rate-limit a failing command error message in block core.
- A fix for an old cciss bug from Stephen.
- Prevent overflow of partition count from Alan."
* 'for-linus' of git://git.kernel.dk/linux-block:
cciss: fix handling of protocol error
blk: add an upper sanity check on partition adding
mtip32xx: fix user_buffer check in exec_drive_command
mtip32xx: Remove dead code
mtip32xx: Change printk to pr_xxxx
mtip32xx: Proper reporting of write protect status on big-endian
mtip32xx: Increase timeout for standby command
mtip32xx: Handle NCQ commands during the security locked state
mtip32xx: Add support for new devices
block: rate-limit the error message from failing commands
Merge tag 'sh-for-linus' of git://github.com/pmundt/linux-sh
Pull SuperH fixes from Paul Mundt.
* tag 'sh-for-linus' of git://github.com/pmundt/linux-sh:
sh: Fix up TIF_NOTIFY_RESUME sans TIF_SIGPENDING handling.
sh: pfc: Release spinlock in sh_pfc_gpio_request_enable() error path
sh: intc: Fix up multi-evt irq association.
Merge tag 'md-3.6-fixes' of git://neil.brown.name/md
Pull md fixes from NeilBrown:
"3 fixes for md in 3.6.
One reverts a recent patch which turns out to not be such a good idea.
Other two fix minor bugs with the new (since 3.3) 'replacement' code
and have been tagged for -stable."
* tag 'md-3.6-fixes' of git://neil.brown.name/md:
md: make sure metadata is updated when spares are activated or removed.
md/raid5: fix calculate of 'degraded' when a replacement becomes active.
Revert "md/raid5: For odirect-write performance, do not set STRIPE_PREREAD_ACTIVE."
Merge branch 'for-3.6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq
Pull workqueue / powernow-k8 fix from Tejun Heo:
"This is the fix for the bug where cpufreq/powernow-k8 was tripping
BUG_ON() in try_to_wake_up_local() by migrating workqueue worker to a
different CPU.
https://bugzilla.kernel.org/show_bug.cgi?id=47301
As discussed, the fix is now two parts - one to reimplement
work_on_cpu() so that it doesn't create a new kthread each time and
the actual fix which makes powernow-k8 use work_on_cpu() instead of
performing manual migration.
While pretty late in the merge cycle, both changes are on the safer
side. Jiri and I verified two existing users of work_on_cpu() and
Duncan confirmed that the powernow-k8 fix survived about 18 hours of
testing."
* 'for-3.6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
cpufreq/powernow-k8: workqueue user shouldn't migrate the kworker to another CPU
workqueue: reimplement work_on_cpu() using system_wq
cpufreq/powernow-k8: workqueue user shouldn't migrate the kworker to another CPU
powernowk8_target() runs off a per-cpu work item and if the
cpufreq_policy->cpu is different from the current one, it migrates the
kworker to the target CPU by manipulating current->cpus_allowed. The
function migrates the kworker back to the original CPU but this is
still broken. Workqueue concurrency management requires the kworkers
to stay on the same CPU and powernowk8_target() ends up triggerring
BUG_ON(rq != this_rq()) in try_to_wake_up_local() if it contends on
fidvid_mutex and sleeps.
It is unclear why this bug is being reported now. Duncan says it
appeared to be a regression of 3.6-rc1 and couldn't reproduce it on
3.5. Bisection seemed to point to 63d95a91 "workqueue: use @pool
instead of @gcwq or @cpu where applicable" which is an non-functional
change. Given that the reproduce case sometimes took upto days to
trigger, it's easy to be misled while bisecting. Maybe something made
contention on fidvid_mutex more likely? I don't know.
This patch fixes the bug by using work_on_cpu() instead if @pol->cpu
isn't the same as the current one. The code assumes that
cpufreq_policy->cpu is kept online by the caller, which Rafael tells
me is the case.
stable: ed48ece27c ("workqueue: reimplement work_on_cpu() using
system_wq") should be applied before this; otherwise, the
behavior could be horrible.
Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Duncan <1i5t5.duncan@cox.net> Tested-by: Duncan <1i5t5.duncan@cox.net> Cc: Rafael J. Wysocki <rjw@sisk.pl> Cc: Andreas Herrmann <andreas.herrmann3@amd.com> Cc: stable@vger.kernel.org
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=47301
workqueue: reimplement work_on_cpu() using system_wq
The existing work_on_cpu() implementation is hugely inefficient. It
creates a new kthread, execute that single function and then let the
kthread die on each invocation.
Now that system_wq can handle concurrent executions, there's no
advantage of doing this. Reimplement work_on_cpu() using system_wq
which makes it simpler and way more efficient.
stable: While this isn't a fix in itself, it's needed to fix a
workqueue related bug in cpufreq/powernow-k8. AFAICS, this
shouldn't break other existing users.
Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Jiri Kosina <jkosina@suse.cz> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Bjorn Helgaas <bhelgaas@google.com> Cc: Len Brown <lenb@kernel.org> Cc: Rafael J. Wysocki <rjw@sisk.pl> Cc: stable@vger.kernel.org
Dan Carpenter [Wed, 5 Sep 2012 12:30:42 +0000 (15:30 +0300)]
x86, microcode, AMD: Fix use after free in free_cache()
list_for_each_entry_reverse() dereferences the iterator, but we already
freed it. I don't see a reason that this has to be done in reverse order
so change it to use list_for_each_entry_safe().
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
This patch updates the existing Intel IvyBridge (model 58)
support with proper PEBS event constraints. It cannot reuse
the same as SandyBridge because some events (0xd3) are
specific to IvyBridge.
Also there is no UOPS_DISPATCHED.THREAD on IVB, so do not
populate the PERF_COUNT_HW_STALLED_CYCLES_BACKEND mapping.
x86/debug: Dump family, model, stepping of the boot CPU
When acting on a user bug report, we find ourselves constantly
asking for /proc/cpuinfo in order to know the exact family,
model, stepping of the CPU in question.
Instead of having to ask this, add this to dmesg so that it is
visible and no ambiguities can ensue from looking at the
official name string of the CPU coming from CPUID and trying
to map it to f/m/s.
Merge tag 'ras_queue_for_3.7' of git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras into x86/mce
Pull MCE changes from Borislav Petkov:
" Patch 1/2 which enables MCA by default because I still see bugreports
where CONFIG_X86_MCE is disabled and this is a bad idea so turning it on
by default makes sense to me. The second one is a trivial cleanup. "
md: make sure metadata is updated when spares are activated or removed.
It isn't always necessary to update the metadata when spares are
removed as the presence-or-not of a spare isn't really important to
the integrity of an array.
Also activating a spare doesn't always require updating the metadata
as the update on 'recovery-completed' is usually sufficient.
However the introduction of 'replacement' devices have made these
transitions sometimes more important. For example the 'Replacement'
flag isn't cleared until the original device is removed, so we need
to ensure a metadata update after that 'spare' is removed.
So set MD_CHANGE_DEVS whenever a spare is activated or removed, to
complement the current situation where it is set when a spare is added
or a device is failed (or a number of other less common situations).
This is suitable for -stable as out-of-data metadata could lead
to data corruption.
This is only relevant for 3.3 and later 9when 'replacement' as
introduced.
md/raid5: fix calculate of 'degraded' when a replacement becomes active.
When a replacement device becomes active, we mark the device that it
replaces as 'faulty' so that it can subsequently get removed.
However 'calc_degraded' only pays attention to the primary device, not
the replacement, so the array appears to become degraded, which is
wrong.
So teach 'calc_degraded' to consider any replacement if a primary
device is faulty.
This is suitable for -stable as an incorrect 'degraded' value can
confuse md and could lead to data corruption.
This is only relevant for 3.3 and later.
Cc: stable@vger.kernel.org Reported-by: Robin Hill <robin@robinhill.me.uk> Reported-by: John Drescher <drescherjm@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>
While this patch seemed like a good idea and did help some workloads,
it hurts other workloads.
Large sequential O_DIRECT writes were faster,
Small random O_DIRECT writes were slower.
Other changes (batching RAID5 writes) have improved the sequential
writes using a different mechanism, so the net result of this patch
is definitely negative. So revert it.
Reported-by: Shaohua Li <shli@kernel.org> Tested-by: Jianpeng Ma <majianpeng@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>
x86, fpu: remove cpu_has_xmm check in the fx_finit()
CPUs with FXSAVE but no XMM/MXCSR (Pentium II from Intel,
Crusoe/TM-3xxx/5xxx from Transmeta, and presumably some of the K6
generation from AMD) ever looked at the mxcsr field during
fxrstor/fxsave. So remove the cpu_has_xmm check in the fx_finit()
Add the "eagerfpu=auto" (that selects the default scheme in
enabling eagerfpu) which can override compiled-in boot parameters
like "eagerfpu=on/off" (that force enable/disable eagerfpu).
x86, fpu: decouple non-lazy/eager fpu restore from xsave
Decouple non-lazy/eager fpu restore policy from the existence of the xsave
feature. Introduce a synthetic CPUID flag to represent the eagerfpu
policy. "eagerfpu=on" boot paramter will enable the policy.
Suresh Siddha [Fri, 24 Aug 2012 21:13:02 +0000 (14:13 -0700)]
x86, fpu: use non-lazy fpu restore for processors supporting xsave
Fundamental model of the current Linux kernel is to lazily init and
restore FPU instead of restoring the task state during context switch.
This changes that fundamental lazy model to the non-lazy model for
the processors supporting xsave feature.
Reasons driving this model change are:
i. Newer processors support optimized state save/restore using xsaveopt and
xrstor by tracking the INIT state and MODIFIED state during context-switch.
This is faster than modifying the cr0.TS bit which has serializing semantics.
ii. Newer glibc versions use SSE for some of the optimized copy/clear routines.
With certain workloads (like boot, kernel-compilation etc), application
completes its work with in the first 5 task switches, thus taking upto 5 #DNA
traps with the kernel not getting a chance to apply the above mentioned
pre-load heuristic.
iii. Some xstate features (like AMD's LWP feature) don't honor the cr0.TS bit
and thus will not work correctly in the presence of lazy restore. Non-lazy
state restore is needed for enabling such features.
Some data on a two socket SNB system:
* Saved 20K DNA exceptions during boot on a two socket SNB system.
* Saved 50K DNA exceptions during kernel-compilation workload.
* Improved throughput of the AVX based checksumming function inside the
kernel by ~15% as xsave/xrstor is faster than the serializing clts/stts
pair.
Also now kernel_fpu_begin/end() relies on the patched
alternative instructions. So move check_fpu() which uses the
kernel_fpu_begin/end() after alternative_instructions().
Suresh Siddha [Fri, 24 Aug 2012 21:13:01 +0000 (14:13 -0700)]
lguest, x86: handle guest TS bit for lazy/non-lazy fpu host models
Instead of using unlazy_fpu() check if user_has_fpu() and set/clear
the host TS bits so that the lguest works fine with both the
lazy/non-lazy FPU host models with minimal changes.
Suresh Siddha [Fri, 24 Aug 2012 21:13:00 +0000 (14:13 -0700)]
x86, fpu: always use kernel_fpu_begin/end() for in-kernel FPU usage
use kernel_fpu_begin/end() instead of unconditionally accessing cr0 and
saving/restoring just the few used xmm/ymm registers.
This has some advantages like:
* If the task's FPU state is already active, then kernel_fpu_begin()
will just save the user-state and avoiding the read/write of cr0.
In general, cr0 accesses are much slower.
* Manual save/restore of xmm/ymm registers will affect the 'modified' and
the 'init' optimizations brought in the by xsaveopt/xrstor
infrastructure.
* Foward compatibility with future vector register extensions will be a
problem if the xmm/ymm registers are manually saved and restored
(corrupting the extended state of those vector registers).
With this patch, there was no significant difference in the xor throughput
using AVX, measured during boot.
Suresh Siddha [Fri, 24 Aug 2012 21:12:59 +0000 (14:12 -0700)]
x86, kvm: use kernel_fpu_begin/end() in kvm_load/put_guest_fpu()
kvm's guest fpu save/restore should be wrapped around
kernel_fpu_begin/end(). This will avoid for example taking a DNA
in kvm_load_guest_fpu() when it tries to load the fpu immediately
after doing unlazy_fpu() on the host side.
More importantly this will prevent the host process fpu from being
corrupted.
Suresh Siddha [Fri, 24 Aug 2012 21:12:58 +0000 (14:12 -0700)]
x86, fpu: remove unnecessary user_fpu_end() in save_xstate_sig()
Few lines below we do drop_fpu() which is more safer. Remove the
unnecessary user_fpu_end() in save_xstate_sig(), which allows
the drop_fpu() to ignore any pending exceptions from the user-space
and drop the current fpu.
Suresh Siddha [Fri, 24 Aug 2012 21:12:57 +0000 (14:12 -0700)]
x86, fpu: drop_fpu() before restoring new state from sigframe
No need to save the state with unlazy_fpu(), that is about to get overwritten
by the state from the signal frame. Instead use drop_fpu() and continue
to restore the new state.
x86, fpu: Unify signal handling code paths for x86 and x86_64 kernels
Currently for x86 and x86_32 binaries, fpstate in the user sigframe is copied
to/from the fpstate in the task struct.
And in the case of signal delivery for x86_64 binaries, if the fpstate is live
in the CPU registers, then the live state is copied directly to the user
sigframe. Otherwise fpstate in the task struct is copied to the user sigframe.
During restore, fpstate in the user sigframe is restored directly to the live
CPU registers.
Historically, different code paths led to different bugs. For example,
x86_64 code path was not preemption safe till recently. Also there is lot
of code duplication for support of new features like xsave etc.
Unify signal handling code paths for x86 and x86_64 kernels.
New strategy is as follows:
Signal delivery: Both for 32/64-bit frames, align the core math frame area to
64bytes as needed by xsave (this where the main fpu/extended state gets copied
to and excludes the legacy compatibility fsave header for the 32-bit [f]xsave
frames). If the state is live, copy the register state directly to the user
frame. If not live, copy the state in the thread struct to the user frame. And
for 32-bit [f]xsave frames, construct the fsave header separately before
the actual [f]xsave area.
Signal return: As the 32-bit frames with [f]xstate has an additional
'fsave' header, copy everything back from the user sigframe to the
fpstate in the task structure and reconstruct the fxstate from the 'fsave'
header (Also user passed pointers may not be correctly aligned for
any attempt to directly restore any partial state). At the next fpstate usage,
everything will be restored to the live CPU registers.
For all the 64-bit frames and the 32-bit fsave frame, restore the state from
the user sigframe directly to the live CPU registers. 64-bit signals always
restored the math frame directly, so we can expect the math frame pointer
to be correctly aligned. For 32-bit fsave frames, there are no alignment
requirements, so we can restore the state directly.
"lat_sig catch" microbenchmark numbers (for x86, x86_64, x86_32 binaries) are
with in the noise range with this change.
Use config_enabled() to cleanup the definitions of is_ia32/is_x32. Move
the function prototypes to the header file to cleanup ifdefs,
and move the x32_setup_rt_frame() code around.
vfs: dcache: use DCACHE_DENTRY_KILLED instead of DCACHE_DISCONNECTED in d_kill()
IBM reported a soft lockup after applying the fix for the rename_lock
deadlock. Commit c83ce989cb5f ("VFS: Fix the nfs sillyrename regression
in kernel 2.6.38") was found to be the culprit.
The nfs sillyrename fix used DCACHE_DISCONNECTED to indicate that the
dentry was killed. This flag can be set on non-killed dentries too,
which results in infinite retries when trying to traverse the dentry
tree.
This patch introduces a separate flag: DCACHE_DENTRY_KILLED, which is
only set in d_kill() and makes try_to_ascend() test only this flag.
IBM reported successful test results with this patch.
If a command completes with a status of CMD_PROTOCOL_ERR, this
information should be conveyed to the SCSI mid layer, not dropped
on the floor. Unlike a similar bug in the hpsa driver, this bug
only affects tape drives and CD and DVD ROM drives in the cciss
driver, and to induce it, you have to disconnect (or damage) a
cable, so it is not a very likely scenario (which would explain
why the bug has gone undetected for the last 10 years.)
Signed-off-by: Stephen M. Cameron <scameron@beardog.cce.hp.com> Cc: stable@vger.kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
Al Viro [Tue, 18 Sep 2012 08:04:37 +0000 (17:04 +0900)]
sh: Fix up TIF_NOTIFY_RESUME sans TIF_SIGPENDING handling.
As Al notes, we missed a TIF_NOTIFY_RESUME check which caused any
handlers without TIF_SIGPENDING also set to skip the notification:
Looks like while it is in the relevant masks *and* checked in
do_notify_resume() both on 32bit and 64bit variants since commit ab99c733ae73cce31f2a2434f7099564e5a73d95 ("sh: Make syscall tracer
use tracehook notifiers, add TIF_NOTIFY_RESUME.") they are
actually *not* reached without simulataneous SIGPENDING, since
the actual glue in the callers had not been updated back then and
still checks for _TIF_SIGPENDING alone when deciding whether to
hit do_notify_resume() or not.
Reported-by: Nobuhiro Iwamatsu <nobuhiro.iwamatsu.yj@renesas.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Tested-by: Nobuhiro Iwamatsu <nobuhiro.iwamatsu.yj@renesas.com> Signed-off-by: Paul Mundt <lethal@linux-sh.org>
sh: pfc: Release spinlock in sh_pfc_gpio_request_enable() error path
The sh_pfc_gpio_request_enable() function acquires a spinlock but fails
to release it before returning if the requested mux type is not
supported. Fix this.
Signed-off-by: Laurent Pinchart <laurent.pinchart@ideasonboard.com> Signed-off-by: Paul Mundt <lethal@linux-sh.org>
Merge branch 'for-3.6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq
Pull another workqueue fix from Tejun Heo:
"Unfortunately, yet another late fix. This too is discovered and fixed
by Lai. This bug was introduced during this merge window by commit 25511a477657 ("workqueue: reimplement CPU online rebinding to handle
idle workers") which started using WORKER_REBIND flag for idle rebind
too.
The bug is relatively easy to trigger if the CPU rapidly goes through
off, on and then off (and stay off). The fix is on the safer side.
This hasn't been on linux-next yet but I'm pushing early so that it
can get more exposure before v3.6 release."
* 'for-3.6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
workqueue: always clear WORKER_REBIND in busy_worker_rebind_fn()
Lai Jiangshan [Mon, 17 Sep 2012 22:42:31 +0000 (15:42 -0700)]
workqueue: always clear WORKER_REBIND in busy_worker_rebind_fn()
busy_worker_rebind_fn() didn't clear WORKER_REBIND if rebinding failed
(CPU is down again). This used to be okay because the flag wasn't
used for anything else.
However, after 25511a477 "workqueue: reimplement CPU online rebinding
to handle idle workers", WORKER_REBIND is also used to command idle
workers to rebind. If not cleared, the worker may confuse the next
CPU_UP cycle by having REBIND spuriously set or oops / get stuck by
prematurely calling idle_worker_rebind().
There was no reason to keep WORKER_REBIND on failure in the first
place - WORKER_UNBOUND is guaranteed to be set in such cases
preventing incorrectly activating concurrency management. Always
clear WORKER_REBIND.
tj: Updated comment and description.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Tejun Heo <tj@kernel.org>
Merge fixes from Andrew Morton:
"13 patches. 12 are fixes and one is a little preparatory thing for
Andi."
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (13 commits)
memory hotplug: fix section info double registration bug
mm/page_alloc: fix the page address of higher page's buddy calculation
drivers/rtc/rtc-twl.c: ensure all interrupts are disabled during probe
compiler.h: add __visible
pid-namespace: limit value of ns_last_pid to (0, max_pid)
include/net/sock.h: squelch compiler warning in sk_rmem_schedule()
slub: consider pfmemalloc_match() in get_partial_node()
slab: fix starting index for finding another object
slab: do ClearSlabPfmemalloc() for all pages of slab
nbd: clear waiting_queue on shutdown
MAINTAINERS: fix TXT maintainer list and source repo path
mm/ia64: fix a memory block size bug
memory hotplug: reset pgdat->kswapd to NULL if creating kernel thread fails
memory hotplug: fix section info double registration bug
There may be a bug when registering section info. For example, on my
Itanium platform, the pfn range of node0 includes the other nodes, so
other nodes' section info will be double registered, and memmap's page
count will equal to 3.
Li Haifeng [Mon, 17 Sep 2012 21:09:21 +0000 (14:09 -0700)]
mm/page_alloc: fix the page address of higher page's buddy calculation
The heuristic method for buddy has been introduced since commit 43506fad21ca ("mm/page_alloc.c: simplify calculation of combined index
of adjacent buddy lists"). But the page address of higher page's buddy
was wrongly calculated, which will lead page_is_buddy to fail for ever.
IOW, the heuristic method would be disabled with the wrong page address
of higher page's buddy.
Calculating the page address of higher page's buddy should be based
higher_page with the offset between index of higher page and index of
higher page's buddy.
Signed-off-by: Haifeng Li <omycle@gmail.com> Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com> Reviewed-by: Michal Hocko <mhocko@suse.cz> Cc: KyongHo Cho <pullip.cho@samsung.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: Johannes Weiner <jweiner@redhat.com> Cc: <stable@vger.kernel.org> [2.6.38+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Kevin Hilman [Mon, 17 Sep 2012 21:09:17 +0000 (14:09 -0700)]
drivers/rtc/rtc-twl.c: ensure all interrupts are disabled during probe
On some platforms, bootloaders are known to do some interesting RTC
programming. Without going into the obscurities as to why this may be
the case, suffice it to say the the driver should not make any
assumptions about the state of the RTC when the driver loads. In
particular, the driver probe should be sure that all interrupts are
disabled until otherwise programmed.
This was discovered when finding bursty I2C traffic every second on
Overo platforms. This I2C overhead was keeping the SoC from hitting
deep power states. The cause was found to be the RTC firing every
second on the I2C-connected TWL PMIC.
Special thanks to Felipe Balbi for suggesting to look for a rogue driver
as the source of the I2C traffic rather than the I2C driver itself.
Special thanks to Steve Sakoman for helping track down the source of the
continuous RTC interrups on the Overo boards.
Signed-off-by: Kevin Hilman <khilman@ti.com> Cc: Felipe Balbi <balbi@ti.com> Tested-by: Steve Sakoman <steve@sakoman.com> Cc: Alessandro Zummo <a.zummo@towertech.it> Tested-by: Shubhrajyoti Datta <omaplinuxkernel@gmail.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
gcc 4.6+ has support for a externally_visible attribute that prevents the
optimizer from optimizing unused symbols away. Add a __visible macro to
use it with that compiler version or later.
This is used (at least) by the "Link Time Optimization" patchset.
Chuck Lever [Mon, 17 Sep 2012 21:09:11 +0000 (14:09 -0700)]
include/net/sock.h: squelch compiler warning in sk_rmem_schedule()
This warning:
In file included from linux/include/linux/tcp.h:227:0,
from linux/include/linux/ipv6.h:221,
from linux/include/net/ipv6.h:16,
from linux/include/linux/sunrpc/clnt.h:26,
from linux/net/sunrpc/stats.c:22:
linux/include/net/sock.h: In function `sk_rmem_schedule':
linux/nfs-2.6/include/net/sock.h:1339:13: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
is seen with gcc (GCC) 4.6.3 20120306 (Red Hat 4.6.3-2) using the
-Wextra option.
Commit c76562b6709f ("netvm: prevent a stream-specific deadlock")
accidentally replaced the "size" parameter of sk_rmem_schedule() with an
unsigned int. This changes the semantics of the comparison in the
return statement.
In sk_wmem_schedule we have syntactically the same comparison, but
"size" is a signed integer. In addition, __sk_mem_schedule() takes a
signed integer for its "size" parameter, so there is an implicit type
conversion in sk_rmem_schedule() anyway.
Revert the "size" parameter back to a signed integer so that the
semantics of the expressions in both sk_[rw]mem_schedule() are exactly
the same.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Mel Gorman <mgorman@suse.de> Cc: David Miller <davem@davemloft.net> Cc: Joonsoo Kim <js1304@gmail.com> Cc: David Rientjes <rientjes@google.com> Cc: Pekka Enberg <penberg@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Joonsoo Kim [Mon, 17 Sep 2012 21:09:09 +0000 (14:09 -0700)]
slub: consider pfmemalloc_match() in get_partial_node()
get_partial() is currently not checking pfmemalloc_match() meaning that
it is possible for pfmemalloc pages to leak to non-pfmemalloc users.
This is a problem in the following situation. Assume that there is a
request from normal allocation and there are no objects in the per-cpu
cache and no node-partial slab.
In this case, slab_alloc enters the slow path and new_slab_objects() is
called which may return a PFMEMALLOC page. As the current user is not
allowed to access PFMEMALLOC page, deactivate_slab() is called
([5091b74a: mm: slub: optimise the SLUB fast path to avoid pfmemalloc
checks]) and returns an object from PFMEMALLOC page.
Next time, when we get another request from normal allocation,
slab_alloc() enters the slow-path and calls new_slab_objects(). In
new_slab_objects(), we call get_partial() and get a partial slab which
was just deactivated but is a pfmemalloc page. We extract one object
from it and re-deactivate.
"deactivate -> re-get in get_partial -> re-deactivate" occures repeatedly.
As a result, access to PFMEMALLOC page is not properly restricted and it
can cause a performance degradation due to frequent deactivation.
deactivation frequently.
This patch changes get_partial_node() to take pfmemalloc_match() into
account and prevents the "deactivate -> re-get in get_partial()
scenario. Instead, new_slab() is called.
Signed-off-by: Joonsoo Kim <js1304@gmail.com> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Mel Gorman <mgorman@suse.de> Cc: David Miller <davem@davemloft.net> Cc: Chuck Lever <chuck.lever@oracle.com> Cc: Pekka Enberg <penberg@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Joonsoo Kim [Mon, 17 Sep 2012 21:09:06 +0000 (14:09 -0700)]
slab: fix starting index for finding another object
In array cache, there is a object at index 0, check it.
Signed-off-by: Joonsoo Kim <js1304@gmail.com> Signed-off-by: Mel Gorman <mgorman@suse.de> Cc: David Miller <davem@davemloft.net> Cc: Chuck Lever <chuck.lever@oracle.com> Cc: David Rientjes <rientjes@google.com> Cc: Pekka Enberg <penberg@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
slab: do ClearSlabPfmemalloc() for all pages of slab
Right now, we call ClearSlabPfmemalloc() for first page of slab when we
clear SlabPfmemalloc flag. This is fine for most swap-over-network use
cases as it is expected that order-0 pages are in use. Unfortunately it
is possible that that __ac_put_obj() checks SlabPfmemalloc on a tail
page and while this is harmless, it is sloppy. This patch ensures that
the head page is always used.
This problem was originally identified by Joonsoo Kim.
[js1304@gmail.com: Original implementation and problem identification] Signed-off-by: Mel Gorman <mgorman@suse.de> Cc: David Miller <davem@davemloft.net> Cc: Chuck Lever <chuck.lever@oracle.com> Cc: Joonsoo Kim <js1304@gmail.com> Cc: David Rientjes <rientjes@google.com> Cc: Pekka Enberg <penberg@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Paul Clements [Mon, 17 Sep 2012 21:09:02 +0000 (14:09 -0700)]
nbd: clear waiting_queue on shutdown
Fix a serious but uncommon bug in nbd which occurs when there is heavy
I/O going to the nbd device while, at the same time, a failure (server,
network) or manual disconnect of the nbd connection occurs.
There is a small window between the time that the nbd_thread is stopped
and the socket is shutdown where requests can continue to be queued to
nbd's internal waiting_queue. When this happens, those requests are
never completed or freed.
The fix is to clear the waiting_queue on shutdown of the nbd device, in
the same way that the nbd request queue (queue_head) is already being
cleared.
Signed-off-by: Paul Clements <paul.clements@steeleye.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Gang Wei [Mon, 17 Sep 2012 21:08:59 +0000 (14:08 -0700)]
MAINTAINERS: fix TXT maintainer list and source repo path
Signed-off-by: Gang Wei <gang.wei@intel.com> Cc: Richard L Maliszewski <richard.l.maliszewski@intel.com> Cc: Gang Wei <gang.wei@intel.com> Cc: Shane Wang <shane.wang@intel.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Because MIN_MEMORY_BLOCK_SIZE is int type and length of 32bits,
so MIN_MEMORY_BLOCK_SIZE(1 << 32) will will equal to 0.
Actually when SECTION_SIZE_BITS >= 31, MIN_MEMORY_BLOCK_SIZE will be wrong.
This will cause wrong system memory infomation in sysfs.
I think it should be:
And "echo offline > memory0/state" will cause following call trace:
kernel BUG at mm/memory_hotplug.c:885!
sh[6455]: bugcheck! 0 [1]
Pid: 6455, CPU 0, comm: sh
psr : 0000101008526030 ifs : 8000000000000fa4 ip : [<a0000001008c40f0>] Not tainted (3.6.0-rc1)
ip is at offline_pages+0x210/0xee0
Call Trace:
show_stack+0x80/0xa0
show_regs+0x640/0x920
die+0x190/0x2c0
die_if_kernel+0x50/0x80
ia64_bad_break+0x3d0/0x6e0
ia64_native_leave_kernel+0x0/0x270
offline_pages+0x210/0xee0
alloc_pages_current+0x180/0x2a0
Signed-off-by: Jianguo Wu <wujianguo@huawei.com> Signed-off-by: Jiang Liu <jiang.liu@huawei.com> Cc: "Luck, Tony" <tony.luck@intel.com> Reviewed-by: Michal Hocko <mhocko@suse.cz> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
memory hotplug: reset pgdat->kswapd to NULL if creating kernel thread fails
If kthread_run() fails, pgdat->kswapd contains errno. When we stop this
thread, we only check whether pgdat->kswapd is NULL and access it. If
it contains errno, it will cause page fault. Reset pgdat->kswapd to
NULL when creating kernel thread fails can avoid this problem.
Signed-off-by: Wen Congyang <wency@cn.fujitsu.com> Reviewed-by: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Merge tag 'rdma-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband
Pull InfiniBand/RDMA fixes from Roland Dreier:
- A couple more IPoIB fixes for regressions introduced by path database
conversion
- Minor other fixes to low-level drivers (cxgb4, mlx4, qib, ocrdma)
* tag 'rdma-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband:
IB/qib: Fix failure of compliance test C14-024#06_LocalPortNum
RDMA/ocrdma: Fix CQE expansion of unsignaled WQE
mlx4_core: Fix integer overflows so 8TBs of memory registration works
IPoIB: Fix AB-BA deadlock when deleting neighbours
IPoIB: Fix memory leak in the neigh table deletion flow
RDMA/cxgb4: Move dereference below NULL test
MCA is the basic support for hardware error logging and reporting, and
it is majorly unwise to run without it so enable machine check software
support by default on x86.
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com> Acked-by: Tony Luck <tony.luck@intel.com>
fs/proc: fix potential unregister_sysctl_table hang
The unregister_sysctl_table() function hangs if all references to its
ctl_table_header structure are not dropped.
This can happen sometimes because of a leak in proc_sys_lookup():
proc_sys_lookup() gets a reference to the table via lookup_entry(), but
it does not release it when a subsequent call to sysctl_follow_link()
fails.
This patch fixes this leak by making sure the reference is always
dropped on return.
See also commit 076c3eed2c31 ("sysctl: Rewrite proc_sys_lookup
introducing find_entry and lookup_entry") which reorganized this code in
3.4.
Tested in Linux 3.4.4.
Signed-off-by: Francesco Ruggeri <fruggeri@aristanetworks.com> Cc: stable@vger.kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Namhyung Kim [Thu, 13 Sep 2012 04:14:30 +0000 (13:14 +0900)]
perf report: Add missing perf_hpp__init for pipe-mode
The perf_hpp__init() function was only called from setup_browser() so
that the pipe-mode missed the initialization thus didn't respond to
related options. Fix it.
Reported-by: Robert Richter <robert.richter@amd.com> Tested-by: Robert Richter <robert.richter@amd.com> Signed-off-by: Namhyung Kim <namhyung@kernel.org> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Robert Richter <robert.richter@amd.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-tip-commits@vger.kernel.org Link: http://lkml.kernel.org/r/87txv28spl.fsf_-_@sejong.aot.lge.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
perf symbols: Filter samples with unresolved symbol when "--symbols" option is used
Report/top commands support to only handle specific symbols with
"--symbols" option, but current code will keep those samples whose
symbol can't be resolved, which should actually be filtered.
If we run following commands:
$perf record -a tree
$perf report --symbols intel_idle -n
the output will be:
Without the patch:
==================
46.27% 156 sshd [unknown]
26.05% 48 swapper [kernel.kallsyms]
17.26% 38 tree libc-2.12.1.so
7.69% 17 tree tree
2.73% 6 tree ld-2.12.1.so
With the patch:
===============
100.00% 48 swapper [kernel.kallsyms]
Signed-off-by: Feng Tang <feng.tang@intel.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: David Ahern <dsahern@gmail.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/1347007349-3102-2-git-send-email-feng.tang@intel.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Namhyung Kim [Fri, 14 Sep 2012 08:35:28 +0000 (17:35 +0900)]
perf report: Enable integrated annotation only if possible
The integrated annotation feature is supported only in TUI mode. Also
it should be enabled with 'symbol' sort key otherwise resulting hist
entry doesn't need to have same symbol as of a sample so that it can
fail on hist_entry__inc_addr_samples with -ERANGE.
You can easily see it when start perf report TUI without symbol* sort
key. This patch fixes the problem.
Matthew Garrett [Fri, 27 Jul 2012 21:20:49 +0000 (17:20 -0400)]
x86, EFI: Calculate the EFI framebuffer size instead of trusting the firmware
Seth Forshee reported that his system was reporting that the EFI framebuffer
stretched from 0x90010000-0xb0010000 despite the GPU's BAR only covering
0x90000000-0x9ffffff. It's safer to calculate this value from the pixel
stride and screen height (values we already depend on) rather than face
potential problems with resource allocation later on.
Signed-off-by: Matthew Garrett <mjg@redhat.com> Tested-by: Seth Forshee <seth.forshee@canonical.com> Signed-off-by: Matt Fleming <matt.fleming@intel.com>
Matthew Garrett [Fri, 27 Jul 2012 16:58:53 +0000 (12:58 -0400)]
efifb: Skip DMI checks if the bootloader knows what it's doing
The majority of the DMI checks in efifb are for cases where the bootloader
has provided invalid information. However, on some machines the overrides
may do more harm than good due to configuration differences between machines
with the same machine identifier. It turns out that it's possible for the
bootloader to get the correct information on GOP-based systems, but we
can't guarantee that the kernel's being booted with one that's been updated
to do so. Add support for a capabilities flag that can be set by the
bootloader, and skip the DMI checks in that case. Additionally, set this
flag in the UEFI stub code.
Signed-off-by: Matthew Garrett <mjg@redhat.com> Acked-by: Peter Jones <pjones@redhat.com> Signed-off-by: Matt Fleming <matt.fleming@intel.com>
Matthew Garrett [Thu, 26 Jul 2012 22:00:00 +0000 (18:00 -0400)]
efi: Build EFI stub with EFI-appropriate options
We can't assume the presence of the red zone while we're still in a boot
services environment, so we should build with -fno-red-zone to avoid
problems. Change the size of wchar at the same time to make string handling
simpler.
Signed-off-by: Matthew Garrett <mjg@redhat.com> Signed-off-by: Matt Fleming <matt.fleming@intel.com>
Matthew Garrett [Thu, 26 Jul 2012 22:00:27 +0000 (18:00 -0400)]
X86: Improve GOP detection in the EFI boot stub
We currently use the PCI IO protocol as a proxy for a functional GOP. This
is less than ideal, since some platforms will put the GOP on output devices
rather than the GPU itself. Move to using the conout protocol. This is not
guaranteed per-spec, but is part of the consplitter implementation that
causes this problem in the first place and so should be reliable.
Signed-off-by: Matthew Garrett <mjg@redhat.com> Signed-off-by: Matt Fleming <matt.fleming@intel.com>