Tejun Heo [Tue, 4 Aug 2009 05:30:08 +0000 (14:30 +0900)]
ahci: add workaround for on-board 5723s on some gigabyte boards
Some gigabytes have on-board SIMG5723s connected to JMB ahcis. These
are used to implement hardware raid. Unfortunately some firmware
revisions on these 5723s don't bring the link down when all the
downstream ports are unoccupied while not responding to reset protocol
which makes libata think that there's device attached to the port but
is not responding and retry. This results in painfully wrong boot
detection time for these ports when they're empty.
This patch quirks those boards such that ahci gives up after the
initial timeout. Combined with parallel probing, this gives quick
enough probing and also is safe because SIMG5723 will respond to the
first try if any of the downstream ports is occupied.
Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Marc Bowes <marcbowes@gmail.com> Reported-by: Nicolas Mailhot <Nicolas.Mailhot@LaPoste.net> Signed-off-by: Jeff Garzik <jgarzik@redhat.com>
Shane Huang [Wed, 5 Aug 2009 02:10:41 +0000 (10:10 +0800)]
ahci: Soften up the dmesg on SB600 PMP softreset failure recovery
Too strong words led to spurious bug reports: Novell bugzilla #527748,
RedHat bugzilla #468800. This patch is used to soften up the dmesg on
SB600 PMP softreset failure recovery, so as to remove the scariness and
concern from community.
Reported-by: pgnet Dev <pgnet.dev@gmail.com> Signed-off-by: Shane Huang <shane.huang@amd.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Jeff Garzik <jgarzik@redhat.com>
By default the kernel honors the HPA (host protected area) of hard
drives. Using libata's ignore_hpa module option it's possible to
change this behaviour.
Document usage and options of libata.ignore_hpa in
Documentation/kernel-parameters.txt.
Signed-off-by: Michael Prokop <mika@grml.org> Signed-off-by: Jeff Garzik <jgarzik@redhat.com>
Tony Vroon [Wed, 5 Aug 2009 23:50:09 +0000 (00:50 +0100)]
sata_nv: MSI support, disabled by default
At least the nVidia MCP55 controller quite happily supports MSI.
This adds an option to use it. It is disabled by default.
As per feedback by Robert Hancock, it will honour the user
request as the kernel will not enable MSI where the controller
or the specific system configuration do not support it.
Signed-off-by: Tony Vroon <tony@linx.net> Cc: Robert Hancock <hancockrwd@gmail.com> Signed-off-by: Jeff Garzik <jgarzik@redhat.com>
Tejun Heo [Thu, 6 Aug 2009 16:59:15 +0000 (01:59 +0900)]
libata: OCZ Vertex can't do HPA
OCZ Vertex SSD can't do HPA and not in a usual way. It reports HPA,
allows unlocking but then fails all IOs which fall in the unlocked
area. Quirk it so that HPA unlocking is not used for the device.
Tejun Heo [Fri, 7 Aug 2009 02:15:20 +0000 (11:15 +0900)]
pata_at91: fix resource release
Julias Lawall discovered that pata_at91 wasn't freeing a memory region
allocated with kzalloc() on init failure paths. Upon review,
pata_at91 also seems to be doing unnecessary explicit resource
releases for managed resources too. Convert memory allocation to
managed one and drop unnecessary explicit resource releases.
Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Julia Lawall <julia@diku.dk> Cc: Sergey Matyukevich <geomatsi@gmail.com> Signed-off-by: Jeff Garzik <jgarzik@redhat.com>
xfs: fix spin_is_locked assert on uni-processor builds
Without SMP or preemption spin_is_locked always returns false,
so we can't do an assert with it. Instead use assert_spin_locked,
which does the right thing on all builds.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Eric Sandeen <sandeen@sandeen.net> Reported-by: Johannes Engel <jcnengel@googlemail.com> Tested-by: Johannes Engel <jcnengel@googlemail.com> Signed-off-by: Felix Blyakher <felixb@sgi.com>
Ramon tested XFS with a modified version of fsfuzzer and hit a NULL
pointer dereference in __xfs_get_blocks due to the RT device target
pointer being NULL.
To fix this reject inode with the realtime bit set on a a filesystem
without an RT subvolume during inode read.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Eric Sandeen <sandeen@sandeen.net> Reviewed-by: Felix Blyakher <felixb@sgi.com> Reported-by: Ramon de Carvalho Valle <ramon@risesecurity.org> Tested-by: Ramon de Carvalho Valle <ramon@risesecurity.org> Signed-off-by: Felix Blyakher <felixb@sgi.com>
Eric Sandeen [Mon, 20 Jul 2009 15:52:15 +0000 (10:52 -0500)]
use XFS_CORRUPTION_ERROR in xfs_btree_check_sblock
In Red Hat Bug 512552
- Can't write to XFS mount during raid5 resync
a user ran into corruption while resyncing a raid, and we failed
a consistency test, but didn't get much more info; it'd be nice
to call XFS_CORRUPTION_ERROR here so we can see the buffer
contents.
Signed-off-by: Eric Sandeen <sandeen@sandeen.net> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Felix Blyakher <felixb@sgi.com>
xfs: switch to NOFS allocation under i_lock in xfs_attr_rmtval_get
xfs_attr_rmtval_get is always called with i_lock held, but i_lock is taken
in reclaim context so all allocations under it must avoid recursions into
the filesystem.
Reported by the new reclaim context tracing in lockdep.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Felix Blyakher <felixb@sgi.com> Signed-off-by: Felix Blyakher <felixb@sgi.com>
xfs: switch to NOFS allocation under i_lock in xfs_readlink_bmap
xfs_readlink_bmap is called with i_lock held, but i_lock is taken in
reclaim context so all allocations under it must avoid recursions into
the filesystem.
Reported by the new reclaim context tracing in lockdep.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Felix Blyakher <felixb@sgi.com> Signed-off-by: Felix Blyakher <felixb@sgi.com>
xfs: switch to NOFS allocation under i_lock in xfs_attr_rmtval_set
xfs_attr_rmtval_set is always called with i_lock held, and i_lock is taken
in reclaim context so all allocations under it must avoid recursions into
the filesystem.
Reported by the new reclaim context tracing in lockdep.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Felix Blyakher <felixb@sgi.com> Signed-off-by: Felix Blyakher <felixb@sgi.com>
xfs: switch to NOFS allocation under i_lock in xfs_buf_associate_memory
xfs_buf_associate_memory is used for setting up the spare buffer for the
log wrap case in xlog_sync which can happen under i_lock when called from
xfs_fsync. The i_lock mutex is taken in reclaim context so all allocations
under it must avoid recursions into the filesystem. There are a couple
more uses of xfs_buf_associate_memory in the log recovery code that are
also affected by this, but I'd rather keep the code simple than passing on
a gfp_mask argument. Longer term we should just stop requiring the memoery
allocation in xlog_sync by some smaller rework of the buffer layer.
Reported by the new reclaim context tracing in lockdep.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Felix Blyakher <felixb@sgi.com> Signed-off-by: Felix Blyakher <felixb@sgi.com>
xfs: switch to NOFS allocation under i_lock in xfs_dir_cilookup_result
xfs_dir_cilookup_result is always called with i_lock held, but i_lock is taken
in reclaim context so all allocations under it must avoid recursions into the
filesystem.
Reported by the new reclaim context tracing in lockdep.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Felix Blyakher <felixb@sgi.com> Signed-off-by: Felix Blyakher <felixb@sgi.com>
xfs: switch to NOFS allocation under i_lock in xfs_da_state_alloc
xfs_da_state_alloc is always called with i_lock held, but i_lock is taken in
reclaim context so all allocations under it must avoid recursions into the
filesystem.
Reported by the new reclaim context tracing in lockdep.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Felix Blyakher <felixb@sgi.com> Signed-off-by: Felix Blyakher <felixb@sgi.com>
xfs: switch to NOFS allocation under i_lock in xfs_getbmap
xfs_getbmap allocates memory with i_lock held, but i_lock is taken in
reclaim context so all allocations under it must avoid recursions into
the filesystem.
Reported by the new reclaim context tracing in lockdep.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Felix Blyakher <felixb@sgi.com> Signed-off-by: Felix Blyakher <felixb@sgi.com>
xfs: avoid memory allocation under m_peraglock in growfs code
Allocate the memory for the larger m_perag array before taking the
per-AG lock as the per-AG lock can be taken under the i_lock which
can be taken from reclaim context.
Reported by the new reclaim context tracing in lockdep.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Felix Blyakher <felixb@sgi.com> Signed-off-by: Felix Blyakher <felixb@sgi.com>
Fenghua Yu [Tue, 11 Aug 2009 21:52:11 +0000 (14:52 -0700)]
arch/ia64/Makefile: Remove -mtune=merced in IA64 kernel build
Between GCC version 3.4.0 and 4.3.3 (including 3.4.0 and 4.3.3), -mtune=merced
is implemented in GCC. Starting from 4.4.0, -mtune=merced is deprecated.
Even implemented in versions between 3.4.0 and 4.3.3, the -mtune=merced
feature has been broken in some of the versions. For example, GCC 4.1.2 reports
interanl tuning function errors during kernel building with -mtune=merced. Or
GCC Bugzilla 16130 reports another -mtune=merced issue on GCC 3.4.1.
So I would remove the -mtune=merced from IA64 kernel build. Without this option,
kernel on Merced will remain the same except losing an unstable and out-of-date
performance tunning feature.
Since GCC version 3.4.0, -mtune=mckinley has been implemented. The
-mtune=mckinley option functions the same as mtune=itanium2. And mtune=itanium2
is the default option. So we don't need to add mtune=mckinley either since its
been the default option in any GCC version which implements this option.
Johannes Weiner [Tue, 11 Aug 2009 21:52:10 +0000 (14:52 -0700)]
ia64: boolean __test_and_clear_bit
__test_and_clear_bit() returns a bitfield with the tested-for bit set.
Make it consistent with the other bitops - of ia64 but also every
other architecture - and return a boolean value.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Fenghua Yu <fenghua.yu@intel.com>
This dma_ops->dma_supported is first called in platform_dma_init() during kernel
boot. Then dma_ops->dma_supported will be called recursively in
iommu_dma_supported.
Kernel can not boot because kernel can not get out of iommu_dma_supported until
it runs out of stack memory.
Kevin Winchester [Mon, 10 Aug 2009 22:56:45 +0000 (19:56 -0300)]
x86: Clear incorrectly forced X86_FEATURE_LAHF_LM flag
Due to an erratum with certain AMD Athlon 64 processors, the
BIOS may need to force enable the LAHF_LM capability.
Unfortunately, in at least one case, the BIOS does this even
for processors that do not support the functionality.
Add a specific check that will clear the feature bit for
processors known not to support the LAHF/SAHF instructions.
Ingo Molnar [Tue, 11 Aug 2009 08:47:36 +0000 (10:47 +0200)]
perf_counter, x86: Fix lapic printk message
Instead of this garbled bootup on UP Pentium-M systems:
[ 0.015048] Performance Counters:
[ 0.016004] no Local APIC, try rebooting with lapicno PMU driver, software counters only.
Print:
[ 0.015050] Performance Counters:
[ 0.016004] no APIC, boot with the "lapic" boot parameter to force-enable it.
[ 0.017003] no PMU driver, software counters only.
Cf: Frederic Weisbecker <fweisbec@gmail.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Paul Mackerras <paulus@samba.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Frederic Weisbecker <fweisbec@gmail.com>
LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>
Dmitry Torokhov [Mon, 10 Aug 2009 04:44:49 +0000 (21:44 -0700)]
x86, mce: therm_throt - change when we print messages
My Latitude d630 seems to be handling thermal events in SMI by
lowering the max frequency of the CPU till it cools down but
still leaks the "everything is normal" events.
This spams the console and with high priority printks.
Adjust therm_throt driver to only print messages about the fact
that temperatire returned back to normal when leaving the
throttling state.
Also lower the severity of "back to normal" message from
KERN_CRIT to KERN_INFO.
Signed-off-by: Dmitry Torokhov <dtor@mail.ru> Acked-by: H. Peter Anvin <hpa@zytor.com>
LKML-Reference: <20090810051513.0558F526EC9@mailhub.coreip.homeip.net> Signed-off-by: Ingo Molnar <mingo@elte.hu>
Takashi Iwai [Tue, 11 Aug 2009 06:45:11 +0000 (08:45 +0200)]
ALSA: hda - Don't override ADC definitions for ALC codecs
ALC269 and ALC861-VD parsers override the ADC definitions
unconditionally without checking the spec definition. This causes
the problem when any inconsistent ADC is set up in the device quirk
(like ALC272 with digital-mic).
This patch avoids the overriding by adding the proper checks.
Magnus Damm [Mon, 10 Aug 2009 21:41:18 +0000 (23:41 +0200)]
PM / Driver Core: Kill dev_pm_ops platform warning for now
Commit 783ea7d4eeefe895f2731fe73ac951e94418927b
(Driver Core: Rework platform suspend/resume, print warning)
added a warning message printed for platform drivers that use the
legacy PM callbacks rather than struct dev_pm_ops. Unfortunately,
this resulted in some confusion and made some people try to convert
drivers by replacing the old callbacks with struct dev_pm_ops in
automatic way, which generally is not a good idea.
Remove the platform device runtime dev_pm_ops warning for now,
because it's annoying to users and it's not really necessary right
now.
[rjw: Modified the changelog to be more informative.]
Signed-off-by: Magnus Damm <damm@igel.co.jp> Acked-by: Greg Kroah-Hartman <gregkh@suse.de> Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Linus Torvalds [Mon, 10 Aug 2009 20:21:19 +0000 (13:21 -0700)]
pty: fix data loss when stopped (^S/^Q)
Commit d945cb9cc ("pty: Rework the pty layer to use the normal buffering
logic") dropped the test for 'tty->stopped' in pty_write_room(), which
then causes the n_tty line discipline thing to not throttle the data
properly when the tty is stopped.
So instead of pausing the write due to the tty being stopped, the ldisc
layer would go ahead and push it down to the pty. The pty write()
routine would then refuse to take the data (because it _did_ check
'stopped'), and the data wouldn't actually be written.
This whole stopped test should eventually be moved into the tty ldisc
layer rather than have low-level tty drivers care about these things,
but right now the fix is to just re-instate the missing pty 'stopped'
handling.
Reported-and-tested-by: Artur Skawina <art.08.09@gmail.com> Cc: Alan Cox <alan@lxorguk.ukuu.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Jan Kara [Thu, 6 Aug 2009 21:29:34 +0000 (23:29 +0200)]
ocfs2: Fix possible deadlock when extending quota file
In OCFS2, allocator locks rank above transaction start. Thus we
cannot extend quota file from inside a transaction less we could
deadlock.
We solve the problem by starting transaction not already in
ocfs2_acquire_dquot() but only in ocfs2_local_read_dquot() and
ocfs2_global_read_dquot() and we allocate blocks to quota files before starting
the transaction. In case we crash, quota files will just have a few blocks
more but that's no problem since we just use them next time we extend the
quota file.
Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Joel Becker <joel.becker@oracle.com>
Linus Torvalds [Mon, 10 Aug 2009 18:48:51 +0000 (11:48 -0700)]
Merge branch 'perfcounters-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'perfcounters-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (27 commits)
perf_counter: Zero dead bytes from ftrace raw samples size alignment
perf_counter: Subtract the buffer size field from the event record size
perf_counter: Require CAP_SYS_ADMIN for raw tracepoint data
perf_counter: Correct PERF_SAMPLE_RAW output
perf tools: callchain: Fix bad rounding of minimum rate
perf_counter tools: Fix libbfd detection for systems with libz dependency
perf: "Longum est iter per praecepta, breve et efficax per exempla"
perf_counter: Fix a race on perf_counter_ctx
perf_counter: Fix tracepoint sampling to be part of generic sampling
perf_counter: Work around gcc warning by initializing tracepoint record unconditionally
perf tools: callchain: Fix sum of percentages to be 100% by displaying amount of ignored chains in fractal mode
perf tools: callchain: Fix 'perf report' display to be callchain by default
perf tools: callchain: Fix spurious 'perf report' warnings: ignore empty callchains
perf record: Fix the -A UI for empty or non-existent perf.data
perf util: Fix do_read() to fail on EOF instead of busy-looping
perf list: Fix the output to not include tracepoints without an id
perf_counter/powerpc: Fix oops on cpus without perf_counter hardware support
perf stat: Fix tool option consistency: rename -S/--scale to -c/--scale
perf report: Add debug help for the finding of symbol bugs - show the symtab origin (DSO, build-id, kernel, etc)
perf report: Fix per task mult-counter stat reporting
...
Darren Hart [Fri, 7 Aug 2009 22:20:48 +0000 (15:20 -0700)]
futex: Fix handling of bad requeue syscall pairing
If futex_requeue(requeue_pi=1) finds a futex_q that was created by a call
other the futex_wait_requeue_pi(), the q.rt_waiter may be null. If so,
this will result in an oops from the following call graph:
We currently WARN_ON() if this is detected, clearly this is inadequate.
If we detect a mispairing in futex_requeue(), bail out, seding -EINVAL to
user-space.
V2: Fix parenthesis warnings.
Signed-off-by: Darren Hart <dvhltc@us.ibm.com> Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: John Kacur <jkacur@redhat.com> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: Dinakar Guniguntala <dino@in.ibm.com> Cc: John Stultz <johnstul@linux.vnet.ibm.com>
LKML-Reference: <4A7CA8C0.7010809@us.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
Linus Torvalds [Mon, 10 Aug 2009 18:00:37 +0000 (11:00 -0700)]
Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jbarnes/pci-2.6
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jbarnes/pci-2.6:
PCI hotplug: SGI hotplug: do not use hotplug_slot_attr
PCI hotplug: SGI hotplug: fix build failure
Wei Chong Tan reported a fast-PIT-calibration corner-case:
| pit_expect_msb() is vulnerable to SMI disturbance corner case
| in some platforms which causes /proc/cpuinfo to show wrong
| CPU MHz value when quick_pit_calibrate() jumps to success
| section.
I think that the real issue isn't even an SMI - but the fact
that in the very last iteration of the loop, there's no
serializing instruction _after_ the last 'rdtsc'. So even in
the absense of SMI's, we do have a situation where the cycle
counter was read without proper serialization.
The last check should be done outside the outer loop, since
_inside_ the outer loop, we'll be testing that the PIT has
the right MSB value has the right value in the next iteration.
So only the _last_ iteration is special, because that's the one
that will not check the PIT MSB value any more, and because the
final 'get_cycles()' isn't serialized.
In other words:
- I'd like to move the PIT MSB check to after the last
iteration, rather than in every iteration
- I think we should comment on the fact that it's also a
serializing instruction and so 'fences in' the TSC read.
Linus Torvalds [Mon, 10 Aug 2009 16:00:47 +0000 (09:00 -0700)]
Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/security-testing-2.6
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/security-testing-2.6:
mm_for_maps: take ->cred_guard_mutex to fix the race with exec
mm_for_maps: shift down_read(mmap_sem) to the caller
mm_for_maps: simplify, use ptrace_may_access()
Figo.zhang [Sat, 8 Aug 2009 13:01:22 +0000 (21:01 +0800)]
mempool.c: clean up type-casting
clean up type-casting twice. "size_t" is typedef as "unsigned long" in
64-bit system, and "unsigned int" in 32-bit system, and the intermediate
cast to 'long' is pointless.
perf_counter: Zero dead bytes from ftrace raw samples size alignment
After aligning the ftrace raw samples, there are dead bytes storing
random data from the stack. We don't want to leak these to userspace,
then zero these out.
perf_counter: Subtract the buffer size field from the event record size
We compute the perf raw sample size by aligning the raw ftrace
event size plus the buffer size field itself. We do that
instead of aligning only the perf raw sample size, so that we
might economize some in some cases.
But this buffer size field is not stored in the perf raw
sample, we must then substract its size from the buffer once we
computed the alignment unless we may get a useless u32 field in
the buffer.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Mike Galbraith <efault@gmx.de> Cc: Paul Mackerras <paulus@samba.org>
LKML-Reference: <20090810141129.GA5124@nowhere> Signed-off-by: Ingo Molnar <mingo@elte.hu>
futex: Fix compat_futex to be same as futex for REQUEUE_PI
Need to add the REQUEUE_PI checks to the compat_sys_futex API
as well to ensure 32 bit requeue's work fine on a 64 bit
system. Patch is against latest tip
Peter Staubach [Mon, 10 Aug 2009 12:54:16 +0000 (08:54 -0400)]
NFS: read-modify-write page updating
Hi.
I have a proposal for possibly resolving this issue.
I believe that this situation occurs due to the way that the
Linux NFS client handles writes which modify partial pages.
The Linux NFS client handles partial page modifications by
allocating a page from the page cache, copying the data from
the user level into the page, and then keeping track of the
offset and length of the modified portions of the page. The
page is not marked as up to date because there are portions
of the page which do not contain valid file contents.
When a read call comes in for a portion of the page, the
contents of the page must be read in the from the server.
However, since the page may already contain some modified
data, that modified data must be written to the server
before the file contents can be read back in the from server.
And, since the writing and reading can not be done atomically,
the data must be written and committed to stable storage on
the server for safety purposes. This means either a
FILE_SYNC WRITE or a UNSTABLE WRITE followed by a COMMIT.
This has been discussed at length previously.
This algorithm could be described as modify-write-read. It
is most efficient when the application only updates pages
and does not read them.
My proposed solution is to add a heuristic to decide whether
to do this modify-write-read algorithm or switch to a read-
modify-write algorithm when initially allocating the page
in the write system call path. The heuristic uses the modes
that the file was opened with, the offset in the page to
read from, and the size of the region to read.
If the file was opened for reading in addition to writing
and the page would not be filled completely with data from
the user level, then read in the old contents of the page
and mark it as Uptodate before copying in the new data. If
the page would be completely filled with data from the user
level, then there would be no reason to read in the old
contents because they would just be copied over.
This would optimize for applications which randomly access
and update portions of files. The linkage editor for the
C compiler is an example of such a thing.
I tested the attached patch by using rpmbuild to build the
current Fedora rawhide kernel. The kernel without the
patch generated about 269,500 WRITE requests. The modified
kernel containing the patch generated about 261,000 WRITE
requests. Thus, about 8,500 fewer WRITE requests were
generated. I suspect that many of these additional
WRITE requests were probably FILE_SYNC requests to WRITE
a single page, but I didn't test this theory.
The difference between this patch and the previous one was
to remove the unneeded PageDirty() test. I then retested to
ensure that the resulting system continued to behave as
desired.
Thanx...
ps
Signed-off-by: Peter Staubach <staubach@redhat.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Peter Zijlstra [Mon, 10 Aug 2009 11:33:05 +0000 (12:33 +0100)]
locking, sched: Give waitqueue spinlocks their own lockdep classes
Give waitqueue spinlocks their own lockdep classes when they
are initialised from init_waitqueue_head(). This means that
struct wait_queue::func functions can operate other waitqueues.
This is used by CacheFiles to catch the page from a backing fs
being unlocked and to wake up another thread to take a copy of
it.
mm_for_maps: take ->cred_guard_mutex to fix the race with exec
The problem is minor, but without ->cred_guard_mutex held we can race
with exec() and get the new ->mm but check old creds.
Now we do not need to re-check task->mm after ptrace_may_access(), it
can't be changed to the new mm under us.
Strictly speaking, this also fixes another very minor problem. Unless
security check fails or the task exits mm_for_maps() should never
return NULL, the caller should get either old or new ->mm.
mm_for_maps: shift down_read(mmap_sem) to the caller
mm_for_maps() takes ->mmap_sem after security checks, this looks
strange and obfuscates the locking rules. Move this lock to its
single caller, m_start().
Oleg Nesterov [Tue, 23 Jun 2009 19:25:32 +0000 (21:25 +0200)]
mm_for_maps: simplify, use ptrace_may_access()
It would be nice to kill __ptrace_may_access(). It requires task_lock(),
but this lock is only needed to read mm->flags in the middle.
Convert mm_for_maps() to use ptrace_may_access(), this also simplifies
the code a little bit.
Also, we do not need to take ->mmap_sem in advance. In fact I think
mm_for_maps() should not play with ->mmap_sem at all, the caller should
take this lock.
With or without this patch, without ->cred_guard_mutex held we can race
with exec() and get the new ->mm but check old creds.
Peter Zijlstra [Mon, 10 Aug 2009 09:16:52 +0000 (11:16 +0200)]
perf_counter: Correct PERF_SAMPLE_RAW output
PERF_SAMPLE_* output switches should unconditionally output the
correct format, as they are the only way to unambiguously parse
the PERF_EVENT_SAMPLE data.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Acked-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Mike Galbraith <efault@gmx.de> Cc: Paul Mackerras <paulus@samba.org>
LKML-Reference: <1249896447.17467.74.camel@twins> Signed-off-by: Ingo Molnar <mingo@elte.hu>
Darren Hart [Sun, 9 Aug 2009 22:34:39 +0000 (15:34 -0700)]
futex: Update futex_q lock_ptr on requeue proxy lock
futex_requeue() can acquire the lock on behalf of a waiter
early on or during the requeue loop if it is uncontended or in
the event of a lock steal or owner died. On wakeup, the waiter
(in futex_wait_requeue_pi()) cleans up the pi_state owner using
the lock_ptr to protect against concurrent access to the
pi_state. The pi_state is hung off futex_q's on the requeue
target futex hash bucket so the lock_ptr needs to be updated
accordingly.
The problem manifested by triggering the WARN_ON in
lookup_pi_state() about the pid != pi_state->owner->pid. With
this patch, the pi_state is properly guarded against concurrent
access via the requeue target hb lock.
The astute reviewer may notice that there is a window of time
between when futex_requeue() unlocks the hb locks and when
futex_wait_requeue_pi() will acquire hb2->lock. During this
time the pi_state and uval are not in sync with the underlying
rtmutex owner (but the uval does indicate there are waiters, so
no atomic changes will occur in userspace). However, this is
not a problem. Should a contending thread enter
lookup_pi_state() and acquire hb2->lock before the ownership is
fixed up, it will find the pi_state hung off a waiter's
(possibly the pending owner's) futex_q and block on the
rtmutex. Once futex_wait_requeue_pi() fixes up the owner, it
will also move the pi_state from the old owner's
task->pi_state_list to its own.
v3: Fix plist lock name for application to mainline (rather
than -rt) Compile tested against tip/v2.6.31-rc5.
Signed-off-by: Darren Hart <dvhltc@us.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: Dinakar Guniguntala <dino@in.ibm.com> Cc: John Stultz <johnstul@linux.vnet.ibm.com>
LKML-Reference: <4A7F4EFF.6090903@us.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
powerpc/dma: pci_set_dma_mask() shouldn't fail if mask fits in RAM
On an iMac G5, the b43 driver is failing to initialise because trying to
set the dma mask to 30-bit fails. Even though there's only 512MiB of RAM
in the machine anyway:
https://bugzilla.redhat.com/show_bug.cgi?id=514787
We should probably let it succeed if the available RAM in the system
doesn't exceed the requested limit.
Signed-off-by: David Woodhouse <David.Woodhouse@intel.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
introduced the possibility of an A-B/B-A deadlock between
bd_mutex and reconfig_mutex.
__blkdev_get holds bd_mutex while calling md_open which takes
reconfig_mutex,
do_md_run is always called with reconfig_mutex held, and it now
takes bd_mutex in the call the revalidate_disk.
This potential deadlock was not caught by lockdep due to the
use of mutex_lock_interruptible_nexted which was introduced
by
commit d63a5a74dee87883fda6b7d170244acaac5b05e8
do avoid a warning of an impossible deadlock.
It is quite possible to split reconfig_mutex in to two locks.
One protects the array data structures while it is being
reconfigured, the other ensures that an array is never even partially
open while it is being deactivated.
In particular, the second lock prevents an open from completing
between the time when do_md_stop checks if there are any active opens,
and the time when the array is either set read-only, or when ->pers is
set to NULL. So we can be certain that no IO is in flight as the
array is being destroyed.
So create a new lock, open_mutex, just to ensure exclusion between
'open' and 'stop'.
This avoids the deadlock and also avoids the lockdep warning mentioned
in commit d63a5a74d
Linus Torvalds [Sun, 9 Aug 2009 21:58:21 +0000 (14:58 -0700)]
Merge branch 'kvm-updates/2.6.31' of git://git.kernel.org/pub/scm/virt/kvm/kvm
* 'kvm-updates/2.6.31' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
KVM: Avoid redelivery of edge interrupt before next edge
KVM: MMU: limit rmap chain length
KVM: ia64: fix build failures due to ia64/unsigned long mismatches
KVM: Make KVM_HPAGES_PER_HPAGE unsigned long to avoid build error on powerpc
KVM: fix ack not being delivered when msi present
KVM: s390: fix wait_queue handling
KVM: VMX: Fix locking imbalance on emulation failure
KVM: VMX: Fix locking order in handle_invalid_guest_state
KVM: MMU: handle n_free_mmu_pages > n_alloc_mmu_pages in kvm_mmu_change_mmu_pages
KVM: SVM: force new asid on vcpu migration
KVM: x86: verify MTRR/PAT validity
KVM: PIT: fix kpit_elapsed division by zero
KVM: Fix KVM_GET_MSR_INDEX_LIST
Linus Torvalds [Sun, 9 Aug 2009 21:57:41 +0000 (14:57 -0700)]
Merge branch 'timers-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'timers-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
posix_cpu_timers_exit_group(): Do not use thread_group_cputimer()
Linus Torvalds [Sun, 9 Aug 2009 21:57:26 +0000 (14:57 -0700)]
Merge branch 'tracing-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'tracing-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
perf_counter: Fix/complete ftrace event records sampling
perf_counter, ftrace: Fix perf_counter integration
tracing/filters: Always free pred on filter_add_subsystem_pred() failure
tracing/filters: Don't use pred on alloc failure
ring-buffer: Fix memleak in ring_buffer_free()
tracing: Fix recordmcount.pl to handle sections with only weak functions
ring-buffer: Fix advance of reader in rb_buffer_peek()
tracing: do not use functions starting with .L in recordmcount.pl
ring-buffer: do not disable ring buffer on oops_in_progress
ring-buffer: fix check of try_to_discard result
Linus Torvalds [Sun, 9 Aug 2009 21:57:09 +0000 (14:57 -0700)]
Merge branch 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
x86: fix buffer overflow in efi_init()
x86: Add quirk to make Apple MacBookPro5,1 use reboot=pci
x86: Fix MSI-X initialization by using online_mask for x2apic target_cpus
x86: Fix VMI && stack protector
Linus Torvalds [Sun, 9 Aug 2009 21:56:51 +0000 (14:56 -0700)]
Merge branch 'core-fixes-for-linus-2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'core-fixes-for-linus-2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
lockdep: Fix typos in documentation
lockdep: Fix file mode of lock_stat
rtmutex: Avoid deadlock in rt_mutex_start_proxy_lock()
Trond Myklebust [Sun, 9 Aug 2009 19:14:29 +0000 (15:14 -0400)]
SUNRPC: Allow the cache_detail to specify alternative upcall mechanisms
For events that are rare, such as referral DNS lookups, it makes limited
sense to have a daemon constantly listening for upcalls on a channel. An
alternative in those cases might simply be to run the app that fills the
cache using call_usermodehelper_exec() and friends.
The following patch allows the cache_detail to specify alternative upcall
mechanisms for these particular cases.
Trond Myklebust [Sun, 9 Aug 2009 19:14:28 +0000 (15:14 -0400)]
SUNRPC: Remove the global temporary write buffer in net/sunrpc/cache.c
While we do want to protect against multiple concurrent readers and writers
on each upcall/downcall pipe, we don't want to limit concurrent reading and
writing to separate caches.
This patch therefore replaces the static buffer 'write_buf', which can only
be used by one writer at a time, with use of the page cache as the
temporary buffer for downcalls. We still fall back to using the the old
global buffer if the downcall is larger than PAGE_CACHE_SIZE, since this is
apparently needed by the SPKM security context initialisation.
It then replaces the use of the global 'queue_io_mutex' with the
inode->i_mutex in cache_read() and cache_write().
Trond Myklebust [Sun, 9 Aug 2009 19:14:27 +0000 (15:14 -0400)]
SUNRPC: Ensure we initialise the cache_detail before creating procfs files
Also ensure that we destroy those files before we destroy the cache_detail.
Otherwise, user processes might attempt to write into uninitialised caches.
Trond Myklebust [Sun, 9 Aug 2009 19:14:26 +0000 (15:14 -0400)]
SUNRPC: One more clean up for rpc_create_client_dir()
In order to allow rpc_pipefs to create directories with different types of
subtrees, it is useful to allow the caller to customise the subtree filling
process.
In order to do so, we separate out the parts which are specific to making
an RPC client directory, and put them in a separate helper, then we convert
the process of filling the directory contents into a callback.
Trond Myklebust [Sun, 9 Aug 2009 19:14:25 +0000 (15:14 -0400)]
SUNRPC: clean up rpc_setup_pipedir()
There is still a little wart or two there: Since we've already got a
vfsmount, we might as well pass that in to rpc_create_client_dir.
Another point is that if we open code __rpc_lookup_path() here, then we can
avoid looking up the entire parent directory path over and over again: it
doesn't change.
Also get rid of rpc_clnt->cl_pathname, since it has no users...