Paul E. McKenney [Sat, 14 Nov 2009 03:51:39 +0000 (19:51 -0800)]
rcu: Eliminate __rcu_pending() false positives
Now that there are both ->gpnum and ->completed fields in the
rcu_node structure, __rcu_pending() should check rdp->gpnum and
rdp->completed against rnp->gpnum and rdp->completed, respectively,
instead of the prior comparison against the rcu_state fields
rsp->gpnum and rsp->completed.
Given the old comparison, __rcu_pending() could return 1, resulting
in a needless raise_softirq(RCU_SOFTIRQ). This useless work would
happen if RCU responded to a scheduling-clock interrupt after the
rcu_state fields had been updated, but before the rcu_node fields
had been updated.
Changing the comparison from the rcu_state fields to the rcu_node
fields prevents this useless work from happening.
Paul E. McKenney [Sat, 14 Nov 2009 03:51:38 +0000 (19:51 -0800)]
rcu: Further cleanups of use of lastcomp
Now that a copy of the rsp->completed flag is available in all
rcu_node structures, make full use of it. It is still
legitimate to access rsp->completed while holding the root
rcu_node structure's lock, however.
Also, tighten up force_quiescent_state()'s checks for end of
current grace period.
Paul E. McKenney [Fri, 13 Nov 2009 06:35:04 +0000 (22:35 -0800)]
rcu: Simplify association of forced quiescent states with grace periods
The force_quiescent_state() function also took a snapshot
of the ->completed field, which was as obnoxious as it was in
rcu_sched_qs() and friends. So snapshot ->gpnum-1.
Also, since the dyntick_record_completed() and
dyntick_recall_completed() functions are now simple assignments
that are independent of CONFIG_NO_HZ, and since their names are
now misleading, get rid of them.
Paul E. McKenney [Fri, 13 Nov 2009 06:35:03 +0000 (22:35 -0800)]
rcu: Accelerate callback processing on CPUs not detecting GP end
An earlier fix for a race resulted in a situation where the CPUs
other than the CPU that detected the end of the grace period would
not process their callbacks until the next grace period started.
This means that these other CPUs would unnecessarily demand that an
extra grace period be started.
This patch eliminates this extra grace period and speeds callback
processing by propagating rsp->completed to the rcu_node structures
in the case where the CPU detecting the end of the grace period
sees no reason to start a new grace period.
Paul E. McKenney [Tue, 10 Nov 2009 21:37:22 +0000 (13:37 -0800)]
rcu: Simplify association of quiescent states with grace periods
The rdp->passed_quiesc_completed fields are used to properly
associate the recorded quiescent state with a grace period. It
is OK to wrongly associate a given quiescent state with a
preceding grace period, but it is fatal to associate a given
quiescent state with a grace period that begins after the
quiescent state occurred. Grace periods are numbered, and the
following fields track them:
o ->gpnum is the number of the grace period currently in
progress, or the number of the last grace period to
complete if no grace period is currently in progress.
o ->completed is the number of the last grace period to
have completed.
These two fields are equal if there is no grace period in
progress, otherwise ->gpnum is one greater than ->completed.
But the rdp->passed_quiesc_completed field compared against
->completed, and if equal, the quiescent state is presumed to
count against the current grace period.
The earlier code copied rdp->completed to
rdp->passed_quiesc_completed, which has been made to work, but
is error-prone. In contrast, copying one less than rdp->gpnum
is guaranteed safe, because rdp->gpnum is not incremented until
after the start of the corresponding grace period. At the end of
the grace period, when ->completed has incremented, then any
quiescent periods recorded previously will be discarded.
Impose a clear locking design on the note_new_gpnum()
function's use of the ->gpnum counter. This is done by updating
rdp->gpnum only from the corresponding leaf rcu_node structure's
rnp->gpnum field, and even then only under the protection of
that same rcu_node structure's ->lock field. Performance and
scalability are maintained using a form of double-checked
locking, and excessive spinning is avoided by use of the
spin_trylock() function. The use of spin_trylock() is safe due
to the fact that CPUs who fail to acquire this lock will try
again later. The hierarchical nature of the rcu_node data
structure limits contention (which could be limited further if
need be using the RCU_FANOUT kernel parameter).
Without this patch, obscure but quite possible races could
result in a quiescent state that occurred during one grace
period to be accounted to the following grace period, causing
this following grace period to end prematurely. Not good!
rcu: Fix synchronization for rcu_process_gp_end() uses of ->completed counter
Impose a clear locking design on the rcu_process_gp_end()
function's use of the ->completed counter. This is done by
creating a ->completed field in the rcu_node structure, which
can safely be accessed under the protection of that structure's
lock. Performance and scalability are maintained by using a
form of double-checked locking, so that rcu_process_gp_end()
only acquires the leaf rcu_node structure's ->lock if a grace
period has recently ended.
This fix reduces rcutorture failure rate by at least two orders
of magnitude under heavy stress with force_quiescent_state()
being invoked artificially often. Without this fix,
unsynchronized access to the ->completed field can cause
rcu_process_gp_end() to advance callbacks whose grace period has
not yet expired. (Bad idea!)
rcu: Prepare for synchronization fixes: clean up for non-NO_HZ handling of ->completed counter
Impose a clear locking design on non-NO_HZ handling of the
->completed counter. This increases the distance between the
RCU and the CPU-hotplug mechanisms.
Currently, rcu_irq_exit() is invoked only for CONFIG_NO_HZ,
while rcu_irq_enter() is invoked unconditionally. This patch
moves rcu_irq_exit() out from under CONFIG_NO_HZ so that the
calls are balanced.
This patch has no effect on the behavior of the kernel because
both rcu_irq_enter() and rcu_irq_exit() are empty for
!CONFIG_NO_HZ, but the code is easier to understand if the calls
are obviously balanced in all cases.
Paul E. McKenney [Wed, 28 Oct 2009 15:14:49 +0000 (08:14 -0700)]
rcu: Fix long-grace-period race between forcing and initialization
Very long RCU read-side critical sections (50 milliseconds or
so) can cause a race between force_quiescent_state() and
rcu_start_gp() as follows on kernel builds with multi-level
rcu_node hierarchies:
1. CPU 0 calls force_quiescent_state(), sees that there is a
grace period in progress, and acquires ->fsqlock.
2. CPU 1 detects the end of the grace period, and so
cpu_quiet_msk_finish() sets rsp->completed to rsp->gpnum.
This operation is carried out under the root rnp->lock,
but CPU 0 has not yet acquired that lock. Note that
rsp->signaled is still RCU_SAVE_DYNTICK from the last
grace period.
3. CPU 1 calls rcu_start_gp(), but no one wants a new grace
period, so it drops the root rnp->lock and returns.
4. CPU 0 acquires the root rnp->lock and picks up rsp->completed
and rsp->signaled, then drops rnp->lock. It then enters the
RCU_SAVE_DYNTICK leg of the switch statement.
5. CPU 2 invokes call_rcu(), and now needs a new grace period.
It calls rcu_start_gp(), which acquires the root rnp->lock, sets
rsp->signaled to RCU_GP_INIT (too bad that CPU 0 is already in
the RCU_SAVE_DYNTICK leg of the switch statement!) and starts
initializing the rcu_node hierarchy. If there are multiple
levels to the hierarchy, it will drop the root rnp->lock and
initialize the lower levels of the hierarchy.
6. CPU 0 notes that rsp->completed has not changed, which permits
both CPU 2 and CPU 0 to try updating it concurrently. If CPU 0's
update prevails, later calls to force_quiescent_state() can
count old quiescent states against the new grace period, which
can in turn result in premature ending of grace periods.
Not good.
This patch adds an RCU_GP_IDLE state for rsp->signaled that is
set initially at boot time and any time a grace period ends.
This prevents CPU 0 from getting into the workings of
force_quiescent_state() in step 4. Additional locking and
checks prevent the concurrent update of rsp->signaled in step 6.
Thomas Gleixner [Mon, 2 Nov 2009 12:01:56 +0000 (13:01 +0100)]
uids: Prevent tear down race
Ingo triggered the following warning:
WARNING: at lib/debugobjects.c:255 debug_print_object+0x42/0x50()
Hardware name: System Product Name
ODEBUG: init active object type: timer_list
Modules linked in:
Pid: 2619, comm: dmesg Tainted: G W 2.6.32-rc5-tip+ #5298
Call Trace:
[<81035443>] warn_slowpath_common+0x6a/0x81
[<8120e483>] ? debug_print_object+0x42/0x50
[<81035498>] warn_slowpath_fmt+0x29/0x2c
[<8120e483>] debug_print_object+0x42/0x50
[<8120ec2a>] __debug_object_init+0x279/0x2d7
[<8120ecb3>] debug_object_init+0x13/0x18
[<810409d2>] init_timer_key+0x17/0x6f
[<81041526>] free_uid+0x50/0x6c
[<8104ed2d>] put_cred_rcu+0x61/0x72
[<81067fac>] rcu_do_batch+0x70/0x121
debugobjects warns about an enqueued timer being initialized. If
CONFIG_USER_SCHED=y the user management code uses delayed work to
remove the user from the hash table and tear down the sysfs objects.
free_uid is called from RCU and initializes/schedules delayed work if
the usage count of the user_struct is 0. The init/schedule happens
outside of the uidhash_lock protected region which allows a concurrent
caller of find_user() to reference the about to be destroyed
user_struct w/o preventing the work from being scheduled. If the next
free_uid call happens before the work timer expired then the active
timer is initialized and the work scheduled again.
The race was introduced in commit 5cb350ba (sched: group scheduling,
sysfs tunables) and made more prominent by commit 3959214f (sched:
delayed cleanup of user_struct)
Move the init/schedule_delayed_work inside of the uidhash_lock
protected region to prevent the race.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Dhaval Giani <dhaval@linux.vnet.ibm.com> Cc: Paul E. McKenney <paulmck@us.ibm.com> Cc: Kay Sievers <kay.sievers@vrfy.org> Cc: stable@kernel.org
Thomas Gleixner [Wed, 28 Oct 2009 19:26:48 +0000 (20:26 +0100)]
futex: Fix spurious wakeup for requeue_pi really
The requeue_pi path doesn't use unqueue_me() (and the racy lock_ptr ==
NULL test) nor does it use the wake_list of futex_wake() which where
the reason for commit 41890f2 (futex: Handle spurious wake up)
See debugging discussing on LKML Message-ID: <4AD4080C.20703@us.ibm.com>
The changes in this fix to the wait_requeue_pi path were considered to
be a likely unecessary, but harmless safety net. But it turns out that
due to the fact that for unknown $@#!*( reasons EWOULDBLOCK is defined
as EAGAIN we built an endless loop in the code path which returns
correctly EWOULDBLOCK.
Spurious wakeups in wait_requeue_pi code path are unlikely so we do
the easy solution and return EWOULDBLOCK^WEAGAIN to user space and let
it deal with the spurious wakeup.
Cc: Darren Hart <dvhltc@us.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: John Stultz <johnstul@linux.vnet.ibm.com> Cc: Dinakar Guniguntala <dino@in.ibm.com>
LKML-Reference: <4AE23C74.1090502@us.ibm.com> Cc: stable@kernel.org Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Paul E. McKenney [Mon, 26 Oct 2009 20:57:44 +0000 (13:57 -0700)]
rcu: Fix TINY_RCU #elif condition
Some compilers are happy with "#elif CONFIG_RCU_TINY", while
others strongly prefer "#elif defined(CONFIG_RCU_TINY)". Change
to the latter to make more compilers happy.
Peter Zijlstra [Mon, 26 Oct 2009 17:24:31 +0000 (10:24 -0700)]
rcu: Simplify creating of lockdep class for root rcu_node
Use lockdep_set_class() to simplify the code and to avoid any
additional overhead in the !LOCKDEP case. Also move the
definition of rcu_root_class into kernel/rcutree.c, as suggested
by Lai Jiangshan.
Paul E. McKenney [Mon, 26 Oct 2009 02:03:51 +0000 (19:03 -0700)]
rcu: Add synchronize_srcu_expedited()
This patch creates a synchronize_srcu_expedited() that uses
synchronize_sched_expedited() where synchronize_srcu()
uses synchronize_sched(). The synchronize_srcu() and
synchronize_srcu_expedited() functions become one-liners that
pass synchronize_sched() or synchronize_sched_expedited(),
repectively, to a new __synchronize_srcu() function.
While in the file, move the EXPORT_SYMBOL_GPL()s to immediately
follow the corresponding functions.
Paul E. McKenney [Mon, 26 Oct 2009 02:03:50 +0000 (19:03 -0700)]
rcu: "Tiny RCU", The Bloatwatch Edition
This patch is a version of RCU designed for !SMP provided for a
small-footprint RCU implementation. In particular, the
implementation of synchronize_rcu() is extremely lightweight and
high performance. It passes rcutorture testing in each of the
four relevant configurations (combinations of NO_HZ and PREEMPT)
on x86. This saves about 1K bytes compared to old Classic RCU
(which is no longer in mainline), and more than three kilobytes
compared to Hierarchical RCU (updated to 2.6.30):
CONFIG_TREE_RCU:
text data bss dec filename
183 4 0 187 kernel/rcupdate.o
2783 520 36 3339 kernel/rcutree.o
3526 Total (vs 4565 for v7)
CONFIG_TREE_PREEMPT_RCU:
text data bss dec filename
263 4 0 267 kernel/rcupdate.o
4594 776 52 5422 kernel/rcutree.o
5689 Total (6155 for v7)
CONFIG_TINY_RCU:
text data bss dec filename
96 4 0 100 kernel/rcupdate.o
734 24 0 758 kernel/rcutiny.o
858 Total (vs 848 for v7)
The above is for x86. Your mileage may vary on other platforms.
Further compression is possible, but is being procrastinated.
Changes from v7 (http://lkml.org/lkml/2009/10/9/388)
o Apply Lai Jiangshan's review comments (aside from
might_sleep() in synchronize_sched(), which is covered by SMP builds).
o Fix up expedited primitives.
Changes from v6 (http://lkml.org/lkml/2009/9/23/293).
o Forward ported to put it into the 2.6.33 stream.
o Added lockdep support.
o Make lightweight rcu_barrier.
Changes from v5 (http://lkml.org/lkml/2009/6/23/12).
o Ported to latest pre-2.6.32 merge window kernel.
- Renamed rcu_qsctr_inc() to rcu_sched_qs().
- Renamed rcu_bh_qsctr_inc() to rcu_bh_qs().
- Provided trivial rcu_cpu_notify().
- Provided trivial exit_rcu().
- Provided trivial rcu_needs_cpu().
- Fixed up the rcu_*_enter/exit() functions in linux/hardirq.h.
o Removed the dependence on EMBEDDED, with a view to making
TINY_RCU default for !SMP at some time in the future.
o Added (trivial) support for expedited grace periods.
Changes from v4 (http://lkml.org/lkml/2009/5/2/91) include:
o Squeeze the size down a bit further by removing the
->completed field from struct rcu_ctrlblk.
o This permits synchronize_rcu() to become the empty function.
Previous concerns about rcutorture were unfounded, as
rcutorture correctly handles a constant value from
rcu_batches_completed() and rcu_batches_completed_bh().
Changes from v3 (http://lkml.org/lkml/2009/3/29/221) include:
o Changed rcu_batches_completed(), rcu_batches_completed_bh()
rcu_enter_nohz(), rcu_exit_nohz(), rcu_nmi_enter(), and
rcu_nmi_exit(), to be static inlines, as suggested by David
Howells. Doing this saves about 100 bytes from rcutiny.o.
(The numbers between v3 and this v4 of the patch are not directly
comparable, since they are against different versions of Linux.)
Changes from v2 (http://lkml.org/lkml/2009/2/3/333) include:
o Fix whitespace issues.
o Change short-circuit "||" operator to instead be "+" in order
to fix performance bug noted by "kraai" on LWN.
(http://lwn.net/Articles/324348/)
Changes from v1 (http://lkml.org/lkml/2009/1/13/440) include:
o This version depends on EMBEDDED as well as !SMP, as suggested
by Ingo.
o Updated rcu_needs_cpu() to unconditionally return zero,
permitting the CPU to enter dynticks-idle mode at any time.
This works because callbacks can be invoked upon entry to
dynticks-idle mode.
o Paul is now OK with this being included, based on a poll at
the Kernel Miniconf at linux.conf.au, where about ten people said
that they cared about saving 900 bytes on single-CPU systems.
Paul E. McKenney [Thu, 15 Oct 2009 16:26:14 +0000 (09:26 -0700)]
rcu: Fix TREE_PREEMPT_RCU CPU_HOTPLUG bad-luck hang
If the following sequence of events occurs, then
TREE_PREEMPT_RCU will hang waiting for a grace period to
complete, eventually OOMing the system:
o A TREE_PREEMPT_RCU build of the kernel is booted on a system
with more than 64 physical CPUs present (32 on a 32-bit system).
Alternatively, a TREE_PREEMPT_RCU build of the kernel is booted
with RCU_FANOUT set to a sufficiently small value that the
physical CPUs populate two or more leaf rcu_node structures.
o A task is preempted in an RCU read-side critical section
while running on a CPU corresponding to a given leaf rcu_node
structure.
o All CPUs corresponding to this same leaf rcu_node structure
record quiescent states for the current grace period.
o All of these same CPUs go offline (hence the need for enough
physical CPUs to populate more than one leaf rcu_node structure).
This causes the preempted task to be moved to the root rcu_node
structure.
At this point, there is nothing left to cause the quiescent
state to be propagated up the rcu_node tree, so the current
grace period never completes.
The simplest fix, especially after considering the deadlock
possibilities, is to detect this situation when the last CPU is
offlined, and to set that CPU's ->qsmask bit in its leaf
rcu_node structure. This will cause the next invocation of
force_quiescent_state() to end the grace period.
Without this fix, this hang can be triggered in an hour or so on
some machines with rcutorture and random CPU onlining/offlining.
With this fix, these same machines pass a full 10 hours of this
sort of abuse.
Paul E. McKenney [Wed, 14 Oct 2009 17:15:56 +0000 (10:15 -0700)]
rcu: Stopgap fix for synchronize_rcu_expedited() for TREE_PREEMPT_RCU
For the short term, map synchronize_rcu_expedited() to
synchronize_rcu() for TREE_PREEMPT_RCU and to
synchronize_sched_expedited() for TREE_RCU.
Longer term, there needs to be a real expedited grace period for
TREE_PREEMPT_RCU, but candidate patches to date are considerably
more complex and intrusive.
Paul E. McKenney [Wed, 14 Oct 2009 17:15:55 +0000 (10:15 -0700)]
rcu: Prevent RCU IPI storms in presence of high call_rcu() load
As the number of callbacks on a given CPU rises, invoke
force_quiescent_state() only every blimit number of callbacks
(defaults to 10,000), and even then only if no other CPU has
invoked force_quiescent_state() in the meantime.
This should fix the performance regression reported by Nick.
Darren Hart [Wed, 14 Oct 2009 17:12:39 +0000 (10:12 -0700)]
futex: Check for NULL keys in match_futex
If userspace tries to perform a requeue_pi on a non-requeue_pi waiter,
it will find the futex_q->requeue_pi_key to be NULL and OOPS.
Check for NULL in match_futex() instead of doing explicit NULL pointer
checks on all call sites. While match_futex(NULL, NULL) returning
false is a little odd, it's still correct as we expect valid key
references.
Signed-off-by: Darren Hart <dvhltc@us.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ingo Molnar <mingo@elte.hu> CC: Eric Dumazet <eric.dumazet@gmail.com> CC: Dinakar Guniguntala <dino@in.ibm.com> CC: John Stultz <johnstul@us.ibm.com> Cc: stable@kernel.org
LKML-Reference: <4AD60687.10306@us.ibm.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Thomas Gleixner [Tue, 13 Oct 2009 18:40:43 +0000 (20:40 +0200)]
futex: Handle spurious wake up
The futex code does not handle spurious wake up in futex_wait and
futex_wait_requeue_pi.
The code assumes that any wake up which was not caused by futex_wake /
requeue or by a timeout was caused by a signal wake up and returns one
of the syscall restart error codes.
In case of a spurious wake up the signal delivery code which deals
with the restart error codes is not invoked and we return that error
code to user space. That causes applications which actually check the
return codes to fail. Blaise reported that on preempt-rt a python test
program run into a exception trap. -rt exposed that due to a built in
spurious wake up accelerator :)
Solve this by checking signal_pending(current) in the wake up path and
handle the spurious wake up case w/o returning to user space.
Reported-by: Blaise Gassend <blaise@willowgarage.com> Debugged-by: Darren Hart <dvhltc@us.ibm.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: stable@kernel.org
LKML-Reference: <new-submission>
Robert Richter [Fri, 9 Oct 2009 01:17:44 +0000 (03:17 +0200)]
oprofile: warn on freeing event buffer too early
A race shouldn't happen since all workqueues or handlers are canceled
or flushed before the event buffer is freed. A warning is triggered
now if the buffer is freed too early.
Also, this patch adds some comments about event buffer protection,
reworks some code and adds code to clear buffer_pos during alloc and
free of the event buffer.
Cc: David Rientjes <rientjes@google.com> Cc: Stephane Eranian <eranian@google.com> Signed-off-by: Robert Richter <robert.richter@amd.com>
David Rientjes [Wed, 9 Sep 2009 13:02:33 +0000 (15:02 +0200)]
oprofile: fix race condition in event_buffer free
Looking at the 2.6.31-rc9 code, it appears there is a race condition
in the event_buffer cleanup code path (shutdown). This could lead to
kernel panic as some CPUs may be operating on the event buffer AFTER
it has been freed. The attached patch solves the problem and makes
sure CPUs check if the buffer is not NULL before they access it as
some may have been spinning on the mutex while the buffer was being
freed.
The race may happen if the buffer is freed during pending reads. But
it is not clear why there are races in add_event_entry() since all
workqueues or handlers are canceled or flushed before the event buffer
is freed.
Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Stephane Eranian <eranian@google.com> Signed-off-by: Robert Richter <robert.richter@amd.com>
Peter Zijlstra [Fri, 9 Oct 2009 08:12:41 +0000 (10:12 +0200)]
lockdep: Use cpu_clock() for lockstat
Some tracepoint magic (TRACE_EVENT(lock_acquired)) relies on
the fact that lock hold times are positive and uses div64 on
that. That triggered a build warning on MIPS, and probably
causes bad output in certain circumstances as well.
Make it truly positive.
Reported-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1254818502.21044.112.camel@laptop> Signed-off-by: Ingo Molnar <mingo@elte.hu>
Linus Torvalds [Thu, 8 Oct 2009 19:22:45 +0000 (12:22 -0700)]
Merge branch 'upstream-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jgarzik/libata-dev
* 'upstream-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jgarzik/libata-dev:
pata_atp867x: add Power Management support
pata_atp867x: PIO support fixes
pata_atp867x: clarifications in timings calculations and cable detection
pata_atp867x: fix it to not claim MWDMA support
libata: fix incorrect link online check during probe
ahci: filter FPDMA non-zero offset enable for Aspire 3810T
libata: make gtf_filter per-dev
libata: implement more acpi filtering options
libata: cosmetic updates
ahci: display all AHCI 1.3 HBA capability flags (v2)
pata_ali: trivial fix of a very frequent spelling mistake
ahci: disable 64bit DMA by default on SB600s
Linus Torvalds [Thu, 8 Oct 2009 19:16:35 +0000 (12:16 -0700)]
Merge branch 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
futex: fix requeue_pi key imbalance
futex: Fix typo in FUTEX_WAIT/WAKE_BITSET_PRIVATE definitions
rcu: Place root rcu_node structure in separate lockdep class
rcu: Make hot-unplugged CPU relinquish its own RCU callbacks
rcu: Move rcu_barrier() to rcutree
futex: Move exit_pi_state() call to release_mm()
futex: Nullify robust lists after cleanup
futex: Fix locking imbalance
panic: Fix panic message visibility by calling bust_spinlocks(0) before dying
rcu: Replace the rcu_barrier enum with pointer to call_rcu*() function
rcu: Clean up code based on review feedback from Josh Triplett, part 4
rcu: Clean up code based on review feedback from Josh Triplett, part 3
rcu: Fix rcu_lock_map build failure on CONFIG_PROVE_LOCKING=y
rcu: Clean up code to address Ingo's checkpatch feedback
rcu: Clean up code based on review feedback from Josh Triplett, part 2
rcu: Clean up code based on review feedback from Josh Triplett
Linus Torvalds [Thu, 8 Oct 2009 19:07:24 +0000 (12:07 -0700)]
Merge branch 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
sched: Set correct normal_prio and prio values in sched_fork()
Linus Torvalds [Thu, 8 Oct 2009 19:06:09 +0000 (12:06 -0700)]
Merge branch 'tracing-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'tracing-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
tracing: user local buffer variable for trace branch tracer
tracing: fix warning on kernel/trace/trace_branch.c andtrace_hw_branches.c
ftrace: check for failure for all conversions
tracing: correct module boundaries for ftrace_release
tracing: fix transposed numbers of lock_depth and preempt_count
trace: Fix missing assignment in trace_ctxwake_*
tracing: Use free_percpu instead of kfree
tracing: Check total refcount before releasing bufs in profile_enable failure
Linus Torvalds [Thu, 8 Oct 2009 19:05:50 +0000 (12:05 -0700)]
Merge branch 'sparc-perf-events-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'sparc-perf-events-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
mm, perf_event: Make vmalloc_user() align base kernel virtual address to SHMLBA
perf_event: Provide vmalloc() based mmap() backing
Linus Torvalds [Thu, 8 Oct 2009 19:05:00 +0000 (12:05 -0700)]
Merge branch 'perf-fixes-for-linus-2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'perf-fixes-for-linus-2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
perf_events: Make ABI definitions available to userspace
perf tools: elf_sym__is_function() should accept "zero" sized functions
tracing/syscalls: Use long for syscall ret format and field definitions
perf trace: Update eval_flag() flags array to match interrupt.h
perf trace: Remove unused code in builtin-trace.c
perf: Propagate term signal to child
Linus Torvalds [Thu, 8 Oct 2009 19:04:04 +0000 (12:04 -0700)]
Merge branch 'timers-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'timers-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
x86, timers: Check for pending timers after (device) interrupts
NOHZ: update idle state also when NOHZ is inactive
Linus Torvalds [Thu, 8 Oct 2009 19:03:21 +0000 (12:03 -0700)]
Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound-2.6
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound-2.6:
ALSA: ice1724: increase SPDIF and independent stereo buffer sizes
ALSA: opl3: circular locking in the snd_opl3_note_on() and snd_opl3_note_off()
ALSA: ICE1712/24 - Change the Multi Track Peak control (level meters) from MIXER to PCM type
ALSA: hda - Fix yet another auto-mic bug in ALC268
ASoC: WM8350 capture PGA mutes are inverted
ASoC: Remove absent SYNC and TDM DAI format options from i.MX SSI
sound: via82xx: move DXS volume controls to PCM interface
ALSA: hda - Don't pick up invalid HP pins in alc_subsystem_id()
ALSA: hda - Add a workaround for ASUS A7K
ALSA: hda - Fix invalid initializations for ALC861 auto mode
ASoC: wm8940: Fix check on error code form snd_soc_codec_set_cache_io
ASoC: Fix SND_SOC_DAPM_LINE handling
Linus Torvalds [Thu, 8 Oct 2009 19:02:06 +0000 (12:02 -0700)]
Merge branch 'drm-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/airlied/drm-2.6
* 'drm-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/airlied/drm-2.6: (24 commits)
drm/radeon/kms: fix vline register for second head.
drm/r600: avoid assigning vb twice in blit code
drm/radeon: use list_for_each_entry instead of list_for_each
drm/radeon/kms: Fix AGP support for R600/RV770 family (v2)
drm/radeon/kms: Fallback to non AGP when acceleration fails to initialize (v2)
drm/radeon/kms: Fix RS600/RV515/R520/RS690 IRQ
drm/radeon: Fix setting of bits
drm/ttm: fix refcounting in ttm global code.
drm/fb: add more correct 8/16/24/32 bpp fb support.
drm/fb: add setcmap and fix 8-bit support.
drm/radeon/kms: respect single crtc cards, only create one crtc. (v2)
drm: Delete the DRM_DEBUG_KMS in drm_mode_cursor_ioctl
drm/radeon/kms: add support for "Surround View"
drm/radeon/kms: Fix irq handling on AVIVO hw
drm/radeon/kms: R600/RV770 remove dead code and print message for wrong BIOS
drm/radeon/kms: Fix R600/RV770 disable acceleration path
drm/radeon/kms: Fix R600/RV770 startup path & reset
drm/radeon/kms: Fix R600 write back buffer
drm/radeon/kms: Remove old init path as no hw use it anymore
drm/radeon/kms: Convert RS600 to new init path
...
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6: (40 commits)
ethoc: limit the number of buffers to 128
ethoc: use system memory as buffer
ethoc: align received packet to make IP header at word boundary
ethoc: fix buffer address mapping
ethoc: fix typo to compute number of tx descriptors
au1000_eth: Duplicate test of RX_OVERLEN bit in update_rx_stats()
netxen: Fix Unlikely(x) > y
pasemi_mac: ethtool get settings fix
add maintainer for network drop monitor kernel service
tg3: Fix phylib locking strategy
rndis_host: support ETHTOOL_GPERMADDR
ipv4: arp_notify address list bug
gigaset: add kerneldoc comments
gigaset: correct debugging output selection
gigaset: improve error recovery
gigaset: fix device ERROR response handling
gigaset: announce if built with debugging
gigaset: handle isoc frame errors more gracefully
gigaset: linearize skb
gigaset: fix reject/hangup handling
...
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/ide-2.6:
Revert "Revert "ide: try to use PIO Mode 0 during probe if possible""
sis5513: fix PIO setup for ATAPI devices
x86, timers: Check for pending timers after (device) interrupts
Now that range timers and deferred timers are common, I found a
problem with these using the "perf timechart" tool. Frans Pop also
reported high scheduler latencies via LatencyTop, when using
iwlagn.
It turns out that on x86, these two 'opportunistic' timers only get
checked when another "real" timer happens. These opportunistic
timers have the objective to save power by hitchhiking on other
wakeups, as to avoid CPU wakeups by themselves as much as possible.
The change in this patch runs this check not only at timer
interrupts, but at all (device) interrupts. The effect is that:
1) the deferred timers/range timers get delayed less
2) the range timers cause less wakeups by themselves because
the percentage of hitchhiking on existing wakeup events goes up.
I've verified the working of the patch using "perf timechart", the
original exposed bug is gone with this patch. Frans also reported
success - the latencies are now down in the expected ~10 msec
range.
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com> Tested-by: Frans Pop <elendil@planet.nl> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Mike Galbraith <efault@gmx.de>
LKML-Reference: <20091008064041.67219b13@infradead.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>
David Miller [Mon, 21 Sep 2009 19:22:34 +0000 (12:22 -0700)]
mm, perf_event: Make vmalloc_user() align base kernel virtual address to SHMLBA
When a vmalloc'd area is mmap'd into userspace, some kind of
co-ordination is necessary for this to work on platforms with cpu
D-caches which can have aliases.
Otherwise kernel side writes won't be seen properly in userspace
and vice versa.
If the kernel side mapping and the user side one have the same
alignment, modulo SHMLBA, this can work as long as VM_SHARED is
shared of VMA and for all current users this is true. VM_SHARED
will force SHMLBA alignment of the user side mmap on platforms with
D-cache aliasing matters.
The bulk of this patch is just making it so that a specific
alignment can be passed down into __get_vm_area_node(). All
existing callers pass in '1' which preserves existing behavior.
vmalloc_user() gives SHMLBA for the alignment.
As a side effect this should get the video media drivers and other
vmalloc_user() users into more working shape on such systems.
Signed-off-by: David S. Miller <davem@davemloft.net> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Jens Axboe <jens.axboe@oracle.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
LKML-Reference: <200909211922.n8LJMYjw029425@imap1.linux-foundation.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>
Linus Torvalds [Thu, 8 Oct 2009 14:40:19 +0000 (07:40 -0700)]
Merge branch 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/kyle/parisc-2.6
* 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/kyle/parisc-2.6:
agp: parisc-agp.c - use correct page_mask function
parisc: Fix linker script breakage.
parisc: convert to asm-generic/hardirq.h
parisc: Make THREAD_SIZE available to assembly files and linker scripts.
parisc: correct use of SHF_ALLOC
parisc: rename parisc's vmalloc_start to parisc_vmalloc_start
parisc: add me to Maintainers
parisc: includecheck fix: signal.c
parisc: HAVE_ARCH_TRACEHOOK
parisc: add skeleton syscall.h
parisc: stop using task->ptrace for {single,block}step flags
parisc: split syscall_trace into two halves
parisc: add missing TI_TASK macro in syscall.S
parisc: tracehook_signal_handler
parisc: tracehook_report_syscall
David Vrabel [Wed, 7 Oct 2009 23:32:33 +0000 (16:32 -0700)]
mmc: sdio: don't require CISTPL_VERS_1 to contain 4 strings
The PC Card 8.0 specification (vol. 4, section 3.2.10) says the
TPLLV1_INFO field of the CISTPL_VERS_1 tuple must contain 4 strings. Some
cards don't have all 4 so just parse as many as we can.
[akpm@linux-foundation.org: coding-style fixes] Signed-off-by: David Vrabel <david.vrabel@csr.com> Tested-by: Jonathan Cameron <jic23@cam.ac.uk> Tested-by: Bing Zhao <bzhao@marvell.com> Cc: Roel Kluin <roel.kluin@gmail.com> Cc: <linux-mmc@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Paul Menage [Wed, 7 Oct 2009 23:32:26 +0000 (16:32 -0700)]
cgroups: update documentation of cgroups tasks and procs files
Update documentation of cgroups tasks and procs files
Document the cgroup.procs file.
Clarify the semantics of the cgroup.procs and tasks files. Although the
current cgroup.procs interface returns a sorted and uniqified list of
pids, potential future performance enhancements could result in those
properties being removed - explicitly document this aspect of the API.
There are no existing users of cgroup.procs, so compatibility isn't an
issue. There are users of the "tasks" file, but none that would appear to
break in the event of the sorted property being broken. The standard
"libcpuset" explicitly sorts the results of reading from the tasks file,
and "libcg" and other users don't appear to care about ordering.
Signed-off-by: Paul Menage <menage@google.com> Reviewed-by: Li Zefan <lizf@cn.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Hugh Dickins [Wed, 7 Oct 2009 23:32:22 +0000 (16:32 -0700)]
ksm: more on default values
Adjust the max_kernel_pages default to a quarter of totalram_pages,
instead of nr_free_buffer_pages() / 4: the KSM pages themselves come from
highmem, and even on a 16GB PAE machine, 4GB of KSM pages would only be
pinning 32MB of lowmem with their rmap_items, so no need for the more
obscure calculation (nor for its own special init function).
There is no way for the user to switch KSM on if CONFIG_SYSFS is not
enabled, so in that case default run to KSM_RUN_MERGE.
Update KSM Documentation and Kconfig to reflect the new defaults.
Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk> Cc: Izik Eidus <ieidus@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Robert Hancock [Thu, 8 Oct 2009 02:19:21 +0000 (20:19 -0600)]
ALSA: ice1724: increase SPDIF and independent stereo buffer sizes
Increase the default and maximum PCM buffer prellocation size for ice1724's
SPDIF and independent stereo pair outputs to 256K, which is the hardware's
maximum supported size. This allows a reduction in interrupt rate and
potentially power usage when an application is not latency-critical.
Signed-off-by: Robert Hancock <hancockrwd@gmail.com> Signed-off-by: Takashi Iwai <tiwai@suse.de>
Pavel Hofman [Tue, 6 Oct 2009 14:04:11 +0000 (16:04 +0200)]
ALSA: ICE1712/24 - Change the Multi Track Peak control (level meters) from MIXER to PCM type
* PLEASE NOTE - this change requires the corresponding update of
envy24control for ice1712 - kind of an ABI change.
* The "Multi Track Peak" control is read-only level meters indicator.
* The control is VERY confusing to most users since it is currently displayed
in regular mixers. E.g. alsamixer ignores its read-only status
and allows changing the levels with keys which makes no sense.
Signed-off-by: Pavel Hofman <pavel.hofman@ivitera.com> Acked-by: Jaroslav Kysela <perex@perex.cz> Signed-off-by: Takashi Iwai <tiwai@suse.de>
Steven Rostedt [Thu, 8 Oct 2009 01:53:41 +0000 (21:53 -0400)]
tracing: user local buffer variable for trace branch tracer
Just using the tr->buffer for the API to trace_buffer_lock_reserve
is not good enough. This is because the tr->buffer may change, and we
do not want to commit with a different buffer that we reserved from.
This patch uses a local variable to hold the buffer that was used to
reserve and commit with.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Jerome Glisse [Tue, 6 Oct 2009 17:04:30 +0000 (19:04 +0200)]
drm/radeon/kms: Fix AGP support for R600/RV770 family (v2)
For AGP to work unmapped access must cover VRAM & AGP as
AGP is treated like VRAM by the GPU (ie physical address).
This patch properly setup the virtual memory system aperture
to cover AGP if AGP is enabled. It seems that there is memory
corruption after resume when using AGP (RV770 seems unaffected
thought). Version 2 just fix merge issue with updated AGP
fallback patch.
Signed-off-by: Jerome Glisse <jglisse@redhat.com> Signed-off-by: Dave Airlie <airlied@redhat.com>
Jerome Glisse [Tue, 6 Oct 2009 17:04:29 +0000 (19:04 +0200)]
drm/radeon/kms: Fallback to non AGP when acceleration fails to initialize (v2)
When GPU acceleration is not working with AGP try to fallback to non
AGP GART (either PCI or PCIE GART). This should make KMS failure on
AGP less painfull. We still need to find out what is wrong when AGP
fails but at least user have a lot of more chances to get a working
configuration with acceleration. This patch also cleanup R600/RV770
fallback path so they use same code as others asics. Version 2
factorize agp disabling logic to avoid code duplication and bugs.
Signed-off-by: Jerome Glisse <jglisse@redhat.com> Signed-off-by: Dave Airlie <airlied@redhat.com>
Steven Rostedt [Wed, 7 Oct 2009 20:57:56 +0000 (16:57 -0400)]
ftrace: check for failure for all conversions
Due to legacy code from back when the dynamic tracer used a daemon,
only core kernel code was checking for failures. This is no longer
the case. We must check for failures any time we perform text modifications.
Cc: stable@kernel.org Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
tracing: correct module boundaries for ftrace_release
When the module is about the unload we release its call records.
The ftrace_release function was given wrong values representing
the module core boundaries, thus not releasing its call records.
Plus making ftrace_release function module specific.
Signed-off-by: Jiri Olsa <jolsa@redhat.com>
LKML-Reference: <1254934835-363-3-git-send-email-jolsa@redhat.com> Cc: stable@kernel.org Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Darren Hart [Wed, 7 Oct 2009 18:46:54 +0000 (11:46 -0700)]
futex: fix requeue_pi key imbalance
If futex_wait_requeue_pi() wakes prior to requeue, we drop the
reference to the source futex_key twice, once in
handle_early_requeue_pi_wakeup() and once on our way out.
Remove the drop from the handle_early_requeue_pi_wakeup() and keep
the get/drops together in futex_wait_requeue_pi().
Reported-by: Helge Bahmann <hcb@chaoticmind.net> Signed-off-by: Darren Hart <dvhltc@us.ibm.com> Cc: Helge Bahmann <hcb@chaoticmind.net> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: Dinakar Guniguntala <dino@in.ibm.com> Cc: John Stultz <johnstul@us.ibm.com> Cc: stable-2.6.31 <stable@kernel.org>
LKML-Reference: <4ACCE21E.5030805@us.ibm.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
When injecting DRAM ECC errors (F3xBC_x8), EccVector[15:0] is a bitmask
of which bits should be error injected when written to and holds the
payload of 16-bit DRAM word when read, respectively.
Add /sysfs members to show the DRAM ECC section/word/vector.
Fail wrong injection values entered over /sysfs instead of truncating
them.
On Fam10h and above, F1x[1, 0][7C:40] are DRAM Base/Limit registers
which specify the destination node of a DRAM address. Those address
boundaries are being extracted into ->dram_base[] and ->dram_limit[].
Correct the extraction masks to match the respective address bits.
Different processor families support a different number of chip selects.
Handle this in a family-dependent way with the proper values assigned at
init time (see amd64_set_dct_base_and_mask).
Remove _DCSM_COUNT defines since they're used at one place and originate
from public documentation.
CC: Keith Mannthey <kmannth@us.ibm.com> Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
The pvt->dram_IntlvEn saves the 3 "Interleave Enable" bits already
right-shifted by 8 so the check in find_mc_by_sys_addr() by shifting the
values to the left 8 bits is wrong.
amd64_edac: fix DRAM base and limit address extraction
K8 DRAM base and limit addresses from F1x40 +8*i and F1x44 + 8*i, where
i in (0..7) are both bits 39-24 and therefore the shifting should be
done by 24 and not by 8.
Takashi Iwai [Wed, 7 Oct 2009 13:12:27 +0000 (15:12 +0200)]
ALSA: hda - Fix yet another auto-mic bug in ALC268
Since patch_alc268() doesn't call set_capture_mixer() (due to its h/w
design different from other siblings), it needs to call fixup_automic_adc()
explicitly to set up the auto-mic routing. Otherwise the indices for
int/ext mics aren't set properly.
The root cause of reported system hangs was (now fixed) sis5513 bug
and not "ide: try to use PIO Mode 0 during probe if possible" change
(commit 6029336426a2b43e4bc6f4a84be8789a047d139e) so the revert was
incorrect (it simply replaced one regression with the other one).
Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Clear prefetch setting before potentially (re-)enabling it in
config_drive_art_rwp() so the transition of the device type on
the port from ATA to ATAPI (i.e. during warm-plug operation)
is handled correctly.
This is a really old bug (it probably goes back to very early
days of the driver) but it was only affecting warm-plug operation
until the recent "ide: try to use PIO Mode 0 during probe if
possible" change (commit 6029336426a2b43e4bc6f4a84be8789a047d139e).
Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com> Tested-by: David Fries <david@fries.net> Signed-off-by: David S. Miller <davem@davemloft.net>