Tejun Heo [Tue, 21 Aug 2012 20:18:24 +0000 (13:18 -0700)]
workqueue: deprecate __cancel_delayed_work()
Now that cancel_delayed_work() can be safely called from IRQ handlers,
there's no reason to use __cancel_delayed_work(). Use
cancel_delayed_work() instead of __cancel_delayed_work() and mark the
latter deprecated.
Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Jens Axboe <axboe@kernel.dk> Cc: Jiri Kosina <jkosina@suse.cz> Cc: Roland Dreier <roland@kernel.org> Cc: Tomi Valkeinen <tomi.valkeinen@ti.com>
Tejun Heo [Tue, 21 Aug 2012 20:18:24 +0000 (13:18 -0700)]
workqueue: reimplement cancel_delayed_work() using try_to_grab_pending()
cancel_delayed_work() can't be called from IRQ handlers due to its use
of del_timer_sync() and can't cancel work items which are already
transferred from timer to worklist.
Also, unlike other flush and cancel functions, a canceled delayed_work
would still point to the last associated cpu_workqueue. If the
workqueue is destroyed afterwards and the work item is re-used on a
different workqueue, the queueing code can oops trying to dereference
already freed cpu_workqueue.
This patch reimplements cancel_delayed_work() using
try_to_grab_pending() and set_work_cpu_and_clear_pending(). This
allows the function to be called from IRQ handlers and makes its
behavior consistent with other flush / cancel functions.
Tejun Heo [Tue, 21 Aug 2012 20:18:24 +0000 (13:18 -0700)]
workqueue: use mod_delayed_work() instead of __cancel + queue
Now that mod_delayed_work() is safe to call from IRQ handlers,
__cancel_delayed_work() followed by queue_delayed_work() can be
replaced with mod_delayed_work().
Most conversions are straight-forward except for the following.
* net/core/link_watch.c: linkwatch_schedule_work() was doing a quite
elaborate dancing around its delayed_work. Collapse it such that
linkwatch_work is queued for immediate execution if LW_URGENT and
existing timer is kept otherwise.
Signed-off-by: Tejun Heo <tj@kernel.org> Cc: "David S. Miller" <davem@davemloft.net> Cc: Tomi Valkeinen <tomi.valkeinen@ti.com>
Tejun Heo [Tue, 21 Aug 2012 20:18:24 +0000 (13:18 -0700)]
workqueue: use irqsafe timer for delayed_work
Up to now, for delayed_works, try_to_grab_pending() couldn't be used
from IRQ handlers because IRQs may happen while
delayed_work_timer_fn() is in progress leading to indefinite -EAGAIN.
This patch makes delayed_work use the new TIMER_IRQSAFE flag for
delayed_work->timer. This makes try_to_grab_pending() and thus
mod_delayed_work_on() safe to call from IRQ handlers.
Tejun Heo [Tue, 21 Aug 2012 20:18:23 +0000 (13:18 -0700)]
workqueue: clean up delayed_work initializers and add missing one
Reimplement delayed_work initializers using new timer initializers
which take timer flags. This reduces code duplications and will ease
further initializer changes. This patch also adds a missing
initializer - INIT_DEFERRABLE_WORK_ONSTACK().
Tejun Heo [Wed, 8 Aug 2012 18:10:28 +0000 (11:10 -0700)]
timer: Implement TIMER_IRQSAFE
Timer internals are protected with irq-safe locks but timer execution
isn't, so a timer being dequeued for execution and its execution
aren't atomic against IRQs. This makes it impossible to wait for its
completion from IRQ handlers and difficult to shoot down a timer from
IRQ handlers.
This issue caused some issues for delayed_work interface. Because
there's no way to reliably shoot down delayed_work->timer from IRQ
handlers, __cancel_delayed_work() can't share the logic to steal the
target delayed_work with cancel_delayed_work_sync(), and can only
steal delayed_works which are on queued on timer. Similarly, the
pending mod_delayed_work() can't be used from IRQ handlers.
This patch adds a new timer flag TIMER_IRQSAFE, which makes the timer
to be executed without enabling IRQ after dequeueing such that its
dequeueing and execution are atomic against IRQ handlers.
This makes it safe to wait for the timer's completion from IRQ
handlers, for example, using del_timer_sync(). It can never be
executing on the local CPU and if executing on other CPUs it won't be
interrupted until done.
This will enable simplifying delayed_work cancel/mod interface.
Tejun Heo [Wed, 8 Aug 2012 18:10:26 +0000 (11:10 -0700)]
timer: Relocate declarations of init_timer_on_stack_key()
init_timer_on_stack_key() is used by init macro definitions. Move
init_timer_on_stack_key() and destroy_timer_on_stack() declarations
above init macro defs. This will make the next init cleanup patch
easier to read.
Tejun Heo [Mon, 20 Aug 2012 21:51:24 +0000 (14:51 -0700)]
workqueue: deprecate system_nrt[_freezable]_wq
system_nrt[_freezable]_wq are now spurious. Mark them deprecated and
convert all users to system[_freezable]_wq.
If you're cc'd and wondering what's going on: Now all workqueues are
non-reentrant, so there's no reason to use system_nrt[_freezable]_wq.
Please use system[_freezable]_wq instead.
This patch doesn't make any functional difference.
Signed-off-by: Tejun Heo <tj@kernel.org> Acked-By: Lai Jiangshan <laijs@cn.fujitsu.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: David Airlie <airlied@linux.ie> Cc: Jiri Kosina <jkosina@suse.cz> Cc: "David S. Miller" <davem@davemloft.net> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Cc: David Howells <dhowells@redhat.com>
Tejun Heo [Mon, 20 Aug 2012 21:51:24 +0000 (14:51 -0700)]
workqueue: deprecate flush[_delayed]_work_sync()
flush[_delayed]_work_sync() are now spurious. Mark them deprecated
and convert all users to flush[_delayed]_work().
If you're cc'd and wondering what's going on: Now all workqueues are
non-reentrant and the regular flushes guarantee that the work item is
not pending or running on any CPU on return, so there's no reason to
use the sync flushes at all and they're going away.
This patch doesn't make any functional difference.
Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Russell King <linux@arm.linux.org.uk> Cc: Paul Mundt <lethal@linux-sh.org> Cc: Ian Campbell <ian.campbell@citrix.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Mattia Dongili <malattia@linux.it> Cc: Kent Yoder <key@linux.vnet.ibm.com> Cc: David Airlie <airlied@linux.ie> Cc: Jiri Kosina <jkosina@suse.cz> Cc: Karsten Keil <isdn@linux-pingi.de> Cc: Bryan Wu <bryan.wu@canonical.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Alasdair Kergon <agk@redhat.com> Cc: Mauro Carvalho Chehab <mchehab@infradead.org> Cc: Florian Tobias Schandinat <FlorianSchandinat@gmx.de> Cc: David Woodhouse <dwmw2@infradead.org> Cc: "David S. Miller" <davem@davemloft.net> Cc: linux-wireless@vger.kernel.org Cc: Anton Vorontsov <cbou@mail.ru> Cc: Sangbeom Kim <sbkim73@samsung.com> Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Eric Van Hensbergen <ericvh@gmail.com> Cc: Takashi Iwai <tiwai@suse.de> Cc: Steven Whitehouse <swhiteho@redhat.com> Cc: Petr Vandrovec <petr@vandrovec.name> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Avi Kivity <avi@redhat.com>
Tejun Heo [Mon, 20 Aug 2012 21:51:23 +0000 (14:51 -0700)]
workqueue: gut system_nrt[_freezable]_wq()
Now that all workqueues are non-reentrant, system[_freezable]_wq() are
equivalent to system_nrt[_freezable]_wq(). Replace the latter with
wrappers around system[_freezable]_wq(). The wrapping goes through
inline functions so that __deprecated can be added easily.
Tejun Heo [Mon, 20 Aug 2012 21:51:23 +0000 (14:51 -0700)]
workqueue: gut flush[_delayed]_work_sync()
Now that all workqueues are non-reentrant, flush[_delayed]_work_sync()
are equivalent to flush[_delayed]_work(). Drop the separate
implementation and make them thin wrappers around
flush[_delayed]_work().
* start_flush_work() no longer takes @wait_executing as the only left
user - flush_work() - always sets it to %true.
* __cancel_work_timer() uses flush_work() instead of wait_on_work().
Tejun Heo [Mon, 20 Aug 2012 21:51:23 +0000 (14:51 -0700)]
workqueue: make all workqueues non-reentrant
By default, each per-cpu part of a bound workqueue operates separately
and a work item may be executing concurrently on different CPUs. The
behavior avoids some cross-cpu traffic but leads to subtle weirdities
and not-so-subtle contortions in the API.
* There's no sane usefulness in allowing a single work item to be
executed concurrently on multiple CPUs. People just get the
behavior unintentionally and get surprised after learning about it.
Most either explicitly synchronize or use non-reentrant/ordered
workqueue but this is error-prone.
* flush_work() can't wait for multiple instances of the same work item
on different CPUs. If a work item is executing on cpu0 and then
queued on cpu1, flush_work() can only wait for the one on cpu1.
Unfortunately, work items can easily cross CPU boundaries
unintentionally when the queueing thread gets migrated. This means
that if multiple queuers compete, flush_work() can't even guarantee
that the instance queued right before it is finished before
returning.
* flush_work_sync() was added to work around some of the deficiencies
of flush_work(). In addition to the usual flushing, it ensures that
all currently executing instances are finished before returning.
This operation is expensive as it has to walk all CPUs and at the
same time fails to address competing queuer case.
Incorrectly using flush_work() when flush_work_sync() is necessary
is an easy error to make and can lead to bugs which are difficult to
reproduce.
* Similar problems exist for flush_delayed_work[_sync]().
Other than the cross-cpu access concern, there's no benefit in
allowing parallel execution and it's plain silly to have this level of
contortion for workqueue which is widely used from core code to
extremely obscure drivers.
This patch makes all workqueues non-reentrant. If a work item is
executing on a different CPU when queueing is requested, it is always
queued to that CPU. This guarantees that any given work item can be
executing on one CPU at maximum and if a work item is queued and
executing, both are on the same CPU.
The only behavior change which may affect workqueue users negatively
is that non-reentrancy overrides the affinity specified by
queue_work_on(). On a reentrant workqueue, the affinity specified by
queue_work_on() is always followed. Now, if the work item is
executing on one of the CPUs, the work item will be queued there
regardless of the requested affinity. I've reviewed all workqueue
users which request explicit affinity, and, fortunately, none seems to
be crazy enough to exploit parallel execution of the same work item.
This adds an additional busy_hash lookup if the work item was
previously queued on a different CPU. This shouldn't be noticeable
under any sane workload. Work item queueing isn't a very
high-frequency operation and they don't jump across CPUs all the time.
In a micro benchmark to exaggerate this difference - measuring the
time it takes for two work items to repeatedly jump between two CPUs a
number (10M) of times with busy_hash table densely populated, the
difference was around 3%.
While the overhead is measureable, it is only visible in pathological
cases and the difference isn't huge. This change brings much needed
sanity to workqueue and makes its behavior consistent with timer. I
think this is the right tradeoff to make.
This enables significant simplification of workqueue API.
Simplification patches will follow.
Joonsoo Kim [Wed, 15 Aug 2012 14:25:41 +0000 (23:25 +0900)]
workqueue: use system_highpri_wq for unbind_work
To speed cpu down processing up, use system_highpri_wq.
As scheduling priority of workers on it is higher than system_wq and
it is not contended by other normal works on this cpu, work on it
is processed faster than system_wq.
tj: CPU up/downs care quite a bit about latency these days. This
shouldn't hurt anything and makes sense.
Signed-off-by: Joonsoo Kim <js1304@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>
Joonsoo Kim [Wed, 15 Aug 2012 14:25:40 +0000 (23:25 +0900)]
workqueue: use system_highpri_wq for highpri workers in rebind_workers()
In rebind_workers(), we do inserting a work to rebind to cpu for busy workers.
Currently, in this case, we use only system_wq. This makes a possible
error situation as there is mismatch between cwq->pool and worker->pool.
To prevent this, we should use system_highpri_wq for highpri worker
to match theses. This implements it.
tj: Rephrased comment a bit.
Signed-off-by: Joonsoo Kim <js1304@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>
Joonsoo Kim [Wed, 15 Aug 2012 14:25:39 +0000 (23:25 +0900)]
workqueue: introduce system_highpri_wq
Commit 3270476a6c0ce322354df8679652f060d66526dc ('workqueue: reimplement
WQ_HIGHPRI using a separate worker_pool') introduce separate worker pool
for HIGHPRI. When we handle busyworkers for gcwq, it can be normal worker
or highpri worker. But, we don't consider this difference in rebind_workers(),
we use just system_wq for highpri worker. It makes mismatch between
cwq->pool and worker->pool.
It doesn't make error in current implementation, but possible in the future.
Now, we introduce system_highpri_wq to use proper cwq for highpri workers
in rebind_workers(). Following patch fix this issue properly.
tj: Even apart from rebinding, having system_highpri_wq generally
makes sense.
Signed-off-by: Joonsoo Kim <js1304@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>
Joonsoo Kim [Wed, 15 Aug 2012 14:25:38 +0000 (23:25 +0900)]
workqueue: change value of lcpu in __queue_delayed_work_on()
We assign cpu id into work struct's data field in __queue_delayed_work_on().
In current implementation, when work is come in first time,
current running cpu id is assigned.
If we do __queue_delayed_work_on() with CPU A on CPU B,
__queue_work() invoked in delayed_work_timer_fn() go into
the following sub-optimal path in case of WQ_NON_REENTRANT.
Joonsoo Kim [Wed, 15 Aug 2012 14:25:37 +0000 (23:25 +0900)]
workqueue: correct req_cpu in trace_workqueue_queue_work()
When we do tracing workqueue_queue_work(), it records requested cpu.
But, if !(@wq->flag & WQ_UNBOUND) and @cpu is WORK_CPU_UNBOUND,
requested cpu is changed as local cpu.
In case of @wq->flag & WQ_UNBOUND, above change is not occured,
therefore it is reasonable to correct it.
Use temporary local variable for storing requested cpu.
Signed-off-by: Joonsoo Kim <js1304@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>
Joonsoo Kim [Wed, 15 Aug 2012 14:25:36 +0000 (23:25 +0900)]
workqueue: use enum value to set array size of pools in gcwq
Commit 3270476a6c0ce322354df8679652f060d66526dc ('workqueue: reimplement
WQ_HIGHPRI using a separate worker_pool') introduce separate worker_pool
for HIGHPRI. Although there is NR_WORKER_POOLS enum value which represent
size of pools, definition of worker_pool in gcwq doesn't use it.
Using it makes code robust and prevent future mistakes.
So change code to use this enum value.
Signed-off-by: Joonsoo Kim <js1304@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>
Tejun Heo [Tue, 14 Aug 2012 00:08:19 +0000 (17:08 -0700)]
workqueue: add missing wmb() in clear_work_data()
Any operation which clears PENDING should be preceded by a wmb to
guarantee that the next PENDING owner sees all the changes made before
PENDING release.
There are only two places where PENDING is cleared -
set_work_cpu_and_clear_pending() and clear_work_data(). The caller of
the former already does smp_wmb() but the latter doesn't have any.
Move the wmb above set_work_cpu_and_clear_pending() into it and add
one to clear_work_data().
There hasn't been any report related to this issue, and, given how
clear_work_data() is used, it is extremely unlikely to have caused any
actual problems on any architecture.
Tejun Heo [Wed, 8 Aug 2012 16:38:42 +0000 (09:38 -0700)]
workqueue: fix CPU binding of flush_delayed_work[_sync]()
delayed_work encodes the workqueue to use and the last CPU in
delayed_work->work.data while it's on timer. The target CPU is
implicitly recorded as the CPU the timer is queued on and
delayed_work_timer_fn() queues delayed_work->work to the CPU it is
running on.
Unfortunately, this leaves flush_delayed_work[_sync]() no way to find
out which CPU the delayed_work was queued for when they try to
re-queue after killing the timer. Currently, it chooses the local CPU
flush is running on. This can unexpectedly move a delayed_work queued
on a specific CPU to another CPU and lead to subtle errors.
There isn't much point in trying to save several bytes in struct
delayed_work, which is already close to a hundred bytes on 64bit with
all debug options turned off. This patch adds delayed_work->cpu to
remember the CPU it's queued for.
Note that if the timer is migrated during CPU down, the work item
could be queued to the downed global_cwq after this change. As a
detached global_cwq behaves like an unbound one, this doesn't change
much for the delayed_work.
Tejun Heo [Fri, 3 Aug 2012 17:30:47 +0000 (10:30 -0700)]
workqueue: use mod_delayed_work() instead of cancel + queue
Convert delayed_work users doing cancel_delayed_work() followed by
queue_delayed_work() to mod_delayed_work().
Most conversions are straight-forward. Ones worth mentioning are,
* drivers/edac/edac_mc.c: edac_mc_workq_setup() converted to always
use mod_delayed_work() and cancel loop in
edac_mc_reset_delay_period() is dropped.
* drivers/platform/x86/thinkpad_acpi.c: No need to remember whether
watchdog is active or not. @fan_watchdog_active and related code
dropped.
* drivers/power/charger-manager.c: Seemingly a lot of
delayed_work_pending() abuse going on here.
[delayed_]work_pending() are unsynchronized and racy when used like
this. I converted one instance in fullbatt_handler(). Please
conver the rest so that it invokes workqueue APIs for the intended
target state rather than trying to game work item pending state
transitions. e.g. if timer should be modified - call
mod_delayed_work(), canceled - call cancel_delayed_work[_sync]().
* drivers/thermal/thermal_sys.c: thermal_zone_device_set_polling()
simplified. Note that round_jiffies() calls in this function are
meaningless. round_jiffies() work on absolute jiffies not delta
delay used by delayed_work.
v2: Tomi pointed out that __cancel_delayed_work() users can't be
safely converted to mod_delayed_work(). They could be calling it
from irq context and if that happens while delayed_work_timer_fn()
is running, it could deadlock. __cancel_delayed_work() users are
dropped.
Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Henrique de Moraes Holschuh <hmh@hmh.eng.br> Acked-by: Dmitry Torokhov <dmitry.torokhov@gmail.com> Acked-by: Anton Vorontsov <cbouatmailru@gmail.com> Acked-by: David Howells <dhowells@redhat.com> Cc: Tomi Valkeinen <tomi.valkeinen@ti.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Jiri Kosina <jkosina@suse.cz> Cc: Doug Thompson <dougthompson@xmission.com> Cc: David Airlie <airlied@linux.ie> Cc: Roland Dreier <roland@kernel.org> Cc: "John W. Linville" <linville@tuxdriver.com> Cc: Zhang Rui <rui.zhang@intel.com> Cc: Len Brown <len.brown@intel.com> Cc: "J. Bruce Fields" <bfields@fieldses.org> Cc: Johannes Berg <johannes@sipsolutions.net>
Tejun Heo [Fri, 3 Aug 2012 17:30:47 +0000 (10:30 -0700)]
workqueue: implement mod_delayed_work[_on]()
Workqueue was lacking a mechanism to modify the timeout of an already
pending delayed_work. delayed_work users have been working around
this using several methods - using an explicit timer + work item,
messing directly with delayed_work->timer, and canceling before
re-queueing, all of which are error-prone and/or ugly.
This patch implements mod_delayed_work[_on]() which behaves similarly
to mod_timer() - if the delayed_work is idle, it's queued with the
given delay; otherwise, its timeout is modified to the new value.
Zero @delay guarantees immediate execution.
v2: Updated to reflect try_to_grab_pending() changes. Now safe to be
called from bh context.
Tejun Heo [Fri, 3 Aug 2012 17:30:46 +0000 (10:30 -0700)]
workqueue: mark a work item being canceled as such
There can be two reasons try_to_grab_pending() can fail with -EAGAIN.
One is when someone else is queueing or deqeueing the work item. With
the previous patches, it is guaranteed that PENDING and queued state
will soon agree making it safe to busy-retry in this case.
The other is if multiple __cancel_work_timer() invocations are racing
one another. __cancel_work_timer() grabs PENDING and then waits for
running instances of the target work item on all CPUs while holding
PENDING and !queued. try_to_grab_pending() invoked from another task
will keep returning -EAGAIN while the current owner is waiting.
Not distinguishing the two cases is okay because __cancel_work_timer()
is the only user of try_to_grab_pending() and it invokes
wait_on_work() whenever grabbing fails. For the first case, busy
looping should be fine but wait_on_work() doesn't cause any critical
problem. For the latter case, the new contender usually waits for the
same condition as the current owner, so no unnecessarily extended
busy-looping happens. Combined, these make __cancel_work_timer()
technically correct even without irq protection while grabbing PENDING
or distinguishing the two different cases.
While the current code is technically correct, not distinguishing the
two cases makes it difficult to use try_to_grab_pending() for other
purposes than canceling because it's impossible to tell whether it's
safe to busy-retry grabbing.
This patch adds a mechanism to mark a work item being canceled.
try_to_grab_pending() now disables irq on success and returns -EAGAIN
to indicate that grabbing failed but PENDING and queued states are
gonna agree soon and it's safe to busy-loop. It returns -ENOENT if
the work item is being canceled and it may stay PENDING && !queued for
arbitrary amount of time.
__cancel_work_timer() is modified to mark the work canceling with
WORK_OFFQ_CANCELING after grabbing PENDING, thus making
try_to_grab_pending() fail with -ENOENT instead of -EAGAIN. Also, it
invokes wait_on_work() iff grabbing failed with -ENOENT. This isn't
necessary for correctness but makes it consistent with other future
users of try_to_grab_pending().
v2: try_to_grab_pending() was testing preempt_count() to ensure that
the caller has disabled preemption. This triggers spuriously if
!CONFIG_PREEMPT_COUNT. Use preemptible() instead. Reported by
Fengguang Wu.
v3: Updated so that try_to_grab_pending() disables irq on success
rather than requiring preemption disabled by the caller. This
makes busy-looping easier and will allow try_to_grap_pending() to
be used from bh/irq contexts.
Tejun Heo [Fri, 3 Aug 2012 17:30:46 +0000 (10:30 -0700)]
workqueue: introduce WORK_OFFQ_FLAG_*
Low WORK_STRUCT_FLAG_BITS bits of work_struct->data contain
WORK_STRUCT_FLAG_* and flush color. If the work item is queued, the
rest point to the cpu_workqueue with WORK_STRUCT_CWQ set; otherwise,
WORK_STRUCT_CWQ is clear and the bits contain the last CPU number -
either a real CPU number or one of WORK_CPU_*.
Scheduled addition of mod_delayed_work[_on]() requires an additional
flag, which is used only while a work item is off queue. There are
more than enough bits to represent off-queue CPU number on both 32 and
64bits. This patch introduces WORK_OFFQ_FLAG_* which occupy the lower
part of the @work->data high bits while off queue. This patch doesn't
define any actual OFFQ flag yet.
Off-queue CPU number is now shifted by WORK_OFFQ_CPU_SHIFT, which adds
the number of bits used by OFFQ flags to WORK_STRUCT_FLAG_SHIFT, to
make room for OFFQ flags.
To avoid shift width warning with large WORK_OFFQ_FLAG_BITS, ulong
cast is added to WORK_STRUCT_NO_CPU and, just in case, BUILD_BUG_ON()
to check that there are enough bits to accomodate off-queue CPU number
is added.
This patch doesn't make any functional difference.
Tejun Heo [Fri, 3 Aug 2012 17:30:46 +0000 (10:30 -0700)]
workqueue: move try_to_grab_pending() upwards
try_to_grab_pending() will be used by to-be-implemented
mod_delayed_work[_on](). Move try_to_grab_pending() and related
functions above queueing functions.
Tejun Heo [Fri, 3 Aug 2012 17:30:46 +0000 (10:30 -0700)]
workqueue: fix zero @delay handling of queue_delayed_work_on()
If @delay is zero and the dealyed_work is idle, queue_delayed_work()
queues it for immediate execution; however, queue_delayed_work_on()
lacks this logic and always goes through timer regardless of @delay.
This patch moves 0 @delay handling logic from queue_delayed_work() to
queue_delayed_work_on() so that both functions behave the same.
* queue_delayed_work() calls queue_delayed_work_on() with -1 @cpu
which is interpreted as the local CPU.
* flush_delayed_work[_sync]() were using raw_smp_processor_id().
* __queue_work() interprets %WORK_CPU_UNBOUND as local CPU if the
target workqueue is bound one but nobody uses this.
This patch converts all functions to uniformly use %WORK_CPU_UNBOUND
to indicate local CPU and use the local binding feature of
__queue_work(). unlikely() is dropped from %WORK_CPU_UNBOUND handling
in __queue_work().
Tejun Heo [Fri, 3 Aug 2012 17:30:45 +0000 (10:30 -0700)]
workqueue: set delayed_work->timer function on initialization
delayed_work->timer.function is currently initialized during
queue_delayed_work_on(). Export delayed_work_timer_fn() and set
delayed_work timer function during delayed_work initialization
together with other fields.
This ensures the timer function is always valid on an initialized
delayed_work. This is to help mod_delayed_work() implementation.
To detect delayed_work users which diddle with the internal timer,
trigger WARN if timer function doesn't match on queue.
Tejun Heo [Fri, 3 Aug 2012 17:30:45 +0000 (10:30 -0700)]
workqueue: disable irq while manipulating PENDING
Queueing operations use WORK_STRUCT_PENDING_BIT to synchronize access
to the target work item. They first try to claim the bit and proceed
with queueing only after that succeeds and there's a window between
PENDING being set and the actual queueing where the task can be
interrupted or preempted.
There's also a similar window in process_one_work() when clearing
PENDING. A work item is dequeued, gcwq->lock is released and then
PENDING is cleared and the worker might get interrupted or preempted
between releasing gcwq->lock and clearing PENDING.
cancel[_delayed]_work_sync() tries to claim or steal PENDING. The
function assumes that a work item with PENDING is either queued or in
the process of being [de]queued. In the latter case, it busy-loops
until either the work item loses PENDING or is queued. If canceling
coincides with the above described interrupts or preemptions, the
canceling task will busy-loop while the queueing or executing task is
preempted.
This patch keeps irq disabled across claiming PENDING and actual
queueing and moves PENDING clearing in process_one_work() inside
gcwq->lock so that busy looping from PENDING && !queued doesn't wait
for interrupted/preempted tasks. Note that, in process_one_work(),
setting last CPU and clearing PENDING got merged into single
operation.
This removes possible long busy-loops and will allow using
try_to_grab_pending() from bh and irq contexts.
v2: __queue_work() was testing preempt_count() to ensure that the
caller has disabled preemption. This triggers spuriously if
!CONFIG_PREEMPT_COUNT. Use preemptible() instead. Reported by
Fengguang Wu.
v3: Disable irq instead of preemption. IRQ will be disabled while
grabbing gcwq->lock later anyway and this allows using
try_to_grab_pending() from bh and irq contexts.
Tejun Heo [Fri, 3 Aug 2012 17:30:45 +0000 (10:30 -0700)]
workqueue: add missing smp_wmb() in process_one_work()
WORK_STRUCT_PENDING is used to claim ownership of a work item and
process_one_work() releases it before starting execution. When
someone else grabs PENDING, all pre-release updates to the work item
should be visible and all updates made by the new owner should happen
afterwards.
Grabbing PENDING uses test_and_set_bit() and thus has a full barrier;
however, clearing doesn't have a matching wmb. Given the preceding
spin_unlock and use of clear_bit, I don't believe this can be a
problem on an actual machine and there hasn't been any related report
but it still is theretically possible for clear_pending to permeate
upwards and happen before work->entry update.
Add an explicit smp_wmb() before work_clear_pending().
Tejun Heo [Fri, 3 Aug 2012 17:30:44 +0000 (10:30 -0700)]
workqueue: make queueing functions return bool
All queueing functions return 1 on success, 0 if the work item was
already pending. Update them to return bool instead. This signifies
better that they don't return 0 / -errno.
This is cleanup and doesn't cause any functional difference.
While at it, fix comment opening for schedule_work_on().
Tejun Heo [Fri, 3 Aug 2012 17:30:44 +0000 (10:30 -0700)]
workqueue: reorder queueing functions so that _on() variants are on top
Currently, queue/schedule[_delayed]_work_on() are located below the
counterpart without the _on postifx even though the latter is usually
implemented using the former. Swap them.
This is cleanup and doesn't cause any functional difference.
Linus Torvalds [Thu, 2 Aug 2012 18:52:39 +0000 (11:52 -0700)]
Merge branch 'for-linus-3.6' of git://dev.laptop.org/users/dilinger/linux-olpc
Pull OLPC platform updates from Andres Salomon:
"These move the OLPC Embedded Controller driver out of
arch/x86/platform and into drivers/platform/olpc.
OLPC machines are now ARM-based (which means lots of x86 and ARM
changes), but are typically pretty self-contained.. so it makes more
sense to go through a separate OLPC tree after getting the appropriate
review/ACKs."
* 'for-linus-3.6' of git://dev.laptop.org/users/dilinger/linux-olpc:
x86: OLPC: move s/r-related EC cmds to EC driver
Platform: OLPC: move global variables into priv struct
Platform: OLPC: move debugfs support from x86 EC driver
x86: OLPC: switch over to using new EC driver on x86
Platform: OLPC: add a suspended flag to the EC driver
Platform: OLPC: turn EC driver into a platform_driver
Platform: OLPC: allow EC cmd to be overridden, and create a workqueue to call it
drivers: OLPC: update various drivers to include olpc-ec.h
Platform: OLPC: add a stub to drivers/platform/ for the OLPC EC driver
Linus Torvalds [Thu, 2 Aug 2012 18:50:24 +0000 (11:50 -0700)]
Merge tag 'dt2' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc
Pull arm-soc Marvell Orion device-tree updates from Olof Johansson:
"This contains a set of device-tree conversions for Marvell Orion
platforms that were staged early but took a few tries to get the
branch into a format where it was suitable for us to pick up.
Given that most people working on these platforms are hobbyists with
limited time, we were a bit more flexible with merging it even though
it came in late."
* tag 'dt2' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc: (21 commits)
ARM: Kirkwood: Replace mrvl with marvell
ARM: Kirkwood: Describe GoFlex Net LEDs and SATA in DT.
ARM: Kirkwood: Describe Dreamplug LEDs in DT.
ARM: Kirkwood: Describe iConnects LEDs in DT.
ARM: Kirkwood: Describe iConnects temperature sensor in DT.
ARM: Kirkwood: Describe IB62x0 LEDs in DT.
ARM: Kirkwood: Describe IB62x0 gpio-keys in DT.
ARM: Kirkwood: Describe DNS32? gpio-keys in DT.
ARM: Kirkwood: Move common portions into a kirkwood-dnskw.dtsi
ARM: Kirkwood: Replace DNS-320/DNS-325 leds with dt bindings
ARM: Kirkwood: Describe DNS325 temperature sensor in DT.
ARM: Kirkwood: Use DT to configure SATA device.
ARM: kirkwood: use devicetree for SPI on dreamplug
ARM: kirkwood: Add LS-XHL and LS-CHLv2 support
ARM: Kirkwood: Initial DTS support for Kirkwood GoFlex Net
ARM: Kirkwood: Add basic device tree support for QNAP TS219.
ATA: sata_mv: Add device tree support
ARM: Orion: DTify the watchdog timer.
ARM: Orion: Add arch support needed for I2C via DT.
ARM: kirkwood: use devicetree for orion-spi
...
Linus Torvalds [Thu, 2 Aug 2012 18:48:54 +0000 (11:48 -0700)]
Merge tag 'pm2' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc
Pull arm-soc cpuidle enablement for OMAP from Olof Johansson:
"Coupled cpuidle was meant to merge for 3.5 through Len Brown's tree,
but didn't go in because the pull request ended up rejected. So it
just got merged, and we got this staged branch that enables the
coupled cpuidle code on OMAP.
With a stable git workflow from the other maintainer we could have
staged this earlier, but that wasn't the case so we have had to merge
it late.
The alternative is to hold it off until 3.7 but given that the code is
well-isolated to OMAP and they are eager to see it go in, I didn't
push back hard in that direction."
* tag 'pm2' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc:
ARM: OMAP4: CPUidle: Open broadcast clock-event device.
ARM: OMAP4: CPUidle: add synchronization for coupled idle states
ARM: OMAP4: CPUidle: Use coupled cpuidle states to implement SMP cpuidle.
ARM: OMAP: timer: allow gp timer clock-event to be used on both cpus
Linus Torvalds [Thu, 2 Aug 2012 18:48:20 +0000 (11:48 -0700)]
Merge tag 'fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc
Pull ARM SoC fixes from Olof Johansson:
"A few fixes for merge window fallout, and a bugfix for timer resume on
PRIMA2."
* tag 'fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc:
ARM: mmp: add missing irqs.h
arm: mvebu: fix typo in .dtsi comment for Armada XP SoCs
ARM: PRIMA2: delete redundant codes to restore LATCHED when timer resumes
ARM: mxc: Include missing irqs.h header
Linus Torvalds [Thu, 2 Aug 2012 18:45:42 +0000 (11:45 -0700)]
Merge tag 'sh-for-linus' of git://github.com/pmundt/linux-sh
Pull SuperH fixes from Paul Mundt.
* tag 'sh-for-linus' of git://github.com/pmundt/linux-sh: (24 commits)
sh: explicitly include sh_dma.h in setup-sh7722.c
sh: ecovec: care CN5 VBUS if USB host mode
sh: sh7724: fixup renesas_usbhs clock settings
sh: intc: initial irqdomain support.
sh: pfc: Fix up init ordering mess.
serial: sh-sci: fix compilation breakage, when DMA is enabled
dmaengine: shdma: restore partial transfer calculation
sh: modify the sh_dmae_slave_config for RSPI in setup-sh7757
sh: Fix up recursive fault in oops with unset TTB.
sh: pfc: Build fix for pinctrl_remove_gpio_range() changes.
sh: select the fixed regulator driver on several boards
sh: ecovec: switch MMC power control to regulators
sh: add fixed voltage regulators to se7724
sh: add fixed voltage regulators to sdk7786
sh: add fixed voltage regulators to rsk
sh: add fixed voltage regulators to migor
sh: add fixed voltage regulators to kfr2r09
sh: add fixed voltage regulators to ap325rxa
sh: add fixed voltage regulators to sh7757lcr
sh: add fixed voltage regulators to sh2007
...
Linus Torvalds [Thu, 2 Aug 2012 18:34:40 +0000 (11:34 -0700)]
Merge tag 'md-3.6' of git://neil.brown.name/md
Pull additional md update from NeilBrown:
"This contains a few patches that depend on plugging changes in the
block layer so needed to wait for those.
It also contains a Kconfig fix for the new RAID10 support in dm-raid."
* tag 'md-3.6' of git://neil.brown.name/md:
md/dm-raid: DM_RAID should select MD_RAID10
md/raid1: submit IO from originating thread instead of md thread.
raid5: raid5d handle stripe in batch way
raid5: make_request use batch stripe release
Linus Torvalds [Thu, 2 Aug 2012 17:57:31 +0000 (10:57 -0700)]
Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client
Pull two ceph fixes from Sage Weil:
"The first patch fixes up the old crufty open intent code to use the
atomic_open stuff properly, and the second fixes a possible null deref
and memory leak with the crypto keys."
Linus Torvalds [Thu, 2 Aug 2012 17:56:34 +0000 (10:56 -0700)]
Merge tag 'ecryptfs-3.6-rc1-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tyhicks/ecryptfs
Pull ecryptfs fixes from Tyler Hicks:
- Fixes a bug when the lower filesystem mount options include 'acl',
but the eCryptfs mount options do not
- Cleanups in the messaging code
- Better handling of empty files in the lower filesystem to improve
usability. Failed file creations are now cleaned up and empty lower
files are converted into eCryptfs during open().
- The write-through cache changes are being reverted due to bugs that
are not easy to fix. Stability outweighs the performance
enhancements here.
- Improvement to the mount code to catch unsupported ciphers specified
in the mount options
* tag 'ecryptfs-3.6-rc1-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tyhicks/ecryptfs:
eCryptfs: check for eCryptfs cipher support at mount
eCryptfs: Revert to a writethrough cache model
eCryptfs: Initialize empty lower files when opening them
eCryptfs: Unlink lower inode when ecryptfs_create() fails
eCryptfs: Make all miscdev functions use daemon ptr in file private_data
eCryptfs: Remove unused messaging declarations and function
eCryptfs: Copy up POSIX ACL and read-only flags from lower mount
Linus Torvalds [Thu, 2 Aug 2012 17:54:11 +0000 (10:54 -0700)]
Merge branch 'for-next' of git://git.samba.org/sfrench/cifs-2.6
Pull CIFS update from Steve French:
"Adds SMB2 rmdir/mkdir capability to the SMB2/SMB2.1 support in cifs.
I am holding up a few more days on merging the remainder of the
SMB2/SMB2.1 enablement although it is nearing review completion, in
order to address some review comments from Jeff Layton on a few of the
subsequent SMB2 patches, and also to debug an unrelated cifs problem
that Pavel discovered."
* 'for-next' of git://git.samba.org/sfrench/cifs-2.6:
CIFS: Add SMB2 support for rmdir
CIFS: Move rmdir code to ops struct
CIFS: Add SMB2 support for mkdir operation
CIFS: Separate protocol specific part from mkdir
CIFS: Simplify cifs_mkdir call
Linus Torvalds [Thu, 2 Aug 2012 17:37:03 +0000 (10:37 -0700)]
mm: remove node_start_pfn checking in new WARN_ON for now
Borislav Petkov reports that the new warning added in commit 88fdf75d1bb5 ("mm: warn if pg_data_t isn't initialized with zero")
triggers for him, and it is the node_start_pfn field that has already
been initialized once.
and (with the warning replaced by debug output), Borislav sees
On node 0 totalpages: 4193848
DMA zone: 64 pages used for memmap
DMA zone: 6 pages reserved
DMA zone: 3890 pages, LIFO batch:0
DMA32 zone: 16320 pages used for memmap
DMA32 zone: 798464 pages, LIFO batch:31
Normal zone: 52736 pages used for memmap
Normal zone: 3322368 pages, LIFO batch:31
free_area_init_node: pgdat->node_start_pfn: 4423680 <----
On node 1 totalpages: 4194304
Normal zone: 65536 pages used for memmap
Normal zone: 4128768 pages, LIFO batch:31
free_area_init_node: pgdat->node_start_pfn: 8617984 <----
On node 2 totalpages: 4194304
Normal zone: 65536 pages used for memmap
Normal zone: 4128768 pages, LIFO batch:31
free_area_init_node: pgdat->node_start_pfn: 12812288 <----
On node 3 totalpages: 4194304
Normal zone: 65536 pages used for memmap
Normal zone: 4128768 pages, LIFO batch:31
so remove the bogus warning for now to avoid annoying people. Minchan
Kim is looking at it.
Reported-by: Borislav Petkov <bp@amd64.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Barry Song [Mon, 30 Jul 2012 05:29:30 +0000 (13:29 +0800)]
ARM: PRIMA2: delete redundant codes to restore LATCHED when timer resumes
The only way to write LATCHED registers to write LATCH_BIT to LATCH register,
that will latch COUNTER into LATCHED.e.g.
writel_relaxed(SIRFSOC_TIMER_LATCH_BIT, sirfsoc_timer_base +
SIRFSOC_TIMER_LATCH);
Writing values to LATCHED registers directly is useless at all.
Signed-off-by: Barry Song <Baohua.Song@csr.com> Signed-off-by: Olof Johansson <olof@lixom.net>
Sage Weil [Tue, 31 Jul 2012 18:27:36 +0000 (11:27 -0700)]
ceph: simplify+fix atomic_open
The initial ->atomic_open op was carried over from the old intent code,
which was incomplete and didn't really work. Replace it with a fresh
method. In particular:
* always attempt to do an atomic open+lookup, both for the create case
and for lookups of existing files.
* fix symlink handling by returning 1 to the VFS so that we can follow
the link to its destination. This fixes a longstanding ceph bug (#2392).
Linus Torvalds [Wed, 1 Aug 2012 23:47:15 +0000 (16:47 -0700)]
Merge branch 'upstream' of git://git.linux-mips.org/pub/scm/ralf/upstream-linus
Pull MIPS updates from Ralf Baechle:
"The lion share of this pull request are fixes for clk-related breakage
caused by other changes during this merge window. For some platforms
the fix was as simple as selecting HAVE_CLK, for others like the
Loongson 2 significant restructuring was required.
The remainder are changes required to get the Lantiq code to work
again."
* 'upstream' of git://git.linux-mips.org/pub/scm/ralf/upstream-linus:
MIPS: Loongson 2: Sort out clock managment.
MIPS: Loongson 1: more clk support and add select HAVE_CLK
MIPS: txx9: Fix redefinition of clk_* by adding select HAVE_CLK
MIPS: BCM63xx: Fix redefinition of clk_* by adding select HAVE_CLK
MIPS: AR7: Fix redefinition of clk_* by adding select HAVE_CLK
MIPS: Lantiq: Platform specific CLK fixup
MIPS: Lantiq: Add device_tree_init function
MIPS: Lantiq: Fix interface clock and PCI control register offset
Linus Torvalds [Wed, 1 Aug 2012 23:45:02 +0000 (16:45 -0700)]
Merge branch 'for-linus-3.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rw/uml
Pull UML fixes from Richard Weinberger:
"This patch set contains mostly fixes and cleanups. The UML tty driver
uses now tty_port and is no longer broken like hell :-)"
* 'for-linus-3.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rw/uml:
um: Add arch/x86/um to MAINTAINERS
um: pass siginfo to guest process
um: fix ubd_file_size for read-only files
um: pull interrupt_end() into userspace()
um: split syscall_trace(), pass pt_regs to it
um: switch UPT_SET_RETURN_VALUE and regs_return_value to pt_regs
um: set BLK_CGROUP=y in defconfig
um: remove count_lock
um: fully use tty_port
um: Remove dead code
um: remove line_ioctl()
TTY: um/line, use tty from tty_port
TTY: um/line, add tty_port
Linus Torvalds [Wed, 1 Aug 2012 23:41:07 +0000 (16:41 -0700)]
Merge branch 'dmaengine' of git://git.linaro.org/people/rmk/linux-arm
Pull ARM DMA engine updates from Russell King:
"This looks scary at first glance, but what it is is:
- a rework of the sa11x0 DMA engine driver merged during the previous
cycle, to extract a common set of helper functions for DMA engine
implementations.
- conversion of amba-pl08x.c to use these helper functions.
- addition of OMAP DMA engine driver (using these helper functions),
and conversion of some of the OMAP DMA users to use DMA engine.
Nothing in the helper functions is ARM specific, so I hope that other
implementations can consolidate some of their code by making use of
these helpers.
This has been sitting in linux-next most of the merge cycle, and has
been tested by several OMAP folk. I've tested it on sa11x0 platforms,
and given it my best shot on my broken platforms which have the
amba-pl08x controller.
The last point is the addition to feature-removal-schedule.txt, which
will have a merge conflict. Between myself and TI, we're planning to
remove the old TI DMA implementation next year."
Fix up trivial add/add conflicts in Documentation/feature-removal-schedule.txt
and drivers/dma/{Kconfig,Makefile}
* 'dmaengine' of git://git.linaro.org/people/rmk/linux-arm: (53 commits)
ARM: 7481/1: OMAP2+: omap2plus_defconfig: enable OMAP DMA engine
ARM: 7464/1: mmc: omap_hsmmc: ensure probe returns error if DMA channel request fails
Add feature removal of old OMAP private DMA implementation
mtd: omap2: remove private DMA API implementation
mtd: omap2: add DMA engine support
spi: omap2-mcspi: remove private DMA API implementation
spi: omap2-mcspi: add DMA engine support
ARM: omap: remove mmc platform data dma_mask and initialization
mmc: omap: remove private DMA API implementation
mmc: omap: add DMA engine support
mmc: omap_hsmmc: remove private DMA API implementation
mmc: omap_hsmmc: add DMA engine support
dmaengine: omap: add support for cyclic DMA
dmaengine: omap: add support for setting fi
dmaengine: omap: add support for returning residue in tx_state method
dmaengine: add OMAP DMA engine driver
dmaengine: sa11x0-dma: add cyclic DMA support
dmaengine: sa11x0-dma: fix DMA residue support
dmaengine: PL08x: ensure all descriptors are freed when channel is released
dmaengine: PL08x: get rid of write only pool_ctr and free_txd locking
...
Linus Torvalds [Wed, 1 Aug 2012 23:35:37 +0000 (16:35 -0700)]
Merge branch 'audit' of git://git.linaro.org/people/rmk/linux-arm
Pull ARM audit/signal updates from Russell King:
"ARM audit/signal handling updates from Al and Will. This improves on
the work Viro did last merge window, and sorts out some of the issues
found with that work."
* 'audit' of git://git.linaro.org/people/rmk/linux-arm:
ARM: 7475/1: sys_trace: allow all syscall arguments to be updated via ptrace
ARM: 7474/1: get rid of TIF_SYSCALL_RESTARTSYS
ARM: 7473/1: deal with handlerless restarts without leaving the kernel
ARM: 7472/1: pull all work_pending logics into C function
ARM: 7471/1: Revert "7442/1: Revert "remove unused restart trampoline""
ARM: 7470/1: Revert "7443/1: Revert "new way of handling ERESTART_RESTARTBLOCK""
Linus Torvalds [Wed, 1 Aug 2012 23:30:45 +0000 (16:30 -0700)]
Merge branch 'fixes' of git://git.linaro.org/people/rmk/linux-arm
Pull ARM fixes from Russell King:
"This fixes various issues found during July"
* 'fixes' of git://git.linaro.org/people/rmk/linux-arm:
ARM: 7479/1: mm: avoid NULL dereference when flushing gate_vma with VIVT caches
ARM: Fix undefined instruction exception handling
ARM: 7480/1: only call smp_send_stop() on SMP
ARM: 7478/1: errata: extend workaround for erratum #720789
ARM: 7477/1: vfp: Always save VFP state in vfp_pm_suspend on UP
ARM: 7476/1: vfp: only clear vfp state for current cpu in vfp_pm_suspend
ARM: 7468/1: ftrace: Trace function entry before updating index
ARM: 7467/1: mutex: use generic xchg-based implementation for ARMv6+
ARM: 7466/1: disable interrupt before spinning endlessly
ARM: 7465/1: Handle >4GB memory sizes in device tree and mem=size@start option
Martin Pärtel [Wed, 1 Aug 2012 22:49:17 +0000 (00:49 +0200)]
um: pass siginfo to guest process
UML guest processes now get correct siginfo_t for SIGTRAP, SIGFPE,
SIGILL and SIGBUS. Specifically, si_addr and si_code are now correct
where previously they were si_addr = NULL and si_code = 128.
Signed-off-by: Martin Pärtel <martin.partel@gmail.com> Signed-off-by: Richard Weinberger <richard@nod.at>
NeilBrown [Wed, 1 Aug 2012 22:33:20 +0000 (08:33 +1000)]
md/raid1: submit IO from originating thread instead of md thread.
queuing writes to the md thread means that all requests go through the
one processor which may not be able to keep up with very high request
rates.
So use the plugging infrastructure to submit all requests on unplug.
If a 'schedule' is needed, we fall back on the old approach of handing
the requests to the thread for it to handle.
Shaohua Li [Wed, 1 Aug 2012 22:33:00 +0000 (08:33 +1000)]
raid5: make_request use batch stripe release
make_request() does stripe release for every stripe and the stripe usually has
count 1, which makes previous release_stripe() optimization not work. In my
test, this release_stripe() becomes the heaviest pleace to take
conf->device_lock after previous patches applied.
Below patch makes stripe release batch. All the stripes will be released in
unplug. The STRIPE_ON_UNPLUG_LIST bit is to protect concurrent access stripe
lru.
Signed-off-by: Shaohua Li <shli@fusionio.com> Signed-off-by: NeilBrown <neilb@suse.de>
Linus Torvalds [Wed, 1 Aug 2012 17:45:12 +0000 (10:45 -0700)]
Merge tag 'fbdev-updates-for-3.6' of git://github.com/schandinat/linux-2.6
Pull fbdev updates from Florian Tobias Schandinat:
- large updates for OMAP
- support for LCD3 overlay manager (omap5)
- omapdss output cleanup
- removal of passive matrix LCD support as there are no drivers for
such panels for DSS or DSS2 and nobody complained (cleanup)
- large updates for SH Mobile
- overlay support
- separating MERAM (cache) from framebuffer driver
- some updates for Exynos and da8xx-fb
- various other small patches
* tag 'fbdev-updates-for-3.6' of git://github.com/schandinat/linux-2.6: (78 commits)
da8xx-fb: fix compile issue due to missing include
fbdev: Make pixel_to_pat() failure mode more friendly
da8xx-fb: do not turn ON LCD backlight unless LCDC is enabled
fbdev: sh_mobile_lcdc: Fix vertical panning step
video: exynos mipi dsi: Fix mipi dsi regulators handling issue
video: da8xx-fb: do clock reset of revision 2 LCDC before enabling
arm: da850: configure LCDC fifo threshold
video: da8xx-fb: configure FIFO threshold to reduce underflow errors
video: da8xx-fb: fix flicker due to 1 frame delay in updated frame
video: da8xx-fb rev2: fix disabling of palette completion interrupt
da8xx-fb: add missing FB_BLANK operations
video: exynos_dp: use usleep_range instead of delay
video: exynos_dp: check the only INTERLANE_ALIGN_DONE bit during Link Training
fb: epson1355fb: Fix section mismatch
video: exynos_dp: fix wrong DPCD address during Link Training
video/smscufx: fix line counting in fb_write
aty128fb: Fix coding style issues
fbdev: sh_mobile_lcdc: Fix pan offset computation in YUV mode
fbdev: sh_mobile_lcdc: Fix overlay registers update during pan operation
fbdev: sh_mobile_lcdc: Support horizontal panning
...
Linus Torvalds [Wed, 1 Aug 2012 17:42:26 +0000 (10:42 -0700)]
Merge tag 'sound-3.6' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound
Pull sound fixes from Takashi Iwai:
"A collection of small fixes that have been found recently. Most of
the commits are regression fixes in HD-audio and some other random
drivers."
* tag 'sound-3.6' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound:
ALSA: snd-usb: fix clock source validity index
ALSA: hda - Fix mute-LED GPIO initialization for IDT codecs
ALSA: hda - Add descriptions for missing IDT 92HD83x models
ALSA: hda - Fix polarity of mute LED on HP Mini 210
ALSA: es1688 - freeup resources on init failure
ALSA: hda - Workaround for silent output on VAIO Z with ALC889
ALSA: hda - Fix WARNING from HDMI/DP parser
ALSA: hda - Detach from converter at closing in patch_hdmi.c
ALSA: hda - Fix mute-LED GPIO setup for HP Mini 210
ALSA: mpu401: Fix missing initialization of irq field
ALSA: hda - Fix invalid D3 of headphone DAC on VT202x codecs
Linus Torvalds [Wed, 1 Aug 2012 17:26:23 +0000 (10:26 -0700)]
Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull second vfs pile from Al Viro:
"The stuff in there: fsfreeze deadlock fixes by Jan (essentially, the
deadlock reproduced by xfstests 068), symlink and hardlink restriction
patches, plus assorted cleanups and fixes.
Note that another fsfreeze deadlock (emergency thaw one) is *not*
dealt with - the series by Fernando conflicts a lot with Jan's, breaks
userland ABI (FIFREEZE semantics gets changed) and trades the deadlock
for massive vfsmount leak; this is going to be handled next cycle.
There probably will be another pull request, but that stuff won't be
in it."
Fix up trivial conflicts due to unrelated changes next to each other in
drivers/{staging/gdm72xx/usb_boot.c, usb/gadget/storage_common.c}
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (54 commits)
delousing target_core_file a bit
Documentation: Correct s_umount state for freeze_fs/unfreeze_fs
fs: Remove old freezing mechanism
ext2: Implement freezing
btrfs: Convert to new freezing mechanism
nilfs2: Convert to new freezing mechanism
ntfs: Convert to new freezing mechanism
fuse: Convert to new freezing mechanism
gfs2: Convert to new freezing mechanism
ocfs2: Convert to new freezing mechanism
xfs: Convert to new freezing code
ext4: Convert to new freezing mechanism
fs: Protect write paths by sb_start_write - sb_end_write
fs: Skip atime update on frozen filesystem
fs: Add freezing handling to mnt_want_write() / mnt_drop_write()
fs: Improve filesystem freezing handling
switch the protection of percpu_counter list to spinlock
nfsd: Push mnt_want_write() outside of i_mutex
btrfs: Push mnt_want_write() outside of i_mutex
fat: Push mnt_want_write() outside of i_mutex
...
Linus Torvalds [Wed, 1 Aug 2012 16:06:47 +0000 (09:06 -0700)]
Merge branch 'for-3.6/drivers' of git://git.kernel.dk/linux-block
Pull block driver changes from Jens Axboe:
- Making the plugging support for drivers a bit more sane from Neil.
This supersedes the plugging change from Shaohua as well.
- The usual round of drbd updates.
- Using a tail add instead of a head add in the request completion for
ndb, making us find the most completed request more quickly.
- A few floppy changes, getting rid of a duplicated flag and also
running the floppy init async (since it takes forever in boot terms)
from Andi.
* 'for-3.6/drivers' of git://git.kernel.dk/linux-block:
floppy: remove duplicated flag FD_RAW_NEED_DISK
blk: pass from_schedule to non-request unplug functions.
block: stack unplug
blk: centralize non-request unplug handling.
md: remove plug_cnt feature of plugging.
block/nbd: micro-optimization in nbd request completion
drbd: announce FLUSH/FUA capability to upper layers
drbd: fix max_bio_size to be unsigned
drbd: flush drbd work queue before invalidate/invalidate remote
drbd: fix potential access after free
drbd: call local-io-error handler early
drbd: do not reset rs_pending_cnt too early
drbd: reset congestion information before reporting it in /proc/drbd
drbd: report congestion if we are waiting for some userland callback
drbd: differentiate between normal and forced detach
drbd: cleanup, remove two unused global flags
floppy: Run floppy initialization asynchronous
Linus Torvalds [Wed, 1 Aug 2012 16:02:41 +0000 (09:02 -0700)]
Merge branch 'for-3.6/core' of git://git.kernel.dk/linux-block
Pull core block IO bits from Jens Axboe:
"The most complicated part if this is the request allocation rework by
Tejun, which has been queued up for a long time and has been in
for-next ditto as well.
There are a few commits from yesterday and today, mostly trivial and
obvious fixes. So I'm pretty confident that it is sound. It's also
smaller than usual."
* 'for-3.6/core' of git://git.kernel.dk/linux-block:
block: remove dead func declaration
block: add partition resize function to blkpg ioctl
block: uninitialized ioc->nr_tasks triggers WARN_ON
block: do not artificially constrain max_sectors for stacking drivers
blkcg: implement per-blkg request allocation
block: prepare for multiple request_lists
block: add q->nr_rqs[] and move q->rq.elvpriv to q->nr_rqs_elvpriv
blkcg: inline bio_blkcg() and friends
block: allocate io_context upfront
block: refactor get_request[_wait]()
block: drop custom queue draining used by scsi_transport_{iscsi|fc}
mempool: add @gfp_mask to mempool_create_node()
blkcg: make root blkcg allocation use %GFP_KERNEL
blkcg: __blkg_lookup_create() doesn't need radix preload
Linus Torvalds [Wed, 1 Aug 2012 16:02:01 +0000 (09:02 -0700)]
Merge branch 'for-next' of git://neil.brown.name/md
Pull md updates from NeilBrown.
* 'for-next' of git://neil.brown.name/md:
DM RAID: Add support for MD RAID10
md/RAID1: Add missing case for attempting to repair known bad blocks.
md/raid5: For odirect-write performance, do not set STRIPE_PREREAD_ACTIVE.
md/raid1: don't abort a resync on the first badblock.
md: remove duplicated test on ->openers when calling do_md_stop()
raid5: Add R5_ReadNoMerge flag which prevent bio from merging at block layer
md/raid1: prevent merging too large request
md/raid1: read balance chooses idlest disk for SSD
md/raid1: make sequential read detection per disk based
MD RAID10: Export md_raid10_congested
MD: Move macros from raid1*.h to raid1*.c
MD RAID1: rename mirror_info structure
MD RAID10: rename mirror_info structure
MD RAID10: Fix compiler warning.
raid5: add a per-stripe lock
raid5: remove unnecessary bitmap write optimization
raid5: lockless access raid5 overrided bi_phys_segments
raid5: reduce chance release_stripe() taking device_lock
J. Bruce Fields [Wed, 1 Aug 2012 11:56:16 +0000 (07:56 -0400)]
locks: remove unused lm_release_private
In commit 3b6e2723f32d ("locks: prevent side-effects of
locks_release_private before file_lock is initialized") we removed the
last user of lm_release_private without removing the field itself.
Signed-off-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Yoichi Yuasa [Wed, 1 Aug 2012 06:42:16 +0000 (15:42 +0900)]
MIPS: Loongson 1: more clk support and add select HAVE_CLK
This fixes a redefinition of clk_*:
arch/mips/loongson1/common/clock.c:23:13: error: redefinition of 'clk_get'
include/linux/clk.h:281:27: note: previous definition of 'clk_get' was here
arch/mips/loongson1/common/clock.c:41:15: error: redefinition of 'clk_get_rate'
include/linux/clk.h:302:29: note: previous definition of 'clk_get_rate' was here
make[3]: *** [arch/mips/loongson1/common/clock.o] Error 1
Yoichi Yuasa [Wed, 1 Aug 2012 06:41:06 +0000 (15:41 +0900)]
MIPS: txx9: Fix redefinition of clk_* by adding select HAVE_CLK
arch/mips/txx9/generic/setup.c:87:13: error: redefinition of 'clk_get'
include/linux/clk.h:281:27: note: previous definition of 'clk_get' was here
arch/mips/txx9/generic/setup.c:97:5: error: redefinition of 'clk_enable'
include/linux/clk.h:295:19: note: previous definition of 'clk_enable' was here
arch/mips/txx9/generic/setup.c:103:6: error: redefinition of 'clk_disable'
include/linux/clk.h:300:20: note: previous definition of 'clk_disable' was here
arch/mips/txx9/generic/setup.c:108:15: error: redefinition of 'clk_get_rate'
include/linux/clk.h:302:29: note: previous definition of 'clk_get_rate' was here
arch/mips/txx9/generic/setup.c:114:6: error: redefinition of 'clk_put'
include/linux/clk.h:291:20: note: previous definition of 'clk_put' was here
make[3]: *** [arch/mips/txx9/generic/setup.o] Error 1
Yoichi Yuasa [Wed, 1 Aug 2012 06:39:52 +0000 (15:39 +0900)]
MIPS: BCM63xx: Fix redefinition of clk_* by adding select HAVE_CLK
arch/mips/bcm63xx/clk.c:249:5: error: redefinition of 'clk_enable'
include/linux/clk.h:295:19: note: previous definition of 'clk_enable' was here
arch/mips/bcm63xx/clk.c:259:6: error: redefinition of 'clk_disable'
include/linux/clk.h:300:20: note: previous definition of 'clk_disable' was here
arch/mips/bcm63xx/clk.c:268:15: error: redefinition of 'clk_get_rate'
include/linux/clk.h:302:29: note: previous definition of 'clk_get_rate' was here
arch/mips/bcm63xx/clk.c:275:13: error: redefinition of 'clk_get'
include/linux/clk.h:281:27: note: previous definition of 'clk_get' was here
arch/mips/bcm63xx/clk.c:302:6: error: redefinition of 'clk_put'
include/linux/clk.h:291:20: note: previous definition of 'clk_put' was here
make[2]: *** [arch/mips/bcm63xx/clk.o] Error 1
Yoichi Yuasa [Wed, 1 Aug 2012 06:38:00 +0000 (15:38 +0900)]
MIPS: AR7: Fix redefinition of clk_* by adding select HAVE_CLK
arch/mips/ar7/clock.c:420:5: error: redefinition of 'clk_enable'
include/linux/clk.h:295:19: note: previous definition of 'clk_enable' was here
arch/mips/ar7/clock.c:426:6: error: redefinition of 'clk_disable'
include/linux/clk.h:300:20: note: previous definition of 'clk_disable' was here
arch/mips/ar7/clock.c:431:15: error: redefinition of 'clk_get_rate'
include/linux/clk.h:302:29: note: previous definition of 'clk_get_rate' was here
arch/mips/ar7/clock.c:437:13: error: redefinition of 'clk_get'
include/linux/clk.h:281:27: note: previous definition of 'clk_get' was here
arch/mips/ar7/clock.c:454:6: error: redefinition of 'clk_put'
include/linux/clk.h:291:20: note: previous definition of 'clk_put' was here
make[2]: *** [arch/mips/ar7/clock.o] Error 1
John Crispin [Sun, 22 Jul 2012 06:55:57 +0000 (08:55 +0200)]
MIPS: Lantiq: Fix interface clock and PCI control register offset
The XRX200 based SoC have a different register offset for the interface
clock and PCI control registers. This patch detects the SoC and sets the
register offset at runtime. This make PCI work on the VR9 SoC.
Al Viro [Wed, 1 Aug 2012 09:07:04 +0000 (13:07 +0400)]
delousing target_core_file a bit
* set_fs(KERNEL_DS) + getname() is probably the weirdest implementation
of strdup() I've seen. Especially since they don't to copy it at all...
* filp_open() never returns NULL; it's ERR_PTR(-E...) on failure.
* file->f_dentry is never going to be NULL, TYVM.
* match_strdup() + snprintf() + kfree() is a bloody weird way to spell
match_strlcpy().
Yuanhan Liu [Wed, 1 Aug 2012 10:25:54 +0000 (12:25 +0200)]
block: remove dead func declaration
__generic_unplug_device() function is removed with commit 7eaceaccab5f40bbfda044629a6298616aeaed50, which forgot to
remove the declaration at meantime. Here remove it.
Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Vivek Goyal [Wed, 1 Aug 2012 10:24:18 +0000 (12:24 +0200)]
block: add partition resize function to blkpg ioctl
Add a new operation code (BLKPG_RESIZE_PARTITION) to the BLKPG ioctl that
allows altering the size of an existing partition, even if it is currently
in use.
This patch converts hd_struct->nr_sects into sequence counter because
One might extend a partition while IO is happening to it and update of
nr_sects can be non-atomic on 32bit machines with 64bit sector_t. This
can lead to issues like reading inconsistent size of a partition. Sequence
counter have been used so that readers don't have to take bdev mutex lock
as we call sector_in_part() very frequently.
Now all the access to hd_struct->nr_sects should happen using sequence
counter read/update helper functions part_nr_sects_read/part_nr_sects_write.
There is one exception though, set_capacity()/get_capacity(). I think
theoritically race should exist there too but this patch does not
modify set_capacity()/get_capacity() due to sheer number of call sites
and I am afraid that change might break something. I have left that as a
TODO item. We can handle it later if need be. This patch does not introduce
any new races as such w.r.t set_capacity()/get_capacity().
v2: Add CONFIG_LBDAF test to UP preempt case as suggested by Phillip.
Reproducing is easy, I can hit it on a KVM system with a very basic
config (x86_64 make defconfig + enable the drivers needed). To hit it,
just install dump (on debian/ubuntu, not sure what the package might be
called on Fedora), and:
dump -o -f /tmp/foo /
You'll see the warning in dmesg once it forks off the I/O process and
starts dumping filesystem contents.
Currently ioc->nr_tasks is used to decide two things - whether an ioc
is done issuing IOs and whether it's shared by multiple tasks. This
patch separate out the first into ioc->active_ref, which is acquired
and released using {get|put}_io_context_active() respectively.
This will be used to associate bio's with a given task. This patch
doesn't introduce any visible behavior change.
Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
It seems like the init of ioc->nr_tasks was removed in that patch,
so it starts out at 0 instead of 1.
Tejun, is the right thing here to add back the init, or should something else
be done?
The below patch removes the warning, but I haven't done any more extensive
testing on it.
Mike Snitzer [Wed, 1 Aug 2012 08:44:28 +0000 (10:44 +0200)]
block: do not artificially constrain max_sectors for stacking drivers
blk_set_stacking_limits is intended to allow stacking drivers to build
up the limits of the stacked device based on the underlying devices'
limits. But defaulting 'max_sectors' to BLK_DEF_MAX_SECTORS (1024)
doesn't allow the stacking driver to inherit a max_sectors larger than
1024 -- due to blk_stack_limits' use of min_not_zero.
It is now clear that this artificial limit is getting in the way so
change blk_set_stacking_limits's max_sectors to UINT_MAX (which allows
stacking drivers like dm-multipath to inherit 'max_sectors' from the
underlying paths).
Daniel Mack [Wed, 1 Aug 2012 08:16:53 +0000 (10:16 +0200)]
ALSA: snd-usb: fix clock source validity index
uac_clock_source_is_valid() uses the control selector value to access
the bmControls bitmap of the clock source unit. This is wrong, as
control selector values start from 1, while the bitmap uses all
available bits.
In other words, "Clock Validity Control" is stored in D3..2, not D5..4
of the clock selector unit's bmControls.
Signed-off-by: Daniel Mack <zonque@gmail.com> Reported-by: Andreas Koch <andreas@akdesigninc.com> Cc: stable@kernel.org Signed-off-by: Takashi Iwai <tiwai@suse.de>