workqueue: mark init_workqueues() as early_initcall()
Mark init_workqueues() as early_initcall() and thus it will be initialized
before smp bringup. init_workqueues() registers for the hotcpu notifier
and thus it should cope with the processors that are brought online after
the workqueues are initialized.
x86 smp bringup code uses workqueues and uses a workaround for the
cold boot process (as the workqueues are initialized post smp_init()).
Marking init_workqueues() as early_initcall() will pave the way for
cleaning up this code.
Tejun Heo [Sun, 1 Aug 2010 09:50:12 +0000 (11:50 +0200)]
workqueue: explain for_each_*cwq_cpu() iterators
for_each_*cwq_cpu() are similar to regular CPU iterators except that
it also considers the pseudo CPU number used for unbound workqueues.
Explain them.
Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Andrew Morton <akpm@linux-foundation.org>
Commit 8b8edefa (fscache: convert object to use workqueue instead of
slow-work) made fscache_exit() call unregister_sysctl_table()
unconditionally breaking build when sysctl is disabled. Fix it by
putting it inside CONFIG_SYSCTL.
Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Randy Dunlap <randy.dunlap@oracle.com> Cc: David Howells <dhowells@redhat.com>
Workqueue can now handle high concurrency. Convert gfs to use
workqueue instead of slow-work.
* Steven pointed out that recovery path might be run from allocation
path and thus requires forward progress guarantee without memory
allocation. Create and use gfs_recovery_wq with rescuer. Please
note that forward progress wasn't guaranteed with slow-work.
* Updated to use non-reentrant workqueue.
Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Steven Whitehouse <swhiteho@redhat.com>
Workqueue can now handle high concurrency. Convert drm_crtc_helper to
use system_nrt_wq instead of slow-work. The conversion is mostly
straight forward. One difference is that drm_helper_hpd_irq_event()
no longer blocks and can be called from any context.
Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: David Airlie <airlied@linux.ie> Cc: dri-devel@lists.freedesktop.org
Workqueue can now handle high concurrency. Use system_nrt_wq
instead of slow-work.
* Updated is_valid_oplock_break() to not call cifs_oplock_break_put()
as advised by Steve French. It might cause deadlock. Instead,
reference is increased after queueing succeeded and
cifs_oplock_break() briefly grabs GlobalSMBSeslock before putting
the cfile to make sure it doesn't put before the matching get is
finished.
* Anton Blanchard reported that cifs conversion was using now gone
system_single_wq. Use system_nrt_wq which provides non-reentrance
guarantee which is enough and much better.
Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Steve French <sfrench@samba.org> Cc: Anton Blanchard <anton@samba.org>
fscache: convert operation to use workqueue instead of slow-work
Make fscache operation to use only workqueue instead of combination of
workqueue and slow-work. FSCACHE_OP_SLOW is dropped and
FSCACHE_OP_FAST is renamed to FSCACHE_OP_ASYNC and uses newly added
fscache_op_wq workqueue to execute op->processor().
fscache_operation_init_slow() is dropped and fscache_operation_init()
now takes @processor argument directly.
* Unbound workqueue is used.
* fscache_retrieval_work() is no longer necessary as OP_ASYNC now does
the equivalent thing.
* sysctl fscache.operation_max_active added to control concurrency.
The default value is nr_cpus clamped between 2 and
WQ_UNBOUND_MAX_ACTIVE.
* debugfs support is dropped for now. Tracing API based debug
facility is planned to be added.
Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: David Howells <dhowells@redhat.com>
fscache: convert object to use workqueue instead of slow-work
Make fscache object state transition callbacks use workqueue instead
of slow-work. New dedicated unbound CPU workqueue fscache_object_wq
is created. get/put callbacks are renamed and modified to take
@object and called directly from the enqueue wrapper and the work
function. While at it, make all open coded instances of get/put to
use fscache_get/put_object().
* Unbound workqueue is used.
* work_busy() output is printed instead of slow-work flags in object
debugging outputs. They mean basically the same thing bit-for-bit.
* sysctl fscache.object_max_active added to control concurrency. The
default value is nr_cpus clamped between 4 and
WQ_UNBOUND_MAX_ACTIVE.
* slow_work_sleep_till_thread_needed() is replaced with fscache
private implementation fscache_object_sleep_till_congested() which
waits on fscache_object_wq congestion.
* debugfs support is dropped for now. Tracing API based debug
facility is planned to be added.
Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: David Howells <dhowells@redhat.com>
workqueue: fix how cpu number is stored in work->data
Once a work starts execution, its data contains the cpu number it was
on instead of pointing to cwq. This is added by commit 7a22ad75
(workqueue: carry cpu number in work data once execution starts) to
reliably determine the work was last on even if the workqueue itself
was destroyed inbetween.
Whether data points to a cwq or contains a cpu number was
distinguished by comparing the value against PAGE_OFFSET. The
assumption was that a cpu number should be below PAGE_OFFSET while a
pointer to cwq should be above it. However, on architectures which
use separate address spaces for user and kernel spaces, this doesn't
hold as PAGE_OFFSET is zero.
Fix it by using an explicit flag, WORK_STRUCT_CWQ, to mark what the
data field contains. If the flag is set, it's pointing to a cwq;
otherwise, it contains a cpu number.
Reported on s390 and microblaze during linux-next testing.
Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Sachin Sant <sachinp@in.ibm.com> Reported-by: Michal Simek <michal.simek@petalogix.com> Reported-by: Martin Schwidefsky <schwidefsky@de.ibm.com> Tested-by: Martin Schwidefsky <schwidefsky@de.ibm.com> Tested-by: Michal Simek <monstr@monstr.eu>
All cpumasks are assumed to have cpu 0 permanently set on UP, so it
can't be used to signify whether there's something to be done for the
CPU. workqueue was using cpumask to track which CPU requested rescuer
assistance and this led rescuer thread to think there always are
pending mayday requests on UP, which resulted in infinite busy loops.
This patch fixes the problem by introducing mayday_mask_t and
associated helpers which wrap cpumask on SMP and emulates its behavior
using bitops and unsigned long on UP.
Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Rusty Russell <rusty@rustcorp.com.au>
Commit f3421797 (workqueue: implement unbound workqueue) incorrectly
tested CONFIG_SMP as part of a C expression in alloc/free_cwqs(). As
CONFIG_SMP is not defined in UP, this breaks build. Fix it by using
Found during linux-next build test.
Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
workqueue: remove WQ_SINGLE_CPU and use WQ_UNBOUND instead
WQ_SINGLE_CPU combined with @max_active of 1 is used to achieve full
ordering among works queued to a workqueue. The same can be achieved
using WQ_UNBOUND as unbound workqueues always use the gcwq for
WORK_CPU_UNBOUND. As @max_active is always one and benefits from cpu
locality isn't accessible anyway, serving them with unbound workqueues
should be fine.
Drop WQ_SINGLE_CPU support and use WQ_UNBOUND instead. Note that most
single thread workqueue users will be converted to use multithread or
non-reentrant instead and only the ones which require strict ordering
will keep using WQ_UNBOUND + @max_active of 1.
This patch implements unbound workqueue which can be specified with
WQ_UNBOUND flag on creation. An unbound workqueue has the following
properties.
* It uses a dedicated gcwq with a pseudo CPU number WORK_CPU_UNBOUND.
This gcwq is always online and disassociated.
* Workers are not bound to any CPU and not concurrency managed. Works
are dispatched to workers as soon as possible and the only applied
limitation is @max_active. IOW, all unbound workqeueues are
implicitly high priority.
Unbound workqueues can be used as simple execution context provider.
Contexts unbound to any cpu are served as soon as possible.
Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Arjan van de Ven <arjan@linux.intel.com> Cc: David Howells <dhowells@redhat.com>
In preparation of WQ_UNBOUND addition, make the following changes.
* Add WORK_CPU_* constants for pseudo cpu id numbers used (currently
only WORK_CPU_NONE) and use them instead of NR_CPUS. This is to
allow another pseudo cpu id for unbound cpu.
* Reorder WQ_* flags.
* Make workqueue_struct->cpu_wq a union which contains a percpu
pointer, regular pointer and an unsigned long value and use
kzalloc/kfree() in UP allocation path. This will be used to
implement unbound workqueues which will use only one cwq on SMPs.
* Move alloc_cwqs() allocation after initialization of wq fields, so
that alloc_cwqs() has access to wq->flags.
* Trivial relocation of wq local variables in freeze functions.
libata: take advantage of cmwq and remove concurrency limitations
libata has two concurrency related limitations.
a. ata_wq which is used for polling PIO has single thread per CPU. If
there are multiple devices doing polling PIO on the same CPU, they
can't be executed simultaneously.
b. ata_aux_wq which is used for SCSI probing has single thread. In
cases where SCSI probing is stalled for extended period of time
which is possible for ATAPI devices, this will stall all probing.
#a is solved by increasing maximum concurrency of ata_wq. Please note
that polling PIO might be used under allocation path and thus needs to
be served by a separate wq with a rescuer.
#b is solved by using the default wq instead and achieving exclusion
via per-port mutex.
Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Jeff Garzik <jgarzik@pobox.com>
workqueue: fix worker management invocation without pending works
When there's no pending work to do, worker_thread() goes back to sleep
after waking up without checking whether worker management is
necessary. This means that idle worker exit requests can be ignored
if the gcwq stays empty.
Fix it by making worker_thread() always check whether worker
management is necessary before going to sleep.
workqueue: fix race condition in flush_workqueue()
When one flusher is cascading to the next flusher, it first sets
wq->first_flusher to the next one and sets up the next flush cycle.
If there's nothing to do for the next cycle, it clears
wq->flush_flusher and proceeds to the one after that.
If the woken up flusher checks wq->first_flusher before it gets
cleared, it will incorrectly assume the role of the first flusher,
which triggers BUG_ON() sanity check.
Fix it by checking wq->first_flusher again after grabbing the mutex.
workqueue: use worker_set/clr_flags() only from worker itself
worker_set/clr_flags() assume that if none of NOT_RUNNING flags is set
the worker must be contributing to nr_running which is only true if
the worker is actually running.
As when called from self, it is guaranteed that the worker is running,
those functions can be safely used from the worker itself and they
aren't necessary from other places anyway. Make the following changes
to fix the bug.
* Make worker_set/clr_flags() whine if not called from self.
* Convert all places which called those functions from other tasks to
manipulate flags directly.
* Make trustee_thread() directly clear nr_running after setting
WORKER_ROGUE on all workers. This is the only place where
nr_running manipulation is necessary outside of workers themselves.
* While at it, add sanity check for nr_running in worker_enter_idle().
Tejun Heo [Tue, 29 Jun 2010 08:07:15 +0000 (10:07 +0200)]
workqueue: implement cpu intensive workqueue
This patch implements cpu intensive workqueue which can be specified
with WQ_CPU_INTENSIVE flag on creation. Works queued to a cpu
intensive workqueue don't participate in concurrency management. IOW,
it doesn't contribute to gcwq->nr_running and thus doesn't delay
excution of other works.
Note that although cpu intensive works won't delay other works, they
can be delayed by other works. Combine with WQ_HIGHPRI to avoid being
delayed by other works too.
As the name suggests this is useful when using workqueue for cpu
intensive works. Workers executing cpu intensive works are not
considered for workqueue concurrency management and left for the
scheduler to manage.
Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Andrew Morton <akpm@linux-foundation.org>
Tejun Heo [Tue, 29 Jun 2010 08:07:14 +0000 (10:07 +0200)]
workqueue: implement high priority workqueue
This patch implements high priority workqueue which can be specified
with WQ_HIGHPRI flag on creation. A high priority workqueue has the
following properties.
* A work queued to it is queued at the head of the worklist of the
respective gcwq after other highpri works, while normal works are
always appended at the end.
* As long as there are highpri works on gcwq->worklist,
[__]need_more_worker() remains %true and process_one_work() wakes up
another worker before it start executing a work.
The above two properties guarantee that works queued to high priority
workqueues are dispatched to workers and start execution as soon as
possible regardless of the state of other works.
Tejun Heo [Tue, 29 Jun 2010 08:07:14 +0000 (10:07 +0200)]
workqueue: implement several utility APIs
Implement the following utility APIs.
workqueue_set_max_active() : adjust max_active of a wq
workqueue_congested() : test whether a wq is contested
work_cpu() : determine the last / current cpu of a work
work_busy() : query whether a work is busy
* Anton Blanchard fixed missing ret initialization in work_busy().
Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Anton Blanchard <anton@samba.org>
Tejun Heo [Tue, 29 Jun 2010 08:07:14 +0000 (10:07 +0200)]
workqueue: s/__create_workqueue()/alloc_workqueue()/, and add system workqueues
This patch makes changes to make new workqueue features available to
its users.
* Now that workqueue is more featureful, there should be a public
workqueue creation function which takes paramters to control them.
Rename __create_workqueue() to alloc_workqueue() and make 0
max_active mean WQ_DFL_ACTIVE. In the long run, all
create_workqueue_*() will be converted over to alloc_workqueue().
* To further unify access interface, rename keventd_wq to system_wq
and export it.
* Add system_long_wq and system_nrt_wq. The former is to host long
running works separately (so that flush_scheduled_work() dosen't
take so long) and the latter guarantees any queued work item is
never executed in parallel by multiple CPUs. These will be used by
future patches to update workqueue users.
Tejun Heo [Tue, 29 Jun 2010 08:07:14 +0000 (10:07 +0200)]
workqueue: increase max_active of keventd and kill current_is_keventd()
Define WQ_MAX_ACTIVE and create keventd with max_active set to half of
it which means that keventd now can process upto WQ_MAX_ACTIVE / 2 - 1
works concurrently. Unless some combination can result in dependency
loop longer than max_active, deadlock won't happen and thus it's
unnecessary to check whether current_is_keventd() before trying to
schedule a work. Kill current_is_keventd().
(Lockdep annotations are broken. We need lock_map_acquire_read_norecurse())
Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Tony Luck <tony.luck@intel.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Oleg Nesterov <oleg@redhat.com>
Tejun Heo [Tue, 29 Jun 2010 08:07:14 +0000 (10:07 +0200)]
workqueue: implement concurrency managed dynamic worker pool
Instead of creating a worker for each cwq and putting it into the
shared pool, manage per-cpu workers dynamically.
Works aren't supposed to be cpu cycle hogs and maintaining just enough
concurrency to prevent work processing from stalling due to lack of
processing context is optimal. gcwq keeps the number of concurrent
active workers to minimum but no less. As long as there's one or more
running workers on the cpu, no new worker is scheduled so that works
can be processed in batch as much as possible but when the last
running worker blocks, gcwq immediately schedules new worker so that
the cpu doesn't sit idle while there are works to be processed.
gcwq always keeps at least single idle worker around. When a new
worker is necessary and the worker is the last idle one, the worker
assumes the role of "manager" and manages the worker pool -
ie. creates another worker. Forward-progress is guaranteed by having
dedicated rescue workers for workqueues which may be necessary while
creating a new worker. When the manager is having problem creating a
new worker, mayday timer activates and rescue workers are summoned to
the cpu and execute works which might be necessary to create new
workers.
Trustee is expanded to serve the role of manager while a CPU is being
taken down and stays down. As no new works are supposed to be queued
on a dead cpu, it just needs to drain all the existing ones. Trustee
continues to try to create new workers and summon rescuers as long as
there are pending works. If the CPU is brought back up while the
trustee is still trying to drain the gcwq from the previous offlining,
the trustee will kill all idles ones and tell workers which are still
busy to rebind to the cpu, and pass control over to gcwq which assumes
the manager role as necessary.
Concurrency managed worker pool reduces the number of workers
drastically. Only workers which are necessary to keep the processing
going are created and kept. Also, it reduces cache footprint by
avoiding unnecessarily switching contexts between different workers.
Please note that this patch does not increase max_active of any
workqueue. All workqueues can still only process one work per cpu.
Tejun Heo [Tue, 29 Jun 2010 08:07:13 +0000 (10:07 +0200)]
workqueue: implement worker_{set|clr}_flags()
Implement worker_{set|clr}_flags() to manipulate worker flags. These
are currently simple wrappers but logics to track the current worker
state and the current level of concurrency will be added.
Tejun Heo [Tue, 29 Jun 2010 08:07:13 +0000 (10:07 +0200)]
workqueue: use shared worklist and pool all workers per cpu
Use gcwq->worklist instead of cwq->worklist and break the strict
association between a cwq and its worker. All works queued on a cpu
are queued on gcwq->worklist and processed by any available worker on
the gcwq.
As there no longer is strict association between a cwq and its worker,
whether a work is executing can now only be determined by calling
[__]find_worker_executing_work().
After this change, the only association between a cwq and its worker
is that a cwq puts a worker into shared worker pool on creation and
kills it on destruction. As all workqueues are still limited to
max_active of one, this means that there are always at least as many
workers as active works and thus there's no danger for deadlock.
The break of strong association between cwqs and workers requires
somewhat clumsy changes to current_is_keventd() and
destroy_workqueue(). Dynamic worker pool management will remove both
clumsy changes. current_is_keventd() won't be necessary at all as the
only reason it exists is to avoid queueing a work from a work which
will be allowed just fine. The clumsy part of destroy_workqueue() is
added because a worker can only be destroyed while idle and there's no
guarantee a worker is idle when its wq is going down. With dynamic
pool management, workers are not associated with workqueues at all and
only idle ones will be submitted to destroy_workqueue() so the code
won't be necessary anymore.
Tejun Heo [Tue, 29 Jun 2010 08:07:13 +0000 (10:07 +0200)]
workqueue: implement WQ_NON_REENTRANT
With gcwq managing all the workers and work->data pointing to the last
gcwq it was on, non-reentrance can be easily implemented by checking
whether the work is still running on the previous gcwq on queueing.
Implement it.
Tejun Heo [Tue, 29 Jun 2010 08:07:13 +0000 (10:07 +0200)]
workqueue: carry cpu number in work data once execution starts
To implement non-reentrant workqueue, the last gcwq a work was
executed on must be reliably obtainable as long as the work structure
is valid even if the previous workqueue has been destroyed.
To achieve this, work->data will be overloaded to carry the last cpu
number once execution starts so that the previous gcwq can be located
reliably. This means that cwq can't be obtained from work after
execution starts but only gcwq.
Implement set_work_{cwq|cpu}(), get_work_[g]cwq() and
clear_work_data() to set work data to the cpu number when starting
execution, access the overloaded work data and clear it after
cancellation.
queue_delayed_work_on() is updated to preserve the last cpu while
in-flight in timer and other callers which depended on getting cwq
from work after execution starts are converted to depend on gcwq
instead.
* Anton Blanchard fixed compile error on powerpc due to missing
linux/threads.h include.
Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Anton Blanchard <anton@samba.org>
Tejun Heo [Tue, 29 Jun 2010 08:07:13 +0000 (10:07 +0200)]
workqueue: add find_worker_executing_work() and track current_cwq
Now that all the workers are tracked by gcwq, we can find which worker
is executing a work from gcwq. Implement find_worker_executing_work()
and make worker track its current_cwq so that we can find things the
other way around. This will be used to implement non-reentrant wqs.
Tejun Heo [Tue, 29 Jun 2010 08:07:13 +0000 (10:07 +0200)]
workqueue: make single thread workqueue shared worker pool friendly
Reimplement st (single thread) workqueue so that it's friendly to
shared worker pool. It was originally implemented by confining st
workqueues to use cwq of a fixed cpu and always having a worker for
the cpu. This implementation isn't very friendly to shared worker
pool and suboptimal in that it ends up crossing cpu boundaries often.
Reimplement st workqueue using dynamic single cpu binding and
cwq->limit. WQ_SINGLE_THREAD is replaced with WQ_SINGLE_CPU. In a
single cpu workqueue, at most single cwq is bound to the wq at any
given time. Arbitration is done using atomic accesses to
wq->single_cpu when queueing a work. Once bound, the binding stays
till the workqueue is drained.
Note that the binding is never broken while a workqueue is frozen.
This is because idle cwqs may have works waiting in delayed_works
queue while frozen. On thaw, the cwq is restarted if there are any
delayed works or unbound otherwise.
When combined with max_active limit of 1, single cpu workqueue has
exactly the same execution properties as the original single thread
workqueue while allowing sharing of per-cpu workers.
Tejun Heo [Tue, 29 Jun 2010 08:07:12 +0000 (10:07 +0200)]
workqueue: reimplement CPU hotplugging support using trustee
Reimplement CPU hotplugging support using trustee thread. On CPU
down, a trustee thread is created and each step of CPU down is
executed by the trustee and workqueue_cpu_callback() simply drives and
waits for trustee state transitions.
CPU down operation no longer waits for works to be drained but trustee
sticks around till all pending works have been completed. If CPU is
brought back up while works are still draining,
workqueue_cpu_callback() tells trustee to step down and tell workers
to rebind to the cpu.
As it's difficult to tell whether cwqs are empty if it's freezing or
frozen, trustee doesn't consider draining to be complete while a gcwq
is freezing or frozen (tracked by new GCWQ_FREEZING flag). Also,
workers which get unbound from their cpu are marked with WORKER_ROGUE.
Trustee based implementation doesn't bring any new feature at this
point but it will be used to manage worker pool when dynamic shared
worker pool is implemented.
Tejun Heo [Tue, 29 Jun 2010 08:07:12 +0000 (10:07 +0200)]
workqueue: implement worker states
Implement worker states. After created, a worker is STARTED. While a
worker isn't processing a work, it's IDLE and chained on
gcwq->idle_list. While processing a work, a worker is BUSY and
chained on gcwq->busy_hash. Also, gcwq now counts the number of all
workers and idle ones.
worker_thread() is restructured to reflect state transitions.
cwq->more_work is removed and waking up a worker makes it check for
events. A worker is killed by setting DIE flag while it's IDLE and
waking it up.
This gives gcwq better visibility of what's going on and allows it to
find out whether a work is executing quickly which is necessary to
have multiple workers processing the same cwq.
Tejun Heo [Tue, 29 Jun 2010 08:07:12 +0000 (10:07 +0200)]
workqueue: introduce global cwq and unify cwq locks
There is one gcwq (global cwq) per each cpu and all cwqs on an cpu
point to it. A gcwq contains a lock to be used by all cwqs on the cpu
and an ida to give IDs to workers belonging to the cpu.
This patch introduces gcwq, moves worker_ida into gcwq and make all
cwqs on the same cpu use the cpu's gcwq->lock instead of separate
locks. gcwq->ida is now protected by gcwq->lock too.
Tejun Heo [Tue, 29 Jun 2010 08:07:12 +0000 (10:07 +0200)]
workqueue: reimplement workqueue freeze using max_active
Currently, workqueue freezing is implemented by marking the worker
freezeable and calling try_to_freeze() from dispatch loop.
Reimplement it using cwq->limit so that the workqueue is frozen
instead of the worker.
* workqueue_struct->saved_max_active is added which stores the
specified max_active on initialization.
* On freeze, all cwq->max_active's are quenched to zero. Freezing is
complete when nr_active on all cwqs reach zero.
* On thaw, all cwq->max_active's are restored to wq->saved_max_active
and the worklist is repopulated.
This new implementation allows having single shared pool of workers
per cpu.
Tejun Heo [Tue, 29 Jun 2010 08:07:12 +0000 (10:07 +0200)]
workqueue: implement per-cwq active work limit
Add cwq->nr_active, cwq->max_active and cwq->delayed_work. nr_active
counts the number of active works per cwq. A work is active if it's
flushable (colored) and is on cwq's worklist. If nr_active reaches
max_active, new works are queued on cwq->delayed_work and activated
later as works on the cwq complete and decrement nr_active.
cwq->max_active can be specified via the new @max_active parameter to
__create_workqueue() and is set to 1 for all workqueues for now. As
each cwq has only single worker now, this double queueing doesn't
cause any behavior difference visible to its users.
This will be used to reimplement freeze/thaw and implement shared
worker pool.
Tejun Heo [Tue, 29 Jun 2010 08:07:12 +0000 (10:07 +0200)]
workqueue: reimplement work flushing using linked works
A work is linked to the next one by having WORK_STRUCT_LINKED bit set
and these links can be chained. When a linked work is dispatched to a
worker, all linked works are dispatched to the worker's newly added
->scheduled queue and processed back-to-back.
Currently, as there's only single worker per cwq, having linked works
doesn't make any visible behavior difference. This change is to
prepare for multiple shared workers per cpu.
Tejun Heo [Tue, 29 Jun 2010 08:07:11 +0000 (10:07 +0200)]
workqueue: introduce worker
Separate out worker thread related information to struct worker from
struct cpu_workqueue_struct and implement helper functions to deal
with the new struct worker. The only change which is visible outside
is that now workqueue worker are all named "kworker/CPUID:WORKERID"
where WORKERID is allocated from per-cpu ida.
This is in preparation of concurrency managed workqueue where shared
multiple workers would be available per cpu.
Tejun Heo [Tue, 29 Jun 2010 08:07:11 +0000 (10:07 +0200)]
workqueue: reimplement workqueue flushing using color coded works
Reimplement workqueue flushing using color coded works. wq has the
current work color which is painted on the works being issued via
cwqs. Flushing a workqueue is achieved by advancing the current work
colors of cwqs and waiting for all the works which have any of the
previous colors to drain.
Currently there are 16 possible colors, one is reserved for no color
and 15 colors are useable allowing 14 concurrent flushes. When color
space gets full, flush attempts are batched up and processed together
when color frees up, so even with many concurrent flushers, the new
implementation won't build up huge queue of flushers which has to be
processed one after another.
Only works which are queued via __queue_work() are colored. Works
which are directly put on queue using insert_work() use NO_COLOR and
don't participate in workqueue flushing. Currently only works used
for work-specific flush fall in this category.
This new implementation leaves only cleanup_workqueue_thread() as the
user of flush_cpu_workqueue(). Just make its users use
flush_workqueue() and kthread_stop() directly and kill
cleanup_workqueue_thread(). As workqueue flushing doesn't use barrier
request anymore, the comment describing the complex synchronization
around it in cleanup_workqueue_thread() is removed together with the
function.
This new implementation is to allow having and sharing multiple
workers per cpu.
Please note that one more bit is reserved for a future work flag by
this patch. This is to avoid shifting bits and updating comments
later.
Tejun Heo [Tue, 29 Jun 2010 08:07:11 +0000 (10:07 +0200)]
workqueue: update cwq alignement
work->data field is used for two purposes. It points to cwq it's
queued on and the lower bits are used for flags. Currently, two bits
are reserved which is always safe as 4 byte alignment is guaranteed on
every architecture. However, future changes will need more flag bits.
On SMP, the percpu allocator is capable of honoring larger alignment
(there are other users which depend on it) and larger alignment works
just fine. On UP, percpu allocator is a thin wrapper around
kzalloc/kfree() and don't honor alignment request.
This patch introduces WORK_STRUCT_FLAG_BITS and implements
alloc/free_cwqs() which guarantees max(1 << WORK_STRUCT_FLAG_BITS,
__alignof__(unsigned long long) alignment both on SMP and UP. On SMP,
simply wrapping percpu allocator is enough. On UP, extra space is
allocated so that cwq can be aligned and the original pointer can be
stored after it which is used in the free path.
* Alignment problem on UP is reported by Michal Simek.
Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Ingo Molnar <mingo@elte.hu> Cc: Frederic Weisbecker <fweisbec@gmail.com> Reported-by: Michal Simek <michal.simek@petalogix.com>
Tejun Heo [Tue, 29 Jun 2010 08:07:11 +0000 (10:07 +0200)]
workqueue: kill cpu_populated_map
Worker management is about to be overhauled. Simplify things by
removing cpu_populated_map, creating workers for all possible cpus and
making single threaded workqueues behave more like multi threaded
ones.
After this patch, all cwqs are always initialized, all workqueues are
linked on the workqueues list and workers for all possibles cpus
always exist. This also makes CPU hotplug support simpler - checking
->cpus_allowed before processing works in worker_thread() and flushing
cwqs on CPU_POST_DEAD are enough.
While at it, make get_cwq() always return the cwq for the specified
cpu, add target_cwq() for cases where single thread distinction is
necessary and drop all direct usage of per_cpu_ptr() on wq->cpu_wq.
Tejun Heo [Tue, 29 Jun 2010 08:07:10 +0000 (10:07 +0200)]
workqueue: define masks for work flags and conditionalize STATIC flags
Work flags are about to see more traditional mask handling. Define
WORK_STRUCT_*_BIT as the bit position constant and redefine
WORK_STRUCT_* as bit masks. Also, make WORK_STRUCT_STATIC_* flags
conditional
While at it, re-define these constants as enums and use
WORK_STRUCT_STATIC instead of hard-coding 2 in
WORK_DATA_STATIC_INIT().
Tejun Heo [Tue, 29 Jun 2010 08:07:10 +0000 (10:07 +0200)]
workqueue: merge feature parameters into flags
Currently, __create_workqueue_key() takes @singlethread and
@freezeable paramters and store them separately in workqueue_struct.
Merge them into a single flags parameter and field and use
WQ_FREEZEABLE and WQ_SINGLE_THREAD.
Tejun Heo [Tue, 29 Jun 2010 08:07:10 +0000 (10:07 +0200)]
workqueue: misc/cosmetic updates
Make the following updates in preparation of concurrency managed
workqueue. None of these changes causes any visible behavior
difference.
* Add comments and adjust indentations to data structures and several
functions.
* Rename wq_per_cpu() to get_cwq() and swap the position of two
parameters for consistency. Convert a direct per_cpu_ptr() access
to wq->cpu_wq to get_cwq().
* Add work_static() and Update set_wq_data() such that it sets the
flags part to WORK_STRUCT_PENDING | WORK_STRUCT_STATIC if static |
@extra_flags.
* Move santiy check on work->entry emptiness from queue_work_on() to
__queue_work() which all queueing paths share.
* Make __queue_work() take @cpu and @wq instead of @cwq.
* Restructure flush_work() and __create_workqueue_key() to make them
easier to modify.
Tejun Heo [Tue, 29 Jun 2010 08:07:09 +0000 (10:07 +0200)]
acpi: use queue_work_on() instead of binding workqueue worker to cpu0
ACPI works need to be executed on cpu0 and acpi/osl.c achieves this by
creating singlethread workqueue and then binding it to cpu0 from a
work which is quite unorthodox. Make it create regular workqueues and
use queue_work_on() instead. This is in preparation of concurrency
managed workqueue and the extra workers won't be a problem after it's
implemented.
Tejun Heo [Tue, 29 Jun 2010 08:07:09 +0000 (10:07 +0200)]
kthread: implement kthread_data()
Implement kthread_data() which takes @task pointing to a kthread and
returns @data specified when creating the kthread. The caller is
responsible for ensuring the validity of @task when calling this
function.
Tejun Heo [Tue, 29 Jun 2010 08:07:09 +0000 (10:07 +0200)]
ivtv: use kthread_worker instead of workqueue
Upcoming workqueue updates will no longer guarantee fixed workqueue to
worker kthread association, so giving RT priority to the irq worker
won't work. Use kthread_worker which guarantees specific kthread
association instead. This also makes setting the priority cleaner.
Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Andy Walls <awalls@md.metrocast.net> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: ivtv-devel@ivtvdriver.org Cc: linux-media@vger.kernel.org
Tejun Heo [Tue, 29 Jun 2010 08:07:09 +0000 (10:07 +0200)]
kthread: implement kthread_worker
Implement simple work processor for kthread. This is to ease using
kthread. Single thread workqueue used to be used for things like this
but workqueue won't guarantee fixed kthread association anymore to
enable worker sharing.
This can be used in cases where specific kthread association is
necessary, for example, when it should have RT priority or be assigned
to certain cgroup.
Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Andrew Morton <akpm@linux-foundation.org>
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6:
wimax/i2400m: fix missing endian correction read in fw loader
net8139: fix a race at the end of NAPI
pktgen: Fix accuracy of inter-packet delay.
pkt_sched: gen_estimator: add a new lock
net: deliver skbs on inactive slaves to exact matches
ipv6: fix ICMP6_MIB_OUTERRORS
r8169: fix mdio_read and update mdio_write according to hw specs
gianfar: Revive the driver for eTSEC devices (disable timestamping)
caif: fix a couple range checks
phylib: Add support for the LXT973 phy.
net: Print num_rx_queues imbalance warning only when there are allocated queues
Linus Torvalds [Fri, 11 Jun 2010 21:18:47 +0000 (14:18 -0700)]
Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable:
Btrfs: The file argument for fsync() is never null
Btrfs: handle ERR_PTR from posix_acl_from_xattr()
Btrfs: avoid BUG when dropping root and reference in same transaction
Btrfs: prohibit a operation of changing acl's mask when noacl mount option used
Btrfs: should add a permission check for setfacl
Btrfs: btrfs_lookup_dir_item() can return ERR_PTR
Btrfs: btrfs_read_fs_root_no_name() returns ERR_PTRs
Btrfs: unwind after btrfs_start_transaction() errors
Btrfs: btrfs_iget() returns ERR_PTR
Btrfs: handle kzalloc() failure in open_ctree()
Btrfs: handle error returns from btrfs_lookup_dir_item()
Btrfs: Fix BUG_ON for fs converted from extN
Btrfs: Fix null dereference in relocation.c
Btrfs: fix remap_file_pages error
Btrfs: uninitialized data is check_path_shared()
Btrfs: fix fallocate regression
Btrfs: fix loop device on top of btrfs
Linus Torvalds [Fri, 11 Jun 2010 21:15:44 +0000 (14:15 -0700)]
Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jbarnes/pci-2.6
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jbarnes/pci-2.6:
PCI: clear bridge resource range if BIOS assigned bad one
PCI: hotplug/cpqphp, fix NULL dereference
Revert "PCI: create function symlinks in /sys/bus/pci/slots/N/"
PCI: change resource collision messages from KERN_ERR to KERN_INFO
Yinghai Lu [Thu, 3 Jun 2010 20:43:03 +0000 (13:43 -0700)]
PCI: clear bridge resource range if BIOS assigned bad one
Yannick found that video does not work with 2.6.34. The cause of this
bug was that the BIOS had assigned the wrong range to the PCI bridge
above the video device. Before 2.6.34 the kernel would have shrunk
the size of the bridge window, but since d65245c PCI: don't shrink bridge resources
the kernel will avoid shrinking BIOS ranges.
So zero out the old range if we fail to claim it at boot time; this will
cause us to allocate a new range at startup, restoring the 2.6.34
behavior.
Jiri Slaby [Wed, 9 Jun 2010 20:31:13 +0000 (22:31 +0200)]
PCI: hotplug/cpqphp, fix NULL dereference
There are devices out there which are PCI Hot-plug controllers with
compaq PCI IDs, but are not bridges, hence have pdev->subordinate
NULL. But cpqphp expects the pointer to be non-NULL.
Add a check to the probe function to avoid oopses like:
BUG: unable to handle kernel NULL pointer dereference at 00000050
IP: [<f82e3c41>] cpqhpc_probe+0x951/0x1120 [cpqphp]
*pdpt = 0000000033779001 *pde = 0000000000000000
...
Bjorn Helgaas [Thu, 3 Jun 2010 19:47:18 +0000 (13:47 -0600)]
PCI: change resource collision messages from KERN_ERR to KERN_INFO
We can often deal with PCI resource issues by moving devices around. In
that case, there's no point in alarming the user with messages like these.
There are many bug reports where the message itself is the only problem,
e.g., https://bugs.launchpad.net/ubuntu/+source/linux/+bug/413419 .
Dan Carpenter [Sat, 29 May 2010 09:49:07 +0000 (09:49 +0000)]
Btrfs: The file argument for fsync() is never null
The "file" argument for fsync is never null so we can remove this check.
What drew my attention here is that 7ea8085910e: "drop unused dentry
argument to ->fsync" introduced an unconditional dereference at the
start of the function and that generated a smatch warning.
Signed-off-by: Dan Carpenter <error27@gmail.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
Sage Weil [Mon, 17 May 2010 17:15:27 +0000 (17:15 +0000)]
Btrfs: avoid BUG when dropping root and reference in same transaction
If btrfs_ioctl_snap_destroy() deletes a snapshot but finishes
with end_transaction(), the cleaner kthread may come in and
drop the root in the same transaction. If that's the case, the
root's refs still == 1 in the tree when btrfs_del_root() deletes
the item, because commit_fs_roots() hasn't updated it yet (that
happens during the commit).
This wasn't a problem before only because
btrfs_ioctl_snap_destroy() would commit the transaction before dropping
the dentry reference, so the dead root wouldn't get queued up until
after the fs root item was updated in the btree.
Since it is not an error to drop the root reference and the root in the
same transaction, just drop the BUG_ON() in btrfs_del_root().
Signed-off-by: Sage Weil <sage@newdream.net> Signed-off-by: Chris Mason <chris.mason@oracle.com>
Shi Weihua [Tue, 18 May 2010 00:51:54 +0000 (00:51 +0000)]
Btrfs: prohibit a operation of changing acl's mask when noacl mount option used
when used Posix File System Test Suite(pjd-fstest) to test btrfs,
some cases about setfacl failed when noacl mount option used.
I simplified used commands in pjd-fstest, and the following steps
can reproduce it.
------------------------
# cd btrfs-part/
# mkdir aaa
# setfacl -m m::rw aaa <- successed, but not expected by pjd-fstest.
------------------------
I checked ext3, a warning message occured, like as:
setfacl: aaa/: Operation not supported
Certainly, it's expected by pjd-fstest.
So, i compared acl.c of btrfs and ext3. Based on that, a patch created.
Fortunately, it works.
Signed-off-by: Shi Weihua <shiwh@cn.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
Shi Weihua [Tue, 18 May 2010 00:50:32 +0000 (00:50 +0000)]
Btrfs: should add a permission check for setfacl
On btrfs, do the following
------------------
# su user1
# cd btrfs-part/
# touch aaa
# getfacl aaa
# file: aaa
# owner: user1
# group: user1
user::rw-
group::rw-
other::r--
# su user2
# cd btrfs-part/
# setfacl -m u::rwx aaa
# getfacl aaa
# file: aaa
# owner: user1
# group: user1
user::rwx <- successed to setfacl
group::rw-
other::r--
------------------
but we should prohibit it that user2 changing user1's acl.
In fact, on ext3 and other fs, a message occurs:
setfacl: aaa: Operation not permitted
This patch fixed it. Signed-off-by: Shi Weihua <shiwh@cn.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
btrfs_read_fs_root_no_name() returns ERR_PTRs on error so I added a
check for that. It's not clear to me if it can also return NULL
pointers or not so I left the original NULL pointer check as is.
Signed-off-by: Dan Carpenter <error27@gmail.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
Dan Carpenter [Sat, 29 May 2010 09:46:47 +0000 (09:46 +0000)]
Btrfs: unwind after btrfs_start_transaction() errors
This was added by a22285a6a3: "Btrfs: Integrate metadata reservation
with start_transaction". If we goto out here then we skip all the
unwinding and there are locks still held etc.
Signed-off-by: Dan Carpenter <error27@gmail.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
Linus Torvalds [Fri, 11 Jun 2010 16:55:50 +0000 (09:55 -0700)]
Merge branch 'rc-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/mmarek/kbuild-2.6
* 'rc-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/mmarek/kbuild-2.6:
kbuild: Create output directory in Makefile.modbuiltin
kbuild: Generate modules.builtin in make modules
Linus Torvalds [Fri, 11 Jun 2010 16:55:21 +0000 (09:55 -0700)]
Merge branch 'urgent' of git://git.kernel.org/pub/scm/linux/kernel/git/brodo/pcmcia-2.6
* 'urgent' of git://git.kernel.org/pub/scm/linux/kernel/git/brodo/pcmcia-2.6:
pcmcia: avoid validate_cis failure on CIS override
pcmcia: dev_node removal bugfix
pcmcia: yenta_socket.c Remove extra #ifdef CONFIG_YENTA_TI
pcmcia: only keep saved I365_CSCINT flag if there is no PCI irq
Linus Torvalds [Fri, 11 Jun 2010 16:52:23 +0000 (09:52 -0700)]
Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client:
ceph: try to send partial cap release on cap message on missing inode
ceph: release cap on import if we don't have the inode
ceph: fix misleading/incorrect debug message
ceph: fix atomic64_t initialization on ia64
ceph: fix lease revocation when seq doesn't match
ceph: fix f_namelen reported by statfs
ceph: fix memory leak in statfs
ceph: fix d_subdirs ordering problem
Dan Carpenter [Tue, 1 Jun 2010 08:23:11 +0000 (08:23 +0000)]
Btrfs: uninitialized data is check_path_shared()
refs can be used with uninitialized data if btrfs_lookup_extent_info()
fails on the first pass through the loop. In the original code if that
happens then check_path_shared() probably returns 1, this patch
changes it to return 1 for safety.
Signed-off-by: Dan Carpenter <error27@gmail.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
Josef Bacik [Mon, 7 Jun 2010 18:26:37 +0000 (18:26 +0000)]
Btrfs: fix fallocate regression
Seems that when btrfs_fallocate was converted to use the new ENOSPC stuff we
dropped passing the mode to the function that actually does the preallocation.
This breaks anybody who wants to use FALLOC_FL_KEEP_SIZE. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
Daniel Turull [Fri, 11 Jun 2010 06:08:11 +0000 (23:08 -0700)]
pktgen: Fix accuracy of inter-packet delay.
This patch correct a bug in the delay of pktgen.
It makes sure the inter-packet interval is accurate.
Signed-off-by: Daniel Turull <daniel.turull@gmail.com> Signed-off-by: Robert Olsson <robert.olsson@its.uu.se> Signed-off-by: David S. Miller <davem@davemloft.net>
John Fastabend [Thu, 3 Jun 2010 09:30:11 +0000 (09:30 +0000)]
net: deliver skbs on inactive slaves to exact matches
Currently, the accelerated receive path for VLAN's will
drop packets if the real device is an inactive slave and
is not one of the special pkts tested for in
skb_bond_should_drop(). This behavior is different then
the non-accelerated path and for pkts over a bonded vlan.
For example,
vlanx -> bond0 -> ethx
will be dropped in the vlan path and not delivered to any
packet handlers at all. However,
bond0 -> vlanx -> ethx
and
bond0 -> ethx
will be delivered to handlers that match the exact dev,
because the VLAN path checks the real_dev which is not a
slave and netif_recv_skb() doesn't drop frames but only
delivers them to exact matches.
This patch adds a sk_buff flag which is used for tagging
skbs that would previously been dropped and allows the
skb to continue to skb_netif_recv(). Here we add
logic to check for the deliver_no_wcard flag and if it
is set only deliver to handlers that match exactly. This
makes both paths above consistent and gives pkt handlers
a way to identify skbs that come from inactive slaves.
Without this patch in some configurations skbs will be
delivered to handlers with exact matches and in others
be dropped out right in the vlan path.
I have tested the following 4 configurations in failover modes
and load balancing modes.
# bond0 -> ethx
# vlanx -> bond0 -> ethx
# bond0 -> vlanx -> ethx
# bond0 -> ethx
|
vlanx -> --
Signed-off-by: John Fastabend <john.r.fastabend@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Sage Weil [Wed, 9 Jun 2010 23:52:04 +0000 (16:52 -0700)]
ceph: try to send partial cap release on cap message on missing inode
If we have enough memory to allocate a new cap release message, do so, so
that we can send a partial release message immediately. This keeps us from
making the MDS wait when the cap release it needs is in a partially full
release message.
If we fail because of ENOMEM, oh well, they'll just have to wait a bit
longer.
Sage Weil [Wed, 9 Jun 2010 23:47:10 +0000 (16:47 -0700)]
ceph: release cap on import if we don't have the inode
If we get an IMPORT that give us a cap, but we don't have the inode, queue
a release (and try to send it immediately) so that the MDS doesn't get
stuck waiting for us.
Catalin Marinas [Thu, 10 Jun 2010 16:02:12 +0000 (17:02 +0100)]
sata_sil24: Use memory barriers before issuing commands
The data in the cmd_block buffers may reach the main memory after the
writel() to the device ports. This patch introduces two calls to wmb()
to ensure the relative ordering.
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> Tested-by: Colin Tuckley <colin.tuckley@arm.com> Cc: Tejun Heo <tj@kernel.org> Cc: Jeff Garzik <jeff@garzik.org> Signed-off-by: Jeff Garzik <jgarzik@redhat.com>
Dan Carpenter [Wed, 9 Jun 2010 12:01:54 +0000 (14:01 +0200)]
sata_sil24: memset() overflow
cb->atapi.cdb is an array of 16 u8 elements. The call too memset()
would set the first part of the sge array to zero as well. It's not
a packed struct.
This one has been around for five years. I found it with Smatch. I
think the reason no one has seen it before is because we normally call
sil24_fill_sg() and that overwrites sge with proper information?
Signed-off-by: Dan Carpenter <error27@gmail.com> Signed-off-by: Jeff Garzik <jgarzik@redhat.com>
Linus Torvalds [Thu, 10 Jun 2010 17:53:14 +0000 (10:53 -0700)]
Merge branch 'kvm-updates/2.6.35' of git://git.kernel.org/pub/scm/virt/kvm/kvm
* 'kvm-updates/2.6.35' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
KVM: read apic->irr with ioapic lock held
KVM: ia64: Add missing spin_unlock in kvm_arch_hardware_enable()
KVM: Fix order passed to iommu_unmap
KVM: MMU: Remove user access when allowing kernel access to gpte.w=0 page
KVM: MMU: invalidate and flush on spte small->large page size change
KVM: SVM: Implement workaround for Erratum 383
KVM: SVM: Handle MCEs early in the vmexit process
KVM: powerpc: fix init/exit annotation