Steven Rostedt [Fri, 14 May 2010 14:19:13 +0000 (10:19 -0400)]
tracing: Comment the use of event_mutex with trace event flags
The flags variable is protected by the event_mutex when modifying,
but the event_mutex is not held when reading the variable.
This is due to the fact that the reads occur in critical sections where
taking a mutex (or even a spinlock) is not wanted.
But the two flags that exist (enable and filter_active) have the code
written as such to handle the reads to not need a lock.
The enable flag is used just to know if the event is enabled or not
and its use is always under the event_mutex. Whether or not the event
is actually enabled is really determined by the tracepoint being
registered. The flag is just a way to let the code know if the tracepoint
is registered.
The filter_active is different. It is read without the lock. If it
is set, then the event probes jump to the filter code. There can be a
slight mismatch between filters available and filter_active. If the flag is
set but no filters are available, the code safely jumps to a filter nop.
If the flag is not set and the filters are available, then the filters
are skipped. This is acceptable since filters are usually set before
tracing or they are set by humans, which would not notice the slight
delay that this causes.
v2: Fixed typo: "cacheing" -> "caching"
Reported-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Acked-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Tom Zanussi <tzanussi@gmail.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Li Zefan [Mon, 10 May 2010 03:23:00 +0000 (11:23 +0800)]
tracing: Fix function declarations if !CONFIG_STACKTRACE
ftrace_trace_stack() and frace_trace_userstacke() take a
struct ring_buffer argument, not struct trace_array. Commit e77405ad("tracing: pass around ring buffer instead of tracer")
made this change.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
LKML-Reference: <4BE77C14.5010806@cn.fujitsu.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Steven Rostedt [Fri, 23 Apr 2010 15:12:36 +0000 (11:12 -0400)]
tracing: Combine event filter_active and enable into single flags field
The filter_active and enable both use an int (4 bytes each) to
set a single flag. We can save 4 bytes per event by combining the
two into a single integer.
The modification of both the enable and filter fields are done
under the event_mutex, so it is still safe to combine the two.
Note: Although Mathieu gave his Acked-by, he would like it documented
that the reads of flags are not protected by the mutex. The way the
code works, these reads will not break anything, but will have a
residual effect. Since this behavior is the same even before this
patch, describing this situation is left to another patch, as this
patch does not change the behavior, but just brought it to Mathieu's
attention.
v2: Updated the event trace self test to for this change.
Acked-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Acked-by: Masami Hiramatsu <mhiramat@redhat.com> Acked-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Tom Zanussi <tzanussi@gmail.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Steven Rostedt [Fri, 23 Apr 2010 14:38:03 +0000 (10:38 -0400)]
tracing: Remove duplicate id information in event structure
Now that the trace_event structure is embedded in the ftrace_event_call
structure, there is no need for the ftrace_event_call id field.
The id field is the same as the trace_event type field.
Removing the id and re-arranging the structure brings down the tracepoint
footprint by another 5K.
Steven Rostedt [Fri, 23 Apr 2010 14:00:22 +0000 (10:00 -0400)]
tracing: Move print functions into event class
Currently, every event has its own trace_event structure. This is
fine since the structure is needed anyway. But the print function
structure (trace_event_functions) is now separate. Since the output
of the trace event is done by the class (with the exception of events
defined by DEFINE_EVENT_PRINT), it makes sense to have the class
define the print functions that all events in the class can use.
This makes a bigger deal with the syscall events since all syscall events
use the same class. The savings here is another 30K.
To accomplish this, and to let the class know what event is being
printed, the event structure is embedded in the ftrace_event_call
structure. This should not be an issues since the event structure
was created for each event anyway.
Steven Rostedt [Thu, 22 Apr 2010 22:46:14 +0000 (18:46 -0400)]
tracing: Allow events to share their print functions
Multiple events may use the same method to print their data.
Instead of having all events have a pointer to their print funtions,
the trace_event structure now points to a trace_event_functions structure
that will hold the way to print ouf the event.
The event itself is now passed to the print function to let the print
function know what kind of event it should print.
This opens the door to consolidating the way several events print
their output.
Steven Rostedt [Thu, 22 Apr 2010 15:46:44 +0000 (11:46 -0400)]
tracing: Move raw_init from events to class
The raw_init function pointer in the event is used to initialize
various kinds of events. The type of initialization needed is usually
classed to the kind of event it is.
Two events with the same class will always have the same initialization
function, so it makes sense to move this to the class structure.
Perhaps even making a special system structure would work since
the initialization is the same for all events within a system.
But since there's no system structure (yet), this will just move it
to the class.
The text grew very slightly, but this is a constant growth that happened
with the changing of the C files that call the init code.
The bigger savings is the data which will be saved the more events share
a class.
Steven Rostedt [Thu, 22 Apr 2010 14:35:55 +0000 (10:35 -0400)]
tracing: Move fields from event to class structure
Move the defined fields from the event to the class structure.
Since the fields of the event are defined by the class they belong
to, it makes sense to have the class hold the information instead
of the individual events. The events of the same class would just
hold duplicate information.
After this change the size of the kernel dropped another 3K:
Although the text increased, this was mainly due to the C files
having to adapt to the change. This is a constant increase, where
new tracepoints will not increase the Text. But the big drop is
in the data size (as well as needed allocations to hold the fields).
This will give even more savings as more tracepoints are created.
Note, if just TRACE_EVENT()s are used and not DECLARE_EVENT_CLASS()
with several DEFINE_EVENT()s, then the savings will be lost. But
we are pushing developers to consolidate events with DEFINE_EVENT()
so this should not be an issue.
The kprobes define a unique class to every new event, but are dynamic
so it should not be a issue.
The syscalls however have a single class but the fields for the individual
events are different. The syscalls use a metadata to define the
fields. I moved the fields list from the event to the metadata and
added a "get_fields()" function to the class. This function is used
to find the fields. For normal events and kprobes, get_fields() just
returns a pointer to the fields list_head in the class. For syscall
events, it returns the fields list_head in the metadata for the event.
v2: Fixed the syscall fields. The syscall metadata needs a list
of fields for both enter and exit.
Acked-by: Frederic Weisbecker <fweisbec@gmail.com> Acked-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Acked-by: Masami Hiramatsu <mhiramat@redhat.com> Cc: Tom Zanussi <tzanussi@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Steven Rostedt [Wed, 21 Apr 2010 16:27:06 +0000 (12:27 -0400)]
tracing: Remove per event trace registering
This patch removes the register functions of TRACE_EVENT() to enable
and disable tracepoints. The registering of a event is now down
directly in the trace_events.c file. The tracepoint_probe_register()
is now called directly.
The prototypes are no longer type checked, but this should not be
an issue since the tracepoints are created automatically by the
macros. If a prototype is incorrect in the TRACE_EVENT() macro, then
other macros will catch it.
The trace_event_class structure now holds the probes to be called
by the callbacks. This removes needing to have each event have
a separate pointer for the probe.
To handle kprobes and syscalls, since they register probes in a
different manner, a "reg" field is added to the ftrace_event_class
structure. If the "reg" field is assigned, then it will be called for
enabling and disabling of the probe for either ftrace or perf. To let
the reg function know what is happening, a new enum (trace_reg) is
created that has the type of control that is needed.
With this new rework, the 82 kernel events and 618 syscall events
has their footprint dramatically lowered:
The size went from 6863829 to 6819176, that's a total of 44K
in savings. With tracepoints being continuously added, this is
critical that the footprint becomes minimal.
v5: Added #ifdef CONFIG_PERF_EVENTS around a reference to perf
specific structure in trace_events.c.
v4: Fixed trace self tests to check probe because regfunc no longer
exists.
v3: Updated to handle void *data in beginning of probe parameters.
Also added the tracepoint: check_trace_callback_type_##call().
v2: Changed the callback probes to pass void * and typecast the
value within the function.
The same callback can also be registered to the same tracepoint as long
as the data registered is different. Note, the data must also be used
to unregister the callback:
Again, this patch also increases the size of the kernel, but
lays the ground work for decreasing it.
v5: Fixed net/core/drop_monitor.c to handle these updates.
v4: Moved the DECLARE_TRACE() DECLARE_TRACE_NOARGS out of the
#ifdef CONFIG_TRACE_POINTS, since the two are the same in both
cases. The __DECLARE_TRACE() is what changes.
Thanks to Frederic Weisbecker for pointing this out.
v3: Made all register_* functions require data to be passed and
all callbacks to take a void * parameter as its first argument.
This makes the calling functions comply with C standards.
Also added more comments to the modifications of DECLARE_TRACE().
v2: Made the DECLARE_TRACE() have the ability to pass arguments
and added a new DECLARE_TRACE_NOARGS() for tracepoints that
do not need any arguments.
Acked-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Acked-by: Masami Hiramatsu <mhiramat@redhat.com> Acked-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Neil Horman <nhorman@tuxdriver.com> Cc: David S. Miller <davem@davemloft.net> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
This check is meant to be used by tracepoint users which do a direct cast of
callbacks to (void *) for direct registration, thus bypassing the
register_trace_##name and unregister_trace_##name checks.
This permits to ensure that the callback type matches the function type at the
call site, but without generating any code.
Acked-by: Masami Hiramatsu <mhiramat@redhat.com> Acked-by: Frederic Weisbecker <fweisbec@gmail.com> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
LKML-Reference: <20100430165959.GA25605@Krystal> CC: Ingo Molnar <mingo@elte.hu> CC: Andrew Morton <akpm@linux-foundation.org> CC: Thomas Gleixner <tglx@linutronix.de> CC: Peter Zijlstra <peterz@infradead.org> CC: Arnaldo Carvalho de Melo <acme@redhat.com> CC: Lai Jiangshan <laijs@cn.fujitsu.com> CC: Li Zefan <lizf@cn.fujitsu.com> CC: Christoph Hellwig <hch@lst.de> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Steven Rostedt [Tue, 20 Apr 2010 14:47:33 +0000 (10:47 -0400)]
tracing: Create class struct for events
This patch creates a ftrace_event_class struct that event structs point to.
This class struct will be made to hold information to modify the
events. Currently the class struct only holds the events system name.
This patch slightly increases the size, but this change lays the ground work
of other changes to make the footprint of tracepoints smaller.
With 82 standard tracepoints, and 618 system call tracepoints
(two tracepoints per syscall: enter and exit):
Changli Gao [Fri, 7 May 2010 06:33:26 +0000 (14:33 +0800)]
sched, wait: Use wrapper functions
epoll should not touch flags in wait_queue_t. This patch introduces a new
function __add_wait_queue_exclusive(), for the users, who use wait queue as a
LIFO queue.
__add_wait_queue_tail_exclusive() is introduced too instead of
add_wait_queue_exclusive_locked(). remove_wait_queue_locked() is removed, as
it is a duplicate of __remove_wait_queue(), disliked by users, and with less
users.
Signed-off-by: Changli Gao <xiaosuo@gmail.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Paul Menage <menage@google.com> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Davide Libenzi <davidel@xmailserver.org> Cc: <containers@lists.linux-foundation.org>
LKML-Reference: <1273214006-2979-1-git-send-email-xiaosuo@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
ondemand: Make the iowait-is-busy time a sysfs tunable
Pavel Machek pointed out that not all CPUs have an efficient
idle at high frequency. Specifically, older Intel and various
AMD cpus would get a higher powerusage when copying files from
USB.
Mike Chan pointed out that the same is true for various ARM
chips as well.
Thomas Renninger suggested to make this a sysfs tunable with a
reasonable default.
This patch adds a sysfs tunable for the new behavior, and uses
a very simple function to determine a reasonable default,
depending on the CPU vendor/type.
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com> Acked-by: Rik van Riel <riel@redhat.com> Acked-by: Pavel Machek <pavel@ucw.cz> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: davej@redhat.com
LKML-Reference: <20100509082651.46914d04@infradead.org>
[ minor tidyup ] Signed-off-by: Ingo Molnar <mingo@elte.hu>
ondemand: Solve a big performance issue by counting IOWAIT time as busy
The ondemand cpufreq governor uses CPU busy time (e.g. not-idle
time) as a measure for scaling the CPU frequency up or down.
If the CPU is busy, the CPU frequency scales up, if it's idle,
the CPU frequency scales down. Effectively, it uses the CPU busy
time as proxy variable for the more nebulous "how critical is
performance right now" question.
This algorithm falls flat on its face in the light of workloads
where you're alternatingly disk and CPU bound, such as the ever
popular "git grep", but also things like startup of programs and
maildir using email clients... much to the chagarin of Andrew
Morton.
This patch changes the ondemand algorithm to count iowait time
as busy, not idle, time. As shown in the breakdown cases above,
iowait is performance critical often, and by counting iowait,
the proxy variable becomes a more accurate representation of the
"how critical is performance" question.
The problem and fix are both verified with the "perf timechar"
tool.
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Dave Jones <davej@redhat.com> Reviewed-by: Rik van Riel <riel@redhat.com> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <20100509082606.3d9f00d0@infradead.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>
Now that the only user of ts->idle_lastupdate is
update_ts_time_stats(), the entire field can be eliminated.
In update_ts_time_stats(), idle_lastupdate is first set to
"now", and a few lines later, the only user is an if() statement
that assigns a variable either to "now" or to
ts->idle_lastupdate, which has the value of "now" at that point.
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Rik van Riel <riel@redhat.com> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: davej@redhat.com
LKML-Reference: <20100509082439.2fab0b4f@infradead.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>
sched: Fold updating of the last_update_time_info into update_ts_time_stats()
This patch folds the updating of the last_update_time into the
update_ts_time_stats() function, and updates the callers.
This allows for further cleanups that are done in the next
patch.
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Rik van Riel <riel@redhat.com> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: davej@redhat.com
LKML-Reference: <20100509082403.60072967@infradead.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>
sched: Update the idle statistics in get_cpu_idle_time_us()
Right now, get_cpu_idle_time_us() only reports the idle
statistics upto the point the CPU entered last idle; not what is
valid right now.
This patch adds an update of the idle statistics to
get_cpu_idle_time_us(), so that calling this function always
returns statistics that are accurate at the point of the call.
This includes resetting the start of the idle time for
accounting purposes to avoid double accounting.
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Rik van Riel <riel@redhat.com> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: davej@redhat.com
LKML-Reference: <20100509082323.2d2f1945@infradead.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>
sched: Introduce a function to update the idle statistics
Currently, two places update the idle statistics (and more to
come later in this series).
This patch creates a helper function for updating these
statistics.
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Rik van Riel <riel@redhat.com> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: davej@redhat.com
LKML-Reference: <20100509082245.163e67ed@infradead.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>
The exported function get_cpu_idle_time_us() has no comment
describing it; add a kerneldoc comment
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Rik van Riel <riel@redhat.com> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: davej@redhat.com
LKML-Reference: <20100509082208.7cb721f0@infradead.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>
Tejun Heo [Sat, 8 May 2010 14:20:53 +0000 (16:20 +0200)]
cpu_stop: add dummy implementation for UP
When !CONFIG_SMP, cpu_stop functions weren't defined at all which
could lead to build failures if UP code uses cpu_stop facility. Add
dummy cpu_stop implementation for UP. The waiting variants execute
the work function directly with preempt disabled and
stop_one_cpu_nowait() schedules a workqueue work.
Makefile and ifdefs around stop_machine implementation are updated to
accomodate CONFIG_SMP && !CONFIG_STOP_MACHINE case.
sched: correctly place paranioa memory barriers in synchronize_sched_expedited()
The memory barriers must be in the SMP case, not in the !SMP case.
Also add a barrier after the atomic_inc() in order to ensure that
other CPUs see post-synchronize_sched_expedited() actions as following
the expedited grace period.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Signed-off-by: Tejun Heo <tj@kernel.org>
Tejun Heo [Thu, 6 May 2010 16:49:21 +0000 (18:49 +0200)]
sched: kill paranoia check in synchronize_sched_expedited()
The paranoid check which verifies that the cpu_stop callback is
actually called on all online cpus is completely superflous. It's
guaranteed by cpu_stop facility and if it didn't work as advertised
other things would go horribly wrong and trying to recover using
synchronize_sched() wouldn't be very meaningful.
Kill the paranoid check. Removal of this feature is done as a
separate step so that it can serve as a bisection point if something
actually goes wrong.
Tejun Heo [Thu, 6 May 2010 16:49:21 +0000 (18:49 +0200)]
sched: replace migration_thread with cpu_stop
Currently migration_thread is serving three purposes - migration
pusher, context to execute active_load_balance() and forced context
switcher for expedited RCU synchronize_sched. All three roles are
hardcoded into migration_thread() and determining which job is
scheduled is slightly messy.
This patch kills migration_thread and replaces all three uses with
cpu_stop. The three different roles of migration_thread() are
splitted into three separate cpu_stop callbacks -
migration_cpu_stop(), active_load_balance_cpu_stop() and
synchronize_sched_expedited_cpu_stop() - and each use case now simply
asks cpu_stop to execute the callback as necessary.
synchronize_sched_expedited() was implemented with private
preallocated resources and custom multi-cpu queueing and waiting
logic, both of which are provided by cpu_stop.
synchronize_sched_expedited_count is made atomic and all other shared
resources along with the mutex are dropped.
synchronize_sched_expedited() also implemented a check to detect cases
where not all the callback got executed on their assigned cpus and
fall back to synchronize_sched(). If called with cpu hotplug blocked,
cpu_stop already guarantees that and the condition cannot happen;
otherwise, stop_machine() would break. However, this patch preserves
the paranoid check using a cpumask to record on which cpus the stopper
ran so that it can serve as a bisection point if something actually
goes wrong theree.
Because the internal execution state is no longer visible,
rcu_expedited_torture_stats() is removed.
This patch also renames cpu_stop threads to from "stopper/%d" to
"migration/%d". The names of these threads ultimately don't matter
and there's no reason to make unnecessary userland visible changes.
With this patch applied, stop_machine() and sched now share the same
resources. stop_machine() is faster without wasting any resources and
sched migration users are much cleaner.
Tejun Heo [Thu, 6 May 2010 16:49:20 +0000 (18:49 +0200)]
stop_machine: reimplement using cpu_stop
Reimplement stop_machine using cpu_stop. As cpu stoppers are
guaranteed to be available for all online cpus,
stop_machine_create/destroy() are no longer necessary and removed.
With resource management and synchronization handled by cpu_stop, the
new implementation is much simpler. Asking the cpu_stop to execute
the stop_cpu() state machine on all online cpus with cpu hotplug
disabled is enough.
stop_machine itself doesn't need to manage any global resources
anymore, so all per-instance information is rolled into struct
stop_machine_data and the mutex and all static data variables are
removed.
The previous implementation created and destroyed RT workqueues as
necessary which made stop_machine() calls highly expensive on very
large machines. According to Dimitri Sivanich, preventing the dynamic
creation/destruction makes booting faster more than twice on very
large machines. cpu_stop resources are preallocated for all online
cpus and should have the same effect.
Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Rusty Russell <rusty@rustcorp.com.au> Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Dimitri Sivanich <sivanich@sgi.com>
Tejun Heo [Thu, 6 May 2010 16:49:20 +0000 (18:49 +0200)]
cpu_stop: implement stop_cpu[s]()
Implement a simplistic per-cpu maximum priority cpu monopolization
mechanism. A non-sleeping callback can be scheduled to run on one or
multiple cpus with maximum priority monopolozing those cpus. This is
primarily to replace and unify RT workqueue usage in stop_machine and
scheduler migration_thread which currently is serving multiple
purposes.
Four functions are provided - stop_one_cpu(), stop_one_cpu_nowait(),
stop_cpus() and try_stop_cpus().
This is to allow clean sharing of resources among stop_cpu and all the
migration thread users. One stopper thread per cpu is created which
is currently named "stopper/CPU". This will eventually replace the
migration thread and take on its name.
* This facility was originally named cpuhog and lived in separate
files but Peter Zijlstra nacked the name and thus got renamed to
cpu_stop and moved into stop_machine.c.
* Better reporting of preemption leak as per Peter's suggestion.
Steven Rostedt [Wed, 5 May 2010 14:52:31 +0000 (10:52 -0400)]
tracing: Fix tracepoint.h DECLARE_TRACE() to allow more than one header
When more than one header is included under CREATE_TRACE_POINTS
the DECLARE_TRACE() macro is not defined back to its original meaning
and the second include will fail to initialize the TRACE_EVENT()
and DECLARE_TRACE() correctly.
To fix this the tracepoint.h file moves the define of DECLARE_TRACE()
out of the #ifdef _LINUX_TRACEPOINT_H protection (just like the
define of the TRACE_EVENT()). This way the define_trace.h will undef
the DECLARE_TRACE() at the end and allow new headers to start
from scratch.
This patch also requires fixing the include/events/napi.h
It currently uses DECLARE_TRACE() and should be converted to a TRACE_EVENT()
format. But I'll leave that change to the authors of that file.
But since the napi.h file depends on using the CREATE_TRACE_POINTS
and does not define its own DEFINE_TRACE() it must use the define_trace.h
method instead.
Cc: Neil Horman <nhorman@tuxdriver.com> Cc: David S. Miller <davem@davemloft.net> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Steven Rostedt [Tue, 4 May 2010 15:24:01 +0000 (11:24 -0400)]
tracing: Convert nop macros to static inlines
The ftrace.h file contains several functions as macros when the
functions are disabled due to config options. This patch converts
most of them to static inlines.
There are two exceptions:
register_ftrace_function() and unregister_ftrace_function()
This is because their parameter "ops" must not be evaluated since
code using the function is allowed to #ifdef out the creation of
the parameter.
This also fixes an error caused by recent changes:
kernel/trace/trace_irqsoff.c: In function 'start_irqsoff_tracer':
kernel/trace/trace_irqsoff.c:571: error: expected expression before 'do'
Reported-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
kgdb: don't needlessly skip PAGE_USER test for Fsl booke
The bypassing of this test is a leftover from 2.4 vintage
kernels, and is no longer appropriate, or even used by KGDB.
Currently KGDB uses probe_kernel_write() for all access to
memory via the KGDB core, so it can simply be deleted.
This fixes CVE-2010-1446.
CC: Benjamin Herrenschmidt <benh@kernel.crashing.org> CC: Paul Mackerras <paulus@samba.org> CC: Kumar Gala <galak@kernel.crashing.org> Signed-off-by: Wufei <fei.wu@windriver.com> Signed-off-by: Jason Wessel <jason.wessel@windriver.com>
Merge branch 'for-linus' of git://git.kernel.dk/linux-2.6-block
* 'for-linus' of git://git.kernel.dk/linux-2.6-block:
exofs: Fix "add bdi backing to mount session" fall out
fs: fs/super.c needs to include backing-dev.h for !CONFIG_BLOCK
* master.kernel.org:/home/rmk/linux-2.6-arm:
ARM: 6061/1: PL061 GPIO: Bug fix - setting gpio for HIGH_LEVEL interrupt is not working.
ARM: 5957/1: ARM: RealView SD/MMC Card detection and write-protect using GPIOLIB
ARM: 6030/1: KS8695: enable console
ARM: 6060/1: PL061 GPIO: Setting gpio val after changing direction to OUT.
ARM: 6059/1: PL061 GPIO: Changing *_irq_chip_data with *_irq_data for real irqs.
ARM: 6023/1: update bcmring_defconfig to latest version and fix build error
ARM: fix build error in arch/arm/kernel/process.c
Dave Chinner [Wed, 28 Apr 2010 23:55:50 +0000 (09:55 +1000)]
xfs: add a shrinker to background inode reclaim
On low memory boxes or those with highmem, kernel can OOM before the
background reclaims inodes via xfssyncd. Add a shrinker to run inode
reclaim so that it inode reclaim is expedited when memory is low.
This is more complex than it needs to be because the VM folk don't
want a context added to the shrinker infrastructure. Hence we need
to add a global list of XFS mount structures so the shrinker can
traverse them.
Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
fs: fs/super.c needs to include backing-dev.h for !CONFIG_BLOCK
When CONFIG_BLOCK is set, it ends up getting backing-dev.h included.
But for !CONFIG_BLOCK, it isn't so lucky. The proper thing to do is
include <linux/backing-dev.h> directly from the file it's used from,
so do that.
Merge branch 'bugfixes' of git://git.linux-nfs.org/projects/trondmy/nfs-2.6
* 'bugfixes' of git://git.linux-nfs.org/projects/trondmy/nfs-2.6:
nfs: fix memory leak in nfs_get_sb with CONFIG_NFS_V4
nfs: fix some issues in nfs41_proc_reclaim_complete()
NFS: Ensure that nfs_wb_page() waits for Pg_writeback to clear
NFS: Fix an unstable write data integrity race
nfs: testing for null instead of ERR_PTR()
NFS: rsize and wsize settings ignored on v4 mounts
NFSv4: Don't attempt an atomic open if the file is a mountpoint
SUNRPC: Fix a bug in rpcauth_prune_expired
exofs: Fix "add bdi backing to mount session" fall out
Commit b3d0ab7e60d1865bb6f6a79a77aaba22f2543236 ("exofs: add bdi backing
to mount session") has a bug in the placement of the bdi member at
struct exofs_sb_info. The layout member must be kept last.
Merge branch 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/x86/linux-2.6-tip
* 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/x86/linux-2.6-tip:
x86: Disable large pages on CPUs with Atom erratum AAE44
x86-64: Clear a 64-bit FS/GS base on fork if selector is nonzero
x86, mrst: Conditionally register cpu hotplug notifier for apbt
Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jbarnes/pci-2.6
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jbarnes/pci-2.6:
x86/PCI: compute Address Space length rather than using _LEN
x86/PCI: never allocate PCI MMIO resources below BIOS_END
Al Viro [Thu, 29 Apr 2010 02:10:43 +0000 (03:10 +0100)]
nfs d_revalidate() is too trigger-happy with d_drop()
If dentry found stale happens to be a root of disconnected tree, we
can't d_drop() it; its d_hash is actually part of s_anon and d_drop()
would simply hide it from shrink_dcache_for_umount(), leading to
all sorts of fun, including busy inodes on umount and oopsen after
that.
Bug had been there since at least 2006 (commit c636eb already has it),
so it's definitely -stable fodder.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Cc: stable@kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Colin Tuckley [Wed, 24 Feb 2010 14:23:10 +0000 (15:23 +0100)]
ARM: 5957/1: ARM: RealView SD/MMC Card detection and write-protect using GPIOLIB
The switch to using GPIOLIB broke the sd/mmc card detection on the
RealView development boards if GPIO_PL061 was not selected.
This patch selects GPIO_PL061 if GPIOLIB is selected.
The sense of the return value from mmc_status has also changed
and is corrected.
Signed-off-by: Colin Tuckley <colin.tuckley@arm.com> Acked-by: Catalin Marinas <catalin.marinas@arm.com> Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6: (27 commits)
sfc: Change falcon_probe_board() to fail for unsupported boards
sfc: Always close net device at the end of a disabling reset
sfc: Wait at most 10ms for the MC to finish reading out MAC statistics
sctp: Fix oops when sending queued ASCONF chunks
sctp: fix to calc the INIT/INIT-ACK chunk length correctly is set
sctp: per_cpu variables should be in bh_disabled section
sctp: fix potential reference of a freed pointer
sctp: avoid irq lock inversion while call sk->sk_data_ready()
Revert "tcp: bind() fix when many ports are bound"
net/usb: add sierra_net.c driver
cdc_ether: fix autosuspend for mbm devices
bluetooth: handle l2cap_create_connless_pdu() errors
gianfar: Wait for both RX and TX to stop
ipheth: potential null dereferences on error path
smc91c92_cs: spin_unlock_irqrestore before calling smc_interrupt()
drivers/usb/net/kaweth.c: add device "Allied Telesyn AT-USB10 USB Ethernet Adapter"
bnx2: Update version to 2.0.9.
bnx2: Prevent "scheduling while atomic" warning with cnic, bonding and vlan.
bnx2: Fix lost MSI-X problem on 5709 NICs.
cxgb3: Wait longer for control packets on initialization
...
Ben Hutchings [Wed, 28 Apr 2010 09:01:50 +0000 (09:01 +0000)]
sfc: Change falcon_probe_board() to fail for unsupported boards
The driver needs specific PHY and board support code for each SFC4000
board; there is no point trying to continue if it is missing.
Currently unsupported boards can trigger an 'oops'.
Signed-off-by: Ben Hutchings <bhutchings@solarflare.com> Cc: stable@kernel.org Signed-off-by: David S. Miller <davem@davemloft.net>
When we finish processing ASCONF_ACK chunk, we try to send
the next queued ASCONF. This action runs the sctp state
machine recursively and it's not prepared to do so.
sctp: fix to calc the INIT/INIT-ACK chunk length correctly is set
When calculating the INIT/INIT-ACK chunk length, we should not
only account the length of parameters, but also the parameters
zero padding length, such as AUTH HMACS parameter and CHUNKS
parameter. Without the parameters zero padding length we may get
following oops.
------------------------------------------------------------------
eth0 has addresses: 3ffe:501:ffff:100:20c:29ff:fe4d:f37e and 192.168.0.21
eth1 has addresses: 192.168.1.21
------------------------------------------------------------------
Reported-by: George Cheimonidis <gchimon@gmail.com> Signed-off-by: Wei Yongjun <yjwei@cn.fujitsu.com> Signed-off-by: Vlad Yasevich <vladislav.yasevich@hp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
When sctp attempts to update an assocition, it removes any
addresses that were not in the updated INITs. However, the loop
may attempt to refrence a transport with address after removing it.
Signed-off-by: Vlad Yasevich <vladislav.yasevich@hp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
sctp: avoid irq lock inversion while call sk->sk_data_ready()
sk->sk_data_ready() of sctp socket can be called from both BH and non-BH
contexts, but the default sk->sk_data_ready(), sock_def_readable(), can
not be used in this case. Therefore, we have to make a new function
sctp_data_ready() to grab sk->sk_data_ready() with BH disabling.
=========================================================
[ INFO: possible irq lock inversion dependency detected ]
2.6.33-rc6 #129
---------------------------------------------------------
sctp_darn/1517 just changed the state of lock:
(clock-AF_INET){++.?..}, at: [<c06aab60>] sock_def_readable+0x20/0x80
but this lock took another, SOFTIRQ-unsafe lock in the past:
(slock-AF_INET){+.-...}
and interrupts could create inverse lock ordering between them.
other info that might help us debug this:
1 lock held by sctp_darn/1517:
#0: (sk_lock-AF_INET){+.+.+.}, at: [<cdfe363d>] sctp_sendmsg+0x23d/0xc00 [sctp]
Signed-off-by: Wei Yongjun <yjwei@cn.fujitsu.com> Signed-off-by: Vlad Yasevich <vladislav.yasevich@hp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
nfs: fix memory leak in nfs_get_sb with CONFIG_NFS_V4
With CONFIG_NFS_V4 and data version 4, nfs_get_sb will allocate memory for
export_path in nfs4_validate_text_mount_data, so we need to free it then.
This is addressed in following kmemleak report:
x86/PCI: compute Address Space length rather than using _LEN
ACPI _CRS Address Space Descriptors have _MIN, _MAX, and _LEN. Linux has
been computing Address Spaces as [_MIN to _MIN + _LEN - 1]. Based on the
tests in the bug reports below, Windows apparently uses [_MIN to _MAX].
Per spec (ACPI 4.0, Table 6-40), for _CRS fixed-size, fixed location
descriptors, "_LEN must be (_MAX - _MIN + 1)", and when that's true, it
doesn't matter which way we compute the end. But of course, there are
BIOSes that don't follow this rule, and we're better off if Linux handles
those exceptions the same way as Windows.
This patch makes Linux use [_MIN to _MAX], as Windows seems to do. This
effectively reverts d558b483d5 and 03db42adfe and replaces them with
simpler code.
Merge branch 'for-linus' of git://git.kernel.dk/linux-2.6-block
* 'for-linus' of git://git.kernel.dk/linux-2.6-block:
coda: move backing-dev.h kernel include inside __KERNEL__
mtd: ensure that bdi entries are properly initialized and registered
Move mtd_bdi_*mappable to mtdcore.c
btrfs: convert to using bdi_setup_and_register()
Catch filesystems lacking s_bdi
drbd: Terminate a connection early if sending the protocol fails
drbd: fix memory leak
Fix JFFS2 sync silent failure
smbfs: add bdi backing to mount session
ncpfs: add bdi backing to mount session
exofs: add bdi backing to mount session
ecryptfs: add bdi backing to mount session
coda: add bdi backing to mount session
cifs: add bdi backing to mount session
afs: add bdi backing to mount session.
9p: add bdi backing to mount session
bdi: add helper function for doing init and register of a bdi for a file system
block: ensure jiffies wrap is handled correctly in blk_rq_timed_out_timer
powerpc/pseries: Flush lazy kernel mappings after unplug operations
This ensures that the translations for unmapped IO mappings or
unmapped memory are properly removed from the MMU hash table
before such an unplug. Without this, the hypervisor refuses the
unplug operations due to those resources still being mapped by
the partition.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Anton Blanchard [Wed, 7 Apr 2010 15:33:44 +0000 (15:33 +0000)]
powerpc/numa: Add form 1 NUMA affinity
Firmware changed the way it represents memory and cpu affinity on POWER7.
Unfortunately the old method now caps the topology to work around issues
with legacy operating systems. For Linux to get the correct topology we
need to use the new form 1 affinity information.
We set the form 1 field in the client architecture, and if we see "1" in the
ibm,associativity-form property firmware supports form 1 affinity and
we should look at the first field in the ibm,associativity-reference-points
array. If not we use the second field as we always have.
Signed-off-by: Anton Blanchard <anton@samba.org> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Elina Pasheva [Wed, 28 Apr 2010 01:06:41 +0000 (18:06 -0700)]
net/usb: add sierra_net.c driver
Re-submitted based on comments from netdev community.
Summary of the changes:
1. Improved error handling.
2. Added the missing timeout arguments to usb_control_msg().
The following is a new Linux driver which exposes certain models of Sierra
Wireless modems to the operating system as Network Interface Cards (NICs).
This driver requires a version of the sierra.c driver which supports
blacklisting to work properly. The blacklist in sierra.c rejects the interfaces
claimed by sierra_net.c. Likewise, the sierra_net.c driver only accepts
(i.e. whitelists) the interface(s) used for USB-to-WWAN traffic.
The version of sierra.c which supports blacklisting is
available from the sierra wireless knowledge base page for older kernels. It is
also available in Linux kernel starting from version 2.6.31.
This driver works with all Sierra Wireless devices configured with PID=68A3
like USB305, USB306 provided the corresponding firmware version is I2.0
(for USB305) or M3.0 (for USB306) and later.
This driver will not work with earlier firmware versions than the ones shown
above. In this case the driver will issue an error message indicating
incompatibility and will not serve the device's USB-to-WWAN interface.
Sierra_net.c sits atop a pre-existing Linux driver called usbnet.c.
A series of hook functions are provided in sierra_net.c which are called by
usbnet.c in response to a particular condition such as receipt or transmission
of a data packet. As such, usbnet.c does most of the work of making
a modem appear to the system as a network device and for properly exchanging
traffic between the USB subsystem and the Network card interface.
Sierra_net.c is concerned with managing the data exchanged between the
USB-to-WWAN interface and the upper layers of the operating system.
Signed-off-by: Elina Pasheva <epasheva@sierrawireless.com> Signed-off-by: Rory Filer <rfiler@sierrawireless.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Steven Rostedt [Wed, 28 Apr 2010 01:04:24 +0000 (21:04 -0400)]
tracing: Fix sleep time function profiling
When sleep_time is off the function profiler ignores the time that a task
is scheduled out. When the task is scheduled out a timestamp is taken.
When the task is scheduled back in, the timestamp is compared to the
current time and the saved calltimes are adjusted accordingly.
But when stopping the function profiler, the sched switch hook that
does this adjustment was stopped before shutting down the tracer.
This allowed some tasks to not get their timestamps set when they
scheduled out. When the function profiler started again, this would
skew the times of the scheduler functions.
This patch moves the stopping of the sched switch to after the function
profiler is stopped. It also ignores zero set calltimes, which may
happen on start up.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Torgny Johansson [Wed, 28 Apr 2010 00:07:40 +0000 (17:07 -0700)]
cdc_ether: fix autosuspend for mbm devices
Autosuspend works until you bring the wwan interface up, then the
device does not enter autosuspend anymore.
The following patch fixes the problem by setting the .manage_power
field in the mbm_info struct to the same as in the cdc_info struct
(cdc_manager_power).
Signed-off-by: Torgny Johansson <torgny.johansson@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Andy Fleming [Tue, 27 Apr 2010 23:43:31 +0000 (16:43 -0700)]
gianfar: Wait for both RX and TX to stop
When gracefully stopping the controller, the driver was continuing if
*either* RX or TX had stopped. We need to wait for both, or the
controller could get into an invalid state.
Signed-off-by: Andy Fleming <afleming@freescale.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/security-testing-2.6
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/security-testing-2.6:
keys: don't need to use RCU in keyring_read() as semaphore is held
David Howells [Tue, 27 Apr 2010 20:13:08 +0000 (13:13 -0700)]
keys: the request_key() syscall should link an existing key to the dest keyring
The request_key() system call and request_key_and_link() should make a
link from an existing key to the destination keyring (if supplied), not
just from a new key to the destination keyring.
This can be tested by:
ring=`keyctl newring fred @s`
keyctl request2 user debug:a a
keyctl request user debug:a $ring
keyctl list $ring
If it says:
keyring is empty
then it didn't work. If it shows something like:
1 key in keyring: 1070462727: --alswrv 0 0 user: debug:a
then it did.
request_key() system call is meant to recursively search all your keyrings for
the key you desire, and, optionally, if it doesn't exist, call out to userspace
to create one for you.
If request_key() finds or creates a key, it should, optionally, create a link
to that key from the destination keyring specified.
Therefore, if, after a successful call to request_key() with a desination
keyring specified, you see the destination keyring empty, the code didn't work
correctly.
If you see the found key in the keyring, then it did - which is what the patch
is required for.
Signed-off-by: David Howells <dhowells@redhat.com> Cc: James Morris <jmorris@namei.org> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Marc Zyngier [Tue, 27 Apr 2010 20:13:07 +0000 (13:13 -0700)]
gpio: fix pca953x set_type 'scheduling while atomic' bug
Bill Gatliff reported the following bug when using the irq_chip facility
of the pca953x driver on a PPC platform:
BUG: scheduling while atomic: insmod/1530/0x00000002
He traced it back to an i2c transaction in pca953x_irq_set_type(), which
can be called with interrupt disabled (from __setup_irq()). As the i2c
controller can sleep while sending a message, this qualifies as a bad
idea.
This patch moves the i2c transaction to pca953x_irq_bus_sync_unlock(),
where it is actually safe to send an i2c message.
[akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Marc Zyngier <maz@misterjones.org> Reported-by: Bill Gatliff <bgat@billgatliff.com> Cc: Eric Miao <eric.y.miao@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Peter Huewe [Tue, 27 Apr 2010 20:13:04 +0000 (13:13 -0700)]
arch/avr32: fix build failure caused by wrong prototype
This patch fixes a build failure introduced by 1d8393171 ("avr32: use
generic ptrace_resume code") which had the static keyword as a leftover.
arch/avr32/kernel/ptrace.c:32: error: static declaration of `user_enable_single_step' follows non-static declaration
include/linux/ptrace.h:268: error: previous declaration of `user_enable_single_step' was here
David Howells [Tue, 27 Apr 2010 21:05:11 +0000 (14:05 -0700)]
keys: don't need to use RCU in keyring_read() as semaphore is held
keyring_read() doesn't need to use rcu_dereference() to access the keyring
payload as the caller holds the key semaphore to prevent modifications
from happening whilst the data is read out.
Signed-off-by: David Howells <dhowells@redhat.com> Cc: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: James Morris <jmorris@namei.org>
NFS: Ensure that nfs_wb_page() waits for Pg_writeback to clear
Neil Brown reports that he is seeing the BUG_ON(ret == 0) trigger in
nfs_page_async_flush. According to the trace in
https://bugzilla.novell.com/show_bug.cgi?id=599628
the problem appears to be due to nfs_wb_page() not waiting for the
PG_writeback flag to clear.
Chase Douglas [Mon, 26 Apr 2010 18:02:05 +0000 (14:02 -0400)]
tracing: Show sample std dev in function profiling
When combined with function graph tracing the ftrace function profiler
also prints the average run time of functions. While this gives us some
good information, it doesn't tell us anything about the variance of the
run times of the function. This change prints out the s^2 sample
standard deviation alongside the average.
This change adds one entry to the profile record structure. This
increases the memory footprint of the function profiler by 1/3 on a
32-bit system, and by 1/5 on a 64-bit system when function graphing is
enabled, though the memory is only allocated when the profiler is turned
on. During the profiling, one extra line of code adds the squared
calltime to the new record entry, so this should not adversly affect
performance.
Note that the square of the sample standard deviation is printed because
there is no sqrt implementation for unsigned long long in the kernel.
Signed-off-by: Chase Douglas <chase.douglas@canonical.com>
LKML-Reference: <1272304925-2436-1-git-send-email-chase.douglas@canonical.com>
[ fixed comment about ns^2 -> us^2 conversion ] Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Michael Chan [Tue, 27 Apr 2010 11:28:10 +0000 (11:28 +0000)]
bnx2: Prevent "scheduling while atomic" warning with cnic, bonding and vlan.
The bonding driver calls ndo_vlan_rx_register() while holding bond->lock.
The bnx2 driver calls bnx2_netif_stop() to stop the rx handling while
changing the vlgrp. The call also stops the cnic driver which sleeps
while the bond->lock is held and cause the warning.
This code path only needs to stop the NAPI rx handling while we are
changing the vlgrp. Since no reset is going to occur, there is no need
to stop cnic in this case. By adding a parameter to bnx2_netif_stop()
to skip stopping cnic, we can avoid the warning.
Signed-off-by: Michael Chan <mchan@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Michael Chan [Tue, 27 Apr 2010 11:28:09 +0000 (11:28 +0000)]
bnx2: Fix lost MSI-X problem on 5709 NICs.
It has been reported that under certain heavy traffic conditions in MSI-X
mode, the driver can lose an MSI-X vector causing all packets in the
associated rx/tx ring pair to be dropped. The problem is caused by
the chip dropping the write to unmask the MSI-X vector by the kernel
(when migrating the IRQ for example).
This can be prevented by increasing the GRC timeout value for these
register read and write operations.
Thanks to Dell for helping us debug this problem.
Signed-off-by: Michael Chan <mchan@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Chase Douglas <chase.douglas@canonical.com>
LKML-Reference: <1272045759-32018-1-git-send-email-chase.douglas@canonical.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>