Steven Rostedt [Tue, 30 Sep 2008 03:02:39 +0000 (23:02 -0400)]
ring_buffer: add paranoid check for buffer page
If for some strange reason the buffer_page gets bigger, or the page struct
gets smaller, I want to know this ASAP. The best way is to not let the
kernel compile.
This patch adds code to test the size of the struct buffer_page against the
page struct and will cause compile issues if the buffer_page ever gets bigger
than the page struct.
Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
The minimum size of an event is 8 bytes. All events are 4 byte
aligned inside the buffer.
There are 4 types (all internal use for the ring buffer, only
the data type is exported to the interface users).
RINGBUF_TYPE_PADDING: this type is used to note extra space at the end
of a buffer page.
RINGBUF_TYPE_TIME_EXTENT: This type is used when the time between events
is greater than the 27 bit delta can hold. We add another
32 bits, and record that in its own event (8 byte size).
RINGBUF_TYPE_TIME_STAMP: (Not implemented yet). This will hold data to
help keep the buffer timestamps in sync.
RINGBUF_TYPE_DATA: The event actually holds user data.
The "len" field is only three bits. Since the data must be
4 byte aligned, this field is shifted left by 2, giving a
max length of 28 bytes. If the data load is greater than 28
bytes, the first array field holds the full length of the
data load and the len field is set to zero.
Example, data size of 7 bytes:
type = RINGBUF_TYPE_DATA
len = 2
time_delta: <time-stamp> - <prev_event-time-stamp>
array[0..1]: <7 bytes of data> <1 byte empty>
This event is saved in 12 bytes of the buffer.
An event with 82 bytes of data:
type = RINGBUF_TYPE_DATA
len = 0
time_delta: <time-stamp> - <prev_event-time-stamp>
array[0]: 84 (Note the alignment)
array[1..14]: <82 bytes of data> <2 bytes empty>
The above event is saved in 92 bytes (if my math is correct).
82 bytes of data, 2 bytes empty, 4 byte header, 4 byte length.
Do not reference the above event struct directly. Use the following
functions to gain access to the event table, since the
ring_buffer_event structure may change in the future.
ring_buffer_event_length(event): get the length of the event.
This is the size of the memory used to record this
event, and not the size of the data pay load.
ring_buffer_time_delta(event): get the time delta of the event
This returns the delta time stamp since the last event.
Note: Even though this is in the header, there should
be no reason to access this directly, accept
for debugging.
ring_buffer_event_data(event): get the data from the event
This is the function to use to get the actual data
from the event. Note, it is only a pointer to the
data inside the buffer. This data must be copied to
another location otherwise you risk it being written
over in the buffer.
ring_buffer_lock: A way to lock the entire buffer.
ring_buffer_unlock: unlock the buffer.
ring_buffer_alloc: create a new ring buffer. Can choose between
overwrite or consumer/producer mode. Overwrite will
overwrite old data, where as consumer producer will
throw away new data if the consumer catches up with the
producer. The consumer/producer is the default.
ring_buffer_free: free the ring buffer.
ring_buffer_resize: resize the buffer. Changes the size of each cpu
buffer. Note, it is up to the caller to provide that
the buffer is not being used while this is happening.
This requirement may go away but do not count on it.
ring_buffer_lock_reserve: locks the ring buffer and allocates an
entry on the buffer to write to.
ring_buffer_unlock_commit: unlocks the ring buffer and commits it to
the buffer.
ring_buffer_write: writes some data into the ring buffer.
ring_buffer_peek: Look at a next item in the cpu buffer.
ring_buffer_consume: get the next item in the cpu buffer and
consume it. That is, this function increments the head
pointer.
ring_buffer_read_start: Start an iterator of a cpu buffer.
For now, this disables the cpu buffer, until you issue
a finish. This is just because we do not want the iterator
to be overwritten. This restriction may change in the future.
But note, this is used for static reading of a buffer which
is usually done "after" a trace. Live readings would want
to use the ring_buffer_consume above, which will not
disable the ring buffer.
ring_buffer_read_finish: Finishes the read iterator and reenables
the ring buffer.
ring_buffer_iter_peek: Look at the next item in the cpu iterator.
ring_buffer_read: Read the iterator and increment it.
ring_buffer_iter_reset: Reset the iterator to point to the beginning
of the cpu buffer.
ring_buffer_iter_empty: Returns true if the iterator is at the end
of the cpu buffer.
ring_buffer_size: returns the size in bytes of each cpu buffer.
Note, the real size is this times the number of CPUs.
ring_buffer_reset_cpu: Sets the cpu buffer to empty
ring_buffer_reset: sets all cpu buffers to empty
ring_buffer_swap_cpu: swaps a cpu buffer from one buffer with a
cpu buffer of another buffer. This is handy when you
want to take a snap shot of a running trace on just one
cpu. Having a backup buffer, to swap with facilitates this.
Ftrace max latencies use this.
ring_buffer_empty: Returns true if the ring buffer is empty.
ring_buffer_empty_cpu: Returns true if the cpu buffer is empty.
ring_buffer_record_disable: disable all cpu buffers (read only)
ring_buffer_record_disable_cpu: disable a single cpu buffer (read only)
ring_buffer_record_enable: enable all cpu buffers.
ring_buffer_record_enabl_cpu: enable a single cpu buffer.
ring_buffer_entries: The number of entries in a ring buffer.
ring_buffer_overruns: The number of entries removed due to writing wrap.
ring_buffer_time_stamp: Get the time stamp used by the ring buffer
ring_buffer_normalize_time_stamp: normalize the ring buffer time stamp
into nanosecs.
I still need to implement the GTOD feature. But we need support from
the cpu frequency infrastructure. But this can be done at a later
time without affecting the ring buffer interface.
Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
Steven Rostedt [Tue, 30 Sep 2008 03:02:37 +0000 (23:02 -0400)]
ftrace: give time for wakeup test to run
It is possible that the testing thread in the ftrace wakeup test does not
run before we stop the trace. This will cause the trace to fail since nothing
will be in the buffers.
This patch adds a small wait in the wakeup test to allow for the woken task
to run and be traced.
Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
ftrace/fastboot: disable tracers self-tests when boot tracer is selected
The tracing engine resets the ring buffer and the tracers touch it
too during self-tests. These self-tests happen during tracers registering
and work against boot tracing which is logging initcalls.
We have to disable tracing self-tests if the boot-tracer is selected.
tracing/ftrace: launch boot tracing after pre-smp initcalls
Launch the boot tracing inside the initcall_debug area. Old printk
have not been removed to keep the old way of initcall tracing for
backward compatibility.
It's primary purpose is to be able to trace the initcalls.
It is intended to be used with scripts/bootgraph.pl after some small
improvements.
Note that it is not active after its init. To avoid tracing (and so
crashing) before the whole tracing engine init, you have to explicitly
call start_boot_trace() after do_pre_smp_initcalls() to enable it.
Arjan van de Ven [Sat, 13 Sep 2008 16:36:06 +0000 (09:36 -0700)]
tracing/fastboot: add a script to visualize the kernel boot process / time
When optimizing the kernel boot time, it's very valuable to visualize
what is going on at which time. In addition, with the fastboot asynchronous
initcall level, it's very valuable to see which initcall gets run where
and when.
This patch adds a script to turn a dmesg into a SVG graph (that can be
shown with tools such as InkScape, Gimp or Firefox) and a small change
to the initcall code to print the PID of the thread calling the initcall
(so that the script can work out the parallelism).
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Lai Jiangshan [Fri, 10 Oct 2008 06:43:57 +0000 (14:43 +0800)]
markers: bit-field is not thread-safe nor smp-safe
bit-field is not thread-safe nor smp-safe.
struct marker_entry.rcu_pending is not protected by any lock
in rcu-callback free_old_closure().
so we must turn it into a safe type.
detail:
I suppose rcu_pending and ptype are store in struct marker_entry.tmp1
free_old_closure() side: change ptype side:
| load struct marker_entry.tmp1
--------------------------------|--------------------------------
| change ptype bit in tmp1
load struct marker_entry.tmp1 |
change rcu_pending bit in tmp1 |
store tmp1 |
--------------------------------|--------------------------------
| store tmp1
now this result equals that free_old_closure() do not change rcu_pending
bit, bug! This bug will cause redundant rcu_barrier_sched() called.
not too harmful.
----- corresponding:
free_old_closure() side: change ptype side:
load struct marker_entry.tmp1 |
--------------------------------|--------------------------------
| load struct marker_entry.tmp1
change rcu_pending bit in tmp1 |
| change ptype bit in tmp1
| store tmp1
--------------------------------|--------------------------------
store tmp1 |
now this result equals that change ptype side do not change ptype
bit, bug! this bug cause marker_probe_cb() access to invalid memory.
oops!
Lai Jiangshan discovered a reentrancy issue with markers and fixed it by
adding synchronize_sched() calls at each registration/unregistraiton.
It works, but it removes the ability to do batch
registration/unregistration and can cause registration of ~100 markers
to take about 30 seconds on a loaded machine (synchronize_sched() is
much slower on such workloads).
This patch implements a version of the fix which won't slow down marker batch
registration/unregistration. It also go back to the original non-synchronized
reg/unreg.
We need a marker_synchronize_unregister() before the end of exit() to make sure
every probe callers have exited the non preemptible section and thus are not
executing the probe code anymore.
Document the need for a marker_synchronize_unregister() before the end of
exit() to make sure every probe callers have exited the non preemptible
section and thus are not executing the probe code anymore.
Need a marker_synchronize_unregister() before the end of exit() to make sure
every probe callers have exited the non preemptible section and thus are not
executing the probe code anymore.
markers: fix unregister bug and reenter bug, cleanup
Use the new rcu_read_lock_sched/unlock_sched() in marker code around the call
site instead of preempt_disable/enable(). It helps reviewing the code more
easily.
Create marker_synchronize_unregister() which must be called before the end of
exit() to make sure every probe callers have exited the non preemptible section
and thus are not executing the probe code anymore.
Lai Jiangshan [Mon, 29 Sep 2008 08:00:05 +0000 (16:00 +0800)]
markers: fix unregister bug and reenter bug
unregister bug:
codes using makers are typically calling marker_probe_unregister()
and then destroying the data that marker_probe_func needs(or
unloading this module). This is bug when the corresponding
marker_probe_func is still running(on other cpus),
it is using the destroying/ed data.
we should call synchronize_sched() after marker_update_probes().
reenter bug:
marker_probe_register(), marker_probe_unregister() and
marker_probe_unregister_private_data() are not reentrant safe
functions. these 3 functions release markers_mutex and then
require it again and do "entry->oldptr = old; ...", but entry->oldptr
maybe is using now for these 3 functions may reenter when markers_mutex
is released.
we use synchronize_sched() instead of call_rcu_sched() to fix
this bug. actually we can do:
"
if (entry->rcu_pending)
rcu_barrier_sched();
"
after require markers_mutex again. but synchronize_sched()
is better and simpler. For these 3 functions are not critical path.
It is because ftrace_modify_code() calls copy_to_user and
copy_from_user.
These functions have been inserted after guessing that there
couldn't be any race condition but copy_[to/from]_user might
sleep and __ftrace_update_code is called with local_irq_saved.
Harvey Harrison [Tue, 23 Sep 2008 22:05:46 +0000 (15:05 -0700)]
x86: suppress trivial sparse signedness warnings
Could just as easily change the three casts to cast to the correct
type...this patch changes the type of ftrace_nop instead.
Supresses sparse warnings:
arch/x86/kernel/ftrace.c:157:14: warning: incorrect type in assignment (different signedness)
arch/x86/kernel/ftrace.c:157:14: expected long *static [toplevel] ftrace_nop
arch/x86/kernel/ftrace.c:157:14: got unsigned long *<noident>
arch/x86/kernel/ftrace.c:161:14: warning: incorrect type in assignment (different signedness)
arch/x86/kernel/ftrace.c:161:14: expected long *static [toplevel] ftrace_nop
arch/x86/kernel/ftrace.c:161:14: got unsigned long *<noident>
arch/x86/kernel/ftrace.c:165:14: warning: incorrect type in assignment (different signedness)
arch/x86/kernel/ftrace.c:165:14: expected long *static [toplevel] ftrace_nop
arch/x86/kernel/ftrace.c:165:14: got unsigned long *<noident>
Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
Pekka Paalanen [Tue, 16 Sep 2008 19:06:42 +0000 (22:06 +0300)]
ftrace: inject markers via trace_marker file
Allow a user to inject a marker (TRACE_PRINT entry) into the trace ring
buffer. The related file operations are derived from code by Frédéric
Weisbecker <fweisbec@gmail.com>.
Signed-off-by: Pekka Paalanen <pq@iki.fi> Acked-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>
Pekka Paalanen [Tue, 16 Sep 2008 18:58:24 +0000 (21:58 +0300)]
ftrace: add trace_vprintk()
trace_vprintk() for easier implementation of tracer specific *_printk
functions. Add check check for no_tracer, and implement
__ftrace_printk() as a wrapper.
Signed-off-by: Pekka Paalanen <pq@iki.fi> Acked-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>
Pekka Paalanen [Tue, 16 Sep 2008 18:56:41 +0000 (21:56 +0300)]
ftrace: move mmiotrace functions out of trace.c
Moves the mmiotrace specific functions from trace.c to
trace_mmiotrace.c. Functions trace_wake_up(), tracing_get_trace_entry(),
and tracing_generic_entry_update() are therefore made available outside
trace.c.
Signed-off-by: Pekka Paalanen <pq@iki.fi> Acked-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>
make most of the tracers depend on DEBUG_KERNEL - that's their intended
purpose. (most distributions have DEBUG_KERNEL enabled anyway so this is
not a practical limitation - but it simplifies the tracing menu in the
normal case)
Steven Rostedt [Wed, 3 Sep 2008 21:42:51 +0000 (17:42 -0400)]
ftrace: print continue index fix
An item in the trace buffer that is bigger than one entry may be split
up using the TRACE_CONT entry. This makes it a virtual single entry.
The current code increments the iterator index even while traversing
TRACE_CONT entries, making it look like the iterator is further than
it actually is.
This patch adds code to not increment the iterator index while skipping
over TRACE_CONT entries.
Signed-off-by: Steven Rostedt <srostedt@redhat.com> Acked-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>
Steven Rostedt [Wed, 3 Sep 2008 21:42:50 +0000 (17:42 -0400)]
ftrace: binary and not logical for continue test
Peter Zijlstra provided me with a nice brown paper bag while letting me know
that I was doing a logical AND and not a binary one, making a condition
true more often than it should be.
Luckily, a false true is handled by the calling function and no harm is
done. But this needs to be fixed regardless.
Signed-off-by: Steven Rostedt <srostedt@redhat.com> Acked-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>
Steven Rostedt [Fri, 29 Aug 2008 20:51:43 +0000 (16:51 -0400)]
ftrace: stack trace add indexes
This patch adds indexes into the stack that the functions in the
stack dump were found at. As an added bonus, I also added a diff
to show which function is the most notorious consumer of the stack.
Steven Rostedt [Wed, 27 Aug 2008 17:02:01 +0000 (13:02 -0400)]
ftrace: remove warning of old objcopy and local functions
The warning messages about old objcopy and local functions spam the
user quite drastically. Remove the warning until we can find a nicer
way of tell the user to upgrade their objcopy.
Signed-off-by: Steven Rostedt <srostedt@redhat.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Ingo Molnar <mingo@elte.hu>
Steven Rostedt [Thu, 28 Aug 2008 03:24:15 +0000 (23:24 -0400)]
ftrace: remove direct reference to mcount in trace code
The mcount record method of ftrace scans objdump for references to mcount.
Using mcount as the reference to test if the calls to mcount being replaced
are indeed calls to mcount, this use of mcount was also caught as a
location to change. Using a variable that points to the mcount address
moves this reference into the data section that is not scanned, and
we do not use a false location to try and modify.
The warn on code was what was used to detect this bug.
Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
Steven Rostedt [Thu, 28 Aug 2008 03:31:01 +0000 (23:31 -0400)]
ftrace: add stack tracer
This is another tracer using the ftrace infrastructure, that examines
at each function call the size of the stack. If the stack use is greater
than the previous max it is recorded.
You can always see (and set) the max stack size seen. By setting it
to zero will start the recording again. The backtrace is also available.
Steven Rostedt [Mon, 25 Aug 2008 18:52:11 +0000 (14:52 -0400)]
ftrace: objcopy version test for local symbols
The --globalize-symbols option came out in objcopy version 2.17.
If the kernel is being compiled on a system with a lower version of
objcopy, then we can not use the globalize / localize trick to
link to symbols pointing to local functions.
This patch tests the version of objcopy and will only use the trick
if the version is greater than or equal to 2.17. Otherwise, if an
object has only local functions within a section, it will give a
nice warning and recommend the user to upgrade their objcopy.
Leaving the symbols unrecorded is not that big of a deal, since the
mcount record method changes the actual mcount code to be a simple
"ret" without recording registers or anything.
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
Stephen Rothwell [Mon, 25 Aug 2008 03:08:44 +0000 (13:08 +1000)]
ftrace: fix build failure
After disabling FTRACE_MCOUNT_RECORD via a patch, a dormant build
failure surfaced:
kernel/trace/ftrace.c: In function 'ftrace_record_ip':
kernel/trace/ftrace.c:416: error: incompatible type for argument 1 of '_spin_lock_irqsave'
kernel/trace/ftrace.c:433: error: incompatible type for argument 1 of '_spin_lock_irqsave'
Steven Rostedt [Wed, 20 Aug 2008 16:55:07 +0000 (12:55 -0400)]
ftrace: x86 use copy to and from user functions
The modification of code is performed either by kstop_machine, before
SMP starts, or on module code before the module is executed. There is
no reason to do the modifications from assembly. The copy to and from
user functions are sufficient and produces cleaner and easier to read
code.
Thanks to Benjamin Herrenschmidt for suggesting the idea.
Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
Steven Rostedt [Wed, 20 Aug 2008 14:07:35 +0000 (10:07 -0400)]
ftrace: handle weak symbol functions
During tests and checks, I've discovered that there were failures to
convert mcount callers into nops. Looking deeper into these failures,
code that was attempted to be changed was not an mcount caller.
The current code only updates if the code being changed is what it expects,
but I still investigate any time there is a failure.
What was happening is that a weak symbol was being used as a reference
for other mcount callers. That weak symbol was also referenced elsewhere
so the offsets were using the strong symbol and not the function symbol
that it was referenced from.
This patch changes the setting up of the mcount_loc section to search
for a global function that is not weak. It will pick a local over a weak
but if only a weak is found in a section, a warning is printed and the
mcount location is not recorded (just to be safe).
Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
Steven Rostedt [Fri, 15 Aug 2008 15:40:24 +0000 (11:40 -0400)]
ftrace: update recordmount.pl arch changes
I'm trying to keep all the arch changes in recordmcount.pl in one place.
I moved your code into that area, by adding the flags to the commands
that were passed in.
Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
make[3]: *** [arch/x86/kernel/time_32.o] Error 1
/bin/sh: scripts/recordmcount.pl: No such file or directory
make[3]: *** [arch/x86/kernel/irq_32.o] Error 1
/bin/sh: scripts/recordmcount.pl: No such file or directory
make[3]: *** [arch/x86/kernel/ldt.o] Error 1
/bin/sh: scripts/recordmcount.pl: No such file or directory
make[3]: *** [arch/x86/kernel/i8259.o] Error 1
/bin/sh: scripts/recordmcount.pl: No such file or directory
Steven Rostedt [Sat, 16 Aug 2008 01:40:05 +0000 (21:40 -0400)]
ftrace: add necessary locking for ftrace records
The new design of pre-recorded mcounts and updating the code outside of
kstop_machine has changed the way the records themselves are protected.
This patch uses the ftrace_lock to protect the records. Note, the lock
still does not need to be taken within calls that are only called via
kstop_machine, since the that code can not run while the spin lock is held.
Also removed the hash_lock needed for the daemon when MCOUNT_RECORD is
configured. Also did a slight cleanup of an unused variable.
Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
Ingo Molnar [Mon, 9 Jun 2008 18:54:22 +0000 (20:54 +0200)]
ftrace: scripts/recordmcount.pl cross-build hack
hack around:
ld: Relocatable linking with relocations from format elf32-i386 (init/.tmp_gl_calibrate.o) to format elf64-x86-64 (init/.tmp_mx_calibrate.o) i CC arch/x86/mm/extable.o
objcopy: 'init/.tmp_mx_calibrate.o': No such file
rm: cannot remove `init/.tmp_mx_calibrate.o': No such file or directory
ld: Relocatable linking with relocations from format elf32-i386 (arch/x86/mm/extable.o) to format elf64-x86-64 (arch/x86/mm/.tmp_mx_extable.o) is not supported
mv: cannot stat `arch/x86/mm/.tmp_mx_extable.o': No such file or directory
ld: Relocatable linking with relocations from format elf32-i386 (arch/x86/mm/fault.o) to format elf64-x86-64 (arch/x86/mm/.tmp_mx_fault.o) is not supported
Ingo Molnar [Fri, 15 Aug 2008 15:48:02 +0000 (17:48 +0200)]
ftrace: build fix
fix:
In file included from init/main.c:65:
include/linux/ftrace.h:166: error: expected ‘,' or ‘;' before ‘{' token
make[1]: *** [init/main.o] Error 1
make: *** [init/main.o] Error 2
Steven Rostedt [Thu, 31 Jul 2008 02:36:46 +0000 (22:36 -0400)]
ftrace: dump out ftrace buffers to console on panic
At OLS I had a lot of interest to be able to have the ftrace buffers
dumped on panic. Usually one would expect to uses kexec and examine
the buffers after a new kernel is loaded. But sometimes the resources
do not permit kdump and kexec, so having an option to still see the
sequence of events up to the crash is very advantageous.
This patch adds the option to have the ftrace buffers dumped to the
console in the latency_trace format on a panic. When the option is set,
the default entries per CPU buffer are lowered to 16384, since the writing
to the serial (if that is the console) may take an awful long time
otherwise.
[
Changes since -v1:
Got alpine to send correctly (as well as spell check working).
Removed config option.
Moved the static variables into ftrace_dump itself.
Gave printk a log level.
]
Signed-off-by: Steven Rostedt <srostedt@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>
Steven Rostedt [Fri, 1 Aug 2008 20:45:49 +0000 (16:45 -0400)]
ftrace: ftrace_printk doc moved
Based on Randy Dunlap's suggestion, the ftrace_printk kernel-doc belongs
with the ftrace_printk macro that should be used. Not with the
__ftrace_printk internal function.
Signed-off-by: Steven Rostedt <srostedt@redhat.com> Acked-by: Randy Dunlap <randy.dunlap@oracle.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
Steven Rostedt [Fri, 1 Aug 2008 16:26:41 +0000 (12:26 -0400)]
ftrace: printk formatting infrastructure
This patch adds a feature that can help kernel developers debug their
code using ftrace.
int ftrace_printk(const char *fmt, ...);
This records into the ftrace buffer using printf formatting. The entry
size in the buffers are still a fixed length. A new type has been added
that allows for more entries to be used for a single recording.
The start of the print is still the same as the other entries.
It returns the number of characters written to the ftrace buffer.
For example:
Having a module with the following code:
static int __init ftrace_print_test(void)
{
ftrace_printk("jiffies are %ld\n", jiffies);
return 0;
}
Gives me:
insmod-5441 3...1 7569us : ftrace_print_test: jiffies are 4296626666
for the latency_trace file and:
insmod-5441 [03] 1959.370498: ftrace_print_test jiffies are 4296626666
for the trace file.
Note: Only the infrastructure should go into the kernel. It is to help
facilitate debugging for other kernel developers. Calls to ftrace_printk
is not intended to be left in the kernel, and should be frowned upon just
like scattering printks around in the code.
But having this easily at your fingertips helps the debugging go faster
and bugs be solved quicker.
Maybe later on, we can hook this with markers and have their printf format
be sucked into ftrace output.
Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
Steven Rostedt [Fri, 1 Aug 2008 16:26:40 +0000 (12:26 -0400)]
ftrace: new continue entry - separate out from trace_entry
Some tracers will need to work with more than one entry. In order to do this
the trace_entry structure was split into two fields. One for the start of
all entries, and one to continue an existing entry.
The trace_entry structure now has a "field" entry that consists of the previous
content of the trace_entry, and a "cont" entry that is just a string buffer
the size of the "field" entry.
Thanks to Andrew Morton for suggesting this idea.
Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
Steven Rostedt [Fri, 15 Aug 2008 02:47:19 +0000 (22:47 -0400)]
ftrace: remove old pointers to mcount
When a mcount pointer is recorded into a table, it is used to add or
remove calls to mcount (replacing them with nops). If the code is removed
via removing a module, the pointers still exist. At modifying the code
a check is always made to make sure the code being replaced is the code
expected. In-other-words, the code being replaced is compared to what
it is expected to be before being replaced.
There is a very small chance that the code being replaced just happens
to look like code that calls mcount (very small since the call to mcount
is relative). To remove this chance, this patch adds ftrace_release to
allow module unloading to remove the pointers to mcount within the module.
Another change for init calls is made to not trace calls marked with
__init. The tracing can not be started until after init is done anyway.
Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
Steven Rostedt [Thu, 14 Aug 2008 22:05:05 +0000 (18:05 -0400)]
ftrace: use only 5 byte nops for x86
Mathieu Desnoyers revealed a bug in the original code. The nop that is
used to relpace the mcount caller can be a two part nop. This runs the
risk where a process can be preempted after executing the first nop, but
before the second part of the nop.
The ftrace code calls kstop_machine to keep multiple CPUs from executing
code that is being modified, but it does not protect against a task preempting
in the middle of a two part nop.
If the above preemption happens and the tracer is enabled, after the
kstop_machine runs, all those nops will be calls to the trace function.
If the preempted process that was preempted between the two nops is executed
again, it will execute half of the call to the trace function, and this
might crash the system.
This patch instead uses what both the latest Intel and AMD spec suggests.
That is the P6_NOP5 sequence of "0x0f 0x1f 0x44 0x00 0x00".
Note, some older CPUs and QEMU might fault on this nop, so this nop
is executed with fault handling first. If it detects a fault, it will then
use the code "0x66 0x66 0x66 0x66 0x90". If that faults, it will then
default to a simple "jmp 1f; .byte 0x00 0x00 0x00; 1:". The jmp is
not optimal but will do if the first two can not be executed.
TODO: Examine the cpuid to determine the nop to use.
Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
Steven Rostedt [Thu, 14 Aug 2008 19:45:12 +0000 (15:45 -0400)]
ftrace: x86 mcount stub
x86 now sets up the mcount locations through the build and no longer
needs to record the ip when the function is executed. This patch changes
the initial mcount to simply return. There's no need to do any other work.
If the ftrace start up test fails, the original mcount will be what everything
will use, so having this as fast as possible is a good thing.
Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
Steven Rostedt [Thu, 14 Aug 2008 19:45:08 +0000 (15:45 -0400)]
ftrace: mcount call site on boot nops core
This is the infrastructure to the converting the mcount call sites
recorded by the __mcount_loc section into nops on boot. It also allows
for using these sites to enable tracing as normal. When the __mcount_loc
section is used, the "ftraced" kernel thread is disabled.
This uses the current infrastructure to record the mcount call sites
as well as convert them to nops. The mcount function is kept as a stub
on boot up and not converted to the ftrace_record_ip function. We use the
ftrace_record_ip to only record from the table.
This patch does not handle modules. That comes with a later patch.
Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
Steven Rostedt [Thu, 14 Aug 2008 19:45:07 +0000 (15:45 -0400)]
ftrace: create __mcount_loc section
This patch creates a section in the kernel called "__mcount_loc".
This will hold a list of pointers to the mcount relocation for
each call site of mcount.
For example:
objdump -dr init/main.o
[...]
Disassembly of section .text:
The offset to of the mcount call site in init_post is an offset from
the start of the section, and not the start of the function init_post.
The mcount relocation is at the call site 0x185 from the start of the
.text section.
.text + 0x185 == init_post + 0xa
We need a way to add this __mcount_loc section in a way that we do not
lose the relocations after final link. The .text section here will
be attached to all other .text sections after final link and the
offsets will be meaningless. We need to keep track of where these
.text sections are.
To do this, we use the start of the first function in the section.
do_one_initcall. We can make a tmp.s file with this function as a reference
to the start of the .text section.
But we have a problem. What happens if the first function in a section
is not exported, and is a static function. The linker will not let
the tmp.o use it. This case exists in main.o as well.
The lowercase 't' means that set_reset_devices is local and is not exported.
If we simply try to link the tmp.o with the set_reset_devices we end
up with two symbols: one local and one global.
We still have an undefined reference to set_reset_devices, and if we try
to compile the kernel, we will end up with an undefined reference to
set_reset_devices, or even worst, it could be exported someplace else,
and then we will have a reference to the wrong location.
To handle this case, we make an intermediate step using objcopy.
We convert set_reset_devices into a global exported symbol before linking
it with tmp.o and set it back afterwards.
Now we have a section in main.o called __mcount_loc that we can place
somewhere in the kernel using vmlinux.ld.S and access it to convert
all these locations that call mcount into nops before starting SMP
and thus, eliminating the need to do this with kstop_machine.
Note, A well documented perl script (scripts/recordmcount.pl) is used
to do all this in one location.
Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
kprobes already has an extensive list of annotations for functions
that should not be instrumented. Add notrace annotations to these
functions as well.
This is particularly useful for functions called by the NMI path.
Instrument the scheduler activity (sched_switch, migration, wakeups,
wait for a task, signal delivery) and process/thread
creation/destruction (fork, exit, kthread stop). Actually, kthread
creation is not instrumented in this patch because it is architecture
dependent. It allows to connect tracers such as ftrace which detects
scheduling latencies, good/bad scheduler decisions. Tools like LTTng can
export this scheduler information along with instrumentation of the rest
of the kernel activity to perform post-mortem analysis on the scheduler
activity.
About the performance impact of tracepoints (which is comparable to
markers), even without immediate values optimizations, tests done by
Hideo Aoki on ia64 show no regression. His test case was using hackbench
on a kernel where scheduler instrumentation (about 5 events in code
scheduler code) was added. See the "Tracepoints" patch header for
performance result detail.
Changelog :
- Change instrumentation location and parameter to match ftrace
instrumentation, previously done with kernel markers.
Implementation of kernel tracepoints. Inspired from the Linux Kernel
Markers. Allows complete typing verification by declaring both tracing
statement inline functions and probe registration/unregistration static
inline functions within the same macro "DEFINE_TRACE". No format string
is required. See the tracepoint Documentation and Samples patches for
usage examples.
Taken from the documentation patch :
"A tracepoint placed in code provides a hook to call a function (probe)
that you can provide at runtime. A tracepoint can be "on" (a probe is
connected to it) or "off" (no probe is attached). When a tracepoint is
"off" it has no effect, except for adding a tiny time penalty (checking
a condition for a branch) and space penalty (adding a few bytes for the
function call at the end of the instrumented function and adds a data
structure in a separate section). When a tracepoint is "on", the
function you provide is called each time the tracepoint is executed, in
the execution context of the caller. When the function provided ends its
execution, it returns to the caller (continuing from the tracepoint
site).
You can put tracepoints at important locations in the code. They are
lightweight hooks that can pass an arbitrary number of parameters, which
prototypes are described in a tracepoint declaration placed in a header
file."
Addition and removal of tracepoints is synchronized by RCU using the
scheduler (and preempt_disable) as guarantees to find a quiescent state
(this is really RCU "classic"). The update side uses rcu_barrier_sched()
with call_rcu_sched() and the read/execute side uses
"preempt_disable()/preempt_enable()".
We make sure the previous array containing probes, which has been
scheduled for deletion by the rcu callback, is indeed freed before we
proceed to the next update. It therefore limits the rate of modification
of a single tracepoint to one update per RCU period. The objective here
is to permit fast batch add/removal of probes on _different_
tracepoints.
Changelog :
- Use #name ":" #proto as string to identify the tracepoint in the
tracepoint table. This will make sure not type mismatch happens due to
connexion of a probe with the wrong type to a tracepoint declared with
the same name in a different header.
- Add tracepoint_entry_free_old.
- Change __TO_TRACE to get rid of the 'i' iterator.
Masami Hiramatsu <mhiramat@redhat.com> :
Tested on x86-64.
Performance impact of a tracepoint : same as markers, except that it
adds about 70 bytes of instructions in an unlikely branch of each
instrumented function (the for loop, the stack setup and the function
call). It currently adds a memory read, a test and a conditional branch
at the instrumentation site (in the hot path). Immediate values will
eventually change this into a load immediate, test and branch, which
removes the memory read which will make the i-cache impact smaller
(changing the memory read for a load immediate removes 3-4 bytes per
site on x86_32 (depending on mov prefixes), or 7-8 bytes on x86_64, it
also saves the d-cache hit).
About the performance impact of tracepoints (which is comparable to
markers), even without immediate values optimizations, tests done by
Hideo Aoki on ia64 show no regression. His test case was using hackbench
on a kernel where scheduler instrumentation (about 5 events in code
scheduler code) was added.
Quoting Hideo Aoki about Markers :
I evaluated overhead of kernel marker using linux-2.6-sched-fixes git
tree, which includes several markers for LTTng, using an ia64 server.
While the immediate trace mark feature isn't implemented on ia64, there
is no major performance regression. So, I think that we don't have any
issues to propose merging marker point patches into Linus's tree from
the viewpoint of performance impact.
I prepared two kernels to evaluate. The first one was compiled without
CONFIG_MARKERS. The second one was enabled CONFIG_MARKERS.
I downloaded the original hackbench from the following URL:
http://devresources.linux-foundation.org/craiger/hackbench/src/hackbench.c
I ran hackbench 5 times in each condition and calculated the average and
difference between the kernels.
The parameter of hackbench: every 50 from 50 to 800
The number of CPUs of the server: 2, 4, and 8
Below is the results. As you can see, major performance regression
wasn't found in any case. Even if number of processes increases,
differences between marker-enabled kernel and marker- disabled kernel
doesn't increase. Moreover, if number of CPUs increases, the differences
doesn't increase either.
Curiously, marker-enabled kernel is better than marker-disabled kernel
in more than half cases, although I guess it comes from the difference
of memory access pattern.
Linus Torvalds [Mon, 13 Oct 2008 16:54:35 +0000 (09:54 -0700)]
Merge phase #5 (misc) of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
Merges oprofile, timers/hpet, x86/traps, x86/time, and x86/core misc items.
* 'x86-core-v4-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (132 commits)
x86: change early_ioremap to use slots instead of nesting
x86: adjust dependencies for CONFIG_X86_CMOV
dumpstack: x86: various small unification steps, fix
x86: remove additional_cpus
x86: remove additional_cpus configurability
x86: improve UP kernel when CPU-hotplug and SMP is enabled
dumpstack: x86: various small unification steps
dumpstack: i386: make kstack= an early boot-param and add oops=panic
dumpstack: x86: use log_lvl and unify trace formatting
dumptrace: x86: consistently include loglevel, print stack switch
dumpstack: x86: add "end" parameter to valid_stack_ptr and print_context_stack
dumpstack: x86: make printk_address equal
dumpstack: x86: move die_nmi to dumpstack_32.c
traps: x86: finalize unification of traps.c
traps: x86: make traps_32.c and traps_64.c equal
traps: x86: various noop-changes preparing for unification of traps_xx.c
traps: x86_64: use task_pid_nr(tsk) instead of tsk->pid in do_general_protection
traps: i386: expand clear_mem_error and remove from mach_traps.h
traps: x86_64: make io_check_error equal to the one on i386
traps: i386: use preempt_conditional_sti/cli in do_int3
...
David S. Miller [Mon, 13 Oct 2008 09:45:26 +0000 (10:45 +0100)]
serial: fix device name reporting when minor space is shared between drivers
The multiple drivers share the minor space occupied by a particular major
number, the actual index within the device name's space is indicated by
the tty_driver->name_base + uart_port->line
Another usable formula is (uart_driver->minor - MINOR_BASE) + port->line
Use those to print the device names properly in such situations in
serial_core.c and 8250.c
Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Alan Cox <alan@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>