Documentation/trace/ftrace.txt

   1                 ftrace - Function Tracer
   2                 ========================
   3
   4 Copyright 2008 Red Hat Inc.
   5    Author:   Steven Rostedt <srostedt@redhat.com>
   6   License:   The GNU Free Documentation License, Version 1.2
   7                (dual licensed under the GPL v2)
   8 Reviewers:   Elias Oltmanns, Randy Dunlap, Andrew Morton,
   9              John Kacur, and David Teigland.
  10 Written for: 2.6.28-rc2
  11 Updated for: 3.10
  12
  13 Introduction
  14 ------------
  15
  16 Ftrace is an internal tracer designed to help out developers and
  17 designers of systems to find what is going on inside the kernel.
  18 It can be used for debugging or analyzing latencies and
  19 performance issues that take place outside of user-space.
  20
  21 Although ftrace is typically considered the function tracer, it
  22 is really a frame work of several assorted tracing utilities.
  23 There's latency tracing to examine what occurs between interrupts
  24 disabled and enabled, as well as for preemption and from a time
  25 a task is woken to the task is actually scheduled in.
  26
  27 One of the most common uses of ftrace is the event tracing.
  28 Through out the kernel is hundreds of static event points that
  29 can be enabled via the debugfs file system to see what is
  30 going on in certain parts of the kernel.
  31
  32
  33 Implementation Details
  34 ----------------------
  35
  36 See ftrace-design.txt for details for arch porters and such.
  37
  38
  39 The File System
  40 ---------------
  41
  42 Ftrace uses the debugfs file system to hold the control files as
  43 well as the files to display output.
  44
  45 When debugfs is configured into the kernel (which selecting any ftrace
  46 option will do) the directory /sys/kernel/debug will be created. To mount
  47 this directory, you can add to your /etc/fstab file:
  48
  49  debugfs       /sys/kernel/debug          debugfs defaults        0       0
  50
  51 Or you can mount it at run time with:
  52
  53  mount -t debugfs nodev /sys/kernel/debug
  54
  55 For quicker access to that directory you may want to make a soft link to
  56 it:
  57
  58  ln -s /sys/kernel/debug /debug
  59
  60 Any selected ftrace option will also create a directory called tracing
  61 within the debugfs. The rest of the document will assume that you are in
  62 the ftrace directory (cd /sys/kernel/debug/tracing) and will only concentrate
  63 on the files within that directory and not distract from the content with
  64 the extended "/sys/kernel/debug/tracing" path name.
  65
  66 That's it! (assuming that you have ftrace configured into your kernel)
  67
  68 After mounting debugfs, you can see a directory called
  69 "tracing".  This directory contains the control and output files
  70 of ftrace. Here is a list of some of the key files:
  71
  72
  73  Note: all time values are in microseconds.
  74
  75   current_tracer:
  76
  77         This is used to set or display the current tracer
  78         that is configured.
  79
  80   available_tracers:
  81
  82         This holds the different types of tracers that
  83         have been compiled into the kernel. The
  84         tracers listed here can be configured by
  85         echoing their name into current_tracer.
  86
  87   tracing_on:
  88
  89         This sets or displays whether writing to the trace
  90         ring buffer is enabled. Echo 0 into this file to disable
  91         the tracer or 1 to enable it. Note, this only disables
  92         writing to the ring buffer, the tracing overhead may
  93         still be occurring.
  94
  95   trace:
  96
  97         This file holds the output of the trace in a human
  98         readable format (described below).
  99
 100   trace_pipe:
 101
 102         The output is the same as the "trace" file but this
 103         file is meant to be streamed with live tracing.
 104         Reads from this file will block until new data is
 105         retrieved.  Unlike the "trace" file, this file is a
 106         consumer. This means reading from this file causes
 107         sequential reads to display more current data. Once
 108         data is read from this file, it is consumed, and
 109         will not be read again with a sequential read. The
 110         "trace" file is static, and if the tracer is not
 111         adding more data, it will display the same
 112         information every time it is read.
 113
 114   trace_options:
 115
 116         This file lets the user control the amount of data
 117         that is displayed in one of the above output
 118         files. Options also exist to modify how a tracer
 119         or events work (stack traces, timestamps, etc).
 120
 121   options:
 122
 123         This is a directory that has a file for every available
 124         trace option (also in trace_options). Options may also be set
 125         or cleared by writing a "1" or "0" respectively into the
 126         corresponding file with the option name.
 127
 128   tracing_max_latency:
 129
 130         Some of the tracers record the max latency.
 131         For example, the time interrupts are disabled.
 132         This time is saved in this file. The max trace
 133         will also be stored, and displayed by "trace".
 134         A new max trace will only be recorded if the
 135         latency is greater than the value in this
 136         file. (in microseconds)
 137
 138   tracing_thresh:
 139
 140         Some latency tracers will record a trace whenever the
 141         latency is greater than the number in this file.
 142         Only active when the file contains a number greater than 0.
 143         (in microseconds)
 144
 145   buffer_size_kb:
 146
 147         This sets or displays the number of kilobytes each CPU
 148         buffer holds. By default, the trace buffers are the same size
 149         for each CPU. The displayed number is the size of the
 150         CPU buffer and not total size of all buffers. The
 151         trace buffers are allocated in pages (blocks of memory
 152         that the kernel uses for allocation, usually 4 KB in size).
 153         If the last page allocated has room for more bytes
 154         than requested, the rest of the page will be used,
 155         making the actual allocation bigger than requested.
 156         ( Note, the size may not be a multiple of the page size
 157           due to buffer management meta-data. )
 158
 159   buffer_total_size_kb:
 160
 161         This displays the total combined size of all the trace buffers.
 162
 163   free_buffer:
 164
 165         If a process is performing the tracing, and the ring buffer
 166         should be shrunk "freed" when the process is finished, even
 167         if it were to be killed by a signal, this file can be used
 168         for that purpose. On close of this file, the ring buffer will
 169         be resized to its minimum size. Having a process that is tracing
 170         also open this file, when the process exits its file descriptor
 171         for this file will be closed, and in doing so, the ring buffer
 172         will be "freed".
 173
 174         It may also stop tracing if disable_on_free option is set.
 175
 176   tracing_cpumask:
 177
 178         This is a mask that lets the user only trace
 179         on specified CPUs. The format is a hex string
 180         representing the CPUs.
 181
 182   set_ftrace_filter:
 183
 184         When dynamic ftrace is configured in (see the
 185         section below "dynamic ftrace"), the code is dynamically
 186         modified (code text rewrite) to disable calling of the
 187         function profiler (mcount). This lets tracing be configured
 188         in with practically no overhead in performance.  This also
 189         has a side effect of enabling or disabling specific functions
 190         to be traced. Echoing names of functions into this file
 191         will limit the trace to only those functions.
 192
 193         This interface also allows for commands to be used. See the
 194         "Filter commands" section for more details.
 195
 196   set_ftrace_notrace:
 197
 198         This has an effect opposite to that of
 199         set_ftrace_filter. Any function that is added here will not
 200         be traced. If a function exists in both set_ftrace_filter
 201         and set_ftrace_notrace, the function will _not_ be traced.
 202
 203   set_ftrace_pid:
 204
 205         Have the function tracer only trace a single thread.
 206
 207   set_event_pid:
 208
 209         Have the events only trace a task with a PID listed in this file.
 210         Note, sched_switch and sched_wake_up will also trace events
 211         listed in this file.
 212
 213         To have the PIDs of children of tasks with their PID in this file
 214         added on fork, enable the "event-fork" option. That option will also
 215         cause the PIDs of tasks to be removed from this file when the task
 216         exits.
 217
 218   set_graph_function:
 219
 220         Set a "trigger" function where tracing should start
 221         with the function graph tracer (See the section
 222         "dynamic ftrace" for more details).
 223
 224   available_filter_functions:
 225
 226         This lists the functions that ftrace
 227         has processed and can trace. These are the function
 228         names that you can pass to "set_ftrace_filter" or
 229         "set_ftrace_notrace". (See the section "dynamic ftrace"
 230         below for more details.)
 231
 232   enabled_functions:
 233
 234         This file is more for debugging ftrace, but can also be useful
 235         in seeing if any function has a callback attached to it.
 236         Not only does the trace infrastructure use ftrace function
 237         trace utility, but other subsystems might too. This file
 238         displays all functions that have a callback attached to them
 239         as well as the number of callbacks that have been attached.
 240         Note, a callback may also call multiple functions which will
 241         not be listed in this count.
 242
 243         If the callback registered to be traced by a function with
 244         the "save regs" attribute (thus even more overhead), a 'R'
 245         will be displayed on the same line as the function that
 246         is returning registers.
 247
 248         If the callback registered to be traced by a function with
 249         the "ip modify" attribute (thus the regs->ip can be changed),
 250         an 'I' will be displayed on the same line as the function that
 251         can be overridden.
 252
 253   function_profile_enabled:
 254
 255         When set it will enable all functions with either the function
 256         tracer, or if enabled, the function graph tracer. It will
 257         keep a histogram of the number of functions that were called
 258         and if run with the function graph tracer, it will also keep
 259         track of the time spent in those functions. The histogram
 260         content can be displayed in the files:
 261
 262         trace_stats/function<cpu> ( function0, function1, etc).
 263
 264   trace_stats:
 265
 266         A directory that holds different tracing stats.
 267
 268   kprobe_events:
 269
 270         Enable dynamic trace points. See kprobetrace.txt.
 271
 272   kprobe_profile:
 273
 274         Dynamic trace points stats. See kprobetrace.txt.
 275
 276   max_graph_depth:
 277
 278         Used with the function graph tracer. This is the max depth
 279         it will trace into a function. Setting this to a value of
 280         one will show only the first kernel function that is called
 281         from user space.
 282
 283   printk_formats:
 284
 285         This is for tools that read the raw format files. If an event in
 286         the ring buffer references a string (currently only trace_printk()
 287         does this), only a pointer to the string is recorded into the buffer
 288         and not the string itself. This prevents tools from knowing what
 289         that string was. This file displays the string and address for
 290         the string allowing tools to map the pointers to what the
 291         strings were.
 292
 293   saved_cmdlines:
 294
 295         Only the pid of the task is recorded in a trace event unless
 296         the event specifically saves the task comm as well. Ftrace
 297         makes a cache of pid mappings to comms to try to display
 298         comms for events. If a pid for a comm is not listed, then
 299         "<...>" is displayed in the output.
 300
 301   snapshot:
 302
 303         This displays the "snapshot" buffer and also lets the user
 304         take a snapshot of the current running trace.
 305         See the "Snapshot" section below for more details.
 306
 307   stack_max_size:
 308
 309         When the stack tracer is activated, this will display the
 310         maximum stack size it has encountered.
 311         See the "Stack Trace" section below.
 312
 313   stack_trace:
 314
 315         This displays the stack back trace of the largest stack
 316         that was encountered when the stack tracer is activated.
 317         See the "Stack Trace" section below.
 318
 319   stack_trace_filter:
 320
 321         This is similar to "set_ftrace_filter" but it limits what
 322         functions the stack tracer will check.
 323
 324   trace_clock:
 325
 326         Whenever an event is recorded into the ring buffer, a
 327         "timestamp" is added. This stamp comes from a specified
 328         clock. By default, ftrace uses the "local" clock. This
 329         clock is very fast and strictly per cpu, but on some
 330         systems it may not be monotonic with respect to other
 331         CPUs. In other words, the local clocks may not be in sync
 332         with local clocks on other CPUs.
 333
 334         Usual clocks for tracing:
 335
 336           # cat trace_clock
 337           [local] global counter x86-tsc
 338
 339           local: Default clock, but may not be in sync across CPUs
 340
 341           global: This clock is in sync with all CPUs but may
 342                   be a bit slower than the local clock.
 343
 344           counter: This is not a clock at all, but literally an atomic
 345                    counter. It counts up one by one, but is in sync
 346                    with all CPUs. This is useful when you need to
 347                    know exactly the order events occurred with respect to
 348                    each other on different CPUs.
 349
 350           uptime: This uses the jiffies counter and the time stamp
 351                   is relative to the time since boot up.
 352
 353           perf: This makes ftrace use the same clock that perf uses.
 354                 Eventually perf will be able to read ftrace buffers
 355                 and this will help out in interleaving the data.
 356
 357           x86-tsc: Architectures may define their own clocks. For
 358                    example, x86 uses its own TSC cycle clock here.
 359
 360           ppc-tb: This uses the powerpc timebase register value.
 361                   This is in sync across CPUs and can also be used
 362                   to correlate events across hypervisor/guest if
 363                   tb_offset is known.
 364
 365           mono: This uses the fast monotonic clock (CLOCK_MONOTONIC)
 366                 which is monotonic and is subject to NTP rate adjustments.
 367
 368           mono_raw:
 369                 This is the raw monotonic clock (CLOCK_MONOTONIC_RAW)
 370                 which is montonic but is not subject to any rate adjustments
 371                 and ticks at the same rate as the hardware clocksource.
 372
 373           boot: This is the boot clock (CLOCK_BOOTTIME) and is based on the
 374                 fast monotonic clock, but also accounts for time spent in
 375                 suspend. Since the clock access is designed for use in
 376                 tracing in the suspend path, some side effects are possible
 377                 if clock is accessed after the suspend time is accounted before
 378                 the fast mono clock is updated. In this case, the clock update
 379                 appears to happen slightly sooner than it normally would have.
 380                 Also on 32-bit systems, it's possible that the 64-bit boot offset
 381                 sees a partial update. These effects are rare and post
 382                 processing should be able to handle them. See comments in the
 383                 ktime_get_boot_fast_ns() function for more information.
 384
 385         To set a clock, simply echo the clock name into this file.
 386
 387           echo global > trace_clock
 388
 389   trace_marker:
 390
 391         This is a very useful file for synchronizing user space
 392         with events happening in the kernel. Writing strings into
 393         this file will be written into the ftrace buffer.
 394
 395         It is useful in applications to open this file at the start
 396         of the application and just reference the file descriptor
 397         for the file.
 398
 399         void trace_write(const char *fmt, ...)
 400         {
 401                 va_list ap;
 402                 char buf[256];
 403                 int n;
 404
 405                 if (trace_fd < 0)
 406                         return;
 407
 408                 va_start(ap, fmt);
 409                 n = vsnprintf(buf, 256, fmt, ap);
 410                 va_end(ap);
 411
 412                 write(trace_fd, buf, n);
 413         }
 414
 415         start:
 416
 417                 trace_fd = open("trace_marker", WR_ONLY);
 418
 419   uprobe_events:
 420
 421         Add dynamic tracepoints in programs.
 422         See uprobetracer.txt
 423
 424   uprobe_profile:
 425
 426         Uprobe statistics. See uprobetrace.txt
 427
 428   instances:
 429
 430         This is a way to make multiple trace buffers where different
 431         events can be recorded in different buffers.
 432         See "Instances" section below.
 433
 434   events:
 435
 436         This is the trace event directory. It holds event tracepoints
 437         (also known as static tracepoints) that have been compiled
 438         into the kernel. It shows what event tracepoints exist
 439         and how they are grouped by system. There are "enable"
 440         files at various levels that can enable the tracepoints
 441         when a "1" is written to them.
 442
 443         See events.txt for more information.
 444
 445   per_cpu:
 446
 447         This is a directory that contains the trace per_cpu information.
 448
 449   per_cpu/cpu0/buffer_size_kb:
 450
 451         The ftrace buffer is defined per_cpu. That is, there's a separate
 452         buffer for each CPU to allow writes to be done atomically,
 453         and free from cache bouncing. These buffers may have different
 454         size buffers. This file is similar to the buffer_size_kb
 455         file, but it only displays or sets the buffer size for the
 456         specific CPU. (here cpu0).
 457
 458   per_cpu/cpu0/trace:
 459
 460         This is similar to the "trace" file, but it will only display
 461         the data specific for the CPU. If written to, it only clears
 462         the specific CPU buffer.
 463
 464   per_cpu/cpu0/trace_pipe
 465
 466         This is similar to the "trace_pipe" file, and is a consuming
 467         read, but it will only display (and consume) the data specific
 468         for the CPU.
 469
 470   per_cpu/cpu0/trace_pipe_raw
 471
 472         For tools that can parse the ftrace ring buffer binary format,
 473         the trace_pipe_raw file can be used to extract the data
 474         from the ring buffer directly. With the use of the splice()
 475         system call, the buffer data can be quickly transferred to
 476         a file or to the network where a server is collecting the
 477         data.
 478
 479         Like trace_pipe, this is a consuming reader, where multiple
 480         reads will always produce different data.
 481
 482   per_cpu/cpu0/snapshot:
 483
 484         This is similar to the main "snapshot" file, but will only
 485         snapshot the current CPU (if supported). It only displays
 486         the content of the snapshot for a given CPU, and if
 487         written to, only clears this CPU buffer.
 488
 489   per_cpu/cpu0/snapshot_raw:
 490
 491         Similar to the trace_pipe_raw, but will read the binary format
 492         from the snapshot buffer for the given CPU.
 493
 494   per_cpu/cpu0/stats:
 495
 496         This displays certain stats about the ring buffer:
 497
 498          entries: The number of events that are still in the buffer.
 499
 500          overrun: The number of lost events due to overwriting when
 501                   the buffer was full.
 502
 503          commit overrun: Should always be zero.
 504                 This gets set if so many events happened within a nested
 505                 event (ring buffer is re-entrant), that it fills the
 506                 buffer and starts dropping events.
 507
 508          bytes: Bytes actually read (not overwritten).
 509
 510          oldest event ts: The oldest timestamp in the buffer
 511
 512          now ts: The current timestamp
 513
 514          dropped events: Events lost due to overwrite option being off.
 515
 516          read events: The number of events read.
 517
 518 The Tracers
 519 -----------
 520
 521 Here is the list of current tracers that may be configured.
 522
 523   "function"
 524
 525         Function call tracer to trace all kernel functions.
 526
 527   "function_graph"
 528
 529         Similar to the function tracer except that the
 530         function tracer probes the functions on their entry
 531         whereas the function graph tracer traces on both entry
 532         and exit of the functions. It then provides the ability
 533         to draw a graph of function calls similar to C code
 534         source.
 535
 536   "irqsoff"
 537
 538         Traces the areas that disable interrupts and saves
 539         the trace with the longest max latency.
 540         See tracing_max_latency. When a new max is recorded,
 541         it replaces the old trace. It is best to view this
 542         trace with the latency-format option enabled.
 543
 544   "preemptoff"
 545
 546         Similar to irqsoff but traces and records the amount of
 547         time for which preemption is disabled.
 548
 549   "preemptirqsoff"
 550
 551         Similar to irqsoff and preemptoff, but traces and
 552         records the largest time for which irqs and/or preemption
 553         is disabled.
 554
 555   "wakeup"
 556
 557         Traces and records the max latency that it takes for
 558         the highest priority task to get scheduled after
 559         it has been woken up.
 560         Traces all tasks as an average developer would expect.
 561
 562   "wakeup_rt"
 563
 564         Traces and records the max latency that it takes for just
 565         RT tasks (as the current "wakeup" does). This is useful
 566         for those interested in wake up timings of RT tasks.
 567
 568   "nop"
 569
 570         This is the "trace nothing" tracer. To remove all
 571         tracers from tracing simply echo "nop" into
 572         current_tracer.
 573
 574
 575 Examples of using the tracer
 576 ----------------------------
 577
 578 Here are typical examples of using the tracers when controlling
 579 them only with the debugfs interface (without using any
 580 user-land utilities).
 581
 582 Output format:
 583 --------------
 584
 585 Here is an example of the output format of the file "trace"
 586
 587                              --------
 588 # tracer: function
 589 #
 590 # entries-in-buffer/entries-written: 140080/250280   #P:4
 591 #
 592 #                              _-----=> irqs-off
 593 #                             / _----=> need-resched
 594 #                            | / _---=> hardirq/softirq
 595 #                            || / _--=> preempt-depth
 596 #                            ||| /     delay
 597 #           TASK-PID   CPU#  ||||    TIMESTAMP  FUNCTION
 598 #              | |       |   ||||       |         |
 599             bash-1977  [000] .... 17284.993652: sys_close <-system_call_fastpath
 600             bash-1977  [000] .... 17284.993653: __close_fd <-sys_close
 601             bash-1977  [000] .... 17284.993653: _raw_spin_lock <-__close_fd
 602             sshd-1974  [003] .... 17284.993653: __srcu_read_unlock <-fsnotify
 603             bash-1977  [000] .... 17284.993654: add_preempt_count <-_raw_spin_lock
 604             bash-1977  [000] ...1 17284.993655: _raw_spin_unlock <-__close_fd
 605             bash-1977  [000] ...1 17284.993656: sub_preempt_count <-_raw_spin_unlock
 606             bash-1977  [000] .... 17284.993657: filp_close <-__close_fd
 607             bash-1977  [000] .... 17284.993657: dnotify_flush <-filp_close
 608             sshd-1974  [003] .... 17284.993658: sys_select <-system_call_fastpath
 609                              --------
 610
 611 A header is printed with the tracer name that is represented by
 612 the trace. In this case the tracer is "function". Then it shows the
 613 number of events in the buffer as well as the total number of entries
 614 that were written. The difference is the number of entries that were
 615 lost due to the buffer filling up (250280 - 140080 = 110200 events
 616 lost).
 617
 618 The header explains the content of the events. Task name "bash", the task
 619 PID "1977", the CPU that it was running on "000", the latency format
 620 (explained below), the timestamp in <secs>.<usecs> format, the
 621 function name that was traced "sys_close" and the parent function that
 622 called this function "system_call_fastpath". The timestamp is the time
 623 at which the function was entered.
 624
 625 Latency trace format
 626 --------------------
 627
 628 When the latency-format option is enabled or when one of the latency
 629 tracers is set, the trace file gives somewhat more information to see
 630 why a latency happened. Here is a typical trace.
 631
 632 # tracer: irqsoff
 633 #
 634 # irqsoff latency trace v1.1.5 on 3.8.0-test+
 635 # --------------------------------------------------------------------
 636 # latency: 259 us, #4/4, CPU#2 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:4)
 637 #    -----------------
 638 #    | task: ps-6143 (uid:0 nice:0 policy:0 rt_prio:0)
 639 #    -----------------
 640 #  => started at: __lock_task_sighand
 641 #  => ended at:   _raw_spin_unlock_irqrestore
 642 #
 643 #
 644 #                  _------=> CPU#
 645 #                 / _-----=> irqs-off
 646 #                | / _----=> need-resched
 647 #                || / _---=> hardirq/softirq
 648 #                ||| / _--=> preempt-depth
 649 #                |||| /     delay
 650 #  cmd     pid   ||||| time  |   caller
 651 #     \   /      |||||  \    |   /
 652       ps-6143    2d...    0us!: trace_hardirqs_off <-__lock_task_sighand
 653       ps-6143    2d..1  259us+: trace_hardirqs_on <-_raw_spin_unlock_irqrestore
 654       ps-6143    2d..1  263us+: time_hardirqs_on <-_raw_spin_unlock_irqrestore
 655       ps-6143    2d..1  306us : <stack trace>
 656  => trace_hardirqs_on_caller
 657  => trace_hardirqs_on
 658  => _raw_spin_unlock_irqrestore
 659  => do_task_stat
 660  => proc_tgid_stat
 661  => proc_single_show
 662  => seq_read
 663  => vfs_read
 664  => sys_read
 665  => system_call_fastpath
 666
 667
 668 This shows that the current tracer is "irqsoff" tracing the time
 669 for which interrupts were disabled. It gives the trace version (which
 670 never changes) and the version of the kernel upon which this was executed on
 671 (3.10). Then it displays the max latency in microseconds (259 us). The number
 672 of trace entries displayed and the total number (both are four: #4/4).
 673 VP, KP, SP, and HP are always zero and are reserved for later use.
 674 #P is the number of online CPUs (#P:4).
 675
 676 The task is the process that was running when the latency
 677 occurred. (ps pid: 6143).
 678
 679 The start and stop (the functions in which the interrupts were
 680 disabled and enabled respectively) that caused the latencies:
 681
 682  __lock_task_sighand is where the interrupts were disabled.
 683  _raw_spin_unlock_irqrestore is where they were enabled again.
 684
 685 The next lines after the header are the trace itself. The header
 686 explains which is which.
 687
 688   cmd: The name of the process in the trace.
 689
 690   pid: The PID of that process.
 691
 692   CPU#: The CPU which the process was running on.
 693
 694   irqs-off: 'd' interrupts are disabled. '.' otherwise.
 695             Note: If the architecture does not support a way to
 696                   read the irq flags variable, an 'X' will always
 697                   be printed here.
 698
 699   need-resched:
 700         'N' both TIF_NEED_RESCHED and PREEMPT_NEED_RESCHED is set,
 701         'n' only TIF_NEED_RESCHED is set,
 702         'p' only PREEMPT_NEED_RESCHED is set,
 703         '.' otherwise.
 704
 705   hardirq/softirq:
 706         'H' - hard irq occurred inside a softirq.
 707         'h' - hard irq is running
 708         's' - soft irq is running
 709         '.' - normal context.
 710
 711   preempt-depth: The level of preempt_disabled
 712
 713 The above is mostly meaningful for kernel developers.
 714
 715   time: When the latency-format option is enabled, the trace file
 716         output includes a timestamp relative to the start of the
 717         trace. This differs from the output when latency-format
 718         is disabled, which includes an absolute timestamp.
 719
 720   delay: This is just to help catch your eye a bit better. And
 721          needs to be fixed to be only relative to the same CPU.
 722          The marks are determined by the difference between this
 723          current trace and the next trace.
 724           '$' - greater than 1 second
 725           '@' - greater than 100 milisecond
 726           '*' - greater than 10 milisecond
 727           '#' - greater than 1000 microsecond
 728           '!' - greater than 100 microsecond
 729           '+' - greater than 10 microsecond
 730           ' ' - less than or equal to 10 microsecond.
 731
 732   The rest is the same as the 'trace' file.
 733
 734   Note, the latency tracers will usually end with a back trace
 735   to easily find where the latency occurred.
 736
 737 trace_options
 738 -------------
 739
 740 The trace_options file (or the options directory) is used to control
 741 what gets printed in the trace output, or manipulate the tracers.
 742 To see what is available, simply cat the file:
 743
 744   cat trace_options
 745 print-parent
 746 nosym-offset
 747 nosym-addr
 748 noverbose
 749 noraw
 750 nohex
 751 nobin
 752 noblock
 753 trace_printk
 754 nobranch
 755 annotate
 756 nouserstacktrace
 757 nosym-userobj
 758 noprintk-msg-only
 759 context-info
 760 nolatency-format
 761 sleep-time
 762 graph-time
 763 record-cmd
 764 overwrite
 765 nodisable_on_free
 766 irq-info
 767 markers
 768 noevent-fork
 769 function-trace
 770 nodisplay-graph
 771 nostacktrace
 772
 773 To disable one of the options, echo in the option prepended with
 774 "no".
 775
 776   echo noprint-parent > trace_options
 777
 778 To enable an option, leave off the "no".
 779
 780   echo sym-offset > trace_options
 781
 782 Here are the available options:
 783
 784   print-parent - On function traces, display the calling (parent)
 785                  function as well as the function being traced.
 786
 787   print-parent:
 788    bash-4000  [01]  1477.606694: simple_strtoul <-kstrtoul
 789
 790   noprint-parent:
 791    bash-4000  [01]  1477.606694: simple_strtoul
 792
 793
 794   sym-offset - Display not only the function name, but also the
 795                offset in the function. For example, instead of
 796                seeing just "ktime_get", you will see
 797                "ktime_get+0xb/0x20".
 798
 799   sym-offset:
 800    bash-4000  [01]  1477.606694: simple_strtoul+0x6/0xa0
 801
 802   sym-addr - this will also display the function address as well
 803              as the function name.
 804
 805   sym-addr:
 806    bash-4000  [01]  1477.606694: simple_strtoul <c0339346>
 807
 808   verbose - This deals with the trace file when the
 809             latency-format option is enabled.
 810
 811     bash  4000 1 0 00000000 00010a95 [58127d26] 1720.415ms \
 812     (+0.000ms): simple_strtoul (kstrtoul)
 813
 814   raw - This will display raw numbers. This option is best for
 815         use with user applications that can translate the raw
 816         numbers better than having it done in the kernel.
 817
 818   hex - Similar to raw, but the numbers will be in a hexadecimal
 819         format.
 820
 821   bin - This will print out the formats in raw binary.
 822
 823   block - When set, reading trace_pipe will not block when polled.
 824
 825   trace_printk - Can disable trace_printk() from writing into the buffer.
 826
 827   branch - Enable branch tracing with the tracer.
 828
 829   annotate - It is sometimes confusing when the CPU buffers are full
 830              and one CPU buffer had a lot of events recently, thus
 831              a shorter time frame, were another CPU may have only had
 832              a few events, which lets it have older events. When
 833              the trace is reported, it shows the oldest events first,
 834              and it may look like only one CPU ran (the one with the
 835              oldest events). When the annotate option is set, it will
 836              display when a new CPU buffer started:
 837
 838           <idle>-0     [001] dNs4 21169.031481: wake_up_idle_cpu <-add_timer_on
 839           <idle>-0     [001] dNs4 21169.031482: _raw_spin_unlock_irqrestore <-add_timer_on
 840           <idle>-0     [001] .Ns4 21169.031484: sub_preempt_count <-_raw_spin_unlock_irqrestore
 841 ##### CPU 2 buffer started ####
 842           <idle>-0     [002] .N.1 21169.031484: rcu_idle_exit <-cpu_idle
 843           <idle>-0     [001] .Ns3 21169.031484: _raw_spin_unlock <-clocksource_watchdog
 844           <idle>-0     [001] .Ns3 21169.031485: sub_preempt_count <-_raw_spin_unlock
 845
 846   userstacktrace - This option changes the trace. It records a
 847                    stacktrace of the current userspace thread.
 848
 849   sym-userobj - when user stacktrace are enabled, look up which
 850                 object the address belongs to, and print a
 851                 relative address. This is especially useful when
 852                 ASLR is on, otherwise you don't get a chance to
 853                 resolve the address to object/file/line after
 854                 the app is no longer running
 855
 856                 The lookup is performed when you read
 857                 trace,trace_pipe. Example:
 858
 859                 a.out-1623  [000] 40874.465068: /root/a.out[+0x480] <-/root/a.out[+0
 860 x494] <- /root/a.out[+0x4a8] <- /lib/libc-2.7.so[+0x1e1a6]
 861
 862
 863   printk-msg-only - When set, trace_printk()s will only show the format
 864                     and not their parameters (if trace_bprintk() or
 865                     trace_bputs() was used to save the trace_printk()).
 866
 867   context-info - Show only the event data. Hides the comm, PID,
 868                  timestamp, CPU, and other useful data.
 869
 870   latency-format - This option changes the trace. When
 871                    it is enabled, the trace displays
 872                    additional information about the
 873                    latencies, as described in "Latency
 874                    trace format".
 875
 876   sleep-time - When running function graph tracer, to include
 877                the time a task schedules out in its function.
 878                When enabled, it will account time the task has been
 879                scheduled out as part of the function call.
 880
 881   graph-time - When running function profiler with function graph tracer,
 882                to include the time to call nested functions. When this is
 883                not set, the time reported for the function will only
 884                include the time the function itself executed for, not the
 885                time for functions that it called.
 886
 887   record-cmd - When any event or tracer is enabled, a hook is enabled
 888                in the sched_switch trace point to fill comm cache
 889                with mapped pids and comms. But this may cause some
 890                overhead, and if you only care about pids, and not the
 891                name of the task, disabling this option can lower the
 892                impact of tracing.
 893
 894   overwrite - This controls what happens when the trace buffer is
 895               full. If "1" (default), the oldest events are
 896               discarded and overwritten. If "0", then the newest
 897               events are discarded.
 898                 (see per_cpu/cpu0/stats for overrun and dropped)
 899
 900   disable_on_free - When the free_buffer is closed, tracing will
 901                     stop (tracing_on set to 0).
 902
 903   irq-info - Shows the interrupt, preempt count, need resched data.
 904              When disabled, the trace looks like:
 905
 906 # tracer: function
 907 #
 908 # entries-in-buffer/entries-written: 144405/9452052   #P:4
 909 #
 910 #           TASK-PID   CPU#      TIMESTAMP  FUNCTION
 911 #              | |       |          |         |
 912           <idle>-0     [002]  23636.756054: ttwu_do_activate.constprop.89 <-try_to_wake_up
 913           <idle>-0     [002]  23636.756054: activate_task <-ttwu_do_activate.constprop.89
 914           <idle>-0     [002]  23636.756055: enqueue_task <-activate_task
 915
 916
 917   markers - When set, the trace_marker is writable (only by root).
 918             When disabled, the trace_marker will error with EINVAL
 919             on write.
 920
 921   event-fork - When set, tasks with PIDs listed in set_event_pid will have
 922                the PIDs of their children added to set_event_pid when those
 923                tasks fork. Also, when tasks with PIDs in set_event_pid exit,
 924                their PIDs will be removed from the file.
 925
 926   function-trace - The latency tracers will enable function tracing
 927             if this option is enabled (default it is). When
 928             it is disabled, the latency tracers do not trace
 929             functions. This keeps the overhead of the tracer down
 930             when performing latency tests.
 931
 932   display-graph - When set, the latency tracers (irqsoff, wakeup, etc) will
 933                   use function graph tracing instead of function tracing.
 934
 935   stacktrace - This is one of the options that changes the trace
 936                itself. When a trace is recorded, so is the stack
 937                of functions. This allows for back traces of
 938                trace sites.
 939
 940  Note: Some tracers have their own options. They only appear in this
 941        file when the tracer is active. They always appear in the
 942        options directory.
 943
 944
 945
 946 irqsoff
 947 -------
 948
 949 When interrupts are disabled, the CPU can not react to any other
 950 external event (besides NMIs and SMIs). This prevents the timer
 951 interrupt from triggering or the mouse interrupt from letting
 952 the kernel know of a new mouse event. The result is a latency
 953 with the reaction time.
 954
 955 The irqsoff tracer tracks the time for which interrupts are
 956 disabled. When a new maximum latency is hit, the tracer saves
 957 the trace leading up to that latency point so that every time a
 958 new maximum is reached, the old saved trace is discarded and the
 959 new trace is saved.
 960
 961 To reset the maximum, echo 0 into tracing_max_latency. Here is
 962 an example:
 963
 964  # echo 0 > options/function-trace
 965  # echo irqsoff > current_tracer
 966  # echo 1 > tracing_on
 967  # echo 0 > tracing_max_latency
 968  # ls -ltr
 969  [...]
 970  # echo 0 > tracing_on
 971  # cat trace
 972 # tracer: irqsoff
 973 #
 974 # irqsoff latency trace v1.1.5 on 3.8.0-test+
 975 # --------------------------------------------------------------------
 976 # latency: 16 us, #4/4, CPU#0 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:4)
 977 #    -----------------
 978 #    | task: swapper/0-0 (uid:0 nice:0 policy:0 rt_prio:0)
 979 #    -----------------
 980 #  => started at: run_timer_softirq
 981 #  => ended at:   run_timer_softirq
 982 #
 983 #
 984 #                  _------=> CPU#
 985 #                 / _-----=> irqs-off
 986 #                | / _----=> need-resched
 987 #                || / _---=> hardirq/softirq
 988 #                ||| / _--=> preempt-depth
 989 #                |||| /     delay
 990 #  cmd     pid   ||||| time  |   caller
 991 #     \   /      |||||  \    |   /
 992   <idle>-0       0d.s2    0us+: _raw_spin_lock_irq <-run_timer_softirq
 993   <idle>-0       0dNs3   17us : _raw_spin_unlock_irq <-run_timer_softirq
 994   <idle>-0       0dNs3   17us+: trace_hardirqs_on <-run_timer_softirq
 995   <idle>-0       0dNs3   25us : <stack trace>
 996  => _raw_spin_unlock_irq
 997  => run_timer_softirq
 998  => __do_softirq
 999  => call_softirq
1000  => do_softirq
1001  => irq_exit
1002  => smp_apic_timer_interrupt
1003  => apic_timer_interrupt
1004  => rcu_idle_exit
1005  => cpu_idle
1006  => rest_init
1007  => start_kernel
1008  => x86_64_start_reservations
1009  => x86_64_start_kernel
1010
1011 Here we see that that we had a latency of 16 microseconds (which is
1012 very good). The _raw_spin_lock_irq in run_timer_softirq disabled
1013 interrupts. The difference between the 16 and the displayed
1014 timestamp 25us occurred because the clock was incremented
1015 between the time of recording the max latency and the time of
1016 recording the function that had that latency.
1017
1018 Note the above example had function-trace not set. If we set
1019 function-trace, we get a much larger output:
1020
1021  with echo 1 > options/function-trace
1022
1023 # tracer: irqsoff
1024 #
1025 # irqsoff latency trace v1.1.5 on 3.8.0-test+
1026 # --------------------------------------------------------------------
1027 # latency: 71 us, #168/168, CPU#3 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:4)
1028 #    -----------------
1029 #    | task: bash-2042 (uid:0 nice:0 policy:0 rt_prio:0)
1030 #    -----------------
1031 #  => started at: ata_scsi_queuecmd
1032 #  => ended at:   ata_scsi_queuecmd
1033 #
1034 #
1035 #                  _------=> CPU#
1036 #                 / _-----=> irqs-off
1037 #                | / _----=> need-resched
1038 #                || / _---=> hardirq/softirq
1039 #                ||| / _--=> preempt-depth
1040 #                |||| /     delay
1041 #  cmd     pid   ||||| time  |   caller
1042 #     \   /      |||||  \    |   /
1043     bash-2042    3d...    0us : _raw_spin_lock_irqsave <-ata_scsi_queuecmd
1044     bash-2042    3d...    0us : add_preempt_count <-_raw_spin_lock_irqsave
1045     bash-2042    3d..1    1us : ata_scsi_find_dev <-ata_scsi_queuecmd
1046     bash-2042    3d..1    1us : __ata_scsi_find_dev <-ata_scsi_find_dev
1047     bash-2042    3d..1    2us : ata_find_dev.part.14 <-__ata_scsi_find_dev
1048     bash-2042    3d..1    2us : ata_qc_new_init <-__ata_scsi_queuecmd
1049     bash-2042    3d..1    3us : ata_sg_init <-__ata_scsi_queuecmd
1050     bash-2042    3d..1    4us : ata_scsi_rw_xlat <-__ata_scsi_queuecmd
1051     bash-2042    3d..1    4us : ata_build_rw_tf <-ata_scsi_rw_xlat
1052 [...]
1053     bash-2042    3d..1   67us : delay_tsc <-__delay
1054     bash-2042    3d..1   67us : add_preempt_count <-delay_tsc
1055     bash-2042    3d..2   67us : sub_preempt_count <-delay_tsc
1056     bash-2042    3d..1   67us : add_preempt_count <-delay_tsc
1057     bash-2042    3d..2   68us : sub_preempt_count <-delay_tsc
1058     bash-2042    3d..1   68us+: ata_bmdma_start <-ata_bmdma_qc_issue
1059     bash-2042    3d..1   71us : _raw_spin_unlock_irqrestore <-ata_scsi_queuecmd
1060     bash-2042    3d..1   71us : _raw_spin_unlock_irqrestore <-ata_scsi_queuecmd
1061     bash-2042    3d..1   72us+: trace_hardirqs_on <-ata_scsi_queuecmd
1062     bash-2042    3d..1  120us : <stack trace>
1063  => _raw_spin_unlock_irqrestore
1064  => ata_scsi_queuecmd
1065  => scsi_dispatch_cmd
1066  => scsi_request_fn
1067  => __blk_run_queue_uncond
1068  => __blk_run_queue
1069  => blk_queue_bio
1070  => generic_make_request
1071  => submit_bio
1072  => submit_bh
1073  => __ext3_get_inode_loc
1074  => ext3_iget
1075  => ext3_lookup
1076  => lookup_real
1077  => __lookup_hash
1078  => walk_component
1079  => lookup_last
1080  => path_lookupat
1081  => filename_lookup
1082  => user_path_at_empty
1083  => user_path_at
1084  => vfs_fstatat
1085  => vfs_stat
1086  => sys_newstat
1087  => system_call_fastpath
1088
1089
1090 Here we traced a 71 microsecond latency. But we also see all the
1091 functions that were called during that time. Note that by
1092 enabling function tracing, we incur an added overhead. This
1093 overhead may extend the latency times. But nevertheless, this
1094 trace has provided some very helpful debugging information.
1095
1096
1097 preemptoff
1098 ----------
1099
1100 When preemption is disabled, we may be able to receive
1101 interrupts but the task cannot be preempted and a higher
1102 priority task must wait for preemption to be enabled again
1103 before it can preempt a lower priority task.
1104
1105 The preemptoff tracer traces the places that disable preemption.
1106 Like the irqsoff tracer, it records the maximum latency for
1107 which preemption was disabled. The control of preemptoff tracer
1108 is much like the irqsoff tracer.
1109
1110  # echo 0 > options/function-trace
1111  # echo preemptoff > current_tracer
1112  # echo 1 > tracing_on
1113  # echo 0 > tracing_max_latency
1114  # ls -ltr
1115  [...]
1116  # echo 0 > tracing_on
1117  # cat trace
1118 # tracer: preemptoff
1119 #
1120 # preemptoff latency trace v1.1.5 on 3.8.0-test+
1121 # --------------------------------------------------------------------
1122 # latency: 46 us, #4/4, CPU#1 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:4)
1123 #    -----------------
1124 #    | task: sshd-1991 (uid:0 nice:0 policy:0 rt_prio:0)
1125 #    -----------------
1126 #  => started at: do_IRQ
1127 #  => ended at:   do_IRQ
1128 #
1129 #
1130 #                  _------=> CPU#
1131 #                 / _-----=> irqs-off
1132 #                | / _----=> need-resched
1133 #                || / _---=> hardirq/softirq
1134 #                ||| / _--=> preempt-depth
1135 #                |||| /     delay
1136 #  cmd     pid   ||||| time  |   caller
1137 #     \   /      |||||  \    |   /
1138     sshd-1991    1d.h.    0us+: irq_enter <-do_IRQ
1139     sshd-1991    1d..1   46us : irq_exit <-do_IRQ
1140     sshd-1991    1d..1   47us+: trace_preempt_on <-do_IRQ
1141     sshd-1991    1d..1   52us : <stack trace>
1142  => sub_preempt_count
1143  => irq_exit
1144  => do_IRQ
1145  => ret_from_intr
1146
1147
1148 This has some more changes. Preemption was disabled when an
1149 interrupt came in (notice the 'h'), and was enabled on exit.
1150 But we also see that interrupts have been disabled when entering
1151 the preempt off section and leaving it (the 'd'). We do not know if
1152 interrupts were enabled in the mean time or shortly after this
1153 was over.
1154
1155 # tracer: preemptoff
1156 #
1157 # preemptoff latency trace v1.1.5 on 3.8.0-test+
1158 # --------------------------------------------------------------------
1159 # latency: 83 us, #241/241, CPU#1 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:4)
1160 #    -----------------
1161 #    | task: bash-1994 (uid:0 nice:0 policy:0 rt_prio:0)
1162 #    -----------------
1163 #  => started at: wake_up_new_task
1164 #  => ended at:   task_rq_unlock
1165 #
1166 #
1167 #                  _------=> CPU#
1168 #                 / _-----=> irqs-off
1169 #                | / _----=> need-resched
1170 #                || / _---=> hardirq/softirq
1171 #                ||| / _--=> preempt-depth
1172 #                |||| /     delay
1173 #  cmd     pid   ||||| time  |   caller
1174 #     \   /      |||||  \    |   /
1175     bash-1994    1d..1    0us : _raw_spin_lock_irqsave <-wake_up_new_task
1176     bash-1994    1d..1    0us : select_task_rq_fair <-select_task_rq
1177     bash-1994    1d..1    1us : __rcu_read_lock <-select_task_rq_fair
1178     bash-1994    1d..1    1us : source_load <-select_task_rq_fair
1179     bash-1994    1d..1    1us : source_load <-select_task_rq_fair
1180 [...]
1181     bash-1994    1d..1   12us : irq_enter <-smp_apic_timer_interrupt
1182     bash-1994    1d..1   12us : rcu_irq_enter <-irq_enter
1183     bash-1994    1d..1   13us : add_preempt_count <-irq_enter
1184     bash-1994    1d.h1   13us : exit_idle <-smp_apic_timer_interrupt
1185     bash-1994    1d.h1   13us : hrtimer_interrupt <-smp_apic_timer_interrupt
1186     bash-1994    1d.h1   13us : _raw_spin_lock <-hrtimer_interrupt
1187     bash-1994    1d.h1   14us : add_preempt_count <-_raw_spin_lock
1188     bash-1994    1d.h2   14us : ktime_get_update_offsets <-hrtimer_interrupt
1189 [...]
1190     bash-1994    1d.h1   35us : lapic_next_event <-clockevents_program_event
1191     bash-1994    1d.h1   35us : irq_exit <-smp_apic_timer_interrupt
1192     bash-1994    1d.h1   36us : sub_preempt_count <-irq_exit
1193     bash-1994    1d..2   36us : do_softirq <-irq_exit
1194     bash-1994    1d..2   36us : __do_softirq <-call_softirq
1195     bash-1994    1d..2   36us : __local_bh_disable <-__do_softirq
1196     bash-1994    1d.s2   37us : add_preempt_count <-_raw_spin_lock_irq
1197     bash-1994    1d.s3   38us : _raw_spin_unlock <-run_timer_softirq
1198     bash-1994    1d.s3   39us : sub_preempt_count <-_raw_spin_unlock
1199     bash-1994    1d.s2   39us : call_timer_fn <-run_timer_softirq
1200 [...]
1201     bash-1994    1dNs2   81us : cpu_needs_another_gp <-rcu_process_callbacks
1202     bash-1994    1dNs2   82us : __local_bh_enable <-__do_softirq
1203     bash-1994    1dNs2   82us : sub_preempt_count <-__local_bh_enable
1204     bash-1994    1dN.2   82us : idle_cpu <-irq_exit
1205     bash-1994    1dN.2   83us : rcu_irq_exit <-irq_exit
1206     bash-1994    1dN.2   83us : sub_preempt_count <-irq_exit
1207     bash-1994    1.N.1   84us : _raw_spin_unlock_irqrestore <-task_rq_unlock
1208     bash-1994    1.N.1   84us+: trace_preempt_on <-task_rq_unlock
1209     bash-1994    1.N.1  104us : <stack trace>
1210  => sub_preempt_count
1211  => _raw_spin_unlock_irqrestore
1212  => task_rq_unlock
1213  => wake_up_new_task
1214  => do_fork
1215  => sys_clone
1216  => stub_clone
1217
1218
1219 The above is an example of the preemptoff trace with
1220 function-trace set. Here we see that interrupts were not disabled
1221 the entire time. The irq_enter code lets us know that we entered
1222 an interrupt 'h'. Before that, the functions being traced still
1223 show that it is not in an interrupt, but we can see from the
1224 functions themselves that this is not the case.
1225
1226 preemptirqsoff
1227 --------------
1228
1229 Knowing the locations that have interrupts disabled or
1230 preemption disabled for the longest times is helpful. But
1231 sometimes we would like to know when either preemption and/or
1232 interrupts are disabled.
1233
1234 Consider the following code:
1235
1236     local_irq_disable();
1237     call_function_with_irqs_off();
1238     preempt_disable();
1239     call_function_with_irqs_and_preemption_off();
1240     local_irq_enable();
1241     call_function_with_preemption_off();
1242     preempt_enable();
1243
1244 The irqsoff tracer will record the total length of
1245 call_function_with_irqs_off() and
1246 call_function_with_irqs_and_preemption_off().
1247
1248 The preemptoff tracer will record the total length of
1249 call_function_with_irqs_and_preemption_off() and
1250 call_function_with_preemption_off().
1251
1252 But neither will trace the time that interrupts and/or
1253 preemption is disabled. This total time is the time that we can
1254 not schedule. To record this time, use the preemptirqsoff
1255 tracer.
1256
1257 Again, using this trace is much like the irqsoff and preemptoff
1258 tracers.
1259
1260  # echo 0 > options/function-trace
1261  # echo preemptirqsoff > current_tracer
1262  # echo 1 > tracing_on
1263  # echo 0 > tracing_max_latency
1264  # ls -ltr
1265  [...]
1266  # echo 0 > tracing_on
1267  # cat trace
1268 # tracer: preemptirqsoff
1269 #
1270 # preemptirqsoff latency trace v1.1.5 on 3.8.0-test+
1271 # --------------------------------------------------------------------
1272 # latency: 100 us, #4/4, CPU#3 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:4)
1273 #    -----------------
1274 #    | task: ls-2230 (uid:0 nice:0 policy:0 rt_prio:0)
1275 #    -----------------
1276 #  => started at: ata_scsi_queuecmd
1277 #  => ended at:   ata_scsi_queuecmd
1278 #
1279 #
1280 #                  _------=> CPU#
1281 #                 / _-----=> irqs-off
1282 #                | / _----=> need-resched
1283 #                || / _---=> hardirq/softirq
1284 #                ||| / _--=> preempt-depth
1285 #                |||| /     delay
1286 #  cmd     pid   ||||| time  |   caller
1287 #     \   /      |||||  \    |   /
1288       ls-2230    3d...    0us+: _raw_spin_lock_irqsave <-ata_scsi_queuecmd
1289       ls-2230    3...1  100us : _raw_spin_unlock_irqrestore <-ata_scsi_queuecmd
1290       ls-2230    3...1  101us+: trace_preempt_on <-ata_scsi_queuecmd
1291       ls-2230    3...1  111us : <stack trace>
1292  => sub_preempt_count
1293  => _raw_spin_unlock_irqrestore
1294  => ata_scsi_queuecmd
1295  => scsi_dispatch_cmd
1296  => scsi_request_fn
1297  => __blk_run_queue_uncond
1298  => __blk_run_queue
1299  => blk_queue_bio
1300  => generic_make_request
1301  => submit_bio
1302  => submit_bh
1303  => ext3_bread
1304  => ext3_dir_bread
1305  => htree_dirblock_to_tree
1306  => ext3_htree_fill_tree
1307  => ext3_readdir
1308  => vfs_readdir
1309  => sys_getdents
1310  => system_call_fastpath
1311
1312
1313 The trace_hardirqs_off_thunk is called from assembly on x86 when
1314 interrupts are disabled in the assembly code. Without the
1315 function tracing, we do not know if interrupts were enabled
1316 within the preemption points. We do see that it started with
1317 preemption enabled.
1318
1319 Here is a trace with function-trace set:
1320
1321 # tracer: preemptirqsoff
1322 #
1323 # preemptirqsoff latency trace v1.1.5 on 3.8.0-test+
1324 # --------------------------------------------------------------------
1325 # latency: 161 us, #339/339, CPU#3 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:4)
1326 #    -----------------
1327 #    | task: ls-2269 (uid:0 nice:0 policy:0 rt_prio:0)
1328 #    -----------------
1329 #  => started at: schedule
1330 #  => ended at:   mutex_unlock
1331 #
1332 #
1333 #                  _------=> CPU#
1334 #                 / _-----=> irqs-off
1335 #                | / _----=> need-resched
1336 #                || / _---=> hardirq/softirq
1337 #                ||| / _--=> preempt-depth
1338 #                |||| /     delay
1339 #  cmd     pid   ||||| time  |   caller
1340 #     \   /      |||||  \    |   /
1341 kworker/-59      3...1    0us : __schedule <-schedule
1342 kworker/-59      3d..1    0us : rcu_preempt_qs <-rcu_note_context_switch
1343 kworker/-59      3d..1    1us : add_preempt_count <-_raw_spin_lock_irq
1344 kworker/-59      3d..2    1us : deactivate_task <-__schedule
1345 kworker/-59      3d..2    1us : dequeue_task <-deactivate_task
1346 kworker/-59      3d..2    2us : update_rq_clock <-dequeue_task
1347 kworker/-59      3d..2    2us : dequeue_task_fair <-dequeue_task
1348 kworker/-59      3d..2    2us : update_curr <-dequeue_task_fair
1349 kworker/-59      3d..2    2us : update_min_vruntime <-update_curr
1350 kworker/-59      3d..2    3us : cpuacct_charge <-update_curr
1351 kworker/-59      3d..2    3us : __rcu_read_lock <-cpuacct_charge
1352 kworker/-59      3d..2    3us : __rcu_read_unlock <-cpuacct_charge
1353 kworker/-59      3d..2    3us : update_cfs_rq_blocked_load <-dequeue_task_fair
1354 kworker/-59      3d..2    4us : clear_buddies <-dequeue_task_fair
1355 kworker/-59      3d..2    4us : account_entity_dequeue <-dequeue_task_fair
1356 kworker/-59      3d..2    4us : update_min_vruntime <-dequeue_task_fair
1357 kworker/-59      3d..2    4us : update_cfs_shares <-dequeue_task_fair
1358 kworker/-59      3d..2    5us : hrtick_update <-dequeue_task_fair
1359 kworker/-59      3d..2    5us : wq_worker_sleeping <-__schedule
1360 kworker/-59      3d..2    5us : kthread_data <-wq_worker_sleeping
1361 kworker/-59      3d..2    5us : put_prev_task_fair <-__schedule
1362 kworker/-59      3d..2    6us : pick_next_task_fair <-pick_next_task
1363 kworker/-59      3d..2    6us : clear_buddies <-pick_next_task_fair
1364 kworker/-59      3d..2    6us : set_next_entity <-pick_next_task_fair
1365 kworker/-59      3d..2    6us : update_stats_wait_end <-set_next_entity
1366       ls-2269    3d..2    7us : finish_task_switch <-__schedule
1367       ls-2269    3d..2    7us : _raw_spin_unlock_irq <-finish_task_switch
1368       ls-2269    3d..2    8us : do_IRQ <-ret_from_intr
1369       ls-2269    3d..2    8us : irq_enter <-do_IRQ
1370       ls-2269    3d..2    8us : rcu_irq_enter <-irq_enter
1371       ls-2269    3d..2    9us : add_preempt_count <-irq_enter
1372       ls-2269    3d.h2    9us : exit_idle <-do_IRQ
1373 [...]
1374       ls-2269    3d.h3   20us : sub_preempt_count <-_raw_spin_unlock
1375       ls-2269    3d.h2   20us : irq_exit <-do_IRQ
1376       ls-2269    3d.h2   21us : sub_preempt_count <-irq_exit
1377       ls-2269    3d..3   21us : do_softirq <-irq_exit
1378       ls-2269    3d..3   21us : __do_softirq <-call_softirq
1379       ls-2269    3d..3   21us+: __local_bh_disable <-__do_softirq
1380       ls-2269    3d.s4   29us : sub_preempt_count <-_local_bh_enable_ip
1381       ls-2269    3d.s5   29us : sub_preempt_count <-_local_bh_enable_ip
1382       ls-2269    3d.s5   31us : do_IRQ <-ret_from_intr
1383       ls-2269    3d.s5   31us : irq_enter <-do_IRQ
1384       ls-2269    3d.s5   31us : rcu_irq_enter <-irq_enter
1385 [...]
1386       ls-2269    3d.s5   31us : rcu_irq_enter <-irq_enter
1387       ls-2269    3d.s5   32us : add_preempt_count <-irq_enter
1388       ls-2269    3d.H5   32us : exit_idle <-do_IRQ
1389       ls-2269    3d.H5   32us : handle_irq <-do_IRQ
1390       ls-2269    3d.H5   32us : irq_to_desc <-handle_irq
1391       ls-2269    3d.H5   33us : handle_fasteoi_irq <-handle_irq
1392 [...]
1393       ls-2269    3d.s5  158us : _raw_spin_unlock_irqrestore <-rtl8139_poll
1394       ls-2269    3d.s3  158us : net_rps_action_and_irq_enable.isra.65 <-net_rx_action
1395       ls-2269    3d.s3  159us : __local_bh_enable <-__do_softirq
1396       ls-2269    3d.s3  159us : sub_preempt_count <-__local_bh_enable
1397       ls-2269    3d..3  159us : idle_cpu <-irq_exit
1398       ls-2269    3d..3  159us : rcu_irq_exit <-irq_exit
1399       ls-2269    3d..3  160us : sub_preempt_count <-irq_exit
1400       ls-2269    3d...  161us : __mutex_unlock_slowpath <-mutex_unlock
1401       ls-2269    3d...  162us+: trace_hardirqs_on <-mutex_unlock
1402       ls-2269    3d...  186us : <stack trace>
1403  => __mutex_unlock_slowpath
1404  => mutex_unlock
1405  => process_output
1406  => n_tty_write
1407  => tty_write
1408  => vfs_write
1409  => sys_write
1410  => system_call_fastpath
1411
1412 This is an interesting trace. It started with kworker running and
1413 scheduling out and ls taking over. But as soon as ls released the
1414 rq lock and enabled interrupts (but not preemption) an interrupt
1415 triggered. When the interrupt finished, it started running softirqs.
1416 But while the softirq was running, another interrupt triggered.
1417 When an interrupt is running inside a softirq, the annotation is 'H'.
1418
1419
1420 wakeup
1421 ------
1422
1423 One common case that people are interested in tracing is the
1424 time it takes for a task that is woken to actually wake up.
1425 Now for non Real-Time tasks, this can be arbitrary. But tracing
1426 it none the less can be interesting.
1427
1428 Without function tracing:
1429
1430  # echo 0 > options/function-trace
1431  # echo wakeup > current_tracer
1432  # echo 1 > tracing_on
1433  # echo 0 > tracing_max_latency
1434  # chrt -f 5 sleep 1
1435  # echo 0 > tracing_on
1436  # cat trace
1437 # tracer: wakeup
1438 #
1439 # wakeup latency trace v1.1.5 on 3.8.0-test+
1440 # --------------------------------------------------------------------
1441 # latency: 15 us, #4/4, CPU#3 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:4)
1442 #    -----------------
1443 #    | task: kworker/3:1H-312 (uid:0 nice:-20 policy:0 rt_prio:0)
1444 #    -----------------
1445 #
1446 #                  _------=> CPU#
1447 #                 / _-----=> irqs-off
1448 #                | / _----=> need-resched
1449 #                || / _---=> hardirq/softirq
1450 #                ||| / _--=> preempt-depth
1451 #                |||| /     delay
1452 #  cmd     pid   ||||| time  |   caller
1453 #     \   /      |||||  \    |   /
1454   <idle>-0       3dNs7    0us :      0:120:R   + [003]   312:100:R kworker/3:1H
1455   <idle>-0       3dNs7    1us+: ttwu_do_activate.constprop.87 <-try_to_wake_up
1456   <idle>-0       3d..3   15us : __schedule <-schedule
1457   <idle>-0       3d..3   15us :      0:120:R ==> [003]   312:100:R kworker/3:1H
1458
1459 The tracer only traces the highest priority task in the system
1460 to avoid tracing the normal circumstances. Here we see that
1461 the kworker with a nice priority of -20 (not very nice), took
1462 just 15 microseconds from the time it woke up, to the time it
1463 ran.
1464
1465 Non Real-Time tasks are not that interesting. A more interesting
1466 trace is to concentrate only on Real-Time tasks.
1467
1468 wakeup_rt
1469 ---------
1470
1471 In a Real-Time environment it is very important to know the
1472 wakeup time it takes for the highest priority task that is woken
1473 up to the time that it executes. This is also known as "schedule
1474 latency". I stress the point that this is about RT tasks. It is
1475 also important to know the scheduling latency of non-RT tasks,
1476 but the average schedule latency is better for non-RT tasks.
1477 Tools like LatencyTop are more appropriate for such
1478 measurements.
1479
1480 Real-Time environments are interested in the worst case latency.
1481 That is the longest latency it takes for something to happen,
1482 and not the average. We can have a very fast scheduler that may
1483 only have a large latency once in a while, but that would not
1484 work well with Real-Time tasks.  The wakeup_rt tracer was designed
1485 to record the worst case wakeups of RT tasks. Non-RT tasks are
1486 not recorded because the tracer only records one worst case and
1487 tracing non-RT tasks that are unpredictable will overwrite the
1488 worst case latency of RT tasks (just run the normal wakeup
1489 tracer for a while to see that effect).
1490
1491 Since this tracer only deals with RT tasks, we will run this
1492 slightly differently than we did with the previous tracers.
1493 Instead of performing an 'ls', we will run 'sleep 1' under
1494 'chrt' which changes the priority of the task.
1495
1496  # echo 0 > options/function-trace
1497  # echo wakeup_rt > current_tracer
1498  # echo 1 > tracing_on
1499  # echo 0 > tracing_max_latency
1500  # chrt -f 5 sleep 1
1501  # echo 0 > tracing_on
1502  # cat trace
1503 # tracer: wakeup
1504 #
1505 # tracer: wakeup_rt
1506 #
1507 # wakeup_rt latency trace v1.1.5 on 3.8.0-test+
1508 # --------------------------------------------------------------------
1509 # latency: 5 us, #4/4, CPU#3 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:4)
1510 #    -----------------
1511 #    | task: sleep-2389 (uid:0 nice:0 policy:1 rt_prio:5)
1512 #    -----------------
1513 #
1514 #                  _------=> CPU#
1515 #                 / _-----=> irqs-off
1516 #                | / _----=> need-resched
1517 #                || / _---=> hardirq/softirq
1518 #                ||| / _--=> preempt-depth
1519 #                |||| /     delay
1520 #  cmd     pid   ||||| time  |   caller
1521 #     \   /      |||||  \    |   /
1522   <idle>-0       3d.h4    0us :      0:120:R   + [003]  2389: 94:R sleep
1523   <idle>-0       3d.h4    1us+: ttwu_do_activate.constprop.87 <-try_to_wake_up
1524   <idle>-0       3d..3    5us : __schedule <-schedule
1525   <idle>-0       3d..3    5us :      0:120:R ==> [003]  2389: 94:R sleep
1526
1527
1528 Running this on an idle system, we see that it only took 5 microseconds
1529 to perform the task switch.  Note, since the trace point in the schedule
1530 is before the actual "switch", we stop the tracing when the recorded task
1531 is about to schedule in. This may change if we add a new marker at the
1532 end of the scheduler.
1533
1534 Notice that the recorded task is 'sleep' with the PID of 2389
1535 and it has an rt_prio of 5. This priority is user-space priority
1536 and not the internal kernel priority. The policy is 1 for
1537 SCHED_FIFO and 2 for SCHED_RR.
1538
1539 Note, that the trace data shows the internal priority (99 - rtprio).
1540
1541   <idle>-0       3d..3    5us :      0:120:R ==> [003]  2389: 94:R sleep
1542
1543 The 0:120:R means idle was running with a nice priority of 0 (120 - 20)
1544 and in the running state 'R'. The sleep task was scheduled in with
1545 2389: 94:R. That is the priority is the kernel rtprio (99 - 5 = 94)
1546 and it too is in the running state.
1547
1548 Doing the same with chrt -r 5 and function-trace set.
1549
1550   echo 1 > options/function-trace
1551
1552 # tracer: wakeup_rt
1553 #
1554 # wakeup_rt latency trace v1.1.5 on 3.8.0-test+
1555 # --------------------------------------------------------------------
1556 # latency: 29 us, #85/85, CPU#3 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:4)
1557 #    -----------------
1558 #    | task: sleep-2448 (uid:0 nice:0 policy:1 rt_prio:5)
1559 #    -----------------
1560 #
1561 #                  _------=> CPU#
1562 #                 / _-----=> irqs-off
1563 #                | / _----=> need-resched
1564 #                || / _---=> hardirq/softirq
1565 #                ||| / _--=> preempt-depth
1566 #                |||| /     delay
1567 #  cmd     pid   ||||| time  |   caller
1568 #     \   /      |||||  \    |   /
1569   <idle>-0       3d.h4    1us+:      0:120:R   + [003]  2448: 94:R sleep
1570   <idle>-0       3d.h4    2us : ttwu_do_activate.constprop.87 <-try_to_wake_up
1571   <idle>-0       3d.h3    3us : check_preempt_curr <-ttwu_do_wakeup
1572   <idle>-0       3d.h3    3us : resched_curr <-check_preempt_curr
1573   <idle>-0       3dNh3    4us : task_woken_rt <-ttwu_do_wakeup
1574   <idle>-0       3dNh3    4us : _raw_spin_unlock <-try_to_wake_up
1575   <idle>-0       3dNh3    4us : sub_preempt_count <-_raw_spin_unlock
1576   <idle>-0       3dNh2    5us : ttwu_stat <-try_to_wake_up
1577   <idle>-0       3dNh2    5us : _raw_spin_unlock_irqrestore <-try_to_wake_up
1578   <idle>-0       3dNh2    6us : sub_preempt_count <-_raw_spin_unlock_irqrestore
1579   <idle>-0       3dNh1    6us : _raw_spin_lock <-__run_hrtimer
1580   <idle>-0       3dNh1    6us : add_preempt_count <-_raw_spin_lock
1581   <idle>-0       3dNh2    7us : _raw_spin_unlock <-hrtimer_interrupt
1582   <idle>-0       3dNh2    7us : sub_preempt_count <-_raw_spin_unlock
1583   <idle>-0       3dNh1    7us : tick_program_event <-hrtimer_interrupt
1584   <idle>-0       3dNh1    7us : clockevents_program_event <-tick_program_event
1585   <idle>-0       3dNh1    8us : ktime_get <-clockevents_program_event
1586   <idle>-0       3dNh1    8us : lapic_next_event <-clockevents_program_event
1587   <idle>-0       3dNh1    8us : irq_exit <-smp_apic_timer_interrupt
1588   <idle>-0       3dNh1    9us : sub_preempt_count <-irq_exit
1589   <idle>-0       3dN.2    9us : idle_cpu <-irq_exit
1590   <idle>-0       3dN.2    9us : rcu_irq_exit <-irq_exit
1591   <idle>-0       3dN.2   10us : rcu_eqs_enter_common.isra.45 <-rcu_irq_exit
1592   <idle>-0       3dN.2   10us : sub_preempt_count <-irq_exit
1593   <idle>-0       3.N.1   11us : rcu_idle_exit <-cpu_idle
1594   <idle>-0       3dN.1   11us : rcu_eqs_exit_common.isra.43 <-rcu_idle_exit
1595   <idle>-0       3.N.1   11us : tick_nohz_idle_exit <-cpu_idle
1596   <idle>-0       3dN.1   12us : menu_hrtimer_cancel <-tick_nohz_idle_exit
1597   <idle>-0       3dN.1   12us : ktime_get <-tick_nohz_idle_exit
1598   <idle>-0       3dN.1   12us : tick_do_update_jiffies64 <-tick_nohz_idle_exit
1599   <idle>-0       3dN.1   13us : cpu_load_update_nohz <-tick_nohz_idle_exit
1600   <idle>-0       3dN.1   13us : _raw_spin_lock <-cpu_load_update_nohz
1601   <idle>-0       3dN.1   13us : add_preempt_count <-_raw_spin_lock
1602   <idle>-0       3dN.2   13us : __cpu_load_update <-cpu_load_update_nohz
1603   <idle>-0       3dN.2   14us : sched_avg_update <-__cpu_load_update
1604   <idle>-0       3dN.2   14us : _raw_spin_unlock <-cpu_load_update_nohz
1605   <idle>-0       3dN.2   14us : sub_preempt_count <-_raw_spin_unlock
1606   <idle>-0       3dN.1   15us : calc_load_exit_idle <-tick_nohz_idle_exit
1607   <idle>-0       3dN.1   15us : touch_softlockup_watchdog <-tick_nohz_idle_exit
1608   <idle>-0       3dN.1   15us : hrtimer_cancel <-tick_nohz_idle_exit
1609   <idle>-0       3dN.1   15us : hrtimer_try_to_cancel <-hrtimer_cancel
1610   <idle>-0       3dN.1   16us : lock_hrtimer_base.isra.18 <-hrtimer_try_to_cancel
1611   <idle>-0       3dN.1   16us : _raw_spin_lock_irqsave <-lock_hrtimer_base.isra.18
1612   <idle>-0       3dN.1   16us : add_preempt_count <-_raw_spin_lock_irqsave
1613   <idle>-0       3dN.2   17us : __remove_hrtimer <-remove_hrtimer.part.16
1614   <idle>-0       3dN.2   17us : hrtimer_force_reprogram <-__remove_hrtimer
1615   <idle>-0       3dN.2   17us : tick_program_event <-hrtimer_force_reprogram
1616   <idle>-0       3dN.2   18us : clockevents_program_event <-tick_program_event
1617   <idle>-0       3dN.2   18us : ktime_get <-clockevents_program_event
1618   <idle>-0       3dN.2   18us : lapic_next_event <-clockevents_program_event
1619   <idle>-0       3dN.2   19us : _raw_spin_unlock_irqrestore <-hrtimer_try_to_cancel
1620   <idle>-0       3dN.2   19us : sub_preempt_count <-_raw_spin_unlock_irqrestore
1621   <idle>-0       3dN.1   19us : hrtimer_forward <-tick_nohz_idle_exit
1622   <idle>-0       3dN.1   20us : ktime_add_safe <-hrtimer_forward
1623   <idle>-0       3dN.1   20us : ktime_add_safe <-hrtimer_forward
1624   <idle>-0       3dN.1   20us : hrtimer_start_range_ns <-hrtimer_start_expires.constprop.11
1625   <idle>-0       3dN.1   20us : __hrtimer_start_range_ns <-hrtimer_start_range_ns
1626   <idle>-0       3dN.1   21us : lock_hrtimer_base.isra.18 <-__hrtimer_start_range_ns
1627   <idle>-0       3dN.1   21us : _raw_spin_lock_irqsave <-lock_hrtimer_base.isra.18
1628   <idle>-0       3dN.1   21us : add_preempt_count <-_raw_spin_lock_irqsave
1629   <idle>-0       3dN.2   22us : ktime_add_safe <-__hrtimer_start_range_ns
1630   <idle>-0       3dN.2   22us : enqueue_hrtimer <-__hrtimer_start_range_ns
1631   <idle>-0       3dN.2   22us : tick_program_event <-__hrtimer_start_range_ns
1632   <idle>-0       3dN.2   23us : clockevents_program_event <-tick_program_event
1633   <idle>-0       3dN.2   23us : ktime_get <-clockevents_program_event
1634   <idle>-0       3dN.2   23us : lapic_next_event <-clockevents_program_event
1635   <idle>-0       3dN.2   24us : _raw_spin_unlock_irqrestore <-__hrtimer_start_range_ns
1636   <idle>-0       3dN.2   24us : sub_preempt_count <-_raw_spin_unlock_irqrestore
1637   <idle>-0       3dN.1   24us : account_idle_ticks <-tick_nohz_idle_exit
1638   <idle>-0       3dN.1   24us : account_idle_time <-account_idle_ticks
1639   <idle>-0       3.N.1   25us : sub_preempt_count <-cpu_idle
1640   <idle>-0       3.N..   25us : schedule <-cpu_idle
1641   <idle>-0       3.N..   25us : __schedule <-preempt_schedule
1642   <idle>-0       3.N..   26us : add_preempt_count <-__schedule
1643   <idle>-0       3.N.1   26us : rcu_note_context_switch <-__schedule
1644   <idle>-0       3.N.1   26us : rcu_sched_qs <-rcu_note_context_switch
1645   <idle>-0       3dN.1   27us : rcu_preempt_qs <-rcu_note_context_switch
1646   <idle>-0       3.N.1   27us : _raw_spin_lock_irq <-__schedule
1647   <idle>-0       3dN.1   27us : add_preempt_count <-_raw_spin_lock_irq
1648   <idle>-0       3dN.2   28us : put_prev_task_idle <-__schedule
1649   <idle>-0       3dN.2   28us : pick_next_task_stop <-pick_next_task
1650   <idle>-0       3dN.2   28us : pick_next_task_rt <-pick_next_task
1651   <idle>-0       3dN.2   29us : dequeue_pushable_task <-pick_next_task_rt
1652   <idle>-0       3d..3   29us : __schedule <-preempt_schedule
1653   <idle>-0       3d..3   30us :      0:120:R ==> [003]  2448: 94:R sleep
1654
1655 This isn't that big of a trace, even with function tracing enabled,
1656 so I included the entire trace.
1657
1658 The interrupt went off while when the system was idle. Somewhere
1659 before task_woken_rt() was called, the NEED_RESCHED flag was set,
1660 this is indicated by the first occurrence of the 'N' flag.
1661
1662 Latency tracing and events
1663 --------------------------
1664 As function tracing can induce a much larger latency, but without
1665 seeing what happens within the latency it is hard to know what
1666 caused it. There is a middle ground, and that is with enabling
1667 events.
1668
1669  # echo 0 > options/function-trace
1670  # echo wakeup_rt > current_tracer
1671  # echo 1 > events/enable
1672  # echo 1 > tracing_on
1673  # echo 0 > tracing_max_latency
1674  # chrt -f 5 sleep 1
1675  # echo 0 > tracing_on
1676  # cat trace
1677 # tracer: wakeup_rt
1678 #
1679 # wakeup_rt latency trace v1.1.5 on 3.8.0-test+
1680 # --------------------------------------------------------------------
1681 # latency: 6 us, #12/12, CPU#2 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:4)
1682 #    -----------------
1683 #    | task: sleep-5882 (uid:0 nice:0 policy:1 rt_prio:5)
1684 #    -----------------
1685 #
1686 #                  _------=> CPU#
1687 #                 / _-----=> irqs-off
1688 #                | / _----=> need-resched
1689 #                || / _---=> hardirq/softirq
1690 #                ||| / _--=> preempt-depth
1691 #                |||| /     delay
1692 #  cmd     pid   ||||| time  |   caller
1693 #     \   /      |||||  \    |   /
1694   <idle>-0       2d.h4    0us :      0:120:R   + [002]  5882: 94:R sleep
1695   <idle>-0       2d.h4    0us : ttwu_do_activate.constprop.87 <-try_to_wake_up
1696   <idle>-0       2d.h4    1us : sched_wakeup: comm=sleep pid=5882 prio=94 success=1 target_cpu=002
1697   <idle>-0       2dNh2    1us : hrtimer_expire_exit: hrtimer=ffff88007796feb8
1698   <idle>-0       2.N.2    2us : power_end: cpu_id=2
1699   <idle>-0       2.N.2    3us : cpu_idle: state=4294967295 cpu_id=2
1700   <idle>-0       2dN.3    4us : hrtimer_cancel: hrtimer=ffff88007d50d5e0
1701   <idle>-0       2dN.3    4us : hrtimer_start: hrtimer=ffff88007d50d5e0 function=tick_sched_timer expires=34311211000000 softexpires=34311211000000
1702   <idle>-0       2.N.2    5us : rcu_utilization: Start context switch
1703   <idle>-0       2.N.2    5us : rcu_utilization: End context switch
1704   <idle>-0       2d..3    6us : __schedule <-schedule
1705   <idle>-0       2d..3    6us :      0:120:R ==> [002]  5882: 94:R sleep
1706
1707
1708 function
1709 --------
1710
1711 This tracer is the function tracer. Enabling the function tracer
1712 can be done from the debug file system. Make sure the
1713 ftrace_enabled is set; otherwise this tracer is a nop.
1714 See the "ftrace_enabled" section below.
1715
1716  # sysctl kernel.ftrace_enabled=1
1717  # echo function > current_tracer
1718  # echo 1 > tracing_on
1719  # usleep 1
1720  # echo 0 > tracing_on
1721  # cat trace
1722 # tracer: function
1723 #
1724 # entries-in-buffer/entries-written: 24799/24799   #P:4
1725 #
1726 #                              _-----=> irqs-off
1727 #                             / _----=> need-resched
1728 #                            | / _---=> hardirq/softirq
1729 #                            || / _--=> preempt-depth
1730 #                            ||| /     delay
1731 #           TASK-PID   CPU#  ||||    TIMESTAMP  FUNCTION
1732 #              | |       |   ||||       |         |
1733             bash-1994  [002] ....  3082.063030: mutex_unlock <-rb_simple_write
1734             bash-1994  [002] ....  3082.063031: __mutex_unlock_slowpath <-mutex_unlock
1735             bash-1994  [002] ....  3082.063031: __fsnotify_parent <-fsnotify_modify
1736             bash-1994  [002] ....  3082.063032: fsnotify <-fsnotify_modify
1737             bash-1994  [002] ....  3082.063032: __srcu_read_lock <-fsnotify
1738             bash-1994  [002] ....  3082.063032: add_preempt_count <-__srcu_read_lock
1739             bash-1994  [002] ...1  3082.063032: sub_preempt_count <-__srcu_read_lock
1740             bash-1994  [002] ....  3082.063033: __srcu_read_unlock <-fsnotify
1741 [...]
1742
1743
1744 Note: function tracer uses ring buffers to store the above
1745 entries. The newest data may overwrite the oldest data.
1746 Sometimes using echo to stop the trace is not sufficient because
1747 the tracing could have overwritten the data that you wanted to
1748 record. For this reason, it is sometimes better to disable
1749 tracing directly from a program. This allows you to stop the
1750 tracing at the point that you hit the part that you are
1751 interested in. To disable the tracing directly from a C program,
1752 something like following code snippet can be used:
1753
1754 int trace_fd;
1755 [...]
1756 int main(int argc, char *argv[]) {
1757         [...]
1758         trace_fd = open(tracing_file("tracing_on"), O_WRONLY);
1759         [...]
1760         if (condition_hit()) {
1761                 write(trace_fd, "0", 1);
1762         }
1763         [...]
1764 }
1765
1766
1767 Single thread tracing
1768 ---------------------
1769
1770 By writing into set_ftrace_pid you can trace a
1771 single thread. For example:
1772
1773 # cat set_ftrace_pid
1774 no pid
1775 # echo 3111 > set_ftrace_pid
1776 # cat set_ftrace_pid
1777 3111
1778 # echo function > current_tracer
1779 # cat trace | head
1780  # tracer: function
1781  #
1782  #           TASK-PID    CPU#    TIMESTAMP  FUNCTION
1783  #              | |       |          |         |
1784      yum-updatesd-3111  [003]  1637.254676: finish_task_switch <-thread_return
1785      yum-updatesd-3111  [003]  1637.254681: hrtimer_cancel <-schedule_hrtimeout_range
1786      yum-updatesd-3111  [003]  1637.254682: hrtimer_try_to_cancel <-hrtimer_cancel
1787      yum-updatesd-3111  [003]  1637.254683: lock_hrtimer_base <-hrtimer_try_to_cancel
1788      yum-updatesd-3111  [003]  1637.254685: fget_light <-do_sys_poll
1789      yum-updatesd-3111  [003]  1637.254686: pipe_poll <-do_sys_poll
1790 # echo > set_ftrace_pid
1791 # cat trace |head
1792  # tracer: function
1793  #
1794  #           TASK-PID    CPU#    TIMESTAMP  FUNCTION
1795  #              | |       |          |         |
1796  ##### CPU 3 buffer started ####
1797      yum-updatesd-3111  [003]  1701.957688: free_poll_entry <-poll_freewait
1798      yum-updatesd-3111  [003]  1701.957689: remove_wait_queue <-free_poll_entry
1799      yum-updatesd-3111  [003]  1701.957691: fput <-free_poll_entry
1800      yum-updatesd-3111  [003]  1701.957692: audit_syscall_exit <-sysret_audit
1801      yum-updatesd-3111  [003]  1701.957693: path_put <-audit_syscall_exit
1802
1803 If you want to trace a function when executing, you could use
1804 something like this simple program:
1805
1806 #include <stdio.h>
1807 #include <stdlib.h>
1808 #include <sys/types.h>
1809 #include <sys/stat.h>
1810 #include <fcntl.h>
1811 #include <unistd.h>
1812 #include <string.h>
1813
1814 #define _STR(x) #x
1815 #define STR(x) _STR(x)
1816 #define MAX_PATH 256
1817
1818 const char *find_debugfs(void)
1819 {
1820        static char debugfs[MAX_PATH+1];
1821        static int debugfs_found;
1822        char type[100];
1823        FILE *fp;
1824
1825        if (debugfs_found)
1826                return debugfs;
1827
1828        if ((fp = fopen("/proc/mounts","r")) == NULL) {
1829                perror("/proc/mounts");
1830                return NULL;
1831        }
1832
1833        while (fscanf(fp, "%*s %"
1834                      STR(MAX_PATH)
1835                      "s %99s %*s %*d %*d\n",
1836                      debugfs, type) == 2) {
1837                if (strcmp(type, "debugfs") == 0)
1838                        break;
1839        }
1840        fclose(fp);
1841
1842        if (strcmp(type, "debugfs") != 0) {
1843                fprintf(stderr, "debugfs not mounted");
1844                return NULL;
1845        }
1846
1847        strcat(debugfs, "/tracing/");
1848        debugfs_found = 1;
1849
1850        return debugfs;
1851 }
1852
1853 const char *tracing_file(const char *file_name)
1854 {
1855        static char trace_file[MAX_PATH+1];
1856        snprintf(trace_file, MAX_PATH, "%s/%s", find_debugfs(), file_name);
1857        return trace_file;
1858 }
1859
1860 int main (int argc, char **argv)
1861 {
1862         if (argc < 1)
1863                 exit(-1);
1864
1865         if (fork() > 0) {
1866                 int fd, ffd;
1867                 char line[64];
1868                 int s;
1869
1870                 ffd = open(tracing_file("current_tracer"), O_WRONLY);
1871                 if (ffd < 0)
1872                         exit(-1);
1873                 write(ffd, "nop", 3);
1874
1875                 fd = open(tracing_file("set_ftrace_pid"), O_WRONLY);
1876                 s = sprintf(line, "%d\n", getpid());
1877                 write(fd, line, s);
1878
1879                 write(ffd, "function", 8);
1880
1881                 close(fd);
1882                 close(ffd);
1883
1884                 execvp(argv[1], argv+1);
1885         }
1886
1887         return 0;
1888 }
1889
1890 Or this simple script!
1891
1892 ------
1893 #!/bin/bash
1894
1895 debugfs=`sed -ne 's/^debugfs \(.*\) debugfs.*/\1/p' /proc/mounts`
1896 echo nop > $debugfs/tracing/current_tracer
1897 echo 0 > $debugfs/tracing/tracing_on
1898 echo $$ > $debugfs/tracing/set_ftrace_pid
1899 echo function > $debugfs/tracing/current_tracer
1900 echo 1 > $debugfs/tracing/tracing_on
1901 exec "$@"
1902 ------
1903
1904
1905 function graph tracer
1906 ---------------------------
1907
1908 This tracer is similar to the function tracer except that it
1909 probes a function on its entry and its exit. This is done by
1910 using a dynamically allocated stack of return addresses in each
1911 task_struct. On function entry the tracer overwrites the return
1912 address of each function traced to set a custom probe. Thus the
1913 original return address is stored on the stack of return address
1914 in the task_struct.
1915
1916 Probing on both ends of a function leads to special features
1917 such as:
1918
1919 - measure of a function's time execution
1920 - having a reliable call stack to draw function calls graph
1921
1922 This tracer is useful in several situations:
1923
1924 - you want to find the reason of a strange kernel behavior and
1925   need to see what happens in detail on any areas (or specific
1926   ones).
1927
1928 - you are experiencing weird latencies but it's difficult to
1929   find its origin.
1930
1931 - you want to find quickly which path is taken by a specific
1932   function
1933
1934 - you just want to peek inside a working kernel and want to see
1935   what happens there.
1936
1937 # tracer: function_graph
1938 #
1939 # CPU  DURATION                  FUNCTION CALLS
1940 # |     |   |                     |   |   |   |
1941
1942  0)               |  sys_open() {
1943  0)               |    do_sys_open() {
1944  0)               |      getname() {
1945  0)               |        kmem_cache_alloc() {
1946  0)   1.382 us    |          __might_sleep();
1947  0)   2.478 us    |        }
1948  0)               |        strncpy_from_user() {
1949  0)               |          might_fault() {
1950  0)   1.389 us    |            __might_sleep();
1951  0)   2.553 us    |          }
1952  0)   3.807 us    |        }
1953  0)   7.876 us    |      }
1954  0)               |      alloc_fd() {
1955  0)   0.668 us    |        _spin_lock();
1956  0)   0.570 us    |        expand_files();
1957  0)   0.586 us    |        _spin_unlock();
1958
1959
1960 There are several columns that can be dynamically
1961 enabled/disabled. You can use every combination of options you
1962 want, depending on your needs.
1963
1964 - The cpu number on which the function executed is default
1965   enabled.  It is sometimes better to only trace one cpu (see
1966   tracing_cpu_mask file) or you might sometimes see unordered
1967   function calls while cpu tracing switch.
1968
1969         hide: echo nofuncgraph-cpu > trace_options
1970         show: echo funcgraph-cpu > trace_options
1971
1972 - The duration (function's time of execution) is displayed on
1973   the closing bracket line of a function or on the same line
1974   than the current function in case of a leaf one. It is default
1975   enabled.
1976
1977         hide: echo nofuncgraph-duration > trace_options
1978         show: echo funcgraph-duration > trace_options
1979
1980 - The overhead field precedes the duration field in case of
1981   reached duration thresholds.
1982
1983         hide: echo nofuncgraph-overhead > trace_options
1984         show: echo funcgraph-overhead > trace_options
1985         depends on: funcgraph-duration
1986
1987   ie:
1988
1989   3) # 1837.709 us |          } /* __switch_to */
1990   3)               |          finish_task_switch() {
1991   3)   0.313 us    |            _raw_spin_unlock_irq();
1992   3)   3.177 us    |          }
1993   3) # 1889.063 us |        } /* __schedule */
1994   3) ! 140.417 us  |      } /* __schedule */
1995   3) # 2034.948 us |    } /* schedule */
1996   3) * 33998.59 us |  } /* schedule_preempt_disabled */
1997
1998   [...]
1999
2000   1)   0.260 us    |              msecs_to_jiffies();
2001   1)   0.313 us    |              __rcu_read_unlock();
2002   1) + 61.770 us   |            }
2003   1) + 64.479 us   |          }
2004   1)   0.313 us    |          rcu_bh_qs();
2005   1)   0.313 us    |          __local_bh_enable();
2006   1) ! 217.240 us  |        }
2007   1)   0.365 us    |        idle_cpu();
2008   1)               |        rcu_irq_exit() {
2009   1)   0.417 us    |          rcu_eqs_enter_common.isra.47();
2010   1)   3.125 us    |        }
2011   1) ! 227.812 us  |      }
2012   1) ! 457.395 us  |    }
2013   1) @ 119760.2 us |  }
2014
2015   [...]
2016
2017   2)               |    handle_IPI() {
2018   1)   6.979 us    |                  }
2019   2)   0.417 us    |      scheduler_ipi();
2020   1)   9.791 us    |                }
2021   1) + 12.917 us   |              }
2022   2)   3.490 us    |    }
2023   1) + 15.729 us   |            }
2024   1) + 18.542 us   |          }
2025   2) $ 3594274 us  |  }
2026
2027   + means that the function exceeded 10 usecs.
2028   ! means that the function exceeded 100 usecs.
2029   # means that the function exceeded 1000 usecs.
2030   * means that the function exceeded 10 msecs.
2031   @ means that the function exceeded 100 msecs.
2032   $ means that the function exceeded 1 sec.
2033
2034
2035 - The task/pid field displays the thread cmdline and pid which
2036   executed the function. It is default disabled.
2037
2038         hide: echo nofuncgraph-proc > trace_options
2039         show: echo funcgraph-proc > trace_options
2040
2041   ie:
2042
2043   # tracer: function_graph
2044   #
2045   # CPU  TASK/PID        DURATION                  FUNCTION CALLS
2046   # |    |    |           |   |                     |   |   |   |
2047   0)    sh-4802     |               |                  d_free() {
2048   0)    sh-4802     |               |                    call_rcu() {
2049   0)    sh-4802     |               |                      __call_rcu() {
2050   0)    sh-4802     |   0.616 us    |                        rcu_process_gp_end();
2051   0)    sh-4802     |   0.586 us    |                        check_for_new_grace_period();
2052   0)    sh-4802     |   2.899 us    |                      }
2053   0)    sh-4802     |   4.040 us    |                    }
2054   0)    sh-4802     |   5.151 us    |                  }
2055   0)    sh-4802     | + 49.370 us   |                }
2056
2057
2058 - The absolute time field is an absolute timestamp given by the
2059   system clock since it started. A snapshot of this time is
2060   given on each entry/exit of functions
2061
2062         hide: echo nofuncgraph-abstime > trace_options
2063         show: echo funcgraph-abstime > trace_options
2064
2065   ie:
2066
2067   #
2068   #      TIME       CPU  DURATION                  FUNCTION CALLS
2069   #       |         |     |   |                     |   |   |   |
2070   360.774522 |   1)   0.541 us    |                                          }
2071   360.774522 |   1)   4.663 us    |                                        }
2072   360.774523 |   1)   0.541 us    |                                        __wake_up_bit();
2073   360.774524 |   1)   6.796 us    |                                      }
2074   360.774524 |   1)   7.952 us    |                                    }
2075   360.774525 |   1)   9.063 us    |                                  }
2076   360.774525 |   1)   0.615 us    |                                  journal_mark_dirty();
2077   360.774527 |   1)   0.578 us    |                                  __brelse();
2078   360.774528 |   1)               |                                  reiserfs_prepare_for_journal() {
2079   360.774528 |   1)               |                                    unlock_buffer() {
2080   360.774529 |   1)               |                                      wake_up_bit() {
2081   360.774529 |   1)               |                                        bit_waitqueue() {
2082   360.774530 |   1)   0.594 us    |                                          __phys_addr();
2083
2084
2085 The function name is always displayed after the closing bracket
2086 for a function if the start of that function is not in the
2087 trace buffer.
2088
2089 Display of the function name after the closing bracket may be
2090 enabled for functions whose start is in the trace buffer,
2091 allowing easier searching with grep for function durations.
2092 It is default disabled.
2093
2094         hide: echo nofuncgraph-tail > trace_options
2095         show: echo funcgraph-tail > trace_options
2096
2097   Example with nofuncgraph-tail (default):
2098   0)               |      putname() {
2099   0)               |        kmem_cache_free() {
2100   0)   0.518 us    |          __phys_addr();
2101   0)   1.757 us    |        }
2102   0)   2.861 us    |      }
2103
2104   Example with funcgraph-tail:
2105   0)               |      putname() {
2106   0)               |        kmem_cache_free() {
2107   0)   0.518 us    |          __phys_addr();
2108   0)   1.757 us    |        } /* kmem_cache_free() */
2109   0)   2.861 us    |      } /* putname() */
2110
2111 You can put some comments on specific functions by using
2112 trace_printk() For example, if you want to put a comment inside
2113 the __might_sleep() function, you just have to include
2114 <linux/ftrace.h> and call trace_printk() inside __might_sleep()
2115
2116 trace_printk("I'm a comment!\n")
2117
2118 will produce:
2119
2120  1)               |             __might_sleep() {
2121  1)               |                /* I'm a comment! */
2122  1)   1.449 us    |             }
2123
2124
2125 You might find other useful features for this tracer in the
2126 following "dynamic ftrace" section such as tracing only specific
2127 functions or tasks.
2128
2129 dynamic ftrace
2130 --------------
2131
2132 If CONFIG_DYNAMIC_FTRACE is set, the system will run with
2133 virtually no overhead when function tracing is disabled. The way
2134 this works is the mcount function call (placed at the start of
2135 every kernel function, produced by the -pg switch in gcc),
2136 starts of pointing to a simple return. (Enabling FTRACE will
2137 include the -pg switch in the compiling of the kernel.)
2138
2139 At compile time every C file object is run through the
2140 recordmcount program (located in the scripts directory). This
2141 program will parse the ELF headers in the C object to find all
2142 the locations in the .text section that call mcount. (Note, only
2143 white listed .text sections are processed, since processing other
2144 sections like .init.text may cause races due to those sections
2145 being freed unexpectedly).
2146
2147 A new section called "__mcount_loc" is created that holds
2148 references to all the mcount call sites in the .text section.
2149 The recordmcount program re-links this section back into the
2150 original object. The final linking stage of the kernel will add all these
2151 references into a single table.
2152
2153 On boot up, before SMP is initialized, the dynamic ftrace code
2154 scans this table and updates all the locations into nops. It
2155 also records the locations, which are added to the
2156 available_filter_functions list.  Modules are processed as they
2157 are loaded and before they are executed.  When a module is
2158 unloaded, it also removes its functions from the ftrace function
2159 list. This is automatic in the module unload code, and the
2160 module author does not need to worry about it.
2161
2162 When tracing is enabled, the process of modifying the function
2163 tracepoints is dependent on architecture. The old method is to use
2164 kstop_machine to prevent races with the CPUs executing code being
2165 modified (which can cause the CPU to do undesirable things, especially
2166 if the modified code crosses cache (or page) boundaries), and the nops are
2167 patched back to calls. But this time, they do not call mcount
2168 (which is just a function stub). They now call into the ftrace
2169 infrastructure.
2170
2171 The new method of modifying the function tracepoints is to place
2172 a breakpoint at the location to be modified, sync all CPUs, modify
2173 the rest of the instruction not covered by the breakpoint. Sync
2174 all CPUs again, and then remove the breakpoint with the finished
2175 version to the ftrace call site.
2176
2177 Some archs do not even need to monkey around with the synchronization,
2178 and can just slap the new code on top of the old without any
2179 problems with other CPUs executing it at the same time.
2180
2181 One special side-effect to the recording of the functions being
2182 traced is that we can now selectively choose which functions we
2183 wish to trace and which ones we want the mcount calls to remain
2184 as nops.
2185
2186 Two files are used, one for enabling and one for disabling the
2187 tracing of specified functions. They are:
2188
2189   set_ftrace_filter
2190
2191 and
2192
2193   set_ftrace_notrace
2194
2195 A list of available functions that you can add to these files is
2196 listed in:
2197
2198    available_filter_functions
2199
2200  # cat available_filter_functions
2201 put_prev_task_idle
2202 kmem_cache_create
2203 pick_next_task_rt
2204 get_online_cpus
2205 pick_next_task_fair
2206 mutex_lock
2207 [...]
2208
2209 If I am only interested in sys_nanosleep and hrtimer_interrupt:
2210
2211  # echo sys_nanosleep hrtimer_interrupt > set_ftrace_filter
2212  # echo function > current_tracer
2213  # echo 1 > tracing_on
2214  # usleep 1
2215  # echo 0 > tracing_on
2216  # cat trace
2217 # tracer: function
2218 #
2219 # entries-in-buffer/entries-written: 5/5   #P:4
2220 #
2221 #                              _-----=> irqs-off
2222 #                             / _----=> need-resched
2223 #                            | / _---=> hardirq/softirq
2224 #                            || / _--=> preempt-depth
2225 #                            ||| /     delay
2226 #           TASK-PID   CPU#  ||||    TIMESTAMP  FUNCTION
2227 #              | |       |   ||||       |         |
2228           usleep-2665  [001] ....  4186.475355: sys_nanosleep <-system_call_fastpath
2229           <idle>-0     [001] d.h1  4186.475409: hrtimer_interrupt <-smp_apic_timer_interrupt
2230           usleep-2665  [001] d.h1  4186.475426: hrtimer_interrupt <-smp_apic_timer_interrupt
2231           <idle>-0     [003] d.h1  4186.475426: hrtimer_interrupt <-smp_apic_timer_interrupt
2232           <idle>-0     [002] d.h1  4186.475427: hrtimer_interrupt <-smp_apic_timer_interrupt
2233
2234 To see which functions are being traced, you can cat the file:
2235
2236  # cat set_ftrace_filter
2237 hrtimer_interrupt
2238 sys_nanosleep
2239
2240
2241 Perhaps this is not enough. The filters also allow simple wild
2242 cards. Only the following are currently available
2243
2244   <match>*  - will match functions that begin with <match>
2245   *<match>  - will match functions that end with <match>
2246   *<match>* - will match functions that have <match> in it
2247
2248 These are the only wild cards which are supported.
2249
2250   <match>*<match> will not work.
2251
2252 Note: It is better to use quotes to enclose the wild cards,
2253       otherwise the shell may expand the parameters into names
2254       of files in the local directory.
2255
2256  # echo 'hrtimer_*' > set_ftrace_filter
2257
2258 Produces:
2259
2260 # tracer: function
2261 #
2262 # entries-in-buffer/entries-written: 897/897   #P:4
2263 #
2264 #                              _-----=> irqs-off
2265 #                             / _----=> need-resched
2266 #                            | / _---=> hardirq/softirq
2267 #                            || / _--=> preempt-depth
2268 #                            ||| /     delay
2269 #           TASK-PID   CPU#  ||||    TIMESTAMP  FUNCTION
2270 #              | |       |   ||||       |         |
2271           <idle>-0     [003] dN.1  4228.547803: hrtimer_cancel <-tick_nohz_idle_exit
2272           <idle>-0     [003] dN.1  4228.547804: hrtimer_try_to_cancel <-hrtimer_cancel
2273           <idle>-0     [003] dN.2  4228.547805: hrtimer_force_reprogram <-__remove_hrtimer
2274           <idle>-0     [003] dN.1  4228.547805: hrtimer_forward <-tick_nohz_idle_exit
2275           <idle>-0     [003] dN.1  4228.547805: hrtimer_start_range_ns <-hrtimer_start_expires.constprop.11
2276           <idle>-0     [003] d..1  4228.547858: hrtimer_get_next_event <-get_next_timer_interrupt
2277           <idle>-0     [003] d..1  4228.547859: hrtimer_start <-__tick_nohz_idle_enter
2278           <idle>-0     [003] d..2  4228.547860: hrtimer_force_reprogram <-__rem
2279
2280 Notice that we lost the sys_nanosleep.
2281
2282  # cat set_ftrace_filter
2283 hrtimer_run_queues
2284 hrtimer_run_pending
2285 hrtimer_init
2286 hrtimer_cancel
2287 hrtimer_try_to_cancel
2288 hrtimer_forward
2289 hrtimer_start
2290 hrtimer_reprogram
2291 hrtimer_force_reprogram
2292 hrtimer_get_next_event
2293 hrtimer_interrupt
2294 hrtimer_nanosleep
2295 hrtimer_wakeup
2296 hrtimer_get_remaining
2297 hrtimer_get_res
2298 hrtimer_init_sleeper
2299
2300
2301 This is because the '>' and '>>' act just like they do in bash.
2302 To rewrite the filters, use '>'
2303 To append to the filters, use '>>'
2304
2305 To clear out a filter so that all functions will be recorded
2306 again:
2307
2308  # echo > set_ftrace_filter
2309  # cat set_ftrace_filter
2310  #
2311
2312 Again, now we want to append.
2313
2314  # echo sys_nanosleep > set_ftrace_filter
2315  # cat set_ftrace_filter
2316 sys_nanosleep
2317  # echo 'hrtimer_*' >> set_ftrace_filter
2318  # cat set_ftrace_filter
2319 hrtimer_run_queues
2320 hrtimer_run_pending
2321 hrtimer_init
2322 hrtimer_cancel
2323 hrtimer_try_to_cancel
2324 hrtimer_forward
2325 hrtimer_start
2326 hrtimer_reprogram
2327 hrtimer_force_reprogram
2328 hrtimer_get_next_event
2329 hrtimer_interrupt
2330 sys_nanosleep
2331 hrtimer_nanosleep
2332 hrtimer_wakeup
2333 hrtimer_get_remaining
2334 hrtimer_get_res
2335 hrtimer_init_sleeper
2336
2337
2338 The set_ftrace_notrace prevents those functions from being
2339 traced.
2340
2341  # echo '*preempt*' '*lock*' > set_ftrace_notrace
2342
2343 Produces:
2344
2345 # tracer: function
2346 #
2347 # entries-in-buffer/entries-written: 39608/39608   #P:4
2348 #
2349 #                              _-----=> irqs-off
2350 #                             / _----=> need-resched
2351 #                            | / _---=> hardirq/softirq
2352 #                            || / _--=> preempt-depth
2353 #                            ||| /     delay
2354 #           TASK-PID   CPU#  ||||    TIMESTAMP  FUNCTION
2355 #              | |       |   ||||       |         |
2356             bash-1994  [000] ....  4342.324896: file_ra_state_init <-do_dentry_open
2357             bash-1994  [000] ....  4342.324897: open_check_o_direct <-do_last
2358             bash-1994  [000] ....  4342.324897: ima_file_check <-do_last
2359             bash-1994  [000] ....  4342.324898: process_measurement <-ima_file_check
2360             bash-1994  [000] ....  4342.324898: ima_get_action <-process_measurement
2361             bash-1994  [000] ....  4342.324898: ima_match_policy <-ima_get_action
2362             bash-1994  [000] ....  4342.324899: do_truncate <-do_last
2363             bash-1994  [000] ....  4342.324899: should_remove_suid <-do_truncate
2364             bash-1994  [000] ....  4342.324899: notify_change <-do_truncate
2365             bash-1994  [000] ....  4342.324900: current_fs_time <-notify_change
2366             bash-1994  [000] ....  4342.324900: current_kernel_time <-current_fs_time
2367             bash-1994  [000] ....  4342.324900: timespec_trunc <-current_fs_time
2368
2369 We can see that there's no more lock or preempt tracing.
2370
2371
2372 Dynamic ftrace with the function graph tracer
2373 ---------------------------------------------
2374
2375 Although what has been explained above concerns both the
2376 function tracer and the function-graph-tracer, there are some
2377 special features only available in the function-graph tracer.
2378
2379 If you want to trace only one function and all of its children,
2380 you just have to echo its name into set_graph_function:
2381
2382  echo __do_fault > set_graph_function
2383
2384 will produce the following "expanded" trace of the __do_fault()
2385 function:
2386
2387  0)               |  __do_fault() {
2388  0)               |    filemap_fault() {
2389  0)               |      find_lock_page() {
2390  0)   0.804 us    |        find_get_page();
2391  0)               |        __might_sleep() {
2392  0)   1.329 us    |        }
2393  0)   3.904 us    |      }
2394  0)   4.979 us    |    }
2395  0)   0.653 us    |    _spin_lock();
2396  0)   0.578 us    |    page_add_file_rmap();
2397  0)   0.525 us    |    native_set_pte_at();
2398  0)   0.585 us    |    _spin_unlock();
2399  0)               |    unlock_page() {
2400  0)   0.541 us    |      page_waitqueue();
2401  0)   0.639 us    |      __wake_up_bit();
2402  0)   2.786 us    |    }
2403  0) + 14.237 us   |  }
2404  0)               |  __do_fault() {
2405  0)               |    filemap_fault() {
2406  0)               |      find_lock_page() {
2407  0)   0.698 us    |        find_get_page();
2408  0)               |        __might_sleep() {
2409  0)   1.412 us    |        }
2410  0)   3.950 us    |      }
2411  0)   5.098 us    |    }
2412  0)   0.631 us    |    _spin_lock();
2413  0)   0.571 us    |    page_add_file_rmap();
2414  0)   0.526 us    |    native_set_pte_at();
2415  0)   0.586 us    |    _spin_unlock();
2416  0)               |    unlock_page() {
2417  0)   0.533 us    |      page_waitqueue();
2418  0)   0.638 us    |      __wake_up_bit();
2419  0)   2.793 us    |    }
2420  0) + 14.012 us   |  }
2421
2422 You can also expand several functions at once:
2423
2424  echo sys_open > set_graph_function
2425  echo sys_close >> set_graph_function
2426
2427 Now if you want to go back to trace all functions you can clear
2428 this special filter via:
2429
2430  echo > set_graph_function
2431
2432
2433 ftrace_enabled
2434 --------------
2435
2436 Note, the proc sysctl ftrace_enable is a big on/off switch for the
2437 function tracer. By default it is enabled (when function tracing is
2438 enabled in the kernel). If it is disabled, all function tracing is
2439 disabled. This includes not only the function tracers for ftrace, but
2440 also for any other uses (perf, kprobes, stack tracing, profiling, etc).
2441
2442 Please disable this with care.
2443
2444 This can be disable (and enabled) with:
2445
2446   sysctl kernel.ftrace_enabled=0
2447   sysctl kernel.ftrace_enabled=1
2448
2449  or
2450
2451   echo 0 > /proc/sys/kernel/ftrace_enabled
2452   echo 1 > /proc/sys/kernel/ftrace_enabled
2453
2454
2455 Filter commands
2456 ---------------
2457
2458 A few commands are supported by the set_ftrace_filter interface.
2459 Trace commands have the following format:
2460
2461 <function>:<command>:<parameter>
2462
2463 The following commands are supported:
2464
2465 - mod
2466   This command enables function filtering per module. The
2467   parameter defines the module. For example, if only the write*
2468   functions in the ext3 module are desired, run:
2469
2470    echo 'write*:mod:ext3' > set_ftrace_filter
2471
2472   This command interacts with the filter in the same way as
2473   filtering based on function names. Thus, adding more functions
2474   in a different module is accomplished by appending (>>) to the
2475   filter file. Remove specific module functions by prepending
2476   '!':
2477
2478    echo '!writeback*:mod:ext3' >> set_ftrace_filter
2479
2480   Mod command supports module globbing. Disable tracing for all
2481   functions except a specific module:
2482
2483    echo '!*:mod:!ext3' >> set_ftrace_filter
2484
2485   Disable tracing for all modules, but still trace kernel:
2486
2487    echo '!*:mod:*' >> set_ftrace_filter
2488
2489   Enable filter only for kernel:
2490
2491    echo '*write*:mod:!*' >> set_ftrace_filter
2492
2493   Enable filter for module globbing:
2494
2495    echo '*write*:mod:*snd*' >> set_ftrace_filter
2496
2497 - traceon/traceoff
2498   These commands turn tracing on and off when the specified
2499   functions are hit. The parameter determines how many times the
2500   tracing system is turned on and off. If unspecified, there is
2501   no limit. For example, to disable tracing when a schedule bug
2502   is hit the first 5 times, run:
2503
2504    echo '__schedule_bug:traceoff:5' > set_ftrace_filter
2505
2506   To always disable tracing when __schedule_bug is hit:
2507
2508    echo '__schedule_bug:traceoff' > set_ftrace_filter
2509
2510   These commands are cumulative whether or not they are appended
2511   to set_ftrace_filter. To remove a command, prepend it by '!'
2512   and drop the parameter:
2513
2514    echo '!__schedule_bug:traceoff:0' > set_ftrace_filter
2515
2516     The above removes the traceoff command for __schedule_bug
2517     that have a counter. To remove commands without counters:
2518
2519    echo '!__schedule_bug:traceoff' > set_ftrace_filter
2520
2521 - snapshot
2522   Will cause a snapshot to be triggered when the function is hit.
2523
2524    echo 'native_flush_tlb_others:snapshot' > set_ftrace_filter
2525
2526   To only snapshot once:
2527
2528    echo 'native_flush_tlb_others:snapshot:1' > set_ftrace_filter
2529
2530   To remove the above commands:
2531
2532    echo '!native_flush_tlb_others:snapshot' > set_ftrace_filter
2533    echo '!native_flush_tlb_others:snapshot:0' > set_ftrace_filter
2534
2535 - enable_event/disable_event
2536   These commands can enable or disable a trace event. Note, because
2537   function tracing callbacks are very sensitive, when these commands
2538   are registered, the trace point is activated, but disabled in
2539   a "soft" mode. That is, the tracepoint will be called, but
2540   just will not be traced. The event tracepoint stays in this mode
2541   as long as there's a command that triggers it.
2542
2543    echo 'try_to_wake_up:enable_event:sched:sched_switch:2' > \
2544          set_ftrace_filter
2545
2546   The format is:
2547
2548     <function>:enable_event:<system>:<event>[:count]
2549     <function>:disable_event:<system>:<event>[:count]
2550
2551   To remove the events commands:
2552
2553
2554    echo '!try_to_wake_up:enable_event:sched:sched_switch:0' > \
2555          set_ftrace_filter
2556    echo '!schedule:disable_event:sched:sched_switch' > \
2557          set_ftrace_filter
2558
2559 - dump
2560   When the function is hit, it will dump the contents of the ftrace
2561   ring buffer to the console. This is useful if you need to debug
2562   something, and want to dump the trace when a certain function
2563   is hit. Perhaps its a function that is called before a tripple
2564   fault happens and does not allow you to get a regular dump.
2565
2566 - cpudump
2567   When the function is hit, it will dump the contents of the ftrace
2568   ring buffer for the current CPU to the console. Unlike the "dump"
2569   command, it only prints out the contents of the ring buffer for the
2570   CPU that executed the function that triggered the dump.
2571
2572 trace_pipe
2573 ----------
2574
2575 The trace_pipe outputs the same content as the trace file, but
2576 the effect on the tracing is different. Every read from
2577 trace_pipe is consumed. This means that subsequent reads will be
2578 different. The trace is live.
2579
2580  # echo function > current_tracer
2581  # cat trace_pipe > /tmp/trace.out &
2582 [1] 4153
2583  # echo 1 > tracing_on
2584  # usleep 1
2585  # echo 0 > tracing_on
2586  # cat trace
2587 # tracer: function
2588 #
2589 # entries-in-buffer/entries-written: 0/0   #P:4
2590 #
2591 #                              _-----=> irqs-off
2592 #                             / _----=> need-resched
2593 #                            | / _---=> hardirq/softirq
2594 #                            || / _--=> preempt-depth
2595 #                            ||| /     delay
2596 #           TASK-PID   CPU#  ||||    TIMESTAMP  FUNCTION
2597 #              | |       |   ||||       |         |
2598
2599  #
2600  # cat /tmp/trace.out
2601             bash-1994  [000] ....  5281.568961: mutex_unlock <-rb_simple_write
2602             bash-1994  [000] ....  5281.568963: __mutex_unlock_slowpath <-mutex_unlock
2603             bash-1994  [000] ....  5281.568963: __fsnotify_parent <-fsnotify_modify
2604             bash-1994  [000] ....  5281.568964: fsnotify <-fsnotify_modify
2605             bash-1994  [000] ....  5281.568964: __srcu_read_lock <-fsnotify
2606             bash-1994  [000] ....  5281.568964: add_preempt_count <-__srcu_read_lock
2607             bash-1994  [000] ...1  5281.568965: sub_preempt_count <-__srcu_read_lock
2608             bash-1994  [000] ....  5281.568965: __srcu_read_unlock <-fsnotify
2609             bash-1994  [000] ....  5281.568967: sys_dup2 <-system_call_fastpath
2610
2611
2612 Note, reading the trace_pipe file will block until more input is
2613 added.
2614
2615 trace entries
2616 -------------
2617
2618 Having too much or not enough data can be troublesome in
2619 diagnosing an issue in the kernel. The file buffer_size_kb is
2620 used to modify the size of the internal trace buffers. The
2621 number listed is the number of entries that can be recorded per
2622 CPU. To know the full size, multiply the number of possible CPUs
2623 with the number of entries.
2624
2625  # cat buffer_size_kb
2626 1408 (units kilobytes)
2627
2628 Or simply read buffer_total_size_kb
2629
2630  # cat buffer_total_size_kb
2631 5632
2632
2633 To modify the buffer, simple echo in a number (in 1024 byte segments).
2634
2635  # echo 10000 > buffer_size_kb
2636  # cat buffer_size_kb
2637 10000 (units kilobytes)
2638
2639 It will try to allocate as much as possible. If you allocate too
2640 much, it can cause Out-Of-Memory to trigger.
2641
2642  # echo 1000000000000 > buffer_size_kb
2643 -bash: echo: write error: Cannot allocate memory
2644  # cat buffer_size_kb
2645 85
2646
2647 The per_cpu buffers can be changed individually as well:
2648
2649  # echo 10000 > per_cpu/cpu0/buffer_size_kb
2650  # echo 100 > per_cpu/cpu1/buffer_size_kb
2651
2652 When the per_cpu buffers are not the same, the buffer_size_kb
2653 at the top level will just show an X
2654
2655  # cat buffer_size_kb
2656 X
2657
2658 This is where the buffer_total_size_kb is useful:
2659
2660  # cat buffer_total_size_kb
2661 12916
2662
2663 Writing to the top level buffer_size_kb will reset all the buffers
2664 to be the same again.
2665
2666 Snapshot
2667 --------
2668 CONFIG_TRACER_SNAPSHOT makes a generic snapshot feature
2669 available to all non latency tracers. (Latency tracers which
2670 record max latency, such as "irqsoff" or "wakeup", can't use
2671 this feature, since those are already using the snapshot
2672 mechanism internally.)
2673
2674 Snapshot preserves a current trace buffer at a particular point
2675 in time without stopping tracing. Ftrace swaps the current
2676 buffer with a spare buffer, and tracing continues in the new
2677 current (=previous spare) buffer.
2678
2679 The following debugfs files in "tracing" are related to this
2680 feature:
2681
2682   snapshot:
2683
2684         This is used to take a snapshot and to read the output
2685         of the snapshot. Echo 1 into this file to allocate a
2686         spare buffer and to take a snapshot (swap), then read
2687         the snapshot from this file in the same format as
2688         "trace" (described above in the section "The File
2689         System"). Both reads snapshot and tracing are executable
2690         in parallel. When the spare buffer is allocated, echoing
2691         0 frees it, and echoing else (positive) values clear the
2692         snapshot contents.
2693         More details are shown in the table below.
2694
2695         status\input  |     0      |     1      |    else    |
2696         --------------+------------+------------+------------+
2697         not allocated |(do nothing)| alloc+swap |(do nothing)|
2698         --------------+------------+------------+------------+
2699         allocated     |    free    |    swap    |   clear    |
2700         --------------+------------+------------+------------+
2701
2702 Here is an example of using the snapshot feature.
2703
2704  # echo 1 > events/sched/enable
2705  # echo 1 > snapshot
2706  # cat snapshot
2707 # tracer: nop
2708 #
2709 # entries-in-buffer/entries-written: 71/71   #P:8
2710 #
2711 #                              _-----=> irqs-off
2712 #                             / _----=> need-resched
2713 #                            | / _---=> hardirq/softirq
2714 #                            || / _--=> preempt-depth
2715 #                            ||| /     delay
2716 #           TASK-PID   CPU#  ||||    TIMESTAMP  FUNCTION
2717 #              | |       |   ||||       |         |
2718           <idle>-0     [005] d...  2440.603828: sched_switch: prev_comm=swapper/5 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=snapshot-test-2 next_pid=2242 next_prio=120
2719            sleep-2242  [005] d...  2440.603846: sched_switch: prev_comm=snapshot-test-2 prev_pid=2242 prev_prio=120 prev_state=R ==> next_comm=kworker/5:1 next_pid=60 next_prio=120
2720 [...]
2721           <idle>-0     [002] d...  2440.707230: sched_switch: prev_comm=swapper/2 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=snapshot-test-2 next_pid=2229 next_prio=120
2722
2723  # cat trace
2724 # tracer: nop
2725 #
2726 # entries-in-buffer/entries-written: 77/77   #P:8
2727 #
2728 #                              _-----=> irqs-off
2729 #                             / _----=> need-resched
2730 #                            | / _---=> hardirq/softirq
2731 #                            || / _--=> preempt-depth
2732 #                            ||| /     delay
2733 #           TASK-PID   CPU#  ||||    TIMESTAMP  FUNCTION
2734 #              | |       |   ||||       |         |
2735           <idle>-0     [007] d...  2440.707395: sched_switch: prev_comm=swapper/7 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=snapshot-test-2 next_pid=2243 next_prio=120
2736  snapshot-test-2-2229  [002] d...  2440.707438: sched_switch: prev_comm=snapshot-test-2 prev_pid=2229 prev_prio=120 prev_state=S ==> next_comm=swapper/2 next_pid=0 next_prio=120
2737 [...]
2738
2739
2740 If you try to use this snapshot feature when current tracer is
2741 one of the latency tracers, you will get the following results.
2742
2743  # echo wakeup > current_tracer
2744  # echo 1 > snapshot
2745 bash: echo: write error: Device or resource busy
2746  # cat snapshot
2747 cat: snapshot: Device or resource busy
2748
2749
2750 Instances
2751 ---------
2752 In the debugfs tracing directory is a directory called "instances".
2753 This directory can have new directories created inside of it using
2754 mkdir, and removing directories with rmdir. The directory created
2755 with mkdir in this directory will already contain files and other
2756 directories after it is created.
2757
2758  # mkdir instances/foo
2759  # ls instances/foo
2760 buffer_size_kb  buffer_total_size_kb  events  free_buffer  per_cpu
2761 set_event  snapshot  trace  trace_clock  trace_marker  trace_options
2762 trace_pipe  tracing_on
2763
2764 As you can see, the new directory looks similar to the tracing directory
2765 itself. In fact, it is very similar, except that the buffer and
2766 events are agnostic from the main director, or from any other
2767 instances that are created.
2768
2769 The files in the new directory work just like the files with the
2770 same name in the tracing directory except the buffer that is used
2771 is a separate and new buffer. The files affect that buffer but do not
2772 affect the main buffer with the exception of trace_options. Currently,
2773 the trace_options affect all instances and the top level buffer
2774 the same, but this may change in future releases. That is, options
2775 may become specific to the instance they reside in.
2776
2777 Notice that none of the function tracer files are there, nor is
2778 current_tracer and available_tracers. This is because the buffers
2779 can currently only have events enabled for them.
2780
2781  # mkdir instances/foo
2782  # mkdir instances/bar
2783  # mkdir instances/zoot
2784  # echo 100000 > buffer_size_kb
2785  # echo 1000 > instances/foo/buffer_size_kb
2786  # echo 5000 > instances/bar/per_cpu/cpu1/buffer_size_kb
2787  # echo function > current_trace
2788  # echo 1 > instances/foo/events/sched/sched_wakeup/enable
2789  # echo 1 > instances/foo/events/sched/sched_wakeup_new/enable
2790  # echo 1 > instances/foo/events/sched/sched_switch/enable
2791  # echo 1 > instances/bar/events/irq/enable
2792  # echo 1 > instances/zoot/events/syscalls/enable
2793  # cat trace_pipe
2794 CPU:2 [LOST 11745 EVENTS]
2795             bash-2044  [002] .... 10594.481032: _raw_spin_lock_irqsave <-get_page_from_freelist
2796             bash-2044  [002] d... 10594.481032: add_preempt_count <-_raw_spin_lock_irqsave
2797             bash-2044  [002] d..1 10594.481032: __rmqueue <-get_page_from_freelist
2798             bash-2044  [002] d..1 10594.481033: _raw_spin_unlock <-get_page_from_freelist
2799             bash-2044  [002] d..1 10594.481033: sub_preempt_count <-_raw_spin_unlock
2800             bash-2044  [002] d... 10594.481033: get_pageblock_flags_group <-get_pageblock_migratetype
2801             bash-2044  [002] d... 10594.481034: __mod_zone_page_state <-get_page_from_freelist
2802             bash-2044  [002] d... 10594.481034: zone_statistics <-get_page_from_freelist
2803             bash-2044  [002] d... 10594.481034: __inc_zone_state <-zone_statistics
2804             bash-2044  [002] d... 10594.481034: __inc_zone_state <-zone_statistics
2805             bash-2044  [002] .... 10594.481035: arch_dup_task_struct <-copy_process
2806 [...]
2807
2808  # cat instances/foo/trace_pipe
2809             bash-1998  [000] d..4   136.676759: sched_wakeup: comm=kworker/0:1 pid=59 prio=120 success=1 target_cpu=000
2810             bash-1998  [000] dN.4   136.676760: sched_wakeup: comm=bash pid=1998 prio=120 success=1 target_cpu=000
2811           <idle>-0     [003] d.h3   136.676906: sched_wakeup: comm=rcu_preempt pid=9 prio=120 success=1 target_cpu=003
2812           <idle>-0     [003] d..3   136.676909: sched_switch: prev_comm=swapper/3 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=rcu_preempt next_pid=9 next_prio=120
2813      rcu_preempt-9     [003] d..3   136.676916: sched_switch: prev_comm=rcu_preempt prev_pid=9 prev_prio=120 prev_state=S ==> next_comm=swapper/3 next_pid=0 next_prio=120
2814             bash-1998  [000] d..4   136.677014: sched_wakeup: comm=kworker/0:1 pid=59 prio=120 success=1 target_cpu=000
2815             bash-1998  [000] dN.4   136.677016: sched_wakeup: comm=bash pid=1998 prio=120 success=1 target_cpu=000
2816             bash-1998  [000] d..3   136.677018: sched_switch: prev_comm=bash prev_pid=1998 prev_prio=120 prev_state=R+ ==> next_comm=kworker/0:1 next_pid=59 next_prio=120
2817      kworker/0:1-59    [000] d..4   136.677022: sched_wakeup: comm=sshd pid=1995 prio=120 success=1 target_cpu=001
2818      kworker/0:1-59    [000] d..3   136.677025: sched_switch: prev_comm=kworker/0:1 prev_pid=59 prev_prio=120 prev_state=S ==> next_comm=bash next_pid=1998 next_prio=120
2819 [...]
2820
2821  # cat instances/bar/trace_pipe
2822      migration/1-14    [001] d.h3   138.732674: softirq_raise: vec=3 [action=NET_RX]
2823           <idle>-0     [001] dNh3   138.732725: softirq_raise: vec=3 [action=NET_RX]
2824             bash-1998  [000] d.h1   138.733101: softirq_raise: vec=1 [action=TIMER]
2825             bash-1998  [000] d.h1   138.733102: softirq_raise: vec=9 [action=RCU]
2826             bash-1998  [000] ..s2   138.733105: softirq_entry: vec=1 [action=TIMER]
2827             bash-1998  [000] ..s2   138.733106: softirq_exit: vec=1 [action=TIMER]
2828             bash-1998  [000] ..s2   138.733106: softirq_entry: vec=9 [action=RCU]
2829             bash-1998  [000] ..s2   138.733109: softirq_exit: vec=9 [action=RCU]
2830             sshd-1995  [001] d.h1   138.733278: irq_handler_entry: irq=21 name=uhci_hcd:usb4
2831             sshd-1995  [001] d.h1   138.733280: irq_handler_exit: irq=21 ret=unhandled
2832             sshd-1995  [001] d.h1   138.733281: irq_handler_entry: irq=21 name=eth0
2833             sshd-1995  [001] d.h1   138.733283: irq_handler_exit: irq=21 ret=handled
2834 [...]
2835
2836  # cat instances/zoot/trace
2837 # tracer: nop
2838 #
2839 # entries-in-buffer/entries-written: 18996/18996   #P:4
2840 #
2841 #                              _-----=> irqs-off
2842 #                             / _----=> need-resched
2843 #                            | / _---=> hardirq/softirq
2844 #                            || / _--=> preempt-depth
2845 #                            ||| /     delay
2846 #           TASK-PID   CPU#  ||||    TIMESTAMP  FUNCTION
2847 #              | |       |   ||||       |         |
2848             bash-1998  [000] d...   140.733501: sys_write -> 0x2
2849             bash-1998  [000] d...   140.733504: sys_dup2(oldfd: a, newfd: 1)
2850             bash-1998  [000] d...   140.733506: sys_dup2 -> 0x1
2851             bash-1998  [000] d...   140.733508: sys_fcntl(fd: a, cmd: 1, arg: 0)
2852             bash-1998  [000] d...   140.733509: sys_fcntl -> 0x1
2853             bash-1998  [000] d...   140.733510: sys_close(fd: a)
2854             bash-1998  [000] d...   140.733510: sys_close -> 0x0
2855             bash-1998  [000] d...   140.733514: sys_rt_sigprocmask(how: 0, nset: 0, oset: 6e2768, sigsetsize: 8)
2856             bash-1998  [000] d...   140.733515: sys_rt_sigprocmask -> 0x0
2857             bash-1998  [000] d...   140.733516: sys_rt_sigaction(sig: 2, act: 7fff718846f0, oact: 7fff71884650, sigsetsize: 8)
2858             bash-1998  [000] d...   140.733516: sys_rt_sigaction -> 0x0
2859
2860 You can see that the trace of the top most trace buffer shows only
2861 the function tracing. The foo instance displays wakeups and task
2862 switches.
2863
2864 To remove the instances, simply delete their directories:
2865
2866  # rmdir instances/foo
2867  # rmdir instances/bar
2868  # rmdir instances/zoot
2869
2870 Note, if a process has a trace file open in one of the instance
2871 directories, the rmdir will fail with EBUSY.
2872
2873
2874 Stack trace
2875 -----------
2876 Since the kernel has a fixed sized stack, it is important not to
2877 waste it in functions. A kernel developer must be conscience of
2878 what they allocate on the stack. If they add too much, the system
2879 can be in danger of a stack overflow, and corruption will occur,
2880 usually leading to a system panic.
2881
2882 There are some tools that check this, usually with interrupts
2883 periodically checking usage. But if you can perform a check
2884 at every function call that will become very useful. As ftrace provides
2885 a function tracer, it makes it convenient to check the stack size
2886 at every function call. This is enabled via the stack tracer.
2887
2888 CONFIG_STACK_TRACER enables the ftrace stack tracing functionality.
2889 To enable it, write a '1' into /proc/sys/kernel/stack_tracer_enabled.
2890
2891  # echo 1 > /proc/sys/kernel/stack_tracer_enabled
2892
2893 You can also enable it from the kernel command line to trace
2894 the stack size of the kernel during boot up, by adding "stacktrace"
2895 to the kernel command line parameter.
2896
2897 After running it for a few minutes, the output looks like:
2898
2899  # cat stack_max_size
2900 2928
2901
2902  # cat stack_trace
2903         Depth    Size   Location    (18 entries)
2904         -----    ----   --------
2905   0)     2928     224   update_sd_lb_stats+0xbc/0x4ac
2906   1)     2704     160   find_busiest_group+0x31/0x1f1
2907   2)     2544     256   load_balance+0xd9/0x662
2908   3)     2288      80   idle_balance+0xbb/0x130
2909   4)     2208     128   __schedule+0x26e/0x5b9
2910   5)     2080      16   schedule+0x64/0x66
2911   6)     2064     128   schedule_timeout+0x34/0xe0
2912   7)     1936     112   wait_for_common+0x97/0xf1
2913   8)     1824      16   wait_for_completion+0x1d/0x1f
2914   9)     1808     128   flush_work+0xfe/0x119
2915  10)     1680      16   tty_flush_to_ldisc+0x1e/0x20
2916  11)     1664      48   input_available_p+0x1d/0x5c
2917  12)     1616      48   n_tty_poll+0x6d/0x134
2918  13)     1568      64   tty_poll+0x64/0x7f
2919  14)     1504     880   do_select+0x31e/0x511
2920  15)      624     400   core_sys_select+0x177/0x216
2921  16)      224      96   sys_select+0x91/0xb9
2922  17)      128     128   system_call_fastpath+0x16/0x1b
2923
2924 Note, if -mfentry is being used by gcc, functions get traced before
2925 they set up the stack frame. This means that leaf level functions
2926 are not tested by the stack tracer when -mfentry is used.
2927
2928 Currently, -mfentry is used by gcc 4.6.0 and above on x86 only.
2929
2930 ---------
2931
2932 More details can be found in the source code, in the
2933 kernel/trace/*.c files.