Yunlong Song [Tue, 31 Mar 2015 13:46:29 +0000 (21:46 +0800)]
perf sched replay: Increase the MAX_PID value to fix assertion failure problem
Current MAX_PID is only 65536, which will cause assertion failure problem
when CPU cores are more than 64 in x86_64.
This is because the pid_max value in x86_64 is at least
PIDS_PER_CPU_DEFAULT * num_possible_cpus() (see function pidmap_init
defined in kernel/pid.c), where PIDS_PER_CPU_DEFAULT is 1024 (defined in
include/linux/threads.h).
Thus for MAX_PID = 65536, the correspoinding CPU cores are
65536/1024=64. This is obviously not enough at all for x86_64, and will
cause an assertion failure problem due to BUG_ON(pid >= MAX_PID) in the
codes.
We increase MAX_PID value from 65536 to 1024*1000, which can be used in
x86_64 with 1000 cores.
This number is finally decided according to the limitation of stack size
of calling process.
Use 'ulimit -a', the result shows the stack size of any process is 8192
Kbytes, which is defined in include/uapi/linux/resource.h (#define
_STK_LIM (8*1024*1024)).
Thus we choose a large enough value for MAX_PID, and make it satisfy to
the limitation of the stack size, i.e., making the perf process take up
a memory space just smaller than 8192 Kbytes.
We have calculated and tested that 1024*1000 is OK for MAX_PID.
This means perf sched replay can now be used with at most 1000 cores in
x86_64 without any assertion failure problem.
Example:
Test environment: x86_64 with 160 cores
$ cat /proc/sys/kernel/pid_max
163840
Before this patch:
$ perf sched replay
run measurement overhead: 240 nsecs
sleep measurement overhead: 55379 nsecs
the run test took 1000004 nsecs
the sleep test took 1059424 nsecs
perf: builtin-sched.c:330: register_pid: Assertion `!(pid >= 65536)'
failed.
Aborted
After this patch:
$ perf sched replay
run measurement overhead: 221 nsecs
sleep measurement overhead: 55397 nsecs
the run test took 999920 nsecs
the sleep test took 1053313 nsecs
nr_run_events: 10
nr_sleep_events: 1562
nr_wakeup_events: 5
task 0 ( :1: 1), nr_events: 1
task 1 ( :2: 2), nr_events: 1
task 2 ( :3: 3), nr_events: 1
task 3 ( :5: 5), nr_events: 1
...
Yunlong Song [Tue, 31 Mar 2015 13:46:28 +0000 (21:46 +0800)]
perf sched replay: Use struct task_desc instead of struct task_task for correct meaning
There is no struct task_task at all, thus it is a typo error in the old
commits, now fix it to what it should be in order to avoid unnecessary
misunderstanding.
Jiri Olsa [Mon, 6 Apr 2015 05:36:08 +0000 (14:36 +0900)]
perf kmem: Respect -i option
Currently the perf kmem does not respect -i option.
Initializing the file.path properly after options get parsed.
Signed-off-by: Jiri Olsa <jolsa@kernel.org> Cc: David Ahern <dsahern@gmail.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Joonsoo Kim <js1304@gmail.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: linux-mm@kvack.org Link: http://lkml.kernel.org/r/1428298576-9785-2-git-send-email-namhyung@kernel.org Signed-off-by: Namhyung Kim <namhyung@kernel.org> Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Namhyung Kim [Mon, 6 Apr 2015 05:36:16 +0000 (14:36 +0900)]
tools lib traceevent: Honor operator priority
Currently it ignores operator priority and just sets processed args as a
right operand. But it could result in priority inversion in case that
the right operand is also a operator arg and its priority is lower.
For example, following print format is from new kmem events.
In this case, the right arg was '?' operator which has lower priority.
But it just sets the whole arg so making the output confusing - page was
always 0 or 1 since that's the result of logical operation.
With this patch, it can handle it properly like following:
Signed-off-by: Namhyung Kim <namhyung@kernel.org> Acked-by: Steven Rostedt <rostedt@goodmis.org> Cc: David Ahern <dsahern@gmail.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Joonsoo Kim <js1304@gmail.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: linux-mm@kvack.org Link: http://lkml.kernel.org/r/1428298576-9785-10-git-send-email-namhyung@kernel.org
[ Replaced 'swap' with 'rotate' in a comment as requested by Steve and agreed by Namhyung ] Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Wang Nan [Tue, 7 Apr 2015 08:22:45 +0000 (08:22 +0000)]
perf kmaps: Check kmaps to make code more robust
This patch add checks in places where map__kmap is used to get kmaps
from struct kmap.
Error messages are added at map__kmap to warn invalid accessing of kmap
(for the case of !map->dso->kernel, kmap(map) does not exists at all).
Also, introduces map__kmaps() to warn uninitialized kmaps.
Reviewed-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Wang Nan <wangnan0@huawei.com> Cc: pi3orama@163.com Cc: Jiri Olsa <jolsa@kernel.org> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Zefan Li <lizefan@huawei.com> Link: http://lkml.kernel.org/r/1428394966-131044-2-git-send-email-wangnan0@huawei.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
He Kuang [Tue, 7 Apr 2015 09:31:10 +0000 (17:31 +0800)]
perf evlist: Fix inverted logic in perf_mmap__empty
perf_evlist__mmap_consume() uses perf_mmap__empty() to judge whether
perf_mmap is empty and can be released. But the result is inverted so
fix it.
Signed-off-by: He Kuang <hekuang@huawei.com> Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Wang Nan <wangnan0@huawei.com> Link: http://lkml.kernel.org/r/1428399071-7141-1-git-send-email-hekuang@huawei.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Yunlong Song [Thu, 2 Apr 2015 13:47:19 +0000 (21:47 +0800)]
perf data: Support using -f to override perf.data file ownership for 'convert'
Enable perf data convert to use perf.data when it is not owned by
current user or root.
Example:
# perf record ls
# chown Yunlong.Song:Yunlong.Song perf.data
# ls -al perf.data
-rw------- 1 Yunlong.Song Yunlong.Song 28260 Apr 2 17:35 perf.data
# id
uid=0(root) gid=0(root) groups=0(root),64(pkcs11)
Before this patch:
# perf data convert --to-ctf=./ctf-data/
File perf.data not owned by current user or root (use -f to override)
# perf data convert --to-ctf=./ctf-data/ -f
Error: unknown switch `f'
usage: perf data convert [<options>]
-v, --verbose be more verbose
-i, --input <file> input file name
--to-ctf ... Convert to CTF format
After this patch:
# perf data convert --to-ctf=./ctf-data/
File perf.data not owned by current user or root (use -f to override)
# perf data convert --to-ctf=./ctf-data/ -f
# ls ctf-data/
metadata perf_stream_0
--event <event> event selector. use 'perf list' to list
available events
--comm show the thread COMM next to its id
--tool_stats show tool stats
-e, --expr <expr> list of events to trace
-o, --output <file> output file name
-i, --input <file> Analyze events in file
-p, --pid <pid> trace events on existing process id
-t, --tid <tid> trace events on existing thread id
--filter-pids <float>
...
As shown above, the -f option does not work at all.
Signed-off-by: Yunlong Song <yunlong.song@huawei.com> Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Wang Nan <wangnan0@huawei.com> Link: http://lkml.kernel.org/r/1427982439-27388-10-git-send-email-yunlong.song@huawei.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Yunlong Song [Thu, 2 Apr 2015 13:47:17 +0000 (21:47 +0800)]
perf timechart: Support using -f to override perf.data file ownership
Enable perf timechart to use perf.data when it is not owned by current
user or root.
Example:
# perf timechart record ls
# chown Yunlong.Song:Yunlong.Song perf.data
# ls -al perf.data
-rw------- 1 Yunlong.Song Yunlong.Song 5471744 Apr 2 15:15 perf.data
# id
uid=0(root) gid=0(root) groups=0(root),64(pkcs11)
Before this patch:
# perf timechart
File perf.data not owned by current user or root (use -f to override)
# perf timechart -f
Error: unknown switch `f'
usage: perf timechart [<options>] {record}
-i, --input <file> input file name
-o, --output <file> output file name
-w, --width <n> page width
--highlight <duration or task name>
highlight tasks. Pass duration in ns or process name.
-P, --power-only output power data only
-T, --tasks-only output processes data only
-p, --process <process>
process selector. Pass a pid or process name.
--symfs <directory>
Look for files with symbols relative to this directory
-n, --proc-num <n> min. number of tasks to print
-t, --topology sort CPUs according to topology
--io-skip-eagain skip EAGAIN errors
--io-min-time <time>
all IO faster than min-time will visually appear longer
--io-merge-dist <time>
merge events that are merge-dist us apart
As shown above, the -f option does not work at all.
After this patch:
# perf timechart
File perf.data not owned by current user or root (use -f to override)
# perf timechart -f
Written 0.0 seconds of trace to output.svg.
# cat output.svg
<?xml version="1.0" standalone="no"?>
<!DOCTYPE svg SYSTEM "http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd">
<svg width="1000" height="10110" version="1.1" xmlns="http://www.w3.org/2000/svg">
<defs>
<style type="text/css">
<![CDATA[
rect { stroke-width: 1; }
...
...
As shown above, the -f option really works now.
Signed-off-by: Yunlong Song <yunlong.song@huawei.com> Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Wang Nan <wangnan0@huawei.com> Link: http://lkml.kernel.org/r/1427982439-27388-9-git-send-email-yunlong.song@huawei.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Yunlong Song [Thu, 2 Apr 2015 13:47:16 +0000 (21:47 +0800)]
perf script: Support using -f to override perf.data file ownership
Enable perf script to use perf.data when it is not owned by current user
or root. Change the short option name of --fields to -F to avoid confusion
with --force.
Example:
# perf record ls
# chown Yunlong.Song:Yunlong.Song perf.data
# ls -al perf.data
-rw------- 1 Yunlong.Song Yunlong.Song 28360 Apr 2 14:53 perf.data
# id
uid=0(root) gid=0(root) groups=0(root),64(pkcs11)
Before this patch:
# perf script
File perf.data not owned by current user or root (use -f to override)
# perf script -f
Error: switch `f' requires a value
As shown above, the -f option does not work at all. And -f is already
taken up by --fields, which makes --force confused, so change the short
option name of --fields to -F like what other perf commands do (e.g.
perf report -F) and use -f as the short option name of --force.
After this patch:
# perf script
File perf.data not owned by current user or root (use -f to override)
# perf script -f
:41298 41298 2590086.564226: 1 cycles: ffffffff8103efc6
native_write_msr_safe ([kernel.kallsyms])
:41298 41298 2590086.564244: 1 cycles: ffffffff8103efc6
native_write_msr_safe ([kernel.kallsyms])
:41298 41298 2590086.564249: 7 cycles: ffffffff8103efc6
native_write_msr_safe ([kernel.kallsyms])
:41298 41298 2590086.564255: 176 cycles: ffffffff8103efc6
native_write_msr_safe ([kernel.kallsyms])
ls 41298 2590086.567346: 4059 cycles: ffffffff8105a592
raise_softirq ([kernel.kallsyms])
ls 41298 2590086.567353: 3717 cycles: ffffffff8105a592
raise_softirq ([kernel.kallsyms])
ls 41298 2590086.567358: 63058 cycles: ffffffff8105a592
raise_softirq ([kernel.kallsyms])
ls 41298 2590086.567448: 1706255 cycles: 406ae0
[unknown] (/usr/bin/ls)
As shown above, the -f option really works now.
Signed-off-by: Yunlong Song <yunlong.song@huawei.com> Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Wang Nan <wangnan0@huawei.com> Link: http://lkml.kernel.org/r/1427982439-27388-8-git-send-email-yunlong.song@huawei.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Yunlong Song [Thu, 2 Apr 2015 13:47:15 +0000 (21:47 +0800)]
perf mem: Support using -f to override perf.data file ownership
Enable perf mem to use perf.data when it is not owned by current user or
root.
Example:
# perf mem -t load record ls
# chown Yunlong.Song:Yunlong.Song perf.data
# ls -al perf.data
-rw------- 1 Yunlong.Song Yunlong.Song 16392 Apr 2 14:34 perf.data
# id
uid=0(root) gid=0(root) groups=0(root),64(pkcs11)
Before this patch:
# perf mem -D report
File perf.data not owned by current user or root (use -f to override)
# perf mem -D -f report
Error: unknown switch `f'
usage: perf mem [<options>] {record|report}
-t, --type <type> memory operations(load,store) Default load,store
-D, --dump-raw-samples
dump raw samples in ASCII
-U, --hide-unresolved
Only display entries resolved to a symbol
-i, --input <file> input file name
-C, --cpu <cpu> list of cpus to profile
-x, --field-separator <separator>
separator for columns, no spaces will be added
between columns '.' is reserved.
As shown above, the -f option does not work at all.
After this patch:
# perf mem -D report
File perf.data not owned by current user or root (use -f to override)
# perf mem -D -f report
# PID, TID, IP, ADDR, LOCAL WEIGHT, DSRC, SYMBOL
39095 39095 0xffffffff81127e40 0x016ffff887f45148338 8 0x68100142
/proc/kcore:perf_event_aux
39095 39095 0xffffffff8100a3fe 0xffff89007f8cb7d0 6 0x68100142
/proc/kcore:native_sched_clock
39095 39095 0xffffffff81309139 0xffff88bf44c9ded8 6 0x68100142
/proc/kcore:acpi_map_lookup
39095 39095 0xffffffff810f8c4c 0xffff89007f8ccd88 6 0x68100142
/proc/kcore:rcu_nmi_exit
39095 39095 0xffffffff81136346 0xffff88fea995dd50 6 0x68100142
/proc/kcore:unlock_page
39095 39095 0xffffffff812a64a2 0xffff88fea995dcc8 6 0x68100142
/proc/kcore:half_md4_transform
39095 39095 0x7f0cf877c7e9 0x25dfb94 6 0x68100142
/lib64/libc-2.19.so:__readdir64
39095 39095 0x7f0cf87575a3 0x7f0cf9163731 6 0x68100142
/lib64/libc-2.19.so:__strcoll_l
39095 39095 0xffffffff8116910e 0xffffea01c1bfbd50 23 0x68100242
/proc/kcore:page_remove_rmap
As shown above, the -f option really works now.
Signed-off-by: Yunlong Song <yunlong.song@huawei.com> Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Wang Nan <wangnan0@huawei.com> Link: http://lkml.kernel.org/r/1427982439-27388-7-git-send-email-yunlong.song@huawei.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Yunlong Song [Thu, 2 Apr 2015 13:47:14 +0000 (21:47 +0800)]
perf lock: Support using -f to override perf.data file ownership
Enable perf lock to use perf.data when it is not owned by current user
or root.
Example:
# perf lock record ls
# chown Yunlong.Song:Yunlong.Song perf.data
# ls -al perf.data
-rw------- 1 Yunlong.Song Yunlong.Song 4880686 Apr 2 14:14 perf.data
# id
uid=0(root) gid=0(root) groups=0(root),64(pkcs11)
Before this patch:
# perf lock report
File perf.data not owned by current user or root (use -f to override)
Initializing perf session failed
# perf lock report -f
Error: unknown switch `f'
As shown above, the -f option does not work at all.
After this patch:
# perf lock report
File perf.data not owned by current user or root (use -f to override)
Initializing perf session failed
# perf lock report -f
Name acquired contended avg wait (ns) total wait (ns) ...
Signed-off-by: Yunlong Song <yunlong.song@huawei.com> Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Wang Nan <wangnan0@huawei.com> Link: http://lkml.kernel.org/r/1427982439-27388-6-git-send-email-yunlong.song@huawei.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Yunlong Song [Thu, 2 Apr 2015 13:47:13 +0000 (21:47 +0800)]
perf kvm: Support using -f to override perf.data.guest file ownership
Enable perf kvm to use perf.data.guest when it is not owned by current
user or root.
Example:
# perf kvm stat record ls
# chown Yunlong.Song:Yunlong.Song perf.data.guest
# ls -al perf.data.guest
-rw------- 1 Yunlong.Song Yunlong.Song 4128937 Apr 2 11:05 perf.data.guest
# id
uid=0(root) gid=0(root) groups=0(root),64(pkcs11)
Before this patch:
# perf kvm stat report
File perf.data.guest not owned by current user or root (use -f to override)
Initializing perf session failed
# perf kvm stat report -f
Error: unknown switch `f'
usage: perf kvm stat report [<options>]
--event <report event>
event for reporting: vmexit, mmio (x86 only),
ioport (x86 only)
--vcpu <n> vcpu id to report
-k, --key <sort-key> key for sorting: sample(sort by samples
number) time (sort by avg time)
-p, --pid <pid> analyze events only for given process id(s)
As shown above, the -f option does not work at all.
After this patch:
# perf kvm stat report
File perf.data.guest not owned by current user or root (use -f to override)
Initializing perf session failed
# perf kvm stat report -f
Analyze events for all VMs, all VCPUs:
VM-EXIT Samples Samples% Time% Min Time Max Time Avg time
Total Samples:0, Total events handled time:0.00us.
As shown above, the -f option really works now. Since we have not
launched any KVM related process, the result shows 0 sample here.
Signed-off-by: Yunlong Song <yunlong.song@huawei.com> Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Wang Nan <wangnan0@huawei.com> Link: http://lkml.kernel.org/r/1427982439-27388-5-git-send-email-yunlong.song@huawei.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Yunlong Song [Thu, 2 Apr 2015 13:47:12 +0000 (21:47 +0800)]
perf kmem: Support using -f to override perf.data file ownership
Enable perf kmem to use perf.data when it is not owned by current user
or root.
Example:
# perf kmem record ls
# chown Yunlong.Song:Yunlong.Song perf.data
# ls -al perf.data
-rw------- 1 Yunlong.Song Yunlong.Song 5315665 Apr 2 10:54 perf.data
# id
uid=0(root) gid=0(root) groups=0(root),64(pkcs11)
Before this patch:
# perf kmem stat
File perf.data not owned by current user or root (use -f to override)
# perf kmem stat -f
Error: unknown switch `f'
usage: perf kmem [<options>] {record|stat}
-i, --input <file> input file name
-v, --verbose be more verbose (show symbol address, etc)
--caller show per-callsite statistics
--alloc show per-allocation statistics
-s, --sort <key[,key2...]>
sort by keys: ptr, call_site, bytes, hit,
pingpong, frag
-l, --line <num> show n lines
--raw-ip show raw ip instead of symbol
As shown above, the -f option does not work at all.
After this patch:
# perf kmem stat
File perf.data not owned by current user or root (use -f to override)
# perf kmem stat -f
SUMMARY
=======
Total bytes requested: 437599
Total bytes allocated: 615472
Total bytes wasted on internal fragmentation: 177873
Internal fragmentation: 28.900259%
Cross CPU allocations: 6/1192
As shown above, the -f option really works now.
Signed-off-by: Yunlong Song <yunlong.song@huawei.com> Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Wang Nan <wangnan0@huawei.com> Link: http://lkml.kernel.org/r/1427982439-27388-4-git-send-email-yunlong.song@huawei.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Yunlong Song [Thu, 2 Apr 2015 13:47:11 +0000 (21:47 +0800)]
perf inject: Support using -f to override perf.data file ownership
Enable perf inject to use perf.data when it is not owned by current user
or root.
Example:
# perf record ls
# chown Yunlong.Song:Yunlong.Song perf.data
# ls -al perf.data
-rw------- 1 Yunlong.Song Yunlong.Song 28260 Apr 2 10:37 perf.data
# id
uid=0(root) gid=0(root) groups=0(root),64(pkcs11)
Before this patch:
# perf inject -v -b -i perf.data -o perf.data.new
File perf.data not owned by current user or root (use -f to override)
# perf inject -v -b -i perf.data -o perf.data.new -f
Error: unknown switch `f'
usage: perf inject [<options>]
-b, --build-ids Inject build-ids into the output stream
-i, --input <file> input file name
-o, --output <file> output file name
-s, --sched-stat Merge sched-stat and sched-switch for getting
events where and how long tasks slept
-v, --verbose be more verbose (show build ids, etc)
--kallsyms <file>
kallsyms pathname
As shown above, the -f option does not work at all.
After this patch:
# perf inject -v -b -i perf.data -o perf.data.new
File perf.data not owned by current user or root (use -f to override)
# perf inject -v -b -i perf.data -o perf.data.new -f
build id event received for [kernel.kallsyms]: f6dcb66d8b98f1c0d9eb87bf043444b69f91d30c
symsrc__init: cannot get elf header.
Looking at the vmlinux_path (7 entries long)
Using /proc/kcore for kernel object code
Using /proc/kallsyms for symbols
As shown above, the -f option really works now.
Signed-off-by: Yunlong Song <yunlong.song@huawei.com> Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Wang Nan <wangnan0@huawei.com> Link: http://lkml.kernel.org/r/1427982439-27388-3-git-send-email-yunlong.song@huawei.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Yunlong Song [Thu, 2 Apr 2015 13:47:10 +0000 (21:47 +0800)]
perf evlist: Support using -f to override perf.data file ownership
Enable perf evlist to use perf.data when it is not owned by current user
or root.
Example:
# perf record ls
# chown Yunlong.Song:Yunlong.Song perf.data
# ls -al perf.data
-rw------- 1 Yunlong.Song Yunlong.Song 28260 Apr 2 10:18 perf.data
# id
uid=0(root) gid=0(root) groups=0(root),64(pkcs11)
Before this patch:
# perf evlist
File perf.data not owned by current user or root (use -f to override)
# perf evlist -f
Error: unknown switch `f'
usage: perf evlist [<options>]
-i, --input <file> Input file name
-F, --freq Show the sample frequency
-v, --verbose Show all event attr details
-g, --group Show event group information
As shown above, the -f option does not work at all.
After this patch:
# perf evlist
File perf.data not owned by current user or root (use -f to override)
# perf evlist -f
cycles
As shown above, the -f option really works now.
Signed-off-by: Yunlong Song <yunlong.song@huawei.com> Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Wang Nan <wangnan0@huawei.com> Link: http://lkml.kernel.org/r/1427982439-27388-2-git-send-email-yunlong.song@huawei.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
perf probe: Fix to track down unnamed union/structure members
Fix 'perf probe' to track down unnamed union/structure members.
perf probe did not track down the tree of unnamed union/structure
members, since it just failed to find given "name" in a parent
structure/union. To solve this issue, I've introduced 2 changes.
- Fix die_find_member() to track down the type-DIE if it is
unnamed, and if it contains the specified member, returns the
unnamed member.
(note that we don't return found member, since unnamed member
has the offset in the parent structure)
- Fix convert_variable_fields() to track down the unnamed union/
structure (one-by-one).
With this patch, perf probe can access unnamed fields:
-----
#./perf probe -nfx ./perf lock__delete ops 'locked_ops=ops->locked.ops'
Added new event:
probe_perf:lock__delete (on lock__delete in /home/mhiramat/ksrc/linux-3/tools/perf/perf with ops locked_ops=ops->locked.ops)
You can now use it in all perf tools, such as:
perf record -e probe_perf:lock__delete -aR sleep 1
-----
Reported-by: Arnaldo Carvalho de Melo <acme@kernel.org> Link: https://lkml.org/lkml/2015/3/5/431 Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: David Ahern <dsahern@gmail.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/20150402073312.14482.37942.stgit@localhost.localdomain Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
arch/x86/kernel/cpu/perf_event_intel_pt.c:413:5: warning: cast to pointer from integer of different size [-Wint-to-pointer-cast]
arch/x86/kernel/cpu/perf_event_intel_bts.c:162:24: warning: cast from pointer to integer of different size [-Wpointer-to-int-cast]
Fix it. The code should probably be (re-)tested on 32-bit systems to make
sure all is fine.
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Kaixu Xia <kaixu.xia@linaro.org> Cc: linux-kernel@vger.kernel.org Cc: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Robert Richter <rric@kernel.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: acme@infradead.org Cc: adrian.hunter@intel.com Cc: kan.liang@intel.com Cc: markus.t.metzger@intel.com Cc: mathieu.poirier@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
Andi Kleen [Fri, 20 Mar 2015 17:11:23 +0000 (10:11 -0700)]
perf/x86/intel: Streamline LBR MSR handling in PMI
The perf PMI currently does unnecessary MSR accesses when
LBRs are enabled. We use LBR freezing, or when in callstack
mode force the LBRs to only filter on ring 3.
So there is no need to disable the LBRs explicitely in the
PMI handler.
Also we always unnecessarily rewrite LBR_SELECT in the LBR
handler, even though it can never change.
5) | /* write_msr: MSR_LBR_SELECT(1c8), value 0 */
5) | /* read_msr: MSR_IA32_DEBUGCTLMSR(1d9), value 1801 */
5) | /* write_msr: MSR_IA32_DEBUGCTLMSR(1d9), value 1801 */
5) | /* write_msr: MSR_CORE_PERF_GLOBAL_CTRL(38f), value 70000000f */
5) | /* write_msr: MSR_CORE_PERF_GLOBAL_CTRL(38f), value 0 */
5) | /* write_msr: MSR_LBR_SELECT(1c8), value 0 */
5) | /* read_msr: MSR_IA32_DEBUGCTLMSR(1d9), value 1801 */
5) | /* write_msr: MSR_IA32_DEBUGCTLMSR(1d9), value 1801 */
This patch:
- Avoids disabling already frozen LBRs unnecessarily in the PMI
- Avoids changing LBR_SELECT in the PMI
Andi Kleen [Fri, 27 Feb 2015 17:48:32 +0000 (09:48 -0800)]
perf/x86: Only dump PEBS register when PEBS has been detected
Technically PEBS_ENABLED is only guaranteed to exist when we
detected PEBS. So add a check for this to the PMU dump function.
I don't think it can happen on a real CPU, but could in a VM.
Stephane Eranian [Mon, 17 Nov 2014 19:07:04 +0000 (20:07 +0100)]
perf/x86/intel: Make the HT bug workaround conditional on HT enabled
This patch disables the PMU HT bug when Hyperthreading (HT)
is disabled. We cannot do this test immediately when perf_events
is initialized. We need to wait until the topology information
is setup properly. As such, we register a later initcall, check
the topology and potentially disable the workaround. To do this,
we need to ensure there is no user of the PMU. At this point of
the boot, the only user is the NMI watchdog, thus we disable
it during the switch and re-enable it right after.
Having the workaround disabled when it is not needed provides
some benefits by limiting the overhead is time and space.
The workaround still ensures correct scheduling of the corrupting
memory events (0xd0, 0xd1, 0xd2) when HT is off. Those events
can only be measured on counters 0-3. Something else the current
kernel did not handle correctly.
Stephane Eranian [Mon, 17 Nov 2014 19:07:02 +0000 (20:07 +0100)]
perf/x86/intel: Limit to half counters when the HT workaround is enabled, to avoid exclusive mode starvation
This patch limits the number of counters available to each CPU when
the HT bug workaround is enabled.
This is necessary to avoid situation of counter starvation. Such can
arise from configuration where one HT thread, HT0, is using all 4 counters
with corrupting events which require exclusion the the sibling HT, HT1.
In such case, HT1 would not be able to schedule any event until HT0
is done. To mitigate this problem, this patch artificially limits
the number of counters to 2.
That way, we can gurantee that at least 2 counters are not in exclusive
mode and therefore allow the sibling thread to schedule events of the
same type (system vs. per-thread). The 2 counters are not determined
in advance. We simply set the limit to two events per HT.
This helps mitigate starvation in case of events with specific counter
constraints such a PREC_DIST.
Note that this does not elimintate the starvation is all cases. But
it is better than not having it.
This patch implements a software workaround for a HW erratum
on Intel SandyBridge, IvyBridge and Haswell processors
with Hyperthreading enabled. The errata are documented for
each processor in their respective specification update
documents:
The bug causes silent counter corruption across hyperthreads only
when measuring certain memory events (0xd0, 0xd1, 0xd2, 0xd3).
Counters measuring those events may leak counts to the sibling
counter. For instance, counter 0, thread 0 measuring event 0xd0,
may leak to counter 0, thread 1, regardless of the event measured
there. The size of the leak is not predictible. It all depends on
the workload and the state of each sibling hyper-thread. The
corrupting events do undercount as a consequence of the leak. The
leak is compensated automatically only when the sibling counter measures
the exact same corrupting event AND the workload is on the two threads
is the same. Given, there is no way to guarantee this, a work-around
is necessary. Furthermore, there is a serious problem if the leaked count
is added to a low-occurrence event. In that case the corruption on
the low occurrence event can be very large, e.g., orders of magnitude.
There is no HW or FW workaround for this problem.
The bug is very easy to reproduce on a loaded system.
Here is an example on a Haswell client, where CPU0, CPU4
are siblings. We load the CPUs with a simple triad app
streaming large floating-point vector. We use 0x81d0
corrupting event (MEM_UOPS_RETIRED:ALL_LOADS) and
0x20cc (ROB_MISC_EVENTS:LBR_INSERTS). Given we are not
using the LBR, the 0x20cc event should be zero.
$ taskset -c 0 triad &
$ taskset -c 4 triad &
$ perf stat -a -C 0 -e r81d0 sleep 100 &
$ perf stat -a -C 4 -r20cc sleep 10
Performance counter stats for 'system wide':
139 277 291 r20cc
10,000969126 seconds time elapsed
In this example, 0x81d0 and r20cc ar eusing sinling counters
on CPU0 and CPU4. 0x81d0 leaks into 0x20cc and corrupts it
from 0 to 139 millions occurrences.
This patch provides a software workaround to this problem by modifying the
way events are scheduled onto counters by the kernel. The patch forces
cross-thread mutual exclusion between counters in case a corrupting event
is measured by one of the hyper-threads. If thread 0, counter 0 is measuring
event 0xd0, then nothing can be measured on counter 0, thread 1. If no corrupting
event is measured on any hyper-thread, event scheduling proceeds as before.
The same example run with the workaround enabled, yield the correct answer:
$ taskset -c 0 triad &
$ taskset -c 4 triad &
$ perf stat -a -C 0 -e r81d0 sleep 100 &
$ perf stat -a -C 4 -r20cc sleep 10
Performance counter stats for 'system wide':
0 r20cc
10,000969126 seconds time elapsed
The patch does provide correctness for all non-corrupting events. It does not
"repatriate" the leaked counts back to the leaking counter. This is planned
for a second patch series. This patch series makes this repatriation more
easy by guaranteeing the sibling counter is not measuring any useful event.
The patch introduces dynamic constraints for events. That means that events which
did not have constraints, i.e., could be measured on any counters, may now be
constrained to a subset of the counters depending on what is going on the sibling
thread. The algorithm is similar to a cache coherency protocol. We call it XSU
in reference to Exclusive, Shared, Unused, the 3 possible states of a PMU
counter.
As a consequence of the workaround, users may see an increased amount of event
multiplexing, even in situtations where there are fewer events than counters
measured on a CPU.
Patch has been tested on all three impacted processors. Note that when
HT is off, there is no corruption. However, the workaround is still enabled,
yet not costing too much. Adding a dynamic detection of HT on turned out to
be complex are requiring too much to code to be justified.
This patch addresses the issue when PEBS is not used. A subsequent patch
fixes the problem when PEBS is used.
This patch adds a new shared_regs style structure to the
per-cpu x86 state (cpuc). It is used to coordinate access
between counters which must be used with exclusion across
HyperThreads on Intel processors. This new struct is not
needed on each PMU, thus is is allocated on demand.
Stephane Eranian [Mon, 17 Nov 2014 19:06:56 +0000 (20:06 +0100)]
perf/x86: Add 'index' param to get_event_constraint() callback
This patch adds an index parameter to the get_event_constraint()
x86_pmu callback. It is expected to represent the index of the
event in the cpuc->event_list[] array. When the callback is used
for fake_cpuc (evnet validation), then the index must be -1. The
motivation for passing the index is to use it to index into another
cpuc array.
This patch adds 3 new PMU model specific callbacks
during the event scheduling done by x86_schedule_events().
->start_scheduling(): invoked when entering the schedule routine.
->stop_scheduling(): invoked at the end of the schedule routine
->commit_scheduling(): invoked for each committed event
Add support for Branch Trace Store (BTS) via kernel perf event infrastructure.
The difference with the existing implementation of BTS support is that this
one is a separate PMU that exports events' trace buffers to userspace by means
of AUX area of the perf buffer, which is zero-copy mapped into userspace.
The immediate benefit is that the buffer size can be much bigger, resulting in
fewer interrupts and no kernel side copying is involved and little to no trace
data loss. Also, kernel code can be traced with this driver.
The old way of collecting BTS traces still works.
Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Kaixu Xia <kaixu.xia@linaro.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Paul Mackerras <paulus@samba.org> Cc: Robert Richter <rric@kernel.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: acme@infradead.org Cc: adrian.hunter@intel.com Cc: kan.liang@intel.com Cc: markus.t.metzger@intel.com Cc: mathieu.poirier@linaro.org Link: http://lkml.kernel.org/r/1422614435-114702-1-git-send-email-alexander.shishkin@linux.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
Add support for Intel Processor Trace (PT) to kernel's perf events.
PT is an extension of Intel Architecture that collects information about
software execuction such as control flow, execution modes and timings and
formats it into highly compressed binary packets. Even being compressed,
these packets are generated at hundreds of megabytes per second per core,
which makes it impractical to decode them on the fly in the kernel.
This driver exports trace data by through AUX space in the perf ring
buffer, which is zero-copy mapped into userspace for faster data retrieval.
Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Kaixu Xia <kaixu.xia@linaro.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Paul Mackerras <paulus@samba.org> Cc: Robert Richter <rric@kernel.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: acme@infradead.org Cc: adrian.hunter@intel.com Cc: kan.liang@intel.com Cc: markus.t.metzger@intel.com Cc: mathieu.poirier@linaro.org Link: http://lkml.kernel.org/r/1422614392-114498-1-git-send-email-alexander.shishkin@linux.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
perf/x86: Mark Intel PT and LBR/BTS as mutually exclusive
Intel PT cannot be used at the same time as LBR or BTS and will cause a
general protection fault if they are used together. In order to avoid
fixing up GPs in the fast path, instead we disallow creating LBR/BTS
events when PT events are present and vice versa.
Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Kaixu Xia <kaixu.xia@linaro.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Paul Mackerras <paulus@samba.org> Cc: Robert Richter <rric@kernel.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: acme@infradead.org Cc: adrian.hunter@intel.com Cc: kan.liang@intel.com Cc: markus.t.metzger@intel.com Cc: mathieu.poirier@linaro.org Link: http://lkml.kernel.org/r/1421237903-181015-12-git-send-email-alexander.shishkin@linux.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
perf: Add ITRACE_START record to indicate that tracing has started
For counters that generate AUX data that is bound to the context of a
running task, such as instruction tracing, the decoder needs to know
exactly which task is running when the event is first scheduled in,
before the first sched_switch. The decoder's need to know this stems
from the fact that instruction flow trace decoding will almost always
require program's object code in order to reconstruct said flow and
for that we need at least its pid/tid in the perf stream.
To single out such instruction tracing pmus, this patch introduces
ITRACE PMU capability. The reason this is not part of RECORD_AUX
record is that not all pmus capable of generating AUX data need this,
and the opposite is *probably* also true.
While sched_switch covers for most cases, there are two problems with it:
the consumer will need to process events out of order (that is, having
found RECORD_AUX, it will have to skip forward to the nearest sched_switch
to figure out which task it was, then go back to the actual trace to
decode it) and it completely misses the case when the tracing is enabled
and disabled before sched_switch, for example, via PERF_EVENT_IOC_DISABLE.
Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Kaixu Xia <kaixu.xia@linaro.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Paul Mackerras <paulus@samba.org> Cc: Robert Richter <rric@kernel.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: acme@infradead.org Cc: adrian.hunter@intel.com Cc: kan.liang@intel.com Cc: markus.t.metzger@intel.com Cc: mathieu.poirier@linaro.org Link: http://lkml.kernel.org/r/1421237903-181015-15-git-send-email-alexander.shishkin@linux.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
perf: Add wakeup watermark control to the AUX area
When AUX area gets a certain amount of new data, we want to wake up
userspace to collect it. This adds a new control to specify how much
data will cause a wakeup. This is then passed down to pmu drivers via
output handle's "wakeup" field, so that the driver can find the nearest
point where it can generate an interrupt.
We repurpose __reserved_2 in the event attribute for this, even though
it was never checked to be zero before, aux_watermark will only matter
for new AUX-aware code, so the old code should still be fine.
Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Kaixu Xia <kaixu.xia@linaro.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Paul Mackerras <paulus@samba.org> Cc: Robert Richter <rric@kernel.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: acme@infradead.org Cc: adrian.hunter@intel.com Cc: kan.liang@intel.com Cc: markus.t.metzger@intel.com Cc: mathieu.poirier@linaro.org Link: http://lkml.kernel.org/r/1421237903-181015-10-git-send-email-alexander.shishkin@linux.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
This adds support for overwrite mode in the AUX area, which means "keep
collecting data till you're stopped", turning AUX area into a circular
buffer, where new data overwrites old data. It does not depend on data
buffer's overwrite mode, so that it doesn't lose sideband data that is
instrumental for processing AUX data.
Overwrite mode is enabled at mapping AUX area read only. Even though
aux_tail in the buffer's user page might be user writable, it will be
ignored in this mode.
A PERF_RECORD_AUX with PERF_AUX_FLAG_OVERWRITE set is written to the perf
data stream every time an event writes new data to the AUX area. The pmu
driver might not be able to infer the exact beginning of the new data in
each snapshot, some drivers will only provide the tail, which is
aux_offset + aux_size in the AUX record. Consumer has to be able to tell
the new data from the old one, for example, by means of time stamps if
such are provided in the trace.
Consumer is also responsible for disabling any events that might write
to the AUX area (thus potentially racing with the consumer) before
collecting the data.
Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Kaixu Xia <kaixu.xia@linaro.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Paul Mackerras <paulus@samba.org> Cc: Robert Richter <rric@kernel.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: acme@infradead.org Cc: adrian.hunter@intel.com Cc: kan.liang@intel.com Cc: markus.t.metzger@intel.com Cc: mathieu.poirier@linaro.org Link: http://lkml.kernel.org/r/1421237903-181015-9-git-send-email-alexander.shishkin@linux.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
For pmus that wish to write data to ring buffer's AUX area, provide
perf_aux_output_{begin,end}() calls to initiate/commit data writes,
similarly to perf_output_{begin,end}. These also use the same output
handle structure. Also, similarly to software counterparts, these
will direct inherited events' output to parents' ring buffers.
After the perf_aux_output_begin() returns successfully, handle->size
is set to the maximum amount of data that can be written wrt aux_tail
pointer, so that no data that the user hasn't seen will be overwritten,
therefore this should always be called before hardware writing is
enabled. On success, this will return the pointer to pmu driver's
private structure allocated for this aux area by pmu::setup_aux. Same
pointer can also be retrieved using perf_get_aux() while hardware
writing is enabled.
PMU driver should pass the actual amount of data written as a parameter
to perf_aux_output_end(). All hardware writes should be completed and
visible before this one is called.
Additionally, perf_aux_output_skip() will adjust output handle and
aux_head in case some part of the buffer has to be skipped over to
maintain hardware's alignment constraints.
Nested writers are forbidden and guards are in place to catch such
attempts.
Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Kaixu Xia <kaixu.xia@linaro.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Paul Mackerras <paulus@samba.org> Cc: Robert Richter <rric@kernel.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: acme@infradead.org Cc: adrian.hunter@intel.com Cc: kan.liang@intel.com Cc: markus.t.metzger@intel.com Cc: mathieu.poirier@linaro.org Link: http://lkml.kernel.org/r/1421237903-181015-8-git-send-email-alexander.shishkin@linux.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
When there's new data in the AUX space, output a record indicating its
offset and size and a set of flags, such as PERF_AUX_FLAG_TRUNCATED, to
mean the described data was truncated to fit in the ring buffer.
Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Arnaldo Carvalho de Melo <acme@infradead.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Kaixu Xia <kaixu.xia@linaro.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Paul Mackerras <paulus@samba.org> Cc: Robert Richter <rric@kernel.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: adrian.hunter@intel.com Cc: kan.liang@intel.com Cc: markus.t.metzger@intel.com Cc: mathieu.poirier@linaro.org Link: http://lkml.kernel.org/r/1421237903-181015-7-git-send-email-alexander.shishkin@linux.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
Usually, pmus that do, for example, instruction tracing, would only ever
be able to have one event per task per cpu (or per perf_event_context). For
such pmus it makes sense to disallow creating conflicting events early on,
so as to provide consistent behavior for the user.
This patch adds a pmu capability that indicates such constraint on event
creation.
Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Kaixu Xia <kaixu.xia@linaro.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Paul Mackerras <paulus@samba.org> Cc: Robert Richter <rric@kernel.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: acme@infradead.org Cc: adrian.hunter@intel.com Cc: kan.liang@intel.com Cc: markus.t.metzger@intel.com Cc: mathieu.poirier@linaro.org Link: http://lkml.kernel.org/r/1422613866-113186-1-git-send-email-alexander.shishkin@linux.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
perf: Add a capability for AUX_NO_SG pmus to do software double buffering
For pmus that don't support scatter-gather for AUX data in hardware, it
might still make sense to implement software double buffering to avoid
losing data while the user is reading data out. For this purpose, add
a pmu capability that guarantees multiple high-order chunks for AUX buffer,
so that the pmu driver can do switchover tricks.
To make use of this feature, add PERF_PMU_CAP_AUX_SW_DOUBLEBUF to your
pmu's capability mask. This will make the ring buffer AUX allocation code
ensure that the biggest high order allocation for the aux buffer pages is
no bigger than half of the total requested buffer size, thus making sure
that the buffer has at least two high order allocations.
Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Kaixu Xia <kaixu.xia@linaro.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Paul Mackerras <paulus@samba.org> Cc: Robert Richter <rric@kernel.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: acme@infradead.org Cc: adrian.hunter@intel.com Cc: kan.liang@intel.com Cc: markus.t.metzger@intel.com Cc: mathieu.poirier@linaro.org Link: http://lkml.kernel.org/r/1421237903-181015-5-git-send-email-alexander.shishkin@linux.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
perf: Support high-order allocations for AUX space
Some pmus (such as BTS or Intel PT without multiple-entry ToPA capability)
don't support scatter-gather and will prefer larger contiguous areas for
their output regions.
This patch adds a new pmu capability to request higher order allocations.
Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Kaixu Xia <kaixu.xia@linaro.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Paul Mackerras <paulus@samba.org> Cc: Robert Richter <rric@kernel.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: acme@infradead.org Cc: adrian.hunter@intel.com Cc: kan.liang@intel.com Cc: markus.t.metzger@intel.com Cc: mathieu.poirier@linaro.org Link: http://lkml.kernel.org/r/1421237903-181015-4-git-send-email-alexander.shishkin@linux.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
Peter Zijlstra [Wed, 14 Jan 2015 12:18:11 +0000 (14:18 +0200)]
perf: Add AUX area to ring buffer for raw data streams
This patch introduces "AUX space" in the perf mmap buffer, intended for
exporting high bandwidth data streams to userspace, such as instruction
flow traces.
AUX space is a ring buffer, defined by aux_{offset,size} fields in the
user_page structure, and read/write pointers aux_{head,tail}, which abide
by the same rules as data_* counterparts of the main perf buffer.
In order to allocate/mmap AUX, userspace needs to set up aux_offset to
such an offset that will be greater than data_offset+data_size and
aux_size to be the desired buffer size. Both need to be page aligned.
Then, same aux_offset and aux_size should be passed to mmap() call and
if everything adds up, you should have an AUX buffer as a result.
Pages that are mapped into this buffer also come out of user's mlock
rlimit plus perf_event_mlock_kb allowance.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Kaixu Xia <kaixu.xia@linaro.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Paul Mackerras <paulus@samba.org> Cc: Robert Richter <rric@kernel.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: acme@infradead.org Cc: adrian.hunter@intel.com Cc: kan.liang@intel.com Cc: markus.t.metzger@intel.com Cc: mathieu.poirier@linaro.org Link: http://lkml.kernel.org/r/1421237903-181015-3-git-send-email-alexander.shishkin@linux.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
Currently, the actual perf ring buffer is one page into the mmap area,
following the user page and the userspace follows this convention. This
patch adds data_{offset,size} fields to user_page that can be used by
userspace instead for locating perf data in the mmap area. This is also
helpful when mapping existing or shared buffers if their size is not
known in advance.
Right now, it is made to follow the existing convention that
data_offset == PAGE_SIZE and
data_offset + data_size == mmap_size.
Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Kaixu Xia <kaixu.xia@linaro.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Paul Mackerras <paulus@samba.org> Cc: Robert Richter <rric@kernel.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: acme@infradead.org Cc: adrian.hunter@intel.com Cc: kan.liang@intel.com Cc: markus.t.metzger@intel.com Cc: mathieu.poirier@linaro.org Link: http://lkml.kernel.org/r/1421237903-181015-2-git-send-email-alexander.shishkin@linux.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
Some of the CYCLE_ACTIVITY.* events can only be scheduled on
counter 2. Due to a typo Haswell matched those with
INTEL_EVENT_CONSTRAINT, which lead to the events never
matching as the comparison does not expect anything
in the umask too. Fix the typo.
Kan Liang [Fri, 27 Mar 2015 14:38:25 +0000 (10:38 -0400)]
perf/x86/intel: Filter branches for PEBS event
For supporting Intel LBR branches filtering, Intel LBR sharing logic
mechanism is introduced from commit b36817e88630 ("perf/x86: Add Intel
LBR sharing logic"). It modifies __intel_shared_reg_get_constraints() to
config lbr_sel, which is finally used to set LBR_SELECT.
However, the intel_shared_regs_constraints() function is called after
intel_pebs_constraints(). The PEBS event will return immediately after
intel_pebs_constraints(). So it's impossible to filter branches for PEBS
events.
This patch moves intel_shared_regs_constraints() ahead of
intel_pebs_constraints().
We can safely do that because the intel_shared_regs_constraints() function
only returns empty constraint if its rejecting the event, otherwise it
returns NULL such that we continue calling intel_pebs_constraints() and
x86_get_event_constraint().
bpf: Fix the build on BPF_SYSCALL=y && !CONFIG_TRACING kernels, make it more configurable
So bpf_tracing.o depends on CONFIG_BPF_SYSCALL - but that's not its only
dependency, it also depends on the tracing infrastructure and on kprobes,
without which it will fail to build with:
In file included from kernel/trace/bpf_trace.c:14:0:
kernel/trace/trace.h: In function ‘trace_test_and_set_recursion’:
kernel/trace/trace.h:491:28: error: ‘struct task_struct’ has no member named ‘trace_recursion’
unsigned int val = current->trace_recursion;
[...]
It took quite some time to trigger this build failure, because right now
BPF_SYSCALL is very obscure, depends on CONFIG_EXPERT. So also make BPF_SYSCALL
more configurable, not just under CONFIG_EXPERT.
If BPF_SYSCALL, tracing and kprobes are enabled then enable the bpf_tracing
gateway as well.
We might want to make this an interactive option later on, although
I'd not complicate it unnecessarily: enabling BPF_SYSCALL is enough of
an indicator that the user wants BPF support.
Cc: Alexei Starovoitov <ast@plumgrid.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Arnaldo Carvalho de Melo <acme@infradead.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: David S. Miller <davem@davemloft.net> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
One BPF program attaches to kmem_cache_alloc_node() and
remembers all allocated objects in the map.
Another program attaches to kmem_cache_free() and deletes
corresponding object from the map.
User space walks the map every second and prints any objects
which are older than 1 second.
Usage:
$ sudo tracex4
Then start few long living processes. The 'tracex4' will print
something like this:
obj 0xffff880465928000 is 13sec old was allocated at ip ffffffff8105dc32
obj 0xffff88043181c280 is 13sec old was allocated at ip ffffffff8105dc32
obj 0xffff880465848000 is 8sec old was allocated at ip ffffffff8105dc32
obj 0xffff8804338bc280 is 15sec old was allocated at ip ffffffff8105dc32
$ addr2line -fispe vmlinux ffffffff8105dc32
do_fork at fork.c:1665
As soon as processes exit the memory is reclaimed and 'tracex4'
prints nothing.
Similar experiment can be done with the __kmalloc()/kfree() pair.
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Arnaldo Carvalho de Melo <acme@infradead.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: David S. Miller <davem@davemloft.net> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Link: http://lkml.kernel.org/r/1427312966-8434-10-git-send-email-ast@plumgrid.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
BPF C program attaches to
blk_mq_start_request()/blk_update_request() kprobe events to
calculate IO latency.
For every completed block IO event it computes the time delta
in nsec and records in a histogram map:
map[log10(delta)*10]++
User space reads this histogram map every 2 seconds and prints
it as a 'heatmap' using gray shades of text terminal. Black
spaces have many events and white spaces have very few events.
Left most space is the smallest latency, right most space is
the largest latency in the range.
Usage:
$ sudo ./tracex3
and do 'sudo dd if=/dev/sda of=/dev/null' in other terminal.
Observe IO latencies and how different activity (like 'make
kernel') affects it.
Similar experiments can be done for network transmit latencies,
syscalls, etc.
'-t' flag prints the heatmap using normal ascii characters:
samples/bpf: Add counting example for kfree_skb() function calls and the write() syscall
this example has two probes in one C file that attach to
different kprove events and use two different maps.
1st probe is x64 specific equivalent of dropmon. It attaches to
kfree_skb, retrevies 'ip' address of kfree_skb() caller and
counts number of packet drops at that 'ip' address. User space
prints 'location - count' map every second.
2nd probe attaches to kprobe:sys_write and computes a histogram
of different write sizes
samples/bpf: Add simple non-portable kprobe filter example
tracex1_kern.c - C program compiled into BPF.
It attaches to kprobe:netif_receive_skb()
When skb->dev->name == "lo", it prints sample debug message into
trace_pipe via bpf_trace_printk() helper function.
tracex1_user.c - corresponding user space component that:
- loads BPF program via bpf() syscall
- opens kprobes:netif_receive_skb event via perf_event_open()
syscall
- attaches the program to event via ioctl(event_fd,
PERF_EVENT_IOC_SET_BPF, prog_fd);
- prints from trace_pipe
Note, this BPF program is non-portable. It must be recompiled
with current kernel headers. kprobe is not a stable ABI and
BPF+kprobe scripts may no longer be meaningful when kernel
internals change.
No matter in what way the kernel changes, neither the kprobe,
nor the BPF program can ever crash or corrupt the kernel,
assuming the kprobes, perf and BPF subsystem has no bugs.
The verifier will detect that the program is using
bpf_trace_printk() and the kernel will print 'this is a DEBUG
kernel' warning banner, which means that bpf_trace_printk()
should be used for debugging of the BPF program only.
tracing: Allow BPF programs to call bpf_trace_printk()
Debugging of BPF programs needs some form of printk from the
program, so let programs call limited trace_printk() with %d %u
%x %p modifiers only.
Similar to kernel modules, during program load verifier checks
whether program is calling bpf_trace_printk() and if so, kernel
allocates trace_printk buffers and emits big 'this is debug
only' banner.
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com> Reviewed-by: Steven Rostedt <rostedt@goodmis.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Arnaldo Carvalho de Melo <acme@infradead.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: David S. Miller <davem@davemloft.net> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1427312966-8434-6-git-send-email-ast@plumgrid.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
tracing, perf: Implement BPF programs attached to kprobes
BPF programs, attached to kprobes, provide a safe way to execute
user-defined BPF byte-code programs without being able to crash or
hang the kernel in any way. The BPF engine makes sure that such
programs have a finite execution time and that they cannot break
out of their sandbox.
The user interface is to attach to a kprobe via the perf syscall:
'prog_fd' is a file descriptor associated with BPF program
previously loaded.
'event_id' is an ID of the kprobe created.
Closing 'event_fd':
close(event_fd);
... automatically detaches BPF program from it.
BPF programs can call in-kernel helper functions to:
- lookup/update/delete elements in maps
- probe_read - wraper of probe_kernel_read() used to access any
kernel data structures
BPF programs receive 'struct pt_regs *' as an input ('struct pt_regs' is
architecture dependent) and return 0 to ignore the event and 1 to store
kprobe event into the ring buffer.
Note, kprobes are a fundamentally _not_ a stable kernel ABI,
so BPF programs attached to kprobes must be recompiled for
every kernel version and user must supply correct LINUX_VERSION_CODE
in attr.kern_version during bpf_prog_load() call.
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com> Reviewed-by: Steven Rostedt <rostedt@goodmis.org> Reviewed-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Arnaldo Carvalho de Melo <acme@infradead.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: David S. Miller <davem@davemloft.net> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1427312966-8434-4-git-send-email-ast@plumgrid.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
add TRACE_EVENT_FL_KPROBE flag to differentiate kprobe type of
tracepoints, since bpf programs can only be attached to kprobe
type of PERF_TYPE_TRACEPOINT perf events.
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com> Reviewed-by: Steven Rostedt <rostedt@goodmis.org> Reviewed-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Arnaldo Carvalho de Melo <acme@infradead.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: David S. Miller <davem@davemloft.net> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1427312966-8434-3-git-send-email-ast@plumgrid.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
Daniel Borkmann [Wed, 25 Mar 2015 19:49:18 +0000 (12:49 -0700)]
bpf: Make internal bpf API independent of CONFIG_BPF_SYSCALL #ifdefs
Socket filter code and other subsystems with upcoming eBPF
support should not need to deal with the fact that we have
CONFIG_BPF_SYSCALL defined or not.
Having the bpf syscall as a config option is a nice thing and
I'd expect it to stay that way for expert users (I presume one
day the default setting of it might change, though), but code
making use of it should not care if it's actually enabled or
not.
Instead, hide this via header files and let the rest deal with it.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Alexei Starovoitov <ast@plumgrid.com> Reviewed-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Cc: Arnaldo Carvalho de Melo <acme@infradead.org> Cc: David S. Miller <davem@davemloft.net> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Steven Rostedt <rostedt@goodmis.org> Link: http://lkml.kernel.org/r/1427312966-8434-2-git-send-email-ast@plumgrid.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
By keeping pointers to machines, evlist and tool in ordered_events.
But that was more a transitional patch while moving stuff out from
perf_session.c to ordered_events.c and possibly not even needed by then,
as we could use the container_of() method and instead of having the
nr_unordered_samples stats in events_stats, we can have it in
ordered_samples.
Based-on-a-patch-by: Jiri Olsa <jolsa@kernel.org> Acked-by: Jiri Olsa <jolsa@kernel.org> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Borislav Petkov <bp@suse.de> Cc: David Ahern <dsahern@gmail.com> Cc: Don Zickus <dzickus@redhat.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Stephane Eranian <eranian@google.com> Link: http://lkml.kernel.org/n/tip-4lk0t9js82g0tfc0x1onpkjt@git.kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Even when it is not used to actually reorder events, some of its fields
are used, like session->ordered_events->tool, to shorten function
signatures where tool, for instance, was being passed, as the tool is
needed for the ordered_events code, we need it there and might as well
use it for other perf_session needs.
This fixes a problem where 'perf script' had some condition that made
session->ordered_events not to be initialized even with its
script->tool ordered_events related flags asking for it to be, which
looks like another bug and needs to be investigated further.
Always initializing session->ordered_events at least leaves the current
assumptions in place, so do it now.
Reported-by: David Ahern <dsahern@gmail.com> Reviewed-by: David Ahern <dsahern@gmail.com> Tested-by: David Ahern <dsahern@gmail.com> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Borislav Petkov <bp@suse.de> Cc: Don Zickus <dzickus@redhat.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Stephane Eranian <eranian@google.com> Link: http://lkml.kernel.org/n/tip-b1xxk0rwkz2a0gip1uufmjqg@git.kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
David Ahern [Mon, 30 Mar 2015 20:35:58 +0000 (14:35 -0600)]
perf tools: Fix ppid for synthesized fork events
363b785f38 added synthesized fork events and set a thread's parent id to
itself. Since we are already processing /proc/<pid>/status the ppid can
be determined properly. Make it so.
perf callchain: Fix kernel symbol resolution by remembering the cpumode
Commit 2e77784bb7d8 ("perf callchain: Move cpumode resolve code to
add_callchain_ip") promised "No change in behavior.".
As this commit breaks callchains on s390x (symbols not getting resolved,
observed when profiling the kernel), this statement is wrong. The cpumode
must be kept when iterating over all ips, otherwise the default
(PERF_RECORD_MISC_USER) will be used by error.
Signed-off-by: David Hildenbrand <dahi@linux.vnet.ibm.com> Acked-by: Jiri Olsa <jolsa@kernel.org> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: David Hildenbrand <dahi@linux.vnet.ibm.com> Cc: Hendrik Brueckner <brueckner@linux.vnet.ibm.com> Cc: Kan Liang <kan.liang@intel.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/1427703060-59883-1-git-send-email-dahi@linux.vnet.ibm.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Jiri Olsa [Sat, 28 Mar 2015 10:30:30 +0000 (11:30 +0100)]
perf build: Disable libbabeltrace check by default
Disabling libbabeltrace check by default and replacing the
NO_LIBBABELTRACE make variable with LIBBABELTRACE.
Users wanting the libbabeltrace feature need to build via:
$ make LIBBABELTRACE=1
The reason for this is that the libababeltrace interface we use (version
1.3) hasn't been packaged/released yet, thus the failing feature check
only slows down build and confuses other (non CTF) developers.
Requested-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Jiri Olsa <jolsa@kernel.org> Acked-by: Ingo Molnar <mingo@kernel.org> Cc: David Ahern <dsahern@gmail.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Jeremie Galarneau <jgalar@efficios.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Tom Zanussi <tzanussi@gmail.com> Cc: Wang Nan <wangnan0@huawei.com> Link: http://lkml.kernel.org/r/20150328103030.GA8431@krava.redhat.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Linus Torvalds [Sun, 29 Mar 2015 22:09:31 +0000 (15:09 -0700)]
Merge tag 'armsoc-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc
Pull ARM SoC fixes from Olof Johansson:
"The latest and greatest fixes for ARM platform code. Worth pointing
out are:
- Lines-wise, largest is a PXA fix for dealing with interrupts on DT
that was quite broken. It's still newish code so while we could
have held this off, it seemed appropriate to include now
- Some GPIO fixes for OMAP platforms added a few lines. This was
also fixes for code recently added (this release).
- Small OMAP timer fix to behave better with partially upstreamed
platforms, which is quite welcome.
- Allwinner fixes about operating point control, reducing
overclocking in some cases for better stability.
plus a handful of other smaller fixes across the map"
* tag 'armsoc-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc:
arm64: juno: Fix misleading name of UART reference clock
ARM: dts: sunxi: Remove overclocked/overvoltaged OPP
ARM: dts: sun4i: a10-lime: Override and remove 1008MHz OPP setting
ARM: socfpga: dts: fix spi1 interrupt
ARM: dts: Fix gpio interrupts for dm816x
ARM: dts: dra7: remove ti,hwmod property from pcie phy
ARM: OMAP: dmtimer: disable pm runtime on remove
ARM: OMAP: dmtimer: check for pm_runtime_get_sync() failure
ARM: OMAP2+: Fix socbus family info for AM33xx devices
ARM: dts: omap3: Add missing dmas for crypto
ARM: dts: rockchip: disable gmac by default in rk3288.dtsi
MAINTAINERS: add rockchip regexp to the ARM/Rockchip entry
ARM: pxa: fix pxa interrupts handling in DT
ARM: pxa: Fix typo in zeus.c
ARM: sunxi: Have ARCH_SUNXI select RESET_CONTROLLER for clock driver usage
Olof Johansson [Sun, 29 Mar 2015 21:00:53 +0000 (14:00 -0700)]
Merge tag 'sunxi-fixes-for-4.0' of https://git.kernel.org/pub/scm/linux/kernel/git/mripard/linux into fixes
Allwinner fixes for 4.0
There's a few fixes to merge for 4.0, one to add a select in the machine
Kconfig option to fix a potential build failure, and two fixing cpufreq related
issues.
* tag 'sunxi-fixes-for-4.0' of https://git.kernel.org/pub/scm/linux/kernel/git/mripard/linux:
ARM: dts: sunxi: Remove overclocked/overvoltaged OPP
ARM: dts: sun4i: a10-lime: Override and remove 1008MHz OPP setting
ARM: sunxi: Have ARCH_SUNXI select RESET_CONTROLLER for clock driver usage
Olof Johansson [Sun, 29 Mar 2015 20:58:54 +0000 (13:58 -0700)]
Merge tag 'fixes-v4.0-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tmlind/linux-omap into fixes
Fixes for omaps for the -rc cycle:
- Fix a device tree based booting vs legacy booting regression for
omap3 crypto hardware by adding the missing DMA channels.
- Fix /sys/bus/soc/devices/soc0/family for am33xx devices.
- Fix two timer issues that can cause hangs if the timer related
hwmod data is missing like it often initially is for new SoCs.
- Remove pcie hwmods entry from dts as that causes runtime PM to
fail for the PHYs.
- A paper bag type dts configuration fix for dm816x GPIO
interrupts that I just noticed. This is most of the changes
diffstat wise, but as it's a basic feature for connecting
devices and things work otherwise, it should be fixed.
* tag 'fixes-v4.0-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tmlind/linux-omap:
ARM: dts: Fix gpio interrupts for dm816x
ARM: dts: dra7: remove ti,hwmod property from pcie phy
ARM: OMAP: dmtimer: disable pm runtime on remove
ARM: OMAP: dmtimer: check for pm_runtime_get_sync() failure
ARM: OMAP2+: Fix socbus family info for AM33xx devices
ARM: dts: omap3: Add missing dmas for crypto
Olof Johansson [Sun, 29 Mar 2015 20:47:21 +0000 (13:47 -0700)]
Merge tag 'fixes-for-v4.0-rc5' of https://github.com/rjarzmik/linux into fixes
arm: pxa: fixes for v4.0-rc5
There are only 2 fixes, one for the zeus board about the regulator changes,
where a typo prevented the zeus board from having a working can regulator,
and one regression triggered by the interrupts IRQ shift of 16 affecting all
boards.
* tag 'fixes-for-v4.0-rc5' of https://github.com/rjarzmik/linux:
ARM: pxa: fix pxa interrupts handling in DT
ARM: pxa: Fix typo in zeus.c
Linus Torvalds [Sat, 28 Mar 2015 18:25:04 +0000 (11:25 -0700)]
Merge branch 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fix from Ingo Molnar:
"Fix x86 syscall exit code bug that resulted in spurious non-execution
of TIF-driven user-return worklets, causing big trouble for things
like KVM that rely on user notifiers for correctness of their vcpu
model, causing crashes like double faults"
* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/asm/entry: Check for syscall exit work with IRQs disabled
Linus Torvalds [Sat, 28 Mar 2015 17:58:53 +0000 (10:58 -0700)]
Merge branch 'parisc-4.0-1' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux
Pull parsic fixes from Helge Deller:
"One patch from Mikulas fixes a bug on parisc by artifically
incrementing the counter in pmd_free when the kernel tries to free
the preallocated pmd.
Other than that we now prevent that syscalls gets added without
incrementing __NR_Linux_syscalls and fix the initial pmd setup code
if a default page size greater than 4k has been selected"
* 'parisc-4.0-1' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux:
parisc: Fix pmd code to depend on PT_NLEVELS value, not on CONFIG_64BIT
parisc: mm: don't count preallocated pmds
parisc: Add compile-time check when adding new syscalls
Linus Torvalds [Sat, 28 Mar 2015 17:47:27 +0000 (09:47 -0800)]
Merge tag 'arc-4.0-fixes-part-2' of git://git.kernel.org/pub/scm/linux/kernel/git/vgupta/arc
Pull ARC fixes from Vineet Gupta:
"We found some issues with signal handling taking down the system. I
know its late, but these are important and all marked for stable.
ARC signal handling related fixes uncovered during recent testing of
NPTL tools"
* tag 'arc-4.0-fixes-part-2' of git://git.kernel.org/pub/scm/linux/kernel/git/vgupta/arc:
ARC: signal handling robustify
ARC: SA_SIGINFO ucontext regs off-by-one
Linus Torvalds [Fri, 27 Mar 2015 21:38:02 +0000 (14:38 -0700)]
Merge tag 'sound-4.0-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound
Pull sound fixes from Takashi Iwai:
"Three trivial oneliner fixes for HD-audio.
Two are device-specific quirks while one is a generic fix for recent
Realtek codecs"
* tag 'sound-4.0-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound:
ALSA: hda - Add one more node in the EAPD supporting candidate list
ALSA: hda_intel: apply the Seperate stream_tag for Sunrise Point
ALSA: hda - Add dock support for Thinkpad T450s (17aa:5036)
Peter Zijlstra [Fri, 20 Feb 2015 13:05:38 +0000 (14:05 +0100)]
perf: Add per event clockid support
While thinking on the whole clock discussion it occurred to me we have
two distinct uses of time:
1) the tracking of event/ctx/cgroup enabled/running/stopped times
which includes the self-monitoring support in struct
perf_event_mmap_page.
2) the actual timestamps visible in the data records.
And we've been conflating them.
The first is all about tracking time deltas, nobody should really care
in what time base that happens, its all relative information, as long
as its internally consistent it works.
The second however is what people are worried about when having to
merge their data with external sources. And here we have the
discussion on MONOTONIC vs MONOTONIC_RAW etc..
Where MONOTONIC is good for correlating between machines (static
offset), MONOTNIC_RAW is required for correlating against a fixed rate
hardware clock.
This means configurability; now 1) makes that hard because it needs to
be internally consistent across groups of unrelated events; which is
why we had to have a global perf_clock().
However, for 2) it doesn't really matter, perf itself doesn't care
what it writes into the buffer.
The below patch makes the distinction between these two cases by
adding perf_event_clock() which is used for the second case. It
further makes this configurable on a per-event basis, but adds a few
sanity checks such that we cannot combine events with different clocks
in confusing ways.
And since we then have per-event configurability we might as well
retain the 'legacy' behaviour as a default.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: David Ahern <dsahern@gmail.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: John Stultz <john.stultz@linaro.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org>
Ingo Molnar [Fri, 27 Mar 2015 09:09:21 +0000 (10:09 +0100)]
Merge branch 'timers/core' into perf/timer, to apply dependent patch
An upcoming patch will depend on tai_ns() and NMI-safe ktime_get_raw_fast(),
so merge timers/core here in a separate topic branch until it's all cooked
and timers/core is merged upstream.