perf probe: Fix --funcs to show correct symbols for offline module
Fix --funcs (-F) option to show correct symbols for offline module.
Since previous perf-probe uses machine__findnew_module_map() for offline
module, even if user passes a module file (with full path) which is for
other architecture, perf-probe always tries to load symbol map for
current kernel module.
This fix uses dso__new_map() to load the map from given binary as same
as a map for user applications.
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/148350053478.19001.15435255244512631545.stgit@devbox Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
perf symbols: Robustify reading of build-id from sysfs
Markus reported that perf segfaults when reading /sys/kernel/notes from
a kernel linked with GNU gold, due to what looks like a gold bug, so do
some bounds checking to avoid crashing in that case.
perf tools: Install tools/lib/traceevent plugins with install-bin
Those are binaries as well, so should be installed by:
make -C tools/perf install-bin'
too.
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Daniel Bristot de Oliveira <bristot@redhat.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Link: http://lkml.kernel.org/n/tip-3841b37u05evxrs1igkyu6ks@git.kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
tools lib traceevent: Fix prev/next_prio for deadline tasks
Currently, the sched:sched_switch tracepoint reports deadline tasks with
priority -1. But when reading the trace via perf script I've got the
following output:
# ./d & # (d is a deadline task, see [1])
# perf record -e sched:sched_switch -a sleep 1
# perf script
...
swapper 0 [000] 2146.962441: sched:sched_switch: swapper/0:0 [120] R ==> d:2593 [4294967295]
d 2593 [000] 2146.972472: sched:sched_switch: d:2593 [4294967295] R ==> g:2590 [4294967295]
The task d reports the wrong priority [4294967295]. This happens because
the "int prio" is stored in an unsigned long long val. Although it is
set as a %lld, as int is shorter than unsigned long long,
trace_seq_printf prints it as a positive number.
The fix is just to cast the val as an int, and print it as a %d,
as in the sched:sched_switch tracepoint's "format".
The output with the fix is:
# ./d &
# perf record -e sched:sched_switch -a sleep 1
# perf script
...
swapper 0 [000] 4306.374037: sched:sched_switch: swapper/0:0 [120] R ==> d:10941 [-1]
d 10941 [000] 4306.383823: sched:sched_switch: d:10941 [-1] R ==> swapper/0:0 [120]
Got the program from the provided URL, http://bristot.me/lkml/d.c,
trimmed it and included in the cset log above, so that we have
everything needed to test it in one place.
Signed-off-by: Daniel Bristot de Oliveira <bristot@redhat.com> Acked-by: Steven Rostedt <rostedt@goodmis.org> Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Daniel Bristot de Oliveira <bristot@redhat.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/866ef75bcebf670ae91c6a96daa63597ba981f0d.1483443552.git.bristot@redhat.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Jiri Olsa [Tue, 3 Jan 2017 08:19:56 +0000 (09:19 +0100)]
perf record: Fix --switch-output documentation and comment
There's no --signal-trigger option, also adding the code comment into
record man page.
Signed-off-by: Jiri Olsa <jolsa@kernel.org> Tested-by: Wang Nan <wangnan0@huawei.com> Cc: David Ahern <dsahern@gmail.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/1483431600-19887-4-git-send-email-jolsa@kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Jiri Olsa [Tue, 3 Jan 2017 08:19:55 +0000 (09:19 +0100)]
perf record: Make __record_options static
There's no need for this one to be global.
Signed-off-by: Jiri Olsa <jolsa@kernel.org> Tested-by: Wang Nan <wangnan0@huawei.com> Cc: David Ahern <dsahern@gmail.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/1483431600-19887-3-git-send-email-jolsa@kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
To allow string options with a default argument and variable set when
the option is used.
Signed-off-by: Jiri Olsa <jolsa@kernel.org> Tested-by: Wang Nan <wangnan0@huawei.com> Cc: David Ahern <dsahern@gmail.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/1483431600-19887-2-git-send-email-jolsa@kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
perf probe: Fix to get correct modname from elf header
Since 'perf probe' supports cross-arch probes, it is possible to analyze
different arch kernel image which has different bits-per-long.
In that case, it fails to get the module name because it uses the
MOD_NAME_OFFSET macro based on the host machine bits-per-long, instead
of the target arch bits-per-long.
This fixes above issue by changing modname-offset based on the target
archs bit width. This is ok because linux kernel uses LP64 model on
64bit arch.
E.g. without this (on x86_64, and target module is arm32):
$ perf probe -m build-arm/fs/configfs/configfs.ko -D configfs_lookup
p:probe/configfs_lookup :configfs_lookup+0
^-Here is an empty module name.
samples/bpf trace_output_user: Remove duplicate sys/ioctl.h include
Cc: Alexei Starovoitov <ast@fb.com> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: Joe Stringer <joe@ovn.org> Cc: Wang Nan <wangnan0@huawei.com> Link: http://lkml.kernel.org/n/tip-3awp0nv8tpnblatojmwjww7z@git.kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
samples/bpf sock_example: Avoid getting ethhdr from two includes
To avoid the following build failure on Alpine Linux 3.4, that has
clang-3.8 with the bpf target:
HOSTCC samples/bpf/sock_example.o
In file included from /usr/include/net/ethernet.h:10:0,
from /git/linux/samples/bpf/sock_example.h:7,
from /git/linux/samples/bpf/sock_example.c:30:
/usr/include/netinet/if_ether.h:96:8: error: redefinition of 'struct
ethhdr'
struct ethhdr {
^
In file included from /git/linux/samples/bpf/sock_example.c:26:0:
./usr/include/linux/if_ether.h:144:8: note: originally defined here
struct ethhdr {
^
scripts/Makefile.host:124: recipe for target
'samples/bpf/sock_example.o' failed
make[2]: *** [samples/bpf/sock_example.o] Error 1
/git/linux/Makefile:1658: recipe for target 'samples/bpf/' failed
So include net/if_ether.h for the needs of sock_example.h, using the
same include that sock_example.c uses.
Cc: Alexei Starovoitov <ast@fb.com> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: Joe Stringer <joe@ovn.org> Cc: Wang Nan <wangnan0@huawei.com> Link: http://lkml.kernel.org/n/tip-m9avekl1b651qe1r1zd5tzz9@git.kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Namhyung Kim [Thu, 22 Dec 2016 06:03:50 +0000 (15:03 +0900)]
perf sched timehist: Show total scheduling time
Show length of analyzed sample time and rate of idle task running.
This also takes care of time range given by --time option.
$ perf sched timehist -sI | tail
Samples do not have callchains.
Idle stats:
CPU 0 idle for 930.316 msec ( 92.93%)
CPU 1 idle for 963.614 msec ( 96.25%)
CPU 2 idle for 885.482 msec ( 88.45%)
CPU 3 idle for 938.635 msec ( 93.76%)
Total number of unique tasks: 118
Total number of context switches: 2337
Total run time (msec): 3718.048
Total scheduling time (msec): 1001.131 (x 4)
Suggested-by: David Ahern <dsahern@gmail.com> Signed-off-by: Namhyung Kim <namhyung@kernel.org> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/20161222060350.17655-3-namhyung@kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Ingo Molnar [Fri, 23 Dec 2016 19:23:29 +0000 (20:23 +0100)]
Merge tag 'perf-urgent-for-mingo-20161222' of git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux into perf/urgent
Pull perf/urgent fixes from Arnaldo Carvalho de Melo:
Fixes for 'perf sched timehist': (Namhyung Kim)
- Define a larger initial alignment value for the COMM column and
make it be more consistently honoured, for instance in the header.
- Fix invalid period calculation when using the --time option to
select a time slice, when events outside that slice were being
considered for the per cpu idle stats summary.
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com> Signed-off-by: Ingo Molnar <mingo@kernel.org>
Namhyung Kim [Thu, 22 Dec 2016 06:03:49 +0000 (15:03 +0900)]
perf sched timehist: Fix invalid period calculation
When --time option is given with a value outside recorded time, the last
sample time (tprev) was set to that value and run time calculation might
be incorrect. This is a problem of the first samples for each cpus
since it would skip the runtime update when tprev is 0. But with --time
option it had non-zero (which is invalid) value so the calculation is
also incorrect.
For example, let's see the followging:
$ perf sched timehist
time cpu task name wait time sch delay run time
[tid/pid] (msec) (msec) (msec)
--------------- ------ ------------------------------ --------- --------- ---------
3195.968367 [0003] <idle> 0.000 0.000 0.000
3195.968386 [0002] Timer[4306/4277] 0.000 0.000 0.018
3195.968397 [0002] Web Content[4277] 0.000 0.000 0.000
3195.968595 [0001] JS Helper[4302/4277] 0.000 0.000 0.000
3195.969217 [0000] <idle> 0.000 0.000 0.621
3195.969251 [0001] kworker/1:1H[291] 0.000 0.000 0.033
The sample starts at 3195.968367 but when I gave a time interval from
3194 to 3196 (in sec) it will calculate the whole 2 second as runtime.
In below, 2 cpus accounted it as runtime, other 2 cpus accounted it as
idle time.
Before:
$ perf sched timehist --time 3194,3196 -s | tail
Idle stats:
CPU 0 idle for 1995.991 msec
CPU 1 idle for 20.793 msec
CPU 2 idle for 30.191 msec
CPU 3 idle for 1999.852 msec
Total number of unique tasks: 23
Total number of context switches: 128
Total run time (msec): 3724.940
After:
$ perf sched timehist --time 3194,3196 -s | tail
Idle stats:
CPU 0 idle for 10.811 msec
CPU 1 idle for 20.793 msec
CPU 2 idle for 30.191 msec
CPU 3 idle for 18.337 msec
Total number of unique tasks: 23
Total number of context switches: 128
Total run time (msec): 18.139
Committer notes:
Further testing:
Before:
Idle stats:
CPU 0 idle for 229.785 msec
CPU 1 idle for 937.944 msec
CPU 2 idle for 188.931 msec
CPU 3 idle for 986.185 msec
Idle stats:
CPU 0 idle for 229.785 msec
CPU 1 idle for 175.407 msec
CPU 2 idle for 188.931 msec
CPU 3 idle for 223.657 msec
Total number of unique tasks: 68
Total number of context switches: 814
Total run time (msec): 97.688
# for cpu in `seq 0 3` ; do echo -n "CPU $cpu idle for " ; perf sched timehist --time 40602,40603 | grep "\[000${cpu}\].*\<idle\>" | tr -s ' ' | cut -d' ' -f7 | awk '{entries++ ; s+=$1} END {print s " msec (entries: " entries ")"}' ; done
CPU 0 idle for 229.721 msec (entries: 123)
CPU 1 idle for 175.381 msec (entries: 65)
CPU 2 idle for 188.903 msec (entries: 56)
CPU 3 idle for 223.61 msec (entries: 102)
Difference due to the idle stats being accounted at nanoseconds precision while
the <idle> entries in 'perf sched timehist' are trucated at msec.usec.
Signed-off-by: Namhyung Kim <namhyung@kernel.org> Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: David Ahern <dsahern@gmail.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Fixes: 853b74071110 ("perf sched timehist: Add option to specify time window of interest") Link: http://lkml.kernel.org/r/20161222060350.17655-2-namhyung@kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Namhyung Kim [Thu, 22 Dec 2016 06:03:48 +0000 (15:03 +0900)]
perf sched timehist: Remove hardcoded 'comm_width' check at print_summary
Now that the default 'comm_width' value is 30, no need to check that at
print_summary,
Signed-off-by: Namhyung Kim <namhyung@kernel.org> Cc: David Ahern <dsahern@gmail.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/20161222060350.17655-1-namhyung@kernel.org
[ Split from a larger patch ] Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Namhyung Kim [Thu, 22 Dec 2016 06:03:48 +0000 (15:03 +0900)]
perf sched timehist: Enlarge default 'comm_width'
Current default value is 20 but it's easily changed to a bigger value as
task has a long name and different tid and pid. And it makes the output
not aligned. So change it to have a large value as summary shows.
Signed-off-by: Namhyung Kim <namhyung@kernel.org> Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: David Ahern <dsahern@gmail.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/20161222060350.17655-1-namhyung@kernel.org
[ Split from a larger patch ] Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Namhyung Kim [Thu, 22 Dec 2016 06:03:48 +0000 (15:03 +0900)]
perf sched timehist: Honour 'comm_width' when aligning the headers
Current default value is 20, but that may change in the future, so make
places where we have 20 hardcoded use 'comm_width'.
Signed-off-by: Namhyung Kim <namhyung@kernel.org> Cc: David Ahern <dsahern@gmail.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/20161222060350.17655-1-namhyung@kernel.org
[ Split from a larger patch ] Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Peter Zijlstra [Wed, 9 Nov 2016 15:51:53 +0000 (16:51 +0100)]
perf/x86: Fix overlap counter scheduling bug
Jiri reported the overlap scheduling exceeding its max stack.
Looking at the constraint that triggered this, it turns out the
overlap marker isn't needed.
The comment with EVENT_CONSTRAINT_OVERLAP states: "This is the case if
the counter mask of such an event is not a subset of any other counter
mask of a constraint with an equal or higher weight".
Esp. that latter part is of interest here I think, our overlapping mask
is 0x0e, that has 3 bits set and is the highest weight mask in on the
PMU, therefore it will be placed last. Can we still create a scenario
where we would need to rewind that?
The scenario for AMD Fam15h is we're having masks like:
0x3F -- 111111
0x38 -- 111000
0x07 -- 000111
0x09 -- 001001
And we mark 0x09 as overlapping, because it is not a direct subset of
0x38 or 0x07 and has less weight than either of those. This means we'll
first try and place the 0x09 event, then try and place 0x38/0x07 events.
Now imagine we have:
3 * 0x07 + 0x09
and the initial pick for the 0x09 event is counter 0, then we'll fail to
place all 0x07 events. So we'll pop back, try counter 4 for the 0x09
event, and then re-try all 0x07 events, which will now work.
The masks on the PMU in question are:
0x01 - 0001
0x03 - 0011
0x0e - 1110
0x0c - 1100
But since all the masks that have overlap (0xe -> {0xc,0x3}) and (0x3 ->
0x1) are of heavier weight, it should all work out.
Reported-by: Jiri Olsa <jolsa@kernel.org> Tested-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Liang Kan <kan.liang@intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Robert Richter <rric@kernel.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vince Weaver <vince@deater.net> Cc: Vince Weaver <vincent.weaver@maine.edu> Link: http://lkml.kernel.org/r/20161109155153.GQ3142@twins.programming.kicks-ass.net Signed-off-by: Ingo Molnar <mingo@kernel.org>
Stephane Eranian [Thu, 22 Dec 2016 08:29:26 +0000 (00:29 -0800)]
perf/x86/pebs: Fix handling of PEBS buffer overflows
This patch solves a race condition between PEBS and the PMU handler.
In case multiple PEBS events are sampled at the same time,
it is possible to have GLOBAL_STATUS bit 62 set indicating
PEBS buffer overflow and also seeing at most 3 PEBS counters
having their bits set in the status register. This is a sign
that there was at least one PEBS record pending at the time
of the PMU interrupt. PEBS counters must only be processed
via the drain_pebs() calls, and not via the regular sample
processing loop coming after that the function, otherwise
phony regular samples may be generated in the sampling buffer
not marked with the EXACT tag.
Another possibility is to have one PEBS event and at least
one non-PEBS event whic hoverflows while PEBS has armed. In this
case, bit 62 of GLOBAL_STATUS will not be set, yet the overflow
status bit for the PEBS counter will be on Skylake.
To avoid this problem, we systematically ignore the PEBS-enabled
counters from the GLOBAL_STATUS mask and we always process PEBS
events via drain_pebs().
The problem manifested itself by having non-exact samples when
sampling only PEBS events, i.e., the PERF_SAMPLE_RECORD would
not have the EXACT flag set.
Note that this problem is only present on Skylake processor.
This fix is harmless on older processors.
Reported-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Stephane Eranian <eranian@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vince Weaver <vincent.weaver@maine.edu> Link: http://lkml.kernel.org/r/1482395366-8992-1-git-send-email-eranian@google.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
Joe Stringer [Fri, 9 Dec 2016 02:46:20 +0000 (18:46 -0800)]
samples/bpf: Move open_raw_sock to separate header
This function was declared in libbpf.c and was the only remaining
function in this library, but has nothing to do with BPF. Shift it out
into a new header, sock_example.h, and include it from the relevant
samples.
Signed-off-by: Joe Stringer <joe@ovn.org> Cc: Alexei Starovoitov <ast@fb.com> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: Wang Nan <wangnan0@huawei.com> Link: http://lkml.kernel.org/r/20161209024620.31660-8-joe@ovn.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Also tested the offwaketime resulting from the rebuild, seems to work as
before.
Signed-off-by: Joe Stringer <joe@ovn.org> Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Alexei Starovoitov <ast@fb.com> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: Wang Nan <wangnan0@huawei.com> Link: http://lkml.kernel.org/r/20161209024620.31660-7-joe@ovn.org
[ Use -I$(srctree)/tools/lib/ to support out of source code tree builds ] Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
[root@f5065a7d6272 linux]# make -j4 O=/tmp/build/linux samples/bpf/
<SNIP>
HOSTCC samples/bpf/fds_example.o
/git/linux/samples/bpf/fds_example.c: In function 'bpf_prog_create':
/git/linux/samples/bpf/fds_example.c:63:6: warning: passing argument 2 of 'bpf_load_program' discards 'const' qualifier from pointer target type [-Wdiscarded-qualifiers]
insns, insns_cnt, "GPL", 0,
^~~~~
In file included from /git/linux/samples/bpf/libbpf.h:5:0,
from /git/linux/samples/bpf/bpf_load.h:4,
from /git/linux/samples/bpf/fds_example.c:15:
/git/linux/tools/lib/bpf/bpf.h:31:5: note: expected 'struct bpf_insn *' but argument is of type 'const struct bpf_insn *'
int bpf_load_program(enum bpf_prog_type type, struct bpf_insn *insns,
^~~~~~~~~~~~~~~~
HOSTCC samples/bpf/sockex1_user.o
So just ditch that 'const' to reduce build noise, leaving changing the
bpf_load_program() bpf_insn parameter to const to a later patch, if deemed
adequate.
Cc: Joe Stringer <joe@ovn.org> Cc: Alexei Starovoitov <ast@fb.com> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: Wang Nan <wangnan0@huawei.com> Link: http://lkml.kernel.org/n/tip-1z5xee8n3oa66jf62bpv16ed@git.kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Joe Stringer [Wed, 14 Dec 2016 22:05:26 +0000 (14:05 -0800)]
tools lib bpf: Add bpf_prog_{attach,detach}
Commit d8c5b17f2bc0 ("samples: bpf: add userspace example for attaching
eBPF programs to cgroups") added these functions to samples/libbpf, but
during this merge all of the samples libbpf functionality is shifting to
tools/lib/bpf. Shift these functions there.
Committer notes:
Use bzero + attr.FIELD = value instead of 'attr = { .FIELD = value, just
like the other wrapper calls to sys_bpf with bpf_attr to make this build
in older toolchais, such as the ones in CentOS 5 and 6.
Joe Stringer [Wed, 14 Dec 2016 22:43:39 +0000 (14:43 -0800)]
samples/bpf: Switch over to libbpf
Now that libbpf under tools/lib/bpf/* is synced with the version from
samples/bpf, we can get rid most of the libbpf library here.
Committer notes:
Built it in a docker fedora rawhide container and ran it in the f25 host, seems
to work just like it did before this patch, i.e. the switch to tools/lib/bpf/
doesn't seem to have introduced problems and Joe said he tested it with
all the entries in samples/bpf/ and other code he found:
[root@f5065a7d6272 linux]# make -j4 O=/tmp/build/linux headers_install
<SNIP>
[root@f5065a7d6272 linux]# rm -rf /tmp/build/linux/samples/bpf/
[root@f5065a7d6272 linux]# make -j4 O=/tmp/build/linux samples/bpf/
make[1]: Entering directory '/tmp/build/linux'
CHK include/config/kernel.release
HOSTCC scripts/basic/fixdep
GEN ./Makefile
CHK include/generated/uapi/linux/version.h
Using /git/linux as source for kernel
CHK include/generated/utsrelease.h
HOSTCC scripts/basic/bin2c
HOSTCC arch/x86/tools/relocs_32.o
HOSTCC arch/x86/tools/relocs_64.o
LD samples/bpf/built-in.o
<SNIP>
HOSTCC samples/bpf/fds_example.o
HOSTCC samples/bpf/sockex1_user.o
/git/linux/samples/bpf/fds_example.c: In function 'bpf_prog_create':
/git/linux/samples/bpf/fds_example.c:63:6: warning: passing argument 2 of 'bpf_load_program' discards 'const' qualifier from pointer target type [-Wdiscarded-qualifiers]
insns, insns_cnt, "GPL", 0,
^~~~~
In file included from /git/linux/samples/bpf/libbpf.h:5:0,
from /git/linux/samples/bpf/bpf_load.h:4,
from /git/linux/samples/bpf/fds_example.c:15:
/git/linux/tools/lib/bpf/bpf.h:31:5: note: expected 'struct bpf_insn *' but argument is of type 'const struct bpf_insn *'
int bpf_load_program(enum bpf_prog_type type, struct bpf_insn *insns,
^~~~~~~~~~~~~~~~
HOSTCC samples/bpf/sockex2_user.o
<SNIP>
HOSTCC samples/bpf/xdp_tx_iptunnel_user.o
clang -nostdinc -isystem /usr/lib/gcc/x86_64-redhat-linux/6.2.1/include -I/git/linux/arch/x86/include -I./arch/x86/include/generated/uapi -I./arch/x86/include/generated -I/git/linux/include -I./include -I/git/linux/arch/x86/include/uapi -I/git/linux/include/uapi -I./include/generated/uapi -include /git/linux/include/linux/kconfig.h \
-D__KERNEL__ -D__ASM_SYSREG_H -Wno-unused-value -Wno-pointer-sign \
-Wno-compare-distinct-pointer-types \
-Wno-gnu-variable-sized-type-not-at-end \
-Wno-address-of-packed-member -Wno-tautological-compare \
-O2 -emit-llvm -c /git/linux/samples/bpf/sockex1_kern.c -o -| llc -march=bpf -filetype=obj -o samples/bpf/sockex1_kern.o
HOSTLD samples/bpf/tc_l2_redirect
<SNIP>
HOSTLD samples/bpf/lwt_len_hist
HOSTLD samples/bpf/xdp_tx_iptunnel
make[1]: Leaving directory '/tmp/build/linux'
[root@f5065a7d6272 linux]#
And then, in the host:
[root@jouet bpf]# mount | grep "docker.*devicemapper\/"
/dev/mapper/docker-253:0-1705076-9bd8aa1e0af33adce89ff42090847868ca676932878942be53941a06ec5923f9 on /var/lib/docker/devicemapper/mnt/9bd8aa1e0af33adce89ff42090847868ca676932878942be53941a06ec5923f9 type xfs (rw,relatime,context="system_u:object_r:container_file_t:s0:c73,c276",nouuid,attr2,inode64,sunit=1024,swidth=1024,noquota)
[root@jouet bpf]# cd /var/lib/docker/devicemapper/mnt/9bd8aa1e0af33adce89ff42090847868ca676932878942be53941a06ec5923f9/rootfs/tmp/build/linux/samples/bpf/
[root@jouet bpf]# file offwaketime
offwaketime: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), dynamically linked, interpreter /lib64/ld-linux-x86-64.so.2, for GNU/Linux 2.6.32, BuildID[sha1]=f423d171e0487b2f802b6a792657f0f3c8f6d155, not stripped
[root@jouet bpf]# readelf -SW offwaketime
offwaketime offwaketime_kern.o offwaketime_user.o
[root@jouet bpf]# readelf -SW offwaketime_kern.o
There are 11 section headers, starting at offset 0x700:
Section Headers:
[Nr] Name Type Address Off Size ES Flg Lk Inf Al
[ 0] NULL 0000000000000000 000000 000000 00 0 0 0
[ 1] .strtab STRTAB 0000000000000000 000658 0000a8 00 0 0 1
[ 2] .text PROGBITS 0000000000000000 000040 000000 00 AX 0 0 4
[ 3] kprobe/try_to_wake_up PROGBITS 0000000000000000 000040 0000d8 00 AX 0 0 8
[ 4] .relkprobe/try_to_wake_up REL 0000000000000000 0005a8 000020 10 10 3 8
[ 5] tracepoint/sched/sched_switch PROGBITS 0000000000000000 000118 000318 00 AX 0 0 8
[ 6] .reltracepoint/sched/sched_switch REL 0000000000000000 0005c8 000090 10 10 5 8
[ 7] maps PROGBITS 0000000000000000 000430 000050 00 WA 0 0 4
[ 8] license PROGBITS 0000000000000000 000480 000004 00 WA 0 0 1
[ 9] version PROGBITS 0000000000000000 000484 000004 00 WA 0 0 4
[10] .symtab SYMTAB 0000000000000000 000488 000120 18 1 4 8
Key to Flags:
W (write), A (alloc), X (execute), M (merge), S (strings)
I (info), L (link order), G (group), T (TLS), E (exclude), x (unknown)
O (extra OS processing required) o (OS specific), p (processor specific)
[root@jouet bpf]# ./offwaketime | head -3
qemu-system-x86;entry_SYSCALL_64_fastpath;sys_ppoll;do_sys_poll;poll_schedule_timeout;schedule_hrtimeout_range;schedule_hrtimeout_range_clock;schedule;__schedule;-;try_to_wake_up;hrtimer_wakeup;__hrtimer_run_queues;hrtimer_interrupt;local_apic_timer_interrupt;smp_apic_timer_interrupt;__irqentry_text_start;cpuidle_enter_state;cpuidle_enter;call_cpuidle;cpu_startup_entry;rest_init;start_kernel;x86_64_start_reservations;x86_64_start_kernel;start_cpu;;swapper/0 4
firefox;entry_SYSCALL_64_fastpath;sys_poll;do_sys_poll;poll_schedule_timeout;schedule_hrtimeout_range;schedule_hrtimeout_range_clock;schedule;__schedule;-;try_to_wake_up;pollwake;__wake_up_common;__wake_up_sync_key;pipe_write;__vfs_write;vfs_write;sys_write;entry_SYSCALL_64_fastpath;;Timer 1
swapper/2;start_cpu;start_secondary;cpu_startup_entry;schedule_preempt_disabled;schedule;__schedule;-;---;; 61
[root@jouet bpf]#
Kan Liang [Tue, 13 Dec 2016 15:29:44 +0000 (10:29 -0500)]
perf diff: Do not overwrite valid build id
Fixes a perf diff regression issue which was introduced by commit 5baecbcd9c9a ("perf symbols: we can now read separate debug-info files
based on a build ID")
The binary name could be same when perf diff different binaries. Build
id is used to distinguish between them.
However, the previous patch assumes the same binary name has same build
id. So it overwrites the build id according to the binary name,
regardless of whether the build id is set or not.
Check the has_build_id in dso__load. If the build id is already set, use
it.
Signed-off-by: Kan Liang <kan.liang@intel.com> Cc: Andi Kleen <andi@firstfloor.org> CC: Dima Kogan <dima@secretsauce.net> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Namhyung Kim <namhyung@kernel.org> Fixes: 5baecbcd9c9a ("perf symbols: we can now read separate debug-info files based on a build ID") Link: http://lkml.kernel.org/r/1481642984-13593-1-git-send-email-kan.liang@intel.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Ravi Bangoria [Tue, 22 Nov 2016 08:40:50 +0000 (14:10 +0530)]
perf annotate: Don't throw error for zero length symbols
'perf report --tui' exits with error when it finds a sample of zero
length symbol (i.e. addr == sym->start == sym->end). Actually these are
valid samples. Don't exit TUI and show report with such symbols.
Reported-and-Tested-by: Anton Blanchard <anton@samba.org> Link: https://lkml.org/lkml/2016/10/8/189 Signed-off-by: Ravi Bangoria <ravi.bangoria@linux.vnet.ibm.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Chris Riyder <chris.ryder@arm.com> Cc: linuxppc-dev@lists.ozlabs.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: stable@kernel.org # v4.9+ Link: http://lkml.kernel.org/r/1479804050-5028-1-git-send-email-ravi.bangoria@linux.vnet.ibm.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Jiri Olsa [Thu, 15 Dec 2016 19:56:54 +0000 (20:56 +0100)]
perf trace: Check if MAP_32BIT is defined (again)
There might be systems where MAP_32BIT is not defined, like some some
RHEL7 powerpc versions.
Signed-off-by: Jiri Olsa <jolsa@kernel.org> Cc: David Ahern <dsahern@gmail.com> Cc: Kyle McMartin <kyle@redhat.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Fixes: 256763b01741 ("perf trace beauty mmap: Add more conditional defines") Link: http://lkml.kernel.org/r/1481831814-23683-1-git-send-email-jolsa@kernel.org
[ Changed the Fixme cset to the one removing the conditional switch case for MAP_32BIT ] Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
While testing Joe's conversion of samples/bpf/ to use tools/lib/bpf/ I noticed
some warnings building samples/bpf/ on a Fedora Rawhide container, with
clang/llvm 3.9 I noticed this:
[root@1e797fdfbf4f linux]# make -j4 O=/tmp/build/linux/ samples/bpf/
make[1]: Entering directory '/tmp/build/linux'
CHK include/config/kernel.release
GEN ./Makefile
CHK include/generated/uapi/linux/version.h
Using /git/linux as source for kernel
<SNIP>
HOSTCC samples/bpf/trace_output_user.o
/git/linux/samples/bpf/trace_output_user.c:64:6: warning: no previous
prototype for 'perf_event_read' [-Wmissing-prototypes]
void perf_event_read(print_fn fn)
^~~~~~~~~~~~~~~
HOSTLD samples/bpf/trace_output
make[1]: Leaving directory '/tmp/build/linux'
Shut up the compiler by making that function static.
Acked-by: Daniel Borkmann <daniel@iogearbox.net> Cc: Alexei Starovoitov <ast@fb.com> Cc: Joe Stringer <joe@ovn.org> Cc: Wang Nan <wangnan0@huawei.com> Link: http://lkml.kernel.org/r/20161215152927.GC6866@kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
uprobes: Fix uprobes on MIPS, allow for a cache flush after ixol breakpoint creation
Commit:
72e6ae285a1d ('ARM: 8043/1: uprobes need icache flush after xol write'
... has introduced an arch-specific method to ensure all caches are
flushed appropriately after an instruction is written to an XOL page.
However, when the XOL area is created and the out-of-line breakpoint
instruction is copied, caches are not flushed at all and stale data may
be found in icache.
Replace a simple copy_to_page() with arch_uprobe_copy_ixol() to allow
the arch to ensure all caches are updated accordingly.
This change fixes uprobes on MIPS InterAptiv (tested on Creator Ci40).
Signed-off-by: Marcin Nowakowski <marcin.nowakowski@imgtec.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Victor Kamensky <victor.kamensky@linaro.org> Cc: linux-mips@linux-mips.org Link: http://lkml.kernel.org/r/1481625657-22850-1-git-send-email-marcin.nowakowski@imgtec.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
Joe Stringer [Wed, 14 Dec 2016 22:43:38 +0000 (14:43 -0800)]
samples/bpf: Make samples more libbpf-centric
Switch all of the sample code to use the function names from
tools/lib/bpf so that they're consistent with that, and to declare their
own log buffers. This allow the next commit to be purely devoted to
getting rid of the duplicate library in samples/bpf.
Committer notes:
Testing it:
On a fedora rawhide container, with clang/llvm 3.9, sharing the host
linux kernel git tree:
# make O=/tmp/build/linux/ headers_install
# make O=/tmp/build/linux -C samples/bpf/
Since I forgot to make it privileged, just tested it outside the
container, using what it generated:
Joe Stringer [Fri, 9 Dec 2016 02:46:16 +0000 (18:46 -0800)]
tools lib bpf: Add flags to bpf_create_map()
Commit 6c905981743 ("bpf: pre-allocate hash map elements") introduces
map_flags to bpf_attr for BPF_MAP_CREATE command. Expose this new
parameter in libbpf.
By exposing it, users can access flags such as whether or not to
preallocate the map.
Signed-off-by: Joe Stringer <joe@ovn.org> Acked-by: Wang Nan <wangnan0@huawei.com> Cc: Alexei Starovoitov <ast@fb.com> Cc: Daniel Borkmann <daniel@iogearbox.net> Link: http://lkml.kernel.org/r/20161209024620.31660-4-joe@ovn.org
[ Added clarifying comment made by Wang Nan ] Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Joe Stringer [Fri, 9 Dec 2016 02:46:15 +0000 (18:46 -0800)]
tools lib bpf: use __u32 from linux/types.h
Fixes the following issue when building without access to 'u32' type:
./tools/lib/bpf/bpf.h:27:23: error: unknown type name ‘u32’
Signed-off-by: Joe Stringer <joe@ovn.org> Acked-by: Wang Nan <wangnan0@huawei.com> Cc: Alexei Starovoitov <ast@fb.com> Cc: Daniel Borkmann <daniel@iogearbox.net> Link: http://lkml.kernel.org/r/20161209024620.31660-3-joe@ovn.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
The tools version of this header is out of date; update it to the latest
version from the kernel headers.
Signed-off-by: Joe Stringer <joe@ovn.org> Acked-by: Wang Nan <wangnan0@huawei.com> Cc: Alexei Starovoitov <ast@fb.com> Cc: Daniel Borkmann <daniel@iogearbox.net> Link: http://lkml.kernel.org/r/20161209024620.31660-2-joe@ovn.org
[ Sync it harder, after merging with what was in net-next via perf/urgent via torvalds/master to get BPG_PROG_(AT|DE)TACH, etc ] Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Ravi Bangoria [Mon, 5 Dec 2016 15:56:47 +0000 (21:26 +0530)]
perf annotate: Fix jump target outside of function address range
If jump target is outside of function range, perf is not handling it
correctly. Especially when target address is lesser than function start
address, target offset will be negative. But, target address declared to
be unsigned, converts negative number into 2's complement. See below
example. Here target of 'jumpq' instruction at 34cf8 is 34ac0 which is
lesser than function start address(34cf0).
Ravi Bangoria [Mon, 5 Dec 2016 15:56:46 +0000 (21:26 +0530)]
perf annotate: Support jump instruction with target as second operand
Architectures like PowerPC have jump instructions that includes a target
address as a second operand. For example, 'bne cr7,0xc0000000000f6154'.
Add support for such instruction in perf annotate.
Jiri Olsa [Mon, 12 Dec 2016 10:35:43 +0000 (11:35 +0100)]
perf record: Force ignore_missing_thread for uid option
Enable perf_evsel::ignore_missing_thread for -u option to ignore
complete failure if any of the user's processes die between its
enumeration and time we open the event.
Committer notes:
While doing a 'make -j4 allmodconfig' we sometimes get into the race:
Before:
# perf record -u acme
Error:
The sys_perf_event_open() syscall returned with 3 (No such process) for event (cycles:ppp).
/bin/dmesg may provide additional information.
No CONFIG_PERF_EVENTS=y kernel support configured?
#
After:
[root@jouet ~]# perf record -u acme
WARNING: Ignored open failure for pid 9888
WARNING: Ignored open failure for pid 18059
[root@jouet ~]#
Which is an improvement, with the races not preventing the remaining threads
for the specified user from being monitored, but the message probably needs
further clarification.
Signed-off-by: Jiri Olsa <jolsa@kernel.org> Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: David Ahern <dsahern@gmail.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/1481538943-21874-6-git-send-email-jolsa@kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
When set true, it allows perf to ignore error of missing pid of perf
event syscall.
We remove missing thread id from the thread_map, so the rest of the
processing like ioctl and mmap won't get disturbed with -1 fd.
The reason for supporting this is to ease up monitoring group of pids,
that 'disappear' before perf opens their event. This currently leads
perf to report error and exit and makes perf record's -u option unusable
under certain setup.
With this change we will allow this race and ignore such failure with
following warning:
WARNING: Ignored open failure for pid 8605
Signed-off-by: Jiri Olsa <jolsa@kernel.org> Cc: David Ahern <dsahern@gmail.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/20161213074622.GA3084@krava Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Jiri Olsa [Mon, 12 Dec 2016 10:35:41 +0000 (11:35 +0100)]
perf thread_map: Add thread_map__remove function
Add thread_map__remove function to remove thread from thread map.
Add automated test also.
Committer notes:
Testing it:
# perf test "Remove thread map"
39: Remove thread map : Ok
# perf test -v "Remove thread map"
39: Remove thread map :
--- start ---
test child forked, pid 4483
2 threads: 4482, 4483
1 thread: 4483
0 thread:
test child finished with 0
---- end ----
Remove thread map: Ok
#
Signed-off-by: Jiri Olsa <jolsa@kernel.org> Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: David Ahern <dsahern@gmail.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/1481538943-21874-4-git-send-email-jolsa@kernel.org
[ Added stdlib.h, to get the free() declaration ] Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Jiri Olsa [Mon, 12 Dec 2016 10:35:40 +0000 (11:35 +0100)]
perf evsel: Use variable instead of repeating lengthy FD macro
It's more readable and will ease up following patches.
Signed-off-by: Jiri Olsa <jolsa@kernel.org> Cc: David Ahern <dsahern@gmail.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/1481538943-21874-3-git-send-email-jolsa@kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
I.e. those parameters/functions _are_ used, so ditch that misleading attribute.
Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: David Ahern <dsahern@gmail.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Wang Nan <wangnan0@huawei.com> Link: http://lkml.kernel.org/n/tip-13cqtjh0yojg5gzvpq1zzpl0@git.kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Namhyung Kim [Thu, 8 Dec 2016 14:47:54 +0000 (23:47 +0900)]
perf sched timehist: Add -I/--idle-hist option
The --idle-hist option is to analyze system idle state so which process
makes cpu to go idle. If this option is specified, non-idle events will
be skipped and processes switching to/from idle will be shown.
This option is mostly useful when used with --summary(-only) option. In
the idle-time summary view, idle time is accounted to previous thread
which is run before idle task.
Signed-off-by: Namhyung Kim <namhyung@kernel.org> Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com> Acked-by: David Ahern <dsahern@gmail.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/20161208144755.16673-6-namhyung@kernel.org Link: http://lkml.kernel.org/r/20161213080632.19099-2-namhyung@kernel.org
[ Merged fix sent by Namhyumg, as posted in the second Link: tag ] Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Namhyung Kim [Thu, 8 Dec 2016 14:47:52 +0000 (23:47 +0900)]
perf sched timehist: Save callchain when entering idle
In order to investigate the idleness reason, it is necessary to keep the
callchains when entering idle. This can be identified by the
sched:sched_switch event having the next_pid field as 0.
The struct idle_time_data is to keep idle stats with callchains entering
to the idle task. The normal thread_runtime calculation is done
transparently since it extends the struct thread_runtime.
Signed-off-by: Namhyung Kim <namhyung@kernel.org> Acked-by: David Ahern <dsahern@gmail.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/20161208144755.16673-3-namhyung@kernel.org
[ Align struct field names ] Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Namhyung Kim [Thu, 8 Dec 2016 14:47:50 +0000 (23:47 +0900)]
perf sched timehist: Split is_idle_sample()
The is_idle_sample() function actually does more than determining
whether sample come from idle task. Split the callchain part into
save_task_callchain() to make it clearer.
Also checking prev_pid from trace data looks preferred than just
checking sample->pid since it's possible, although rare, to have invalid
0 pid/tid on scheduling an exiting task.
Signed-off-by: Namhyung Kim <namhyung@kernel.org> Acked-by: David Ahern <dsahern@gmail.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/20161208144755.16673-2-namhyung@kernel.org
[ Remove some needless () in some return statements ] Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Jiri Olsa [Tue, 6 Dec 2016 13:18:51 +0000 (14:18 +0100)]
perf tools: Move headers check into bash script
To make it nicer and easily maintainable.
Also moving the check into fixdep sub make, so its output is not
scattered around the build output.
Removing extra $$ from mman*.h checks.
Signed-off-by: Jiri Olsa <jolsa@kernel.org> Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: David Ahern <dsahern@gmail.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/1481030331-31944-5-git-send-email-jolsa@kernel.org
[ Use /bin/sh, and 'function check() {' -> 'check () {' to make it work with busybox, in Alpine Linux, for instance ] Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Linus Torvalds [Tue, 13 Dec 2016 04:50:02 +0000 (20:50 -0800)]
Merge branch 'akpm' (patches from Andrew)
Merge updates from Andrew Morton:
- various misc bits
- most of MM (quite a lot of MM material is awaiting the merge of
linux-next dependencies)
- kasan
- printk updates
- procfs updates
- MAINTAINERS
- /lib updates
- checkpatch updates
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (123 commits)
init: reduce rootwait polling interval time to 5ms
binfmt_elf: use vmalloc() for allocation of vma_filesz
checkpatch: don't emit unified-diff error for rename-only patches
checkpatch: don't check c99 types like uint8_t under tools
checkpatch: avoid multiple line dereferences
checkpatch: don't check .pl files, improve absolute path commit log test
scripts/checkpatch.pl: fix spelling
checkpatch: don't try to get maintained status when --no-tree is given
lib/ida: document locking requirements a bit better
lib/rbtree.c: fix typo in comment of ____rb_erase_color
lib/Kconfig.debug: make CONFIG_STRICT_DEVMEM depend on CONFIG_DEVMEM
MAINTAINERS: add drm and drm/i915 irc channels
MAINTAINERS: add "C:" for URI for chat where developers hang out
MAINTAINERS: add drm and drm/i915 bug filing info
MAINTAINERS: add "B:" for URI where to file bugs
get_maintainer: look for arbitrary letter prefixes in sections
printk: add Kconfig option to set default console loglevel
printk/sound: handle more message headers
printk/btrfs: handle more message headers
printk/kdb: handle more message headers
...
Linus Torvalds [Tue, 13 Dec 2016 04:23:11 +0000 (20:23 -0800)]
Merge branch 'irq-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull irq updates from Thomas Gleixner:
"The irq department provides:
- a major update to the auto affinity management code, which is used
by multi-queue devices
- move of the microblaze irq chip driver into the common driver code
so it can be shared between microblaze, powerpc and MIPS
- a series of updates to the ARM GICV3 interrupt controller
- the usual pile of fixes and small improvements all over the place"
* 'irq-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (25 commits)
powerpc/virtex: Use generic xilinx irqchip driver
irqchip/xilinx: Try to fall back if xlnx,kind-of-intr not provided
irqchip/xilinx: Add support for parent intc
irqchip/xilinx: Rename get_irq to xintc_get_irq
irqchip/xilinx: Restructure and use jump label api
irqchip/xilinx: Clean up print messages
microblaze/irqchip: Move intc driver to irqchip
ARM: virt: Select ARM_GIC_V3_ITS
ARM: gic-v3-its: Add 32bit support to GICv3 ITS
irqchip/gic-v3-its: Specialise readq and writeq accesses
irqchip/gic-v3-its: Specialise flush_dcache operation
irqchip/gic-v3-its: Narrow down Entry Size when used as a divider
irqchip/gic-v3-its: Change unsigned types for AArch32 compatibility
irqchip/gic-v3: Use nops macro for Cavium ThunderX erratum 23154
irqchip/gic-v3: Convert arm64 GIC accessors to {read,write}_sysreg_s
genirq/msi: Drop artificial PCI dependency
irqchip/bcm7038-l1: Implement irq_cpu_offline() callback
genirq/affinity: Use default affinity mask for reserved vectors
genirq/affinity: Take reserved vectors into account when spreading irqs
PCI: Remove the irq_affinity mask from struct pci_dev
...
Linus Torvalds [Tue, 13 Dec 2016 03:56:15 +0000 (19:56 -0800)]
Merge branch 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer updates from Thomas Gleixner:
"The time/timekeeping/timer folks deliver with this update:
- Fix a reintroduced signed/unsigned issue and cleanup the whole
signed/unsigned mess in the timekeeping core so this wont happen
accidentaly again.
- Add a new trace clock based on boot time
- Prevent injection of random sleep times when PM tracing abuses the
RTC for storage
- Make posix timers configurable for real tiny systems
- Add tracepoints for the alarm timer subsystem so timer based
suspend wakeups can be instrumented
- The usual pile of fixes and updates to core and drivers"
* 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (23 commits)
timekeeping: Use mul_u64_u32_shr() instead of open coding it
timekeeping: Get rid of pointless typecasts
timekeeping: Make the conversion call chain consistently unsigned
timekeeping_Force_unsigned_clocksource_to_nanoseconds_conversion
alarmtimer: Add tracepoints for alarm timers
trace: Update documentation for mono, mono_raw and boot clock
trace: Add an option for boot clock as trace clock
timekeeping: Add a fast and NMI safe boot clock
timekeeping/clocksource_cyc2ns: Document intended range limitation
timekeeping: Ignore the bogus sleep time if pm_trace is enabled
selftests/timers: Fix spelling mistake "Asyncrhonous" -> "Asynchronous"
clocksource/drivers/bcm2835_timer: Unmap region obtained by of_iomap
clocksource/drivers/arm_arch_timer: Map frame with of_io_request_and_map()
arm64: dts: rockchip: Arch counter doesn't tick in system suspend
clocksource/drivers/arm_arch_timer: Don't assume clock runs in suspend
posix-timers: Make them configurable
posix_cpu_timers: Move the add_device_randomness() call to a proper place
timer: Move sys_alarm from timer.c to itimer.c
ptp_clock: Allow for it to be optional
Kconfig: Regenerate *.c_shipped files after previous changes
...
Linus Torvalds [Tue, 13 Dec 2016 03:25:04 +0000 (19:25 -0800)]
Merge branch 'smp-hotplug-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull smp hotplug updates from Thomas Gleixner:
"This is the final round of converting the notifier mess to the state
machine. The removal of the notifiers and the related infrastructure
will happen around rc1, as there are conversions outstanding in other
trees.
The whole exercise removed about 2000 lines of code in total and in
course of the conversion several dozen bugs got fixed. The new
mechanism allows to test almost every hotplug step standalone, so
usage sites can exercise all transitions extensively.
There is more room for improvement, like integrating all the
pointlessly different architecture mechanisms of synchronizing,
setting cpus online etc into the core code"
* 'smp-hotplug-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (60 commits)
tracing/rb: Init the CPU mask on allocation
soc/fsl/qbman: Convert to hotplug state machine
soc/fsl/qbman: Convert to hotplug state machine
zram: Convert to hotplug state machine
KVM/PPC/Book3S HV: Convert to hotplug state machine
arm64/cpuinfo: Convert to hotplug state machine
arm64/cpuinfo: Make hotplug notifier symmetric
mm/compaction: Convert to hotplug state machine
iommu/vt-d: Convert to hotplug state machine
mm/zswap: Convert pool to hotplug state machine
mm/zswap: Convert dst-mem to hotplug state machine
mm/zsmalloc: Convert to hotplug state machine
mm/vmstat: Convert to hotplug state machine
mm/vmstat: Avoid on each online CPU loops
mm/vmstat: Drop get_online_cpus() from init_cpu_node_state/vmstat_cpu_dead()
tracing/rb: Convert to hotplug state machine
oprofile/nmi timer: Convert to hotplug state machine
net/iucv: Use explicit clean up labels in iucv_init()
x86/pci/amd-bus: Convert to hotplug state machine
x86/oprofile/nmi: Convert to hotplug state machine
...
Jungseung Lee [Tue, 13 Dec 2016 00:46:43 +0000 (16:46 -0800)]
init: reduce rootwait polling interval time to 5ms
For several devices, the rootwait time is sensitive because it directly
affects booting time. The polling interval of rootwait is currently
100ms. To save unnessesary waiting time, reduce the polling interval to
5 ms.
[akpm@linux-foundation.org: remove used-once #define] Link: http://lkml.kernel.org/r/20161207060743.1728-1-js07.lee@samsung.com Signed-off-by: Jungseung Lee <js07.lee@samsung.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Jason Baron [Tue, 13 Dec 2016 00:46:40 +0000 (16:46 -0800)]
binfmt_elf: use vmalloc() for allocation of vma_filesz
We have observed page allocations failures of order 4 during core dump
while trying to allocate vma_filesz. This results in a useless core
file of size 0. To improve reliability use vmalloc().
Note that the vmalloc() allocation is bounded by sysctl_max_map_count,
which is 65,530 by default. So with a 4k page size, and 8 bytes per
seg, this is a max of 128 pages or an order 7 allocation. Other parts
of the core dump path, such as fill_files_note() are already using
vmalloc() for presumably similar reasons.
Andrew Jeffery [Tue, 13 Dec 2016 00:46:37 +0000 (16:46 -0800)]
checkpatch: don't emit unified-diff error for rename-only patches
I generated a patch with `git format-patch` which checkpatch thinks is
invalid:
$ ./scripts/checkpatch.pl lpc-dt/0006-mfd-dt-Move-syscon-bindings-to-syscon-subdirectory.patch
WARNING: added, moved or deleted file(s), does MAINTAINERS need updating?
Documentation/devicetree/bindings/mfd/{ => syscon}/aspeed-scu.txt | 0
ERROR: Does not appear to be a unified-diff format patch
total: 1 errors, 1 warnings, 0 lines checked
NOTE: For some of the reported defects, checkpatch may be able to
mechanically convert to the typical style using --fix or --fix-inplace.
lpc-dt/0006-mfd-dt-Move-syscon-bindings-to-syscon-subdirectory.patch has style problems, please review.
NOTE: If any of the errors are false positives, please report
them to the maintainer, see CHECKPATCH in MAINTAINERS.
The patch in question was all renames with no edits, giving 100%
similarity and thus no diff markers.
Set '$is_patch = 1;' in the add/remove/rename detection to avoid
generating spurious warnings.
Link: http://lkml.kernel.org/r/20161205232224.22685-1-andrew@aj.id.au Signed-off-by: Andrew Jeffery <andrew@aj.id.au> Acked-by: Joe Perches <joe@perches.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Daniel Vetter [Tue, 13 Dec 2016 00:46:20 +0000 (16:46 -0800)]
lib/ida: document locking requirements a bit better
I wanted to wrap a bunch of ida_simple_get calls into their own locking,
until I dug around and read the original commit message. Stuff like
this should imo be added to the kernel doc, let's do that.
Link: http://lkml.kernel.org/r/20161027072216.20411-1-daniel.vetter@ffwll.ch Signed-off-by: Daniel Vetter <daniel.vetter@intel.com> Acked-by: Tejun Heo <tj@kernel.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@suse.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Jani Nikula [Tue, 13 Dec 2016 00:46:11 +0000 (16:46 -0800)]
MAINTAINERS: add drm and drm/i915 irc channels
Link: http://lkml.kernel.org/r/1476966135-26943-4-git-send-email-jani.nikula@intel.com Signed-off-by: Jani Nikula <jani.nikula@intel.com> Reviewed-by: Andrew Donnellan <andrew.donnellan@au1.ibm.com> Cc: Daniel Vetter <daniel@ffwll.ch> Cc: Dave Airlie <airlied@gmail.com> Cc: Joe Perches <joe@perches.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Jani Nikula [Tue, 13 Dec 2016 00:46:08 +0000 (16:46 -0800)]
MAINTAINERS: add "C:" for URI for chat where developers hang out
Make it easier to find the developer chat for the subsystem or driver.
Link: http://lkml.kernel.org/r/1476966135-26943-3-git-send-email-jani.nikula@intel.com Signed-off-by: Jani Nikula <jani.nikula@intel.com> Reviewed-by: Andrew Donnellan <andrew.donnellan@au1.ibm.com> Cc: Daniel Vetter <daniel@ffwll.ch> Cc: Dave Airlie <airlied@gmail.com> Cc: Joe Perches <joe@perches.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Jani Nikula [Tue, 13 Dec 2016 00:46:05 +0000 (16:46 -0800)]
MAINTAINERS: add drm and drm/i915 bug filing info
Link: http://lkml.kernel.org/r/1476966135-26943-2-git-send-email-jani.nikula@intel.com Signed-off-by: Jani Nikula <jani.nikula@intel.com> Reviewed-by: Andrew Donnellan <andrew.donnellan@au1.ibm.com> Cc: Daniel Vetter <daniel@ffwll.ch> Cc: Dave Airlie <airlied@gmail.com> Cc: Joe Perches <joe@perches.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Jani Nikula [Tue, 13 Dec 2016 00:46:02 +0000 (16:46 -0800)]
MAINTAINERS: add "B:" for URI where to file bugs
Different subsystems and drivers have different preferences for where to
file bugs and what information to include. Add "B:" entry for
specifying the URI for the bug tracker directly, a web page for detailed
info on filing bugs, or a mailto: URI.
Link: http://lkml.kernel.org/r/1476966135-26943-1-git-send-email-jani.nikula@intel.com Signed-off-by: Jani Nikula <jani.nikula@intel.com> Reviewed-by: Andrew Donnellan <andrew.donnellan@au1.ibm.com> Acked-by: Daniel Vetter <daniel@ffwll.ch> Cc: Joe Perches <joe@perches.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Joe Perches [Tue, 13 Dec 2016 00:45:59 +0000 (16:45 -0800)]
get_maintainer: look for arbitrary letter prefixes in sections
Jani Nikula proposes patches to add a few new letter prefixes for "B:"
bug reporting and "C:" maintainer chatting to the various sections of
MAINTAINERS.
Add a generic mechanism to get_maintainer.pl to find sections that have
any combination of "[A-Z]" letter prefix types in a section.
Link: http://lkml.kernel.org/r/1477332323.1984.8.camel@perches.com Signed-off-by: Joe Perches <joe@perches.com> Cc: Jani Nikula <jani.nikula@intel.com> Cc: Daniel Vetter <daniel@ffwll.ch> Cc: Dave Airlie <airlied@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Olof Johansson [Tue, 13 Dec 2016 00:45:56 +0000 (16:45 -0800)]
printk: add Kconfig option to set default console loglevel
Add a configuration option to set the default console loglevel. This
is, as before, still possible to override at runtime through bootargs
(loglevel=<x>), sysrq and /proc/printk.
There are cases where adding additional arguments on the commandline is
impractical, and changing the default for the kernel when being built
makes more sense. Provide such a method here, for those who choose to
do so.
Also, while touching this code, clarify the difference between
MESSAGE_LOGLEVEL_DEFAULT and CONSOLE_LOGLEVEL_DEFAULT.
Petr Mladek [Tue, 13 Dec 2016 00:45:53 +0000 (16:45 -0800)]
printk/sound: handle more message headers
Commit 4bcc595ccd80 ("printk: reinstate KERN_CONT for printing
continuation lines") allows to define more message headers for a single
message. The motivation is that continuous lines might get mixed.
Therefore it make sense to define the right log level for every piece of
a cont line.
This patch allows to copy only the real message level. We should ignore
KERN_CONT because <filename:line> is added for each message. By other
words, we want to know where each piece of the line comes from.
[pmladek@suse.com: fix a check of the valid message level] Link: http://lkml.kernel.org/r/20161111183444.GE2145@dhcp128.suse.cz Link: http://lkml.kernel.org/r/1478695291-12169-5-git-send-email-pmladek@suse.com Signed-off-by: Petr Mladek <pmladek@suse.com> Cc: Joe Perches <joe@perches.com> Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Jason Wessel <jason.wessel@windriver.com> Cc: Jaroslav Kysela <perex@perex.cz> Cc: Takashi Iwai <tiwai@suse.com> Cc: Chris Mason <clm@fb.com> Cc: Josef Bacik <jbacik@fb.com> Cc: David Sterba <dsterba@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Petr Mladek [Tue, 13 Dec 2016 00:45:50 +0000 (16:45 -0800)]
printk/btrfs: handle more message headers
Commit 4bcc595ccd80 ("printk: reinstate KERN_CONT for printing
continuation lines") allows to define more message headers for a single
message. The motivation is that continuous lines might get mixed.
Therefore it make sense to define the right log level for every piece of
a cont line.
The current btrfs_printk() macros do not support continuous lines at the
moment. But better be prepared for a custom messages and avoid
potential "lvl" buffer overflow.
This patch iterates over the entire message header. It is interested
only into the message level like the original code.
This patch also introduces PRINTK_MAX_SINGLE_HEADER_LEN. Three bytes
are enough for the message level header at the moment. But it used to
be three, see the commit 04d2c8c83d0e ("printk: convert the format for
KERN_<LEVEL> to a 2 byte pattern").
Also I fixed the default ratelimit level. It looked very strange when it
was different from the default log level.
[pmladek@suse.com: Fix a check of the valid message level] Link: http://lkml.kernel.org/r/20161111183236.GD2145@dhcp128.suse.cz Link: http://lkml.kernel.org/r/1478695291-12169-4-git-send-email-pmladek@suse.com Signed-off-by: Petr Mladek <pmladek@suse.com> Acked-by: David Sterba <dsterba@suse.com> Cc: Joe Perches <joe@perches.com> Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Jason Wessel <jason.wessel@windriver.com> Cc: Jaroslav Kysela <perex@perex.cz> Cc: Takashi Iwai <tiwai@suse.com> Cc: Chris Mason <clm@fb.com> Cc: Josef Bacik <jbacik@fb.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Petr Mladek [Tue, 13 Dec 2016 00:45:47 +0000 (16:45 -0800)]
printk/kdb: handle more message headers
Commit 4bcc595ccd80 ("printk: reinstate KERN_CONT for printing
continuation lines") allows to define more message headers for a single
message. The motivation is that continuous lines might get mixed.
Therefore it make sense to define the right log level for every piece of
a cont line.
This patch introduces printk_skip_headers() that will skip all headers
and uses it in the kdb code instead of printk_skip_level().
This approach helps to fix other printk_skip_level() users
independently.
Link: http://lkml.kernel.org/r/1478695291-12169-3-git-send-email-pmladek@suse.com Signed-off-by: Petr Mladek <pmladek@suse.com> Cc: Joe Perches <joe@perches.com> Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Jason Wessel <jason.wessel@windriver.com> Cc: Jaroslav Kysela <perex@perex.cz> Cc: Takashi Iwai <tiwai@suse.com> Cc: Chris Mason <clm@fb.com> Cc: Josef Bacik <jbacik@fb.com> Cc: David Sterba <dsterba@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Petr Mladek [Tue, 13 Dec 2016 00:45:44 +0000 (16:45 -0800)]
printk/NMI: handle continuous lines and missing newline
Commit 4bcc595ccd80 ("printk: reinstate KERN_CONT for printing
continuation lines") added back KERN_CONT message header. As a result
it might appear in the middle of the line when the parts are squashed
via the temporary NMI buffer.
A reasonable solution seems to be to split the text in the NNI temporary
not only by newlines but also by the message headers.
Another solution would be to filter out KERN_CONT when writing to the
temporary buffer. But this would complicate the lockless handling.
Also it would not solve problems with a missing newline that was there
even before the KERN_CONT stuff.
This patch moves the temporary buffer handling into separate function.
I played with it and it seems that using the char pointers make the code
easier to read.
Also it prints the final newline as a continuous line.
Finally, it moves handling of the s->len overflow into the paranoid
check. And allows to recover from the disaster.
Link: http://lkml.kernel.org/r/1478695291-12169-2-git-send-email-pmladek@suse.com Signed-off-by: Petr Mladek <pmladek@suse.com> Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Joe Perches <joe@perches.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Jason Wessel <jason.wessel@windriver.com> Cc: Jaroslav Kysela <perex@perex.cz> Cc: Takashi Iwai <tiwai@suse.com> Cc: Chris Mason <clm@fb.com> Cc: Josef Bacik <jbacik@fb.com> Cc: David Sterba <dsterba@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Petr Mladek [Tue, 13 Dec 2016 00:45:40 +0000 (16:45 -0800)]
printk/NMI: fix up handling of the full nmi log buffer
vsnprintf() adds the trailing '\0' but it does not count it into the
number of printed characters. The result is that there is one byte less
space for the real characters in the buffer.
The broken check for the free space might cause that we will repeatedly
try to print 1 character into the buffer, never reach the full buffer,
and do not count the messages as missed.
Also vsnprintf() returns the number of characters that would be printed
if the buffer was big enough. As a result, s->len might be bigger than
the size of the buffer[*]. And the printk() function might return
bigger len than it really printed. Both problems are fixed by using
vscnprintf() instead.
Note that I though about increasing the number of missed messages even
when the message was shrunken. But it made the code even more
complicated. I think that it is not worth it. Shrunken messages are
usually easy to recognize. And it should be a corner case.
[*] The overflown s->len value is crazy and unexpected. I "made a
mistake" and reported this situation as an internal error when fixed
handling of PR_CONT headers in some other patch.
Link: http://lkml.kernel.org/r/20161208174912.GA17042@linux.suse Signed-off-by: Petr Mladek <pmladek@suse.com>
CcL Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com> Cc: Chris Mason <clm@fb.com> Cc: David Sterba <dsterba@suse.com> Cc: Jason Wessel <jason.wessel@windriver.com> Cc: Josef Bacik <jbacik@fb.com> Cc: Joe Perches <joe@perches.com> Cc: Jaroslav Kysela <perex@perex.cz> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Takashi Iwai <tiwai@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Tetsuo Handa [Tue, 13 Dec 2016 00:45:35 +0000 (16:45 -0800)]
hung_task: decrement sysctl_hung_task_warnings only if it is positive
Since sysctl_hung_task_warnings == -1 is allowed (infinite warnings),
commit 48a6d64edadb ("hung_task: allow hung_task_panic when
hung_task_warnings is 0") should decrement it only when it is not -1.
This prevents the kernel from ceasing warnings after the first 4294967295 ;)
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: John Siddle <jsiddle@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Rasmus Villemoes [Tue, 13 Dec 2016 00:45:25 +0000 (16:45 -0800)]
fs/proc/array.c: slightly improve render_sigset_t
format_decode and vsnprintf occasionally show up in perf top, so I went
looking for places that might not need the full printf power. With the
help of kprobes, I gathered some statistics on which format strings we
mostly pass to vsnprintf. On a trivial desktop workload, I hit "%x" 25%
of the time, so something apparently reads /proc/pid/status (which does
5*16 printf("%x") calls) a lot.
With this patch, reading /proc/pid/status is 30% faster according to
this microbenchmark:
char buf[4096];
int i, fd;
for (i = 0; i < 10000; ++i) {
fd = open("/proc/self/status", O_RDONLY);
read(fd, buf, sizeof(buf));
close(fd);
}
Kees Cook [Tue, 13 Dec 2016 00:45:05 +0000 (16:45 -0800)]
proc: report no_new_privs state
Similar to being able to examine if a process has been correctly
confined with seccomp, the state of no_new_privs is equally interesting,
so this adds it to /proc/$pid/status.
Link: http://lkml.kernel.org/r/20161103214041.GA58566@beast Signed-off-by: Kees Cook <keescook@chromium.org> Reviewed-by: Jann Horn <jann@thejh.net> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Michal Hocko <mhocko@suse.com> Cc: Konstantin Khlebnikov <koct9i@gmail.com> Cc: Hugh Dickins <hughd@google.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Rodrigo Freire <rfreire@redhat.com> Cc: John Stultz <john.stultz@linaro.org> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Robert Ho <robert.hu@intel.com> Cc: Jerome Marchand <jmarchan@redhat.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: "Richard W.M. Jones" <rjones@redhat.com> Cc: Joe Perches <joe@perches.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
zijun_hu [Tue, 13 Dec 2016 00:45:02 +0000 (16:45 -0800)]
mm/percpu.c: fix panic triggered by BUG_ON() falsely
As shown by pcpu_build_alloc_info(), the number of units within a percpu
group is deduced by rounding up the number of CPUs within the group to
@upa boundary/ Therefore, the number of CPUs isn't equal to the units's
if it isn't aligned to @upa normally. However, pcpu_page_first_chunk()
uses BUG_ON() to assert that one number is equal to the other roughly,
so a panic is maybe triggered by the BUG_ON() incorrectly.
In order to fix this issue, the number of CPUs is rounded up then
compared with units's and the BUG_ON() is replaced with a warning and
return of an error code as well, to keep system alive as much as
possible.
Andrey Ryabinin [Tue, 13 Dec 2016 00:44:59 +0000 (16:44 -0800)]
kasan: turn on -fsanitize-address-use-after-scope
In the upcoming gcc7 release, the -fsanitize=kernel-address option at
first implied new -fsanitize-address-use-after-scope option. This would
cause link errors on older kernels because they don't have two new
functions required for use-after-scope support. Therefore, gcc7 changed
default to -fno-sanitize-address-use-after-scope.
Now the kernel has everything required for that feature since commit 828347f8f9a5 ("kasan: support use-after-scope detection"). So, to make it
work, we just have to enable use-after-scope in CFLAGS.
Dmitry Vyukov [Tue, 13 Dec 2016 00:44:56 +0000 (16:44 -0800)]
kasan: eliminate long stalls during quarantine reduction
Currently we dedicate 1/32 of RAM for quarantine and then reduce it by
1/4 of total quarantine size. This can be a significant amount of
memory. For example, with 4GB of RAM total quarantine size is 128MB and
it is reduced by 32MB at a time. With 128GB of RAM total quarantine
size is 4GB and it is reduced by 1GB. This leads to several problems:
- freeing 1GB can take tens of seconds, causes rcu stall warnings and
just introduces unexpected long delays at random places
- if kmalloc() is called under a mutex, other threads stall on that
mutex while a thread reduces quarantine
- threads wait on quarantine_lock while one thread grabs a large batch
of objects to evict
- we walk the uncached list of object to free twice which makes all of
the above worse
- when a thread frees objects, they are already not accounted against
global_quarantine.bytes; as the result we can have quarantine_size
bytes in quarantine + unbounded amount of memory in large batches in
threads that are in process of freeing
Reduce size of quarantine in smaller batches to reduce the delays. The
only reason to reduce it in batches is amortization of overheads, the
new batch size of 1MB should be well enough to amortize spinlock
lock/unlock and few function calls.
Plus organize quarantine as a FIFO array of batches. This allows to not
walk the list in quarantine_reduce() under quarantine_lock, which in
turn reduces contention and is just faster.
This improves performance of heavy load (syzkaller fuzzing) by ~20% with
4 CPUs and 32GB of RAM. Also this eliminates frequent (every 5 sec)
drops of CPU consumption from ~400% to ~100% (one thread reduces
quarantine while others are waiting on a mutex).
Some reference numbers:
1. Machine with 4 CPUs and 4GB of memory. Quarantine size 128MB.
Currently we free 32MB at at time.
With new code we free 1MB at a time (1024 batches, ~128 are used).
2. Machine with 32 CPUs and 128GB of memory. Quarantine size 4GB.
Currently we free 1GB at at time.
With new code we free 8MB at a time (1024 batches, ~512 are used).
3. Machine with 4096 CPUs and 1TB of memory. Quarantine size 32GB.
Currently we free 8GB at at time.
With new code we free 4MB at a time (16K batches, ~8K are used).
Dmitry Vyukov [Tue, 13 Dec 2016 00:44:53 +0000 (16:44 -0800)]
kasan: support panic_on_warn
If user sets panic_on_warn, he wants kernel to panic if there is
anything barely wrong with the kernel. KASAN-detected errors are
definitely not less benign than an arbitrary kernel WARNING.
Panic after KASAN errors if panic_on_warn is set.
We use this for continuous fuzzing where we want kernel to stop and
reboot on any error.
Hugh Dickins [Tue, 13 Dec 2016 00:44:50 +0000 (16:44 -0800)]
mm: make transparent hugepage size public
Test programs want to know the size of a transparent hugepage. While it
is commonly the same as the size of a hugetlbfs page (shown as
Hugepagesize in /proc/meminfo), that is not always so: powerpc
implements transparent hugepages in a different way from hugetlbfs
pages, so it's coincidence when their sizes are the same; and x86 and
others can support more than one hugetlbfs page size.
Add /sys/kernel/mm/transparent_hugepage/hpage_pmd_size to show the THP
size in bytes - it's the same for Anonymous and Shmem hugepages. Call
it hpage_pmd_size (after HPAGE_PMD_SIZE) rather than hpage_size, in case
some transparent support for pud and pgd pages is added later.
Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1612052200290.13021@eggly.anvils Signed-off-by: Hugh Dickins <hughd@google.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Greg Thelen <gthelen@google.com> Cc: David Rientjes <rientjes@google.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Jan Kara <jack@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Hugh Dickins [Tue, 13 Dec 2016 00:44:47 +0000 (16:44 -0800)]
mm: add cond_resched() in gather_pte_stats()
The other pagetable walks in task_mmu.c have a cond_resched() after
walking their ptes: add a cond_resched() in gather_pte_stats() too, for
reading /proc/<id>/numa_maps. Only pagemap_pmd_range() has a
cond_resched() in its (unusually expensive) pmd_trans_huge case: more
should probably be added, but leave them unchanged for now.
Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1612052157400.13021@eggly.anvils Signed-off-by: Hugh Dickins <hughd@google.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: David Rientjes <rientjes@google.com> Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Hugh Dickins [Tue, 13 Dec 2016 00:44:44 +0000 (16:44 -0800)]
mm: add three more cond_resched() in swapoff
Add a cond_resched() in the unuse_pmd_range() loop (so as to call it
even when pmd none or trans_huge, like zap_pmd_range() does); and in the
unuse_mm() loop (since that might skip over many vmas). shmem_unuse()
and radix_tree_locate_item() look good enough already.
Those were the obvious places, but in fact the stalls came from
find_next_to_unuse(), which sometimes scans through many unused entries.
Apply scan_swap_map()'s LATENCY_LIMIT of 256 there too; and only go off
to test frontswap_map when a used entry is found.
Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1612052155140.13021@eggly.anvils Signed-off-by: Hugh Dickins <hughd@google.com> Reported-by: Eric Dumazet <edumazet@google.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Mel Gorman [Tue, 13 Dec 2016 00:44:41 +0000 (16:44 -0800)]
mm, page_alloc: keep pcp count and list contents in sync if struct page is corrupted
Vlastimil Babka pointed out that commit 479f854a207c ("mm, page_alloc:
defer debugging checks of pages allocated from the PCP") will allow the
per-cpu list counter to be out of sync with the per-cpu list contents if
a struct page is corrupted.
The consequence is an infinite loop if the per-cpu lists get fully
drained by free_pcppages_bulk because all the lists are empty but the
count is positive. The infinite loop occurs here
do {
batch_free++;
if (++migratetype == MIGRATE_PCPTYPES)
migratetype = 0;
list = &pcp->lists[migratetype];
} while (list_empty(list));
What the user sees is a bad page warning followed by a soft lockup with
interrupts disabled in free_pcppages_bulk().
This patch keeps the accounting in sync.
Fixes: 479f854a207c ("mm, page_alloc: defer debugging checks of pages allocated from the PCP") Link: http://lkml.kernel.org/r/20161202112951.23346-2-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@suse.de> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Christoph Lameter <cl@linux.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jesper Dangaard Brouer <brouer@redhat.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: <stable@vger.kernel.org> [4.7+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Vlastimil Babka [Tue, 13 Dec 2016 00:44:38 +0000 (16:44 -0800)]
mm, rmap: handle anon_vma_prepare() common case inline
anon_vma_prepare() is mostly a large "if (unlikely(...))" block, as the
expected common case is that an anon_vma already exists. We could turn
the condition around and return 0, but it also makes sense to do it
inline and avoid a call for the common case.
Bloat-o-meter naturally shows that inlining the check has some code size
costs:
Vlastimil Babka [Tue, 13 Dec 2016 00:44:35 +0000 (16:44 -0800)]
mm, debug: print raw struct page data in __dump_page()
__dump_page() is used when a page metadata inconsistency is detected,
either by standard runtime checks, or extra checks in CONFIG_DEBUG_VM
builds. It prints some of the relevant metadata, but not the whole
struct page, which is based on unions and interpretation is dependent on
the context.
This means that sometimes e.g. a VM_BUG_ON_PAGE() checks certain field,
which is however not printed by __dump_page() and the resulting bug
report may then lack clues that could help in determining the root
cause. This patch solves the problem by simply printing the whole
struct page word by word, so no part is missing, but the interpretation
of the data is left to developers. This is similar to e.g. x86_64 raw
stack dumps.
Add arch specific callback in the generic THP page cache code that will
deposit and withdarw preallocated page table. Archs like ppc64 use this
preallocated table to store the hash pte slot information.
Testing:
kernel build of the patch series on tmpfs mounted with option huge=always