Namhyung Kim [Wed, 30 Oct 2013 08:05:55 +0000 (17:05 +0900)]
perf top: Add --children option
The --children option is for showing accumulated overhead (period)
value as well as self overhead. It should be used with one of -g or
--call-graph option.
Namhyung Kim [Tue, 7 Jan 2014 08:41:03 +0000 (17:41 +0900)]
perf top: Convert to hist_entry_iter
Reuse hist_entry_iter__add() function to share the similar code with
perf report. Note that it needs to be called with hists.lock so tweak
some internal functions not to deadlock or hold the lock too long.
Namhyung Kim [Tue, 7 Jan 2014 08:02:25 +0000 (17:02 +0900)]
perf tools: Add callback function to hist_entry_iter
The new ->add_entry_cb() will be called after an entry was added to
the histogram. It's used for code sharing between perf report and
perf top. Note that ops->add_*_entry() should set iter->he properly
in order to call the ->add_entry_cb.
Also pass @arg to the callback function. It'll be used by perf top
later.
Namhyung Kim [Thu, 20 Mar 2014 00:10:29 +0000 (09:10 +0900)]
perf tools: Do not auto-remove Children column if --fields given
Depending on the configuration perf inserts/removes the Children
column in the output automatically. But it might not be what user
wants if [s]he give --fields option explicitly.
Namhyung Kim [Tue, 22 Jan 2013 09:09:46 +0000 (18:09 +0900)]
perf report: Add report.children config option
Add report.children config option for setting default value of
callchain accumulation. It affects the report output only if
perf.data contains callchain info.
A user can write .perfconfig file like below to enable accumulation
by default:
Namhyung Kim [Thu, 31 Oct 2013 01:17:39 +0000 (10:17 +0900)]
perf tools: Apply percent-limit to cumulative percentage
If -g cumulative option is given, it needs to show entries which don't
have self overhead. So apply percent-limit to accumulated overhead
percentage in this case.
Namhyung Kim [Wed, 30 Oct 2013 07:06:59 +0000 (16:06 +0900)]
perf ui/hist: Add support to accumulated hist stat
Print accumulated stat of a hist entry if requested.
To do that, add new HPP_PERCENT_ACC_FNS macro and generate a
perf_hpp_fmt using it. The __hpp__sort_acc() function sorts entries
by accumulated period value. When accumulated periods of two entries
are same (i.e. single path callchain) put the caller above since
accumulation tends to put callers on higher position for obvious
reason.
Also add "overhead_children" output field to be selected by user.
Namhyung Kim [Thu, 31 Oct 2013 01:05:29 +0000 (10:05 +0900)]
perf report: Cache cumulative callchains
It is possble that a callchain has cycles or recursive calls. In that
case it'll end up having entries more than 100% overhead in the
output. In order to prevent such entries, cache each callchain node
and skip if same entry already cumulated.
Namhyung Kim [Thu, 31 Oct 2013 04:58:30 +0000 (13:58 +0900)]
perf tools: Update cpumode for each cumulative entry
The cpumode and level in struct addr_localtion was set for a sample
and but updated as cumulative callchains were added. This led to have
non-matching symbol and cpumode in the output.
Update it accordingly based on the fact whether the map is a part of
the kernel or not. This is a reverse of what thread__find_addr_map()
does.
Namhyung Kim [Tue, 11 Sep 2012 04:34:27 +0000 (13:34 +0900)]
perf hists: Check if accumulated when adding a hist entry
To support callchain accumulation, @entry should be recognized if it's
accumulated or not when add_hist_entry() called. The period of an
accumulated entry should be added to ->stat_acc but not ->stat. Add
@sample_self arg for that.
Namhyung Kim [Tue, 11 Sep 2012 04:15:07 +0000 (13:15 +0900)]
perf hists: Add support for accumulated stat of hist entry
Maintain accumulated stat information in hist_entry->stat_acc if
symbol_conf.cumulate_callchain is set. Fields in ->stat_acc have same
vaules initially, and will be updated as callchain is processed later.
Namhyung Kim [Wed, 30 Oct 2013 00:40:34 +0000 (09:40 +0900)]
perf tools: Introduce struct hist_entry_iter
There're some duplicate code when adding hist entries. They are
different in that some have branch info or mem info but generally do
same thing. So introduce new struct hist_entry_iter and add callbacks
to customize each case in general way.
The new perf_evsel__add_entry() function will look like:
iter->prepare_entry();
iter->add_single_entry();
while (iter->next_entry())
iter->add_next_entry();
iter->finish_entry();
This will help further work like the cumulative callchain patchset.
Signed-off-by: Namhyung Kim <namhyung@kernel.org> Tested-by: Arun Sharma <asharma@fb.com> Tested-by: Rodrigo Campos <rodrigo@sdfg.com.ar> Cc: David Ahern <dsahern@gmail.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Stephane Eranian <eranian@google.com> Link: http://lkml.kernel.org/r/1401335910-16832-3-git-send-email-namhyung@kernel.org Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Michael Lentine [Tue, 20 May 2014 09:48:50 +0000 (11:48 +0200)]
perf tools: Add automatic remapping of Android libraries
This patch automatically adjusts the path of MMAP records
associated with Android system libraries.
The Android system is organized with system libraries found in
/system/lib and user libraries in /data/app-lib. On the host system
(not running Android), system libraries can be found in the downloaded
NDK directory under ${NDK_ROOT}/platforms/${APP_PLATFORM}/arch-${ARCH}/usr/lib
and the user libraries are installed under libs/${APP_ABI} within
the apk build directory. This patch makes running the reporting
tools possible on the host system using the libraries from the NDK.
Signed-off-by: Michael Lentine <mlentine@google.com> Reviewed-by: Stephane Eranian <eranian@google.com> Link: http://lkml.kernel.org/r/1400579330-5043-3-git-send-email-eranian@google.com
[ fixed 'space required before the open parenthesis' checkpatch.pl errors ] Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Michael Lentine [Tue, 20 May 2014 09:48:49 +0000 (11:48 +0200)]
perf tools: Add cat as fallback pager
This patch adds a fallback to cat for the pager. This is useful
on environments, such as Android, where less does not exist.
It is better to default to cat than to abort.
Namhyung Kim [Mon, 19 May 2014 05:19:30 +0000 (14:19 +0900)]
perf tools: Get rid of obsolete hist_entry__sort_list
Now we moved to the perf_hpp_[_sort]_list so no need to keep the old
hist_entry__sort_list and sort__first_dimension. Also the
hist_entry__sort_snprintf() can be gone as hist_entry__snprintf()
provides the functionality.
Namhyung Kim [Wed, 16 Apr 2014 02:16:33 +0000 (11:16 +0900)]
perf report/tui: Fix a bug when --fields/sort is given
The hists__filter_entries() function is called when down arrow key is
pressed for navigating through the entries in TUI. It has a check for
filtering out entries that have very small overhead (under min_pcnt).
However it just assumed the entries are sorted by the overhead so when
it saw such a small overheaded entry, it just stopped navigating as an
optimization. But it's not true anymore due to new --fields and
--sort optoin behavior and this case users cannot go down to a next
entry if ther's an entry with small overhead in-between.
Namhyung Kim [Tue, 4 Mar 2014 02:01:41 +0000 (11:01 +0900)]
perf tools: Add ->sort() member to struct sort_entry
Currently, what the sort_entry does is just identifying hist entries
so that they can be grouped properly. However, with -F option
support, it indeed needs to sort entries appropriately to be shown to
users. So add ->sort() member to do it.
Namhyung Kim [Tue, 4 Mar 2014 01:46:34 +0000 (10:46 +0900)]
perf report: Add -F option to specify output fields
The -F/--fields option is to allow user setup output field in any
order. It can receive any sort keys and following (hpp) fields:
overhead, overhead_sys, overhead_us, sample and period
If guest profiling is enabled, overhead_guest_{sys,us} will be
available too.
The output fields also affect sort order unless you give -s/--sort
option. And any keys specified on -s option, will also be added to
the output field list automatically.
Namhyung Kim [Wed, 16 Apr 2014 02:04:51 +0000 (11:04 +0900)]
perf tools: Call perf_hpp__init() before setting up GUI browsers
So that it can be set properly prior to set up output fields. That
makes easy to handle/warn errors during the setup since it doesn't
need to be bothered with the GUI.
Namhyung Kim [Tue, 18 Mar 2014 02:31:39 +0000 (11:31 +0900)]
perf tools: Consolidate management of default sort orders
The perf uses different default sort orders for different use-cases,
and this was scattered throughout the code. Add get_default_sort_
order() function to handle this and change initial value of sort_order
to NULL to distinguish it from user-given one.
Namhyung Kim [Mon, 3 Mar 2014 08:05:19 +0000 (17:05 +0900)]
perf ui: Get rid of callback from __hpp__fmt()
The callback was used by TUI for determining color of folded sign
using percent of first field/column. But it cannot be used anymore
since it now support dynamic reordering of output field.
So move the logic to the hist_browser__show_entry().
Namhyung Kim [Mon, 3 Mar 2014 07:16:20 +0000 (16:16 +0900)]
perf tools: Consolidate output field handling to hpp format routines
Until now the hpp and sort functions do similar jobs different ways.
Since the sort functions converted/wrapped to hpp formats it can do
the job in a uniform way.
The perf_hpp__sort_list has a list of hpp formats to sort entries and
the perf_hpp__list has a list of hpp formats to print output result.
To have a backward compatibility, it automatically adds 'overhead'
field in front of sort list. And then all of fields in sort list
added to the output list (if it's not already there).
Stephane Eranian [Thu, 15 May 2014 15:56:44 +0000 (17:56 +0200)]
fix Haswell precise store data source encoding
This patch fixes a bug in precise_store_data_hsw() whereby
it would set the data source memory level to the wrong value.
As per the the SDM Vol 3b Table 18-41 (Layout of Data Linear
Address Information in PEBS Record), when status bit 0 is set
this is a L1 hit, otherwise this is a L1 miss.
This patch encodes the memory level according to the specification.
In V2, we added the filtering on the store events.
Only the following events produce L1 information:
* MEM_UOPS_RETIRED.STLB_MISS_STORES
* MEM_UOPS_RETIRED.LOCK_STORES
* MEM_UOPS_RETIRED.SPLIT_STORES
* MEM_UOPS_RETIRED.ALL_STORES
Cc: mingo@elte.hu Cc: acme@ghostprotocols.net Cc: jolsa@redhat.com Cc: jmario@redhat.com Cc: ak@linux.intel.com Tested-and-Reviewed-by: Don Zickus <dzickus@redhat.com> Signed-off-by: Stephane Eranian <eranian@google.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/20140515155644.GA3884@quad Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Peter Zijlstra [Wed, 23 Apr 2014 10:22:54 +0000 (12:22 +0200)]
perf: Fix perf_event_open(.flags) test
Vince noticed that we test the (unsigned long) flags field against an
(unsigned int) constant. This would allow setting the high bits on 64bit
platforms and not get an error.
There is nothing that uses the high bits, so it should be entirely
harmless, but we don't want userspace to accidentally set them anyway,
so fix the constants.
Cc: Ingo Molnar <mingo@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Reported-by: Vince Weaver <vincent.weaver@maine.edu> Tested-by: Vince Weaver <vincent.weaver@maine.edu> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/20140423102254.GL11096@twins.programming.kicks-ass.net Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Jean Pihet [Fri, 16 May 2014 08:41:11 +0000 (10:41 +0200)]
perf tests: Add dwarf unwind test on ARM
Adding dwarf unwind test, that setups live machine data over
the perf test thread and does the remote unwind.
Need to use -fno-optimize-sibling-calls for test compilation,
otherwise 'krava_*' function calls are optimized into jumps
and omitted from the stack unwind.
So far it was enabled only for x86.
Signed-off-by: Jean Pihet <jean.pihet@linaro.org> Reviewed-by: Will Deacon <will.deacon@arm.com> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com> Cc: David Ahern <dsahern@gmail.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/1400229672-16104-3-git-send-email-jean.pihet@linaro.org Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Masanari Iida [Wed, 14 May 2014 17:13:38 +0000 (02:13 +0900)]
perf session: Fix possible null pointer dereference in session.c
cppcheck detected following warning:
[tools/perf/util/session.c:1628] -> [tools/perf/util/session.c:1632]:
(warning) Possible null pointer dereference: session - otherwise it
is redundant to check it against null.
In order to avoide null pointer, check the pointer before use.
Dongsheng Yang [Tue, 13 May 2014 01:38:21 +0000 (10:38 +0900)]
perf sched: Remove nr_state_machine_bugs in perf latency
As we do not use .success in sched_wakeup event any more, then
we can not guarantee that the task when wakeup event happen is
out of run queue. So the message of nr_state_machine_bugs is
not correct.
Peter Zijlstra [Mon, 12 May 2014 18:19:46 +0000 (20:19 +0200)]
perf tools: Remove usage of trace_sched_wakeup(.success)
trace_sched_wakeup(.success) is a dead argument and has been for ages,
the only reason its still there is because of brain dead software, which
apparently includes perf tools
There's a few more instances in pearly snake shit, but that's not
supported as far as I care anyhow, so let that bitrot.
Namhyung Kim [Mon, 12 May 2014 00:56:42 +0000 (09:56 +0900)]
perf tools: Use tid for finding thread
I believe that passing pid (instead of tid) as the 3rd arg of the
machine__find*_thread() was to find a main thread so that it can
search proper map group for symbols. However with the map sharing
patch applied, it now can do it in any thread.
It fixes a bug when each thread has different name, it only reports a
main thread for samples in other threads.
Cc: Adrian Hunter <adrian.hunter@intel.com> Acked-by: David Ahern <dsahern@gmail.com> Acked-by: Stephane Eranian <eranian@google.com> Signed-off-by: Namhyung Kim <namhyung@kernel.org> Link: http://lkml.kernel.org/r/1399856202-26221-1-git-send-email-namhyung@kernel.org Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Namhyung Kim [Mon, 12 May 2014 00:47:24 +0000 (09:47 +0900)]
perf record: Propagate exit status of a command line workload
Currently perf record doesn't propagate the exit status of a workload
given by the command line. But sometimes it'd useful if it's
propagated so that a monitoring script can handle errors
appropriately.
To do that, it moves most of logic out of the exit handlers and run
them directly in the __cmd_record(). The only thing needs to be done
in the handler is propagating terminating signal so that the shell can
terminate its loop properly when Ctrl-C was pressed. Also it cleaned
up the resource management code in record__exit().
With this change, perf record returns the child exit status in case of
normal termination and send signal to itself when terminated by signal.
Example run of Stephane's case:
$ perf record true && echo yes || echo no
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.013 MB perf.data (~589 samples) ]
yes
$ perf record false && echo yes || echo no
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.013 MB perf.data (~589 samples) ]
no
Jiri's case (error in parent):
$ perf record -m 10G true && echo yes || echo no
rounding mmap pages size to 17179869184 bytes (4194304 pages)
failed to mmap with 12 (Cannot allocate memory)
no
$ ulimit -n 6
$ perf record sleep 1 && echo yes || echo no
failed to create 'go' pipe: Too many open files
Couldn't run the workload!
no
And Peter's case (interrupted by signal):
$ while :; do perf record sleep 1; done
^C[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.014 MB perf.data (~593 samples) ]
Reported-by: Stephane Eranian <eranian@google.com> Signed-off-by: Namhyung Kim <namhyung@kernel.org> Acked-by: Stephane Eranian <eranian@google.com> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Acked-by: Ingo Molnar <mingo@kernel.org> Link: http://lkml.kernel.org/r/1399855645-25815-1-git-send-email-namhyung@kernel.org Signed-off-by: Jiri Olsa <jolsa@kernel.org>
But B0, which is explained as swapper:0 did not appear in the
left part of output. Instead, we use '.' as the shortname of
swapper:0. So the comment of "B0 => swapper:0" is not easy to
understand.
This patch clarify the output of perf sched map with not allocating
one letter-number shortname for swapper:0 and print ". => swapper:0"
as the explanation for swapper:0.
Dongsheng [Mon, 5 May 2014 07:05:53 +0000 (16:05 +0900)]
perf tools: Add missing event for perf sched record.
We should record and process sched:sched_wakeup_new event in
perf sched tool, but currently, there is the process function
for it, without recording it in record subcommand.
This patch add -e sched:sched_wakeup_new to perf sched record.
Peter Zijlstra [Mon, 5 May 2014 10:11:24 +0000 (12:11 +0200)]
perf: Rework free paths
Primarily make perf_event_release_kernel() into put_event(), this will
allow kernel space to create per-task inherited events, and is safer
in general.
Peter Zijlstra [Mon, 5 May 2014 09:41:02 +0000 (11:41 +0200)]
perf: Always destroy groups on exit
Commit 38b435b16c36 ("perf: Fix tear-down of inherited group events")
states that we need to destroy groups for inherited events, but it
doesn't make any sense to not also destroy groups for normal events.
And while it usually makes no difference (the normal events won't
leak, and its very likely all the group events will die in quick
succession) it does make the code more consistent and closes a
potential hole for trouble.
Peter Zijlstra [Tue, 6 May 2014 07:59:34 +0000 (09:59 +0200)]
perf: Ensure consistent inherit state in groups
Make sure all events in a group have the same inherit state. It was
possible for group leaders to have inherit set while sibling events
would not have inherit set.
In this case we'd still inherit the siblings, leading to some
non-fatal weirdness.
Now, assuming the event is a sibling, it will be 'unreachable' for
things like ctx_sched_out() because that iterates the
groups->siblings, and we just unhooked the sibling.
So, if during <hole> we get ctx_sched_out(), it will miss the event
and not call event_sched_out() on it, leaving it programmed on the
PMU.
The subsequent perf_remove_from_context() call will find the ctx is
inactive and only call list_del_event() to remove the event from all
other lists.
Hereafter we can proceed to free the event; while still programmed!
Close this hole by moving perf_group_detach() inside the same
ctx->lock region(s) perf_remove_from_context() has.
The condition on inherited events only in __perf_event_exit_task() is
likely complete crap because non-inherited events are part of groups
too and we're tearing down just the same. But leave that for another
patch.
Most-likely-Fixes: e03a9a55b4e ("perf: Change close() semantics for group events") Reported-by: Vince Weaver <vincent.weaver@maine.edu> Tested-by: Vince Weaver <vincent.weaver@maine.edu> Much-staring-at-traces-by: Vince Weaver <vincent.weaver@maine.edu> Much-staring-at-traces-by: Thomas Gleixner <tglx@linutronix.de> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/20140505093124.GN17778@laptop.programming.kicks-ass.net Signed-off-by: Ingo Molnar <mingo@kernel.org>
Linus Torvalds [Tue, 6 May 2014 20:07:41 +0000 (13:07 -0700)]
Merge branch 'akpm' (incoming from Andrew)
Merge misc fixes from Andrew Morton:
"13 fixes"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
agp: info leak in agpioc_info_wrap()
fs/affs/super.c: bugfix / double free
fanotify: fix -EOVERFLOW with large files on 64-bit
slub: use sysfs'es release mechanism for kmem_cache
revert "mm: vmscan: do not swap anon pages just because free+file is low"
autofs: fix lockref lookup
mm: filemap: update find_get_pages_tag() to deal with shadow entries
mm/compaction: make isolate_freepages start at pageblock boundary
MAINTAINERS: zswap/zbud: change maintainer email address
mm/page-writeback.c: fix divide by zero in pos_ratio_polynom
hugetlb: ensure hugepage access is denied if hugepages are not supported
slub: fix memcg_propagate_slab_attrs
drivers/rtc/rtc-pcf8523.c: fix month definition
Dan Carpenter [Tue, 6 May 2014 19:50:12 +0000 (12:50 -0700)]
agp: info leak in agpioc_info_wrap()
On 64 bit systems the agp_info struct has a 4 byte hole between
->agp_mode and ->aper_base. We need to clear it to avoid disclosing
stack information to userspace.
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Cc: David Airlie <airlied@linux.ie> Cc: Daniel Vetter <daniel.vetter@ffwll.ch> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit 842a859db26b ("affs: use ->kill_sb() to simplify ->put_super()
and failure exits of ->mount()") adds .kill_sb which frees sbi but
doesn't remove sbi free in case of parse_options error causing double
free+random crash.
Signed-off-by: Fabian Frederick <fabf@skynet.be> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: <stable@vger.kernel.org> [3.14.x] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Will Woods [Tue, 6 May 2014 19:50:10 +0000 (12:50 -0700)]
fanotify: fix -EOVERFLOW with large files on 64-bit
On 64-bit systems, O_LARGEFILE is automatically added to flags inside
the open() syscall (also openat(), blkdev_open(), etc). Userspace
therefore defines O_LARGEFILE to be 0 - you can use it, but it's a
no-op. Everything should be O_LARGEFILE by default.
But: when fanotify does create_fd() it uses dentry_open(), which skips
all that. And userspace can't set O_LARGEFILE in fanotify_init()
because it's defined to 0. So if fanotify gets an event regarding a
large file, the read() will just fail with -EOVERFLOW.
This patch adds O_LARGEFILE to fanotify_init()'s event_f_flags on 64-bit
systems, using the same test as open()/openat()/etc.
Signed-off-by: Will Woods <wwoods@redhat.com> Acked-by: Eric Paris <eparis@redhat.com> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Sysfs has a release mechanism. Use that to release the kmem_cache
structure if CONFIG_SYSFS is enabled.
Only slub is changed - slab currently only supports /proc/slabinfo and
not /sys/kernel/slab/*. We talked about adding that and someone was
working on it.
[akpm@linux-foundation.org: fix CONFIG_SYSFS=n build]
[akpm@linux-foundation.org: fix CONFIG_SYSFS=n build even more] Signed-off-by: Christoph Lameter <cl@linux.com> Reported-by: Sasha Levin <sasha.levin@oracle.com> Tested-by: Sasha Levin <sasha.levin@oracle.com> Acked-by: Greg KH <greg@kroah.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Pekka Enberg <penberg@kernel.org> Cc: Russell King <rmk@arm.linux.org.uk> Cc: Bart Van Assche <bvanassche@acm.org> Cc: Al Viro <viro@ZenIV.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Johannes Weiner [Tue, 6 May 2014 19:50:07 +0000 (12:50 -0700)]
revert "mm: vmscan: do not swap anon pages just because free+file is low"
This reverts commit 0bf1457f0cfc ("mm: vmscan: do not swap anon pages
just because free+file is low") because it introduced a regression in
mostly-anonymous workloads, where reclaim would become ineffective and
trap every allocating task in direct reclaim.
The problem is that there is a runaway feedback loop in the scan balance
between file and anon, where the balance tips heavily towards a tiny
thrashing file LRU and anonymous pages are no longer being looked at.
The commit in question removed the safe guard that would detect such
situations and respond with forced anonymous reclaim.
This commit was part of a series to fix premature swapping in loads with
relatively little cache, and while it made a small difference, the cure
is obviously worse than the disease. Revert it.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reported-by: Christian Borntraeger <borntraeger@de.ibm.com> Acked-by: Christian Borntraeger <borntraeger@de.ibm.com> Acked-by: Rafael Aquini <aquini@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: <stable@kernel.org> [3.12+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Ian Kent [Tue, 6 May 2014 19:50:06 +0000 (12:50 -0700)]
autofs: fix lockref lookup
autofs needs to be able to see private data dentry flags for its dentrys
that are being created but not yet hashed and for its dentrys that have
been rmdir()ed but not yet freed. It needs to do this so it can block
processes in these states until a status has been returned to indicate
the given operation is complete.
It does this by keeping two lists, active and expring, of dentrys in
this state and uses ->d_release() to keep them stable while it checks
the reference count to determine if they should be used.
But with the recent lockref changes dentrys being freed sometimes don't
transition to a reference count of 0 before being freed so autofs can
occassionally use a dentry that is invalid which can lead to a panic.
Signed-off-by: Ian Kent <raven@themaw.net> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
1343 /*
1344 * This function is never used on a shmem/tmpfs
1345 * mapping, so a swap entry won't be found here.
1346 */
1347 BUG();
After commit 0cd6144aadd2 ("mm + fs: prepare for non-page entries in
page cache radix trees") this comment and BUG() are out of date because
exceptional entries can now appear in all mappings - as shadows of
recently evicted pages.
However, as Hugh Dickins notes,
"it is truly surprising for a PAGECACHE_TAG_WRITEBACK (and probably
any other PAGECACHE_TAG_*) to appear on an exceptional entry.
I expect it comes down to an occasional race in RCU lookup of the
radix_tree: lacking absolute synchronization, we might sometimes
catch an exceptional entry, with the tag which really belongs with
the unexceptional entry which was there an instant before."
And indeed, not only is the tree walk lockless, the tags are also read
in chunks, one radix tree node at a time. There is plenty of time for
page reclaim to swoop in and replace a page that was already looked up
as tagged with a shadow entry.
Remove the BUG() and update the comment. While reviewing all other
lookup sites for whether they properly deal with shadow entries of
evicted pages, update all the comments and fix memcg file charge moving
to not miss shmem/tmpfs swapcache pages.
Fixes: 0cd6144aadd2 ("mm + fs: prepare for non-page entries in page cache radix trees") Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reported-by: Dave Jones <davej@redhat.com> Acked-by: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Vlastimil Babka [Tue, 6 May 2014 19:50:03 +0000 (12:50 -0700)]
mm/compaction: make isolate_freepages start at pageblock boundary
The compaction freepage scanner implementation in isolate_freepages()
starts by taking the current cc->free_pfn value as the first pfn. In a
for loop, it scans from this first pfn to the end of the pageblock, and
then subtracts pageblock_nr_pages from the first pfn to obtain the first
pfn for the next for loop iteration.
This means that when cc->free_pfn starts at offset X rather than being
aligned on pageblock boundary, the scanner will start at offset X in all
scanned pageblock, ignoring potentially many free pages. Currently this
can happen when
a) zone's end pfn is not pageblock aligned, or
b) through zone->compact_cached_free_pfn with CONFIG_HOLES_IN_ZONE
enabled and a hole spanning the beginning of a pageblock
This patch fixes the problem by aligning the initial pfn in
isolate_freepages() to pageblock boundary. This also permits replacing
the end-of-pageblock alignment within the for loop with a simple
pageblock_nr_pages increment.
Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Reported-by: Heesub Shin <heesub.shin@samsung.com> Acked-by: Minchan Kim <minchan@kernel.org> Cc: Mel Gorman <mgorman@suse.de> Acked-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com> Cc: Michal Nazarewicz <mina86@mina86.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Christoph Lameter <cl@linux.com> Acked-by: Rik van Riel <riel@redhat.com> Cc: Dongjun Shin <d.j.shin@samsung.com> Cc: Sunghwan Yun <sunghwan.yun@samsung.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Rik van Riel [Tue, 6 May 2014 19:50:01 +0000 (12:50 -0700)]
mm/page-writeback.c: fix divide by zero in pos_ratio_polynom
It is possible for "limit - setpoint + 1" to equal zero, after getting
truncated to a 32 bit variable, and resulting in a divide by zero error.
Using the fully 64 bit divide functions avoids this problem. It also
will cause pos_ratio_polynom() to return the correct value when
(setpoint - limit) exceeds 2^32.
Also uninline pos_ratio_polynom, at Andrew's request.
hugetlb: ensure hugepage access is denied if hugepages are not supported
Currently, I am seeing the following when I `mount -t hugetlbfs /none
/dev/hugetlbfs`, and then simply do a `ls /dev/hugetlbfs`. I think it's
related to the fact that hugetlbfs is properly not correctly setting
itself up in this state?:
Unable to handle kernel paging request for data at address 0x00000031
Faulting instruction address: 0xc000000000245710
Oops: Kernel access of bad area, sig: 11 [#1]
SMP NR_CPUS=2048 NUMA pSeries
....
In KVM guests on Power, in a guest not backed by hugepages, we see the
following:
HPAGE_SHIFT == 0 in this configuration, which indicates that hugepages
are not supported at boot-time, but this is only checked in
hugetlb_init(). Extract the check to a helper function, and use it in a
few relevant places.
This does make hugetlbfs not supported (not registered at all) in this
environment. I believe this is fine, as there are no valid hugepages
and that won't change at runtime.
[akpm@linux-foundation.org: use pr_info(), per Mel]
[akpm@linux-foundation.org: fix build when HPAGE_SHIFT is undefined] Signed-off-by: Nishanth Aravamudan <nacc@linux.vnet.ibm.com> Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Acked-by: Mel Gorman <mgorman@suse.de> Cc: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
After creating a cache for a memcg we should initialize its sysfs attrs
with the values from its parent. That's what memcg_propagate_slab_attrs
is for. Currently it's broken - we clearly muddled root-vs-memcg caches
there. Let's fix it up.
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Linus Torvalds [Tue, 6 May 2014 19:22:20 +0000 (12:22 -0700)]
Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs fixes from Al Viro:
"dcache fixes + kvfree() (uninlined, exported by mm/util.c) + posix_acl
bugfix from hch"
The dcache fixes are for a subtle LRU list corruption bug reported by
Miklos Szeredi, where people inside IBM saw list corruptions with the
LTP/host01 test.
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
nick kvfree() from apparmor
posix_acl: handle NULL ACL in posix_acl_equiv_mode
dcache: don't need rcu in shrink_dentry_list()
more graceful recovery in umount_collect()
don't remove from shrink list in select_collect()
dentry_kill(): don't try to remove from shrink list
expand the call of dentry_lru_del() in dentry_kill()
new helper: dentry_free()
fold try_prune_one_dentry()
fold d_kill() and d_free()
fix races between __d_instantiate() and checks of dentry flags
posix_acl: handle NULL ACL in posix_acl_equiv_mode
Various filesystems don't bother checking for a NULL ACL in
posix_acl_equiv_mode, and thus can dereference a NULL pointer when it
gets passed one. This usually happens from the NFS server, as the ACL tools
never pass a NULL ACL, but instead of one representing the mode bits.
Instead of adding boilerplat to all filesystems put this check into one place,
which will allow us to remove the check from other filesystems as well later
on.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reported-by: Ben Greear <greearb@candelatech.com> Reported-by: Marco Munderloh <munderl@tnt.uni-hannover.de>, Cc: Chuck Lever <chuck.lever@oracle.com> Cc: stable@vger.kernel.org Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Linus Torvalds [Tue, 6 May 2014 16:09:35 +0000 (09:09 -0700)]
Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse
Pull fuse fixes from Miklos Szeredi:
"This adds ctime update in the new cached writeback mode and also
fixes/simplifies the mtime update handling. Support for rename flags
(aka renameat2) is also added to the userspace API"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse:
fuse: add renameat2 support
fuse: clear MS_I_VERSION
fuse: clear FUSE_I_CTIME_DIRTY flag on setattr
fuse: trust kernel i_ctime only
fuse: remove .update_time
fuse: allow ctime flushing to userspace
fuse: fuse: add time_gran to INIT_OUT
fuse: add .write_inode
fuse: clean up fsync
fuse: fuse: fallocate: use file_update_time()
fuse: update mtime on open(O_TRUNC) in atomic_o_trunc mode
fuse: update mtime on truncate(2)
fuse: do not use uninitialized i_mode
fuse: fix mtime update error in fsync
fuse: check fallocate mode
fuse: add __exit to fuse_ctl_cleanup
Pull sparc fixes from David Miller:
"I've been auditing the THP support on sparc64 and found several bugs,
hopefully most of which are fixed completely here.
Also an RT kernel locking fix from Kirill Tkhai"
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/sparc:
sparc64: Give more detailed information in {pgd,pmd}_ERROR() and kill pte_ERROR().
sparc64: Add basic validations to {pud,pmd}_bad().
sparc64: Use 'ILOG2_4MB' instead of constant '22'.
sparc64: Fix range check in kern_addr_valid().
sparc64: Fix top-level fault handling bugs.
sparc64: Handle 32-bit tasks properly in compute_effective_address().
sparc64: Don't use _PAGE_PRESENT in pte_modify() mask.
sparc64: Fix hex values in comment above pte_modify().
sparc64: Fix bugs in get_user_pages_fast() wrt. THP.
sparc64: Fix huge PMD invalidation.
sparc64: Fix executable bit testing in set_pmd_at() paths.
sparc64: Normalize NMI watchdog logging and behavior.
sparc64: Make itc_sync_lock raw
sparc64: Fix argument sign extension for compat_sys_futex().
David Miller [Mon, 5 May 2014 20:20:04 +0000 (16:20 -0400)]
slab: Fix off by one in object max number tests.
If freelist_idx_t is a byte, SLAB_OBJ_MAX_NUM should be 255 not 256, and
likewise if freelist_idx_t is a short, then it should be 65535 not
65536.
This was leading to all kinds of random crashes on sparc64 where
PAGE_SIZE is 8192. One problem shown was that if spinlock debugging was
enabled, we'd get deadlocks in copy_pte_range() or do_wp_page() with the
same cpu already holding a lock it shouldn't hold, or the lock belonging
to a completely unrelated process.
Fixes: a41adfaa23df ("slab: introduce byte sized index for the freelist of a slab") Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Joonsoo Kim [Fri, 18 Apr 2014 07:24:09 +0000 (16:24 +0900)]
slab: fix the type of the index on freelist index accessor
Commit a41adfaa23df ("slab: introduce byte sized index for the freelist
of a slab") changes the size of freelist index and also changes
prototype of accessor function to freelist index. And there was a
mistake.
The mistake is that although it changes the size of freelist index
correctly, it changes the size of the index of freelist index
incorrectly. With patch, freelist index can be 1 byte or 2 bytes, that
means that num of object on on a slab can be more than 255. So we need
more than 1 byte for the index to find the index of free object on
freelist. But, above patch makes this index type 1 byte, so slab which
have more than 255 objects cannot work properly and in consequence of
it, the system cannot boot.
This issue was reported by Steven King on m68knommu which would use
2 bytes freelist index:
https://lkml.org/lkml/2014/4/16/433
To fix is easy. To change the type of the index of freelist index on
accessor functions is enough to fix this bug. Although 2 bytes is
enough, I use 4 bytes since it have no bad effect and make things more
easier. This fix was suggested and tested by Steven in his original
report.
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Reported-and-acked-by: Steven King <sfking@fdwdc.com> Acked-by: Christoph Lameter <cl@linux.com> Tested-by: James Hogan <james.hogan@imgtec.com> Tested-by: David Miller <davem@davemloft.net> Cc: Pekka Enberg <penberg@kernel.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2) ns_capable() check in sock_diag netlink code, from Andrew
Lutomirski.
3) Fix invalid queue pairs handling in virtio_net, from Amos Kong.
4) Checksum offloading busted in sxgbe driver due to incorrect
descriptor layout, fix from Byungho An.
5) Fix build failure with SMC_DEBUG set to 2 or larger, from Zi Shen
Lim.
6) Fix uninitialized A and X registers in BPF interpreter, from Alexei
Starovoitov.
7) Fix arch dependencies of candence driver.
8) Fix netlink capabilities checking tree-wide, from Eric W Biederman.
9) Don't dump IFLA_VF_PORTS if netlink request didn't ask for it in
IFLA_EXT_MASK, from David Gibson.
10) IPV6 FIB dump restart doesn't handle table changes that happen
meanwhile, causing the code to loop forever or emit dups, fix from
Kumar Sandararajan.
11) Memory leak on VF removal in bnx2x, from Yuval Mintz.
12) Bug fixes for new Altera TSE driver from Vince Bridgers.
13) Fix route lookup key in SCTP, from Xugeng Zhang.
14) Use BH blocking spinlocks in SLIP, as per a similar fix to CAN/SLCAN
driver. From Oliver Hartkopp.
15) TCP doesn't bump retransmit counters in some code paths, fix from
Eric Dumazet.
16) Clamp delayed_ack in tcp_cubic to prevent theoretical divides by
zero. Fix from Liu Yu.
17) Fix locking imbalance in error paths of HHF packet scheduler, from
John Fastabend.
18) Properly reference the transport module when vsock_core_init() runs,
from Andy King.
19) Fix buffer overflow in cdc_ncm driver, from Bjørn Mork.
20) IP_ECN_decapsulate() doesn't see a correct SKB network header in
ip_tunnel_rcv(), fix from Ying Cai.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (132 commits)
net: macb: Fix race between HW and driver
net: macb: Remove 'unlikely' optimization
net: macb: Re-enable RX interrupt only when RX is done
net: macb: Clear interrupt flags
net: macb: Pass same size to DMA_UNMAP as used for DMA_MAP
ip_tunnel: Set network header properly for IP_ECN_decapsulate()
e1000e: Restrict MDIO Slow Mode workaround to relevant parts
e1000e: Fix issue with link flap on 82579
e1000e: Expand workaround for 10Mb HD throughput bug
e1000e: Workaround for dropped packets in Gig/100 speeds on 82579
net/mlx4_core: Don't issue PCIe speed/width checks for VFs
net/mlx4_core: Load the Eth driver first
net/mlx4_core: Fix slave id computation for single port VF
net/mlx4_core: Adjust port number in qp_attach wrapper when detaching
net: cdc_ncm: fix buffer overflow
Altera TSE: ALTERA_TSE should depend on HAS_DMA
vsock: Make transport the proto owner
net: sched: lock imbalance in hhf qdisc
net: mvmdio: Check for a valid interrupt instead of an error
net phy: Check for aneg completion before setting state to PHY_RUNNING
...
Linus Torvalds [Mon, 5 May 2014 22:51:17 +0000 (15:51 -0700)]
Merge tag 'usb-3.15-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb
Pull USB fixes from Greg KH:
"Here are some small fixes and device ids for 3.15-rc4.
All have been in linux-next just fine"
* tag 'usb-3.15-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb:
USB: Nokia 5300 should be treated as unusual dev
USB: Nokia 305 should be treated as unusual dev
fsl-usb: do not test for PHY_CLK_VALID bit on controller version 1.6
usb: storage: shuttle_usbat: fix discs being detected twice
usb: qcserial: add a number of Dell devices
USB: OHCI: fix problem with global suspend on ATI controllers
usb: gadget: at91-udc: fix irq and iomem resource retrieval
usb: phy: fsm: change "|" to "||" for condition OTG_STATE_A_WAIT_BCON at statemachine
usb: phy: fsm: update OTG HNP state transition
Linus Torvalds [Mon, 5 May 2014 22:50:16 +0000 (15:50 -0700)]
Merge tag 'tty-3.15-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty
Pull tty/serial fixes from Greg KH:
"Here are some tty and serial driver fixes for things reported
recently"
* tag 'tty-3.15-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty:
tty: Fix lockless tty buffer race
Revert "tty: Fix race condition between __tty_buffer_request_room and flush_to_ldisc"
drivers/tty/hvc: don't free hvc_console_setup after init
n_tty: Fix n_tty_write crash when echoing in raw mode
tty: serial: 8250_core.c Bug fix for Exar chips.
Linus Torvalds [Mon, 5 May 2014 22:49:38 +0000 (15:49 -0700)]
Merge tag 'staging-3.15-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging
Pull staging / iio fixes from Greg KH:
"Here are some small IIO driver fixes for 3.15-rc4 that resolve some
reported issues"
* tag 'staging-3.15-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging:
iio: adc: Nothing in ADC should be a bool CONFIG
iio: exynos_adc: use indio_dev->dev structure to handle child nodes
iio:imu:mpu6050: Fixed segfault in Invensens MPU driver due to null dereference
staging:iio:ad2s1200 fix missing parenthesis in a for statment.
Linus Torvalds [Mon, 5 May 2014 22:17:02 +0000 (15:17 -0700)]
Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client
Pull Ceph fixes from Sage Weil:
"First, there is a critical fix for the new primary-affinity function
that went into -rc1.
The second batch of patches from Zheng fix a range of problems with
directory fragmentation, readdir, and a few odds and ends for cephfs"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client:
ceph: reserve caps for file layout/lock MDS requests
ceph: avoid releasing caps that are being used
ceph: clear directory's completeness when creating file
libceph: fix non-default values check in apply_primary_affinity()
ceph: use fpos_cmp() to compare dentry positions
ceph: check directory's completeness before emitting directory entry
Soren Brinkmann [Sun, 4 May 2014 22:43:02 +0000 (15:43 -0700)]
net: macb: Fix race between HW and driver
Under "heavy" RX load, the driver cannot handle the descriptors fast
enough. In detail, when a descriptor is consumed, its used flag is
cleared and once the RX budget is consumed all descriptors with a
cleared used flag are prepared to receive more data. Under load though,
the HW may constantly receive more data and use those descriptors with a
cleared used flag before they are actually prepared for next usage.
The head and tail pointers into the RX-ring should always be valid and
we can omit clearing and checking of the used flag.
Signed-off-by: Soren Brinkmann <soren.brinkmann@xilinx.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Soren Brinkmann [Sun, 4 May 2014 22:43:01 +0000 (15:43 -0700)]
net: macb: Remove 'unlikely' optimization
Coverage data suggests that the unlikely case of receiving data while
the receive handler is running may not be that unlikely.
Coverage data after running iperf for a while:
91320: 891: work_done = bp->macbgem_ops.mog_rx(bp, budget);
91320: 892: if (work_done < budget) {
2362: 893: napi_complete(napi);
-: 894:
-: 895: /* Packets received while interrupts were disabled */
4724: 896: status = macb_readl(bp, RSR);
2362: 897: if (unlikely(status)) {
762: 898: if (bp->caps & MACB_CAPS_ISR_CLEAR_ON_WRITE)
762: 899: macb_writel(bp, ISR, MACB_BIT(RCOMP));
-: 900: napi_reschedule(napi);
-: 901: } else {
1600: 902: macb_writel(bp, IER, MACB_RX_INT_FLAGS);
-: 903: }
-: 904: }
Signed-off-by: Soren Brinkmann <soren.brinkmann@xilinx.com> Signed-off-by: David S. Miller <davem@davemloft.net>