Wu Fengguang [Tue, 11 Oct 2011 23:06:33 +0000 (17:06 -0600)]
writeback: fix ppc compile warnings on do_div(long long, unsigned long)
Fix powerpc compile warnings
mm/page-writeback.c: In function 'bdi_position_ratio':
mm/page-writeback.c:622:3: warning: comparison of distinct pointer types lacks a cast [enabled by default]
page-writeback.c:635:4: warning: comparison of distinct pointer types lacks a cast [enabled by default]
Also fix gcc "uninitialized var" warnings.
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Wu Fengguang [Thu, 18 Nov 2010 20:38:33 +0000 (14:38 -0600)]
writeback: per-bdi background threshold
One thing puzzled me is that in JBOD case, the per-disk writeout
performance is smaller than the corresponding single-disk case even
when they have comparable bdi_thresh. Tracing shows find that in single
disk case, bdi_writeback is always kept high while in JBOD case, it
could drop low from time to time and correspondingly bdi_reclaimable
could sometimes rush high.
The fix is to watch bdi_reclaimable and kick background writeback as
soon as it goes high. This resembles the global background threshold
but in per-bdi manner. The trick is, as long as bdi_reclaimable does
not go high, bdi_writeback naturally won't go low because
bdi_reclaimable+bdi_writeback ~= bdi_thresh.
With less fluctuated writeback pages, JBOD performance is observed to
increase noticeably in various cases.
Wu Fengguang [Fri, 5 Aug 2011 04:16:46 +0000 (22:16 -0600)]
writeback: dirty position control - bdi reserve area
Keep a minimal pool of dirty pages for each bdi, so that the disk IO
queues won't underrun. Also gently increase a small bdi_thresh to avoid
it stuck in 0 for some light dirtied bdi.
It's particularly useful for JBOD and small memory system.
It may result in (pos_ratio > 1) at the setpoint and push the dirty
pages high. This is more or less intended because the bdi is in the
danger of IO queue underflow.
Here dirty_ratelimit is preferred over task_ratelimit because it's
more stable.
It's also important to limit possible large transitional errors:
- bw is changing quickly
- pages_dirtied << nr_dirtied_pause on entering dirty exceeded area
- pages_dirtied >> nr_dirtied_pause on btrfs (to be improved by a
separate fix, but still expect non-trivial errors)
So we end up using the above formula inside clamp_val().
The best test case for this code is to run 100 "dd bs=4M" tasks on
btrfs and check its pause time distribution.
Wu Fengguang [Sun, 12 Jun 2011 01:21:43 +0000 (19:21 -0600)]
writeback: limit max dirty pause time
Apply two policies to scale down the max pause time for
1) small number of concurrent dirtiers
2) small memory system (comparing to storage bandwidth)
MAX_PAUSE=200ms may only be suitable for high end servers with lots of
concurrent dirtiers, where the large pause time can reduce much overheads.
Otherwise, smaller pause time is desirable whenever possible, so as to
get good responsiveness and smooth user experiences. It's actually
required for good disk utilization in the case when all the dirty pages
can be synced to disk within MAX_PAUSE=200ms.
Wu Fengguang [Sat, 28 Aug 2010 00:45:12 +0000 (18:45 -0600)]
writeback: IO-less balance_dirty_pages()
As proposed by Chris, Dave and Jan, don't start foreground writeback IO
inside balance_dirty_pages(). Instead, simply let it idle sleep for some
time to throttle the dirtying task. In the mean while, kick off the
per-bdi flusher thread to do background writeback IO.
RATIONALS
=========
- disk seeks on concurrent writeback of multiple inodes (Dave Chinner)
If every thread doing writes and being throttled start foreground
writeback, it leads to N IO submitters from at least N different
inodes at the same time, end up with N different sets of IO being
issued with potentially zero locality to each other, resulting in
much lower elevator sort/merge efficiency and hence we seek the disk
all over the place to service the different sets of IO.
OTOH, if there is only one submission thread, it doesn't jump between
inodes in the same way when congestion clears - it keeps writing to
the same inode, resulting in large related chunks of sequential IOs
being issued to the disk. This is more efficient than the above
foreground writeback because the elevator works better and the disk
seeks less.
- lock contention and cache bouncing on concurrent IO submitters (Dave Chinner)
With this patchset, the fs_mark benchmark on a 12-drive software RAID0 goes
from CPU bound to IO bound, freeing "3-4 CPUs worth of spinlock contention".
* "CPU usage has dropped by ~55%", "it certainly appears that most of
the CPU time saving comes from the removal of contention on the
inode_wb_list_lock" (IMHO at least 10% comes from the reduction of
cacheline bouncing, because the new code is able to call much less
frequently into balance_dirty_pages() and hence access the global
page states)
* the user space "App overhead" is reduced by 20%, by avoiding the
cacheline pollution by the complex writeback code path
* "for a ~5% throughput reduction", "the number of write IOs have
dropped by ~25%", and the elapsed time reduced from 41:42.17 to
40:53.23.
* On a simple test of 100 dd, it reduces the CPU %system time from 30% to 3%,
and improves IO throughput from 38MB/s to 42MB/s.
- IO size too small for fast arrays and too large for slow USB sticks
The write_chunk used by current balance_dirty_pages() cannot be
directly set to some large value (eg. 128MB) for better IO efficiency.
Because it could lead to more than 1 second user perceivable stalls.
Even the current 4MB write size may be too large for slow USB sticks.
The fact that balance_dirty_pages() starts IO on itself couples the
IO size to wait time, which makes it hard to do suitable IO size while
keeping the wait time under control.
Now it's possible to increase writeback chunk size proportional to the
disk bandwidth. In a simple test of 50 dd's on XFS, 1-HDD, 3GB ram,
the larger writeback size dramatically reduces the seek count to 1/10
(far beyond my expectation) and improves the write throughput by 24%.
- long block time in balance_dirty_pages() hurts desktop responsiveness
Many of us may have the experience: it often takes a couple of seconds
or even long time to stop a heavy writing dd/cp/tar command with
Ctrl-C or "kill -9".
- IO pipeline broken by bumpy write() progress
There are a broad class of "loop {read(buf); write(buf);}" applications
whose read() pipeline will be under-utilized or even come to a stop if
the write()s have long latencies _or_ don't progress in a constant rate.
The current threshold based throttling inherently transfers the large
low level IO completion fluctuations to bumpy application write()s,
and further deteriorates with increasing number of dirtiers and/or bdi's.
For example, when doing 50 dd's + 1 remote rsync to an XFS partition,
the rsync progresses very bumpy in legacy kernel, and throughput is
improved by 67% by this patchset. (plus the larger write chunk size,
it will be 93% speedup).
The new rate based throttling can support 1000+ dd's with excellent
smoothness, low latency and low overheads.
For the above reasons, it's much better to do IO-less and low latency
pauses in balance_dirty_pages().
Jan Kara, Dave Chinner and me explored the scheme to let
balance_dirty_pages() wait for enough writeback IO completions to
safeguard the dirty limit. However it's found to have two problems:
- in large NUMA systems, the per-cpu counters may have big accounting
errors, leading to big throttle wait time and jitters.
- NFS may kill large amount of unstable pages with one single COMMIT.
Because NFS server serves COMMIT with expensive fsync() IOs, it is
desirable to delay and reduce the number of COMMITs. So it's not
likely to optimize away such kind of bursty IO completions, and the
resulted large (and tiny) stall times in IO completion based throttling.
So here is a pause time oriented approach, which tries to control the
pause time in each balance_dirty_pages() invocations, by controlling
the number of pages dirtied before calling balance_dirty_pages(), for
smooth and efficient dirty throttling:
- avoid useless (eg. zero pause time) balance_dirty_pages() calls
- avoid too small pause time (less than 4ms, which burns CPU power)
- avoid too large pause time (more than 200ms, which hurts responsiveness)
- avoid big fluctuations of pause times
It can control pause times at will. The default policy (in a followup
patch) will be to do ~10ms pauses in 1-dd case, and increase to ~100ms
in 1000-dd case.
BEHAVIOR CHANGE
===============
(1) dirty threshold
Users will notice that the applications will get throttled once crossing
the global (background + dirty)/2=15% threshold, and then balanced around
17.5%. Before patch, the behavior is to just throttle it at 20% dirtyable
memory in 1-dd case.
Since the task will be soft throttled earlier than before, it may be
perceived by end users as performance "slow down" if his application
happens to dirty more than 15% dirtyable memory.
(2) smoothness/responsiveness
Users will notice a more responsive system during heavy writeback.
"killall dd" will take effect instantly.
Wu Fengguang [Sun, 12 Jun 2011 00:10:12 +0000 (18:10 -0600)]
writeback: per task dirty rate limit
Add two fields to task_struct.
1) account dirtied pages in the individual tasks, for accuracy
2) per-task balance_dirty_pages() call intervals, for flexibility
The balance_dirty_pages() call interval (ie. nr_dirtied_pause) will
scale near-sqrt to the safety gap between dirty pages and threshold.
The main problem of per-task nr_dirtied is, if 1k+ tasks start dirtying
pages at exactly the same time, each task will be assigned a large
initial nr_dirtied_pause, so that the dirty threshold will be exceeded
long before each task reached its nr_dirtied_pause and hence call
balance_dirty_pages().
The solution is to watch for the number of pages dirtied on each CPU in
between the calls into balance_dirty_pages(). If it exceeds ratelimit_pages
(3% dirty threshold), force call balance_dirty_pages() for a chance to
set bdi->dirty_exceeded. In normal situations, this safeguarding
condition is not expected to trigger at all.
On the sqrt in dirty_poll_interval():
It will serve as an initial guess when dirty pages are still in the
freerun area.
When dirty pages are floating inside the dirty control scope [freerun,
limit], a followup patch will use some refined dirty poll interval to
get the desired pause time.
The above table means, given 1MB (or 1GB) gap and the dd tasks polling
balance_dirty_pages() on every 16 (or 512) pages, the dirty limit won't
be exceeded as long as there are less than 16 (or 512) concurrent dd's.
So sqrt naturally leads to less overheads and more safe concurrent tasks
for large memory servers, which have large (thresh-freerun) gaps.
peter: keep the per-CPU ratelimit for safeguarding the 1k+ tasks case
CC: Peter Zijlstra <a.p.zijlstra@chello.nl> Reviewed-by: Andrea Righi <andrea@betterlinux.com> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Wu Fengguang [Fri, 26 Aug 2011 21:53:24 +0000 (15:53 -0600)]
writeback: stabilize bdi->dirty_ratelimit
There are some imperfections in balanced_dirty_ratelimit.
1) large fluctuations
The dirty_rate used for computing balanced_dirty_ratelimit is merely
averaged in the past 200ms (very small comparing to the 3s estimation
period for write_bw), which makes rather dispersed distribution of
balanced_dirty_ratelimit.
It's pretty hard to average out the singular points by increasing the
estimation period. Considering that the averaging technique will
introduce very undesirable time lags, I give it up totally. (btw, the 3s
write_bw averaging time lag is much more acceptable because its impact
is one-way and therefore won't lead to oscillations.)
The more practical way is filtering -- most singular
balanced_dirty_ratelimit points can be filtered out by remembering some
prev_balanced_rate and prev_prev_balanced_rate. However the more
reliable way is to guard balanced_dirty_ratelimit with task_ratelimit.
2) due to truncates and fs redirties, the (write_bw <=> dirty_rate)
match could become unbalanced, which may lead to large systematical
errors in balanced_dirty_ratelimit. The truncates, due to its possibly
bumpy nature, can hardly be compensated smoothly. So let's face it. When
some over-estimated balanced_dirty_ratelimit brings dirty_ratelimit
high, dirty pages will go higher than the setpoint. task_ratelimit will
in turn become lower than dirty_ratelimit. So if we consider both
balanced_dirty_ratelimit and task_ratelimit and update dirty_ratelimit
only when they are on the same side of dirty_ratelimit, the systematical
errors in balanced_dirty_ratelimit won't be able to bring
dirty_ratelimit far away.
The balanced_dirty_ratelimit estimation may also be inaccurate near
@limit or @freerun, however is less an issue.
3) since we ultimately want to
- keep the fluctuations of task ratelimit as small as possible
- keep the dirty pages around the setpoint as long time as possible
the update policy used for (2) also serves the above goals nicely:
if for some reason the dirty pages are high (task_ratelimit < dirty_ratelimit),
and dirty_ratelimit is low (dirty_ratelimit < balanced_dirty_ratelimit),
there is no point to bring up dirty_ratelimit in a hurry only to hurt
both the above two goals.
So, we make use of task_ratelimit to limit the update of dirty_ratelimit
in two ways:
1) avoid changing dirty rate when it's against the position control target
(the adjusted rate will slow down the progress of dirty pages going
back to setpoint).
2) limit the step size. task_ratelimit is changing values step by step,
leaving a consistent trace comparing to the randomly jumping
balanced_dirty_ratelimit. task_ratelimit also has the nice smaller
errors in stable state and typically larger errors when there are big
errors in rate. So it's a pretty good limiting factor for the step
size of dirty_ratelimit.
Note that bdi->dirty_ratelimit is always tracking balanced_dirty_ratelimit.
task_ratelimit is merely used as a limiting factor.
Estimation of balanced bdi->dirty_ratelimit
===========================================
balanced task_ratelimit
-----------------------
balance_dirty_pages() needs to throttle tasks dirtying pages such that
the total amount of dirty pages stays below the specified dirty limit in
order to avoid memory deadlocks. Furthermore we desire fairness in that
tasks get throttled proportionally to the amount of pages they dirty.
IOW we want to throttle tasks such that we match the dirty rate to the
writeout bandwidth, this yields a stable amount of dirty pages:
dirty_rate == write_bw (1)
The fairness requirement gives us:
task_ratelimit = balanced_dirty_ratelimit
== write_bw / N (2)
where N is the number of dd tasks. We don't know N beforehand, but
still can estimate balanced_dirty_ratelimit within 200ms.
Start by throttling each dd task at rate
task_ratelimit = task_ratelimit_0 (3)
(any non-zero initial value is OK)
After 200ms, we measured
dirty_rate = # of pages dirtied by all dd's / 200ms
write_bw = # of pages written to the disk / 200ms
For the aggressive dd dirtiers, the equality holds
dirty_rate == N * task_rate
== N * task_ratelimit_0 (4)
Or
task_ratelimit_0 == dirty_rate / N (5)
Now we conclude that the balanced task ratelimit can be estimated by
Then using the balanced task ratelimit we can compute task pause times like:
task_pause = task->nr_dirtied / task_ratelimit
task_ratelimit with position control
------------------------------------
However, while the above gives us means of matching the dirty rate to
the writeout bandwidth, it at best provides us with a stable dirty page
count (assuming a static system). In order to control the dirty page
count such that it is high enough to provide performance, but does not
exceed the specified limit we need another control.
The dirty position control works by extending (2) to
where pos_ratio is a negative feedback function that subjects to
1) f(setpoint) = 1.0
2) df/dx < 0
That is, if the dirty pages are ABOVE the setpoint, we throttle each
task a bit more HEAVY than balanced_dirty_ratelimit, so that the dirty
pages are created less fast than they are cleaned, thus DROP to the
setpoints (and the reverse).
Based on (7) and the assumption that both dirty_ratelimit and pos_ratio
remains CONSTANT for the past 200ms, we get
Wu Fengguang [Wed, 2 Mar 2011 22:04:18 +0000 (16:04 -0600)]
writeback: dirty position control
bdi_position_ratio() provides a scale factor to bdi->dirty_ratelimit, so
that the resulted task rate limit can drive the dirty pages back to the
global/bdi setpoints.
Old scheme is,
|
free run area | throttle area
----------------------------------------+---------------------------->
thresh^ dirty pages
1) large enough to pull the dirty pages to setpoint reasonably fast
2) small enough to avoid big fluctuations in the resulted pos_ratio and
hence task ratelimit
Since the fluctuation range of the bdi dirty pages is typically observed
to be within 1-second worth of data, the bdi control line's slope is
selected to be a linear function of bdi write bandwidth, so that it can
adapt to slow/fast storage devices well.
Assume the bdi control line
pos_ratio = 1.0 + k * (dirty - bdi_setpoint)
where k is the negative slope.
If targeting for 12.5% fluctuation range in pos_ratio when dirty pages
are fluctuating in range
Let pos_ratio(x_intercept) = 0, we get the parameter used in code:
x_intercept = bdi_setpoint + 8 * write_bw
The global/bdi slopes are nicely complementing each other when the
system has only one major bdi (indicated by bdi_thresh ~= thresh):
1) slope of global control line => scaling to the control scope size
2) slope of main bdi control line => scaling to the writeout bandwidth
so that
- in memory tight systems, (1) becomes strong enough to squeeze dirty
pages inside the control scope
- in large memory systems where the "gravity" of (1) for pulling the
dirty pages to setpoint is too weak, (2) can back (1) up and drive
dirty pages to bdi_setpoint ~= setpoint reasonably fast.
Unfortunately in JBOD setups, the fluctuation range of bdi threshold
is related to memory size due to the interferences between disks. In
this case, the bdi slope will be weighted sum of write_bw and bdi_thresh.
Introduce the BDI_DIRTIED counter. It will be used for estimating the
bdi's dirty bandwidth.
CC: Jan Kara <jack@suse.cz> CC: Michael Rubin <mrubin@google.com> CC: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Linus Torvalds [Sat, 1 Oct 2011 15:37:25 +0000 (08:37 -0700)]
Merge branches 'irq-urgent-for-linus', 'x86-urgent-for-linus' and 'sched-urgent-for-linus' of git://tesla.tglx.de/git/linux-2.6-tip
* 'irq-urgent-for-linus' of git://tesla.tglx.de/git/linux-2.6-tip:
irq: Fix check for already initialized irq_domain in irq_domain_add
irq: Add declaration of irq_domain_simple_ops to irqdomain.h
* 'x86-urgent-for-linus' of git://tesla.tglx.de/git/linux-2.6-tip:
x86/rtc: Don't recursively acquire rtc_lock
* 'sched-urgent-for-linus' of git://tesla.tglx.de/git/linux-2.6-tip:
posix-cpu-timers: Cure SMP wobbles
sched: Fix up wchan borkage
sched/rt: Migrate equal priority tasks to available CPUs
Peter Zijlstra [Thu, 1 Sep 2011 10:42:04 +0000 (12:42 +0200)]
posix-cpu-timers: Cure SMP wobbles
David reported:
Attached below is a watered-down version of rt/tst-cpuclock2.c from
GLIBC. Just build it with "gcc -o test test.c -lpthread -lrt" or
similar.
Run it several times, and you will see cases where the main thread
will measure a process clock difference before and after the nanosleep
which is smaller than the cpu-burner thread's individual thread clock
difference. This doesn't make any sense since the cpu-burner thread
is part of the top-level process's thread group.
I've reproduced this on both x86-64 and sparc64 (using both 32-bit and
64-bit binaries).
The diff of 'process' should always be >= the diff of 'thread'.
I make sure to wrap the 'thread' clock measurements the most tightly
around the nanosleep() call, and that the 'process' clock measurements
are the outer-most ones.
This is due to us using p->se.sum_exec_runtime in
thread_group_cputime() where we iterate the thread group and sum all
data. This does not take time since the last schedule operation (tick
or otherwise) into account. We can cure this by using
task_sched_runtime() at the cost of having to take locks.
This also means we can (and must) do away with
thread_group_sched_runtime() since the modified thread_group_cputime()
is now more accurate and would deadlock when called from
thread_group_sched_runtime().
Aside of that it makes the function safe on 32 bit systems. The old
code added t->se.sum_exec_runtime unprotected. sum_exec_runtime is a
64bit value and could be changed on another cpu at the same time.
Reported-by: David Miller <davem@davemloft.net> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: stable@kernel.org Link: http://lkml.kernel.org/r/1314874459.7945.22.camel@twins Tested-by: David Miller <davem@davemloft.net> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
ALSA: hda - Fix a regression of the position-buffer check
The commit a810364a0424c297242c6c66071a42f7675a5568
ALSA: hda - Handle -1 as invalid position, too
caused a regression on some machines that require the position-buffer
instead of LPIB, e.g. resulting in noises with mic recording with
PulseAudio.
This patch fixes the detection by delaying the test at the timing as
same as 3.0, i.e. doing the position check only when requested in
azx_position_ok().
Ram Pai [Thu, 22 Sep 2011 07:48:58 +0000 (15:48 +0800)]
Resource: fix wrong resource window calculation
__find_resource() incorrectly returns a resource window which overlaps
an existing allocated window. This happens when the parent's
resource-window spans 0x00000000 to 0xffffffff and is entirely allocated
to all its children resource-windows.
__find_resource() looks for gaps in resource allocation among the
children resource windows. When it encounters the last child window it
blindly tries the range next to one allocated to the last child. Since
the last child's window ends at 0xffffffff the calculation overflows,
leading the algorithm to believe that any window in the range 0x0000000
to 0xfffffff is available for allocation. This leads to a conflicting
window allocation.
Michal Ludvig reported this issue seen on his platform. The following
patch fixes the problem and has been verified by Michal. I believe this
bug has been there for ages. It got exposed by git commit 2bbc6942273b
("PCI : ability to relocate assigned pci-resources")
Signed-off-by: Ram Pai <linuxram@us.ibm.com> Tested-by: Michal Ludvig <mludvig@logix.net.nz> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Merge branch 'v4l_for_linus' of git://linuxtv.org/mchehab/for_linus
* 'v4l_for_linus' of git://linuxtv.org/mchehab/for_linus:
[media] omap3isp: Fix build error in ispccdc.c
[media] uvcvideo: Fix crash when linking entities
[media] v4l: Make sure we hold a reference to the v4l2_device before using it
[media] v4l: Fix use-after-free case in v4l2_device_release
[media] uvcvideo: Set alternate setting 0 on resume if the bus has been reset
[media] OMAP_VOUT: Fix build break caused by update_mode removal in DSS2
Merge branch 'for-linus' of git://git390.marist.edu/pub/scm/linux-2.6
* 'for-linus' of git://git390.marist.edu/pub/scm/linux-2.6:
[S390] cio: fix cio_tpi ignoring adapter interrupts
[S390] gmap: always up mmap_sem properly
[S390] Do not clobber personality flags on exec
* git://github.com/davem330/sparc:
sparc64: Force the execute bit in OpenFirmware's translation entries.
sparc: Make '-p' boot option meaningful again.
sparc, exec: remove redundant addr_limit assignment
sparc64: Future proof Niagara cpu detection.
Merge branch 'drm-intel-fixes' of git://people.freedesktop.org/~keithp/linux
* 'drm-intel-fixes' of git://people.freedesktop.org/~keithp/linux:
drm/i915: FBC off for ironlake and older, otherwise on by default
drm/i915: Enable SDVO hotplug interrupts for HDMI and DVI
drm/i915: Enable dither whenever display bpc < frame buffer bpc
powerpc: Fix device-tree matching for Apple U4 bridge
Apple Quad G5 has some oddity in it's device-tree which causes the new
generic matching code to fail to relate nodes for PCI-E devices below U4
with their respective struct pci_dev. This breaks graphics on those
machines among others.
This fixes it using a quirk which copies the node pointer from the host
bridge for the root complex, which makes the generic code work for the
children afterward.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
bootup: move 'usermodehelper_enable()' a little earlier
Commit d5767c53535a ("bootup: move 'usermodehelper_enable()' to the end
of do_basic_setup()") moved 'usermodehelper_enable()' to end of
do_basic_setup() to after the initcalls. But then I get failed to let
uvesafb work on my computer, and lose the splash boot.
So maybe we could start usermodehelper_enable a little early to make
some task work that need eary init with the help of user mode.
[ I would *really* prefer that initcalls not call into user space - even
the real 'init' hasn't been execve'd yet, after all! But for uvesafb
it really does look like we don't have much choice.
I considered doing this when we mount the root filesystem, but
depending on config options that is in multiple places. We could do
the usermode helper enable as a rootfs_initcall()..
So I'm just using wang yanqing's trivial patch. It's not wonderful,
but it's simple and should work. We should revisit this some day,
though. - Linus ]
Jiri Olsa [Thu, 29 Sep 2011 15:05:08 +0000 (17:05 +0200)]
perf tools: Fix raw sample reading
Wrong pointer is being passed for raw data sanity checking, when parsing
sample event.
This ends up with invalid event and perf record being stuck in
__perf_session__process_events function during processing build IDs
(process_buildids function).
Following command hangs up in my setup:
./perf record -e raw_syscalls:sys_enter ls
The fix is to use proper pointer to the raw data instead of the 'u'
union.
Reviewed-by: David Ahern <dsahern@gmail.com> Cc: David Ahern <dsahern@gmail.com> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Neil Horman <nhorman@tuxdriver.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Steven Rostedt <rostedt@goodmis.org> Link: http://lkml.kernel.org/r/1317308709-9474-2-git-send-email-jolsa@redhat.com Signed-off-by: Jiri Olsa <jolsa@redhat.com> Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
David S. Miller [Thu, 29 Sep 2011 19:18:59 +0000 (12:18 -0700)]
sparc64: Force the execute bit in OpenFirmware's translation entries.
In the OF 'translations' property, the template TTEs in the mappings
never specify the executable bit. This is the case even though some
of these mappings are for OF's code segment.
Therefore, we need to force the execute bit on in every mapping.
This problem can only really trigger on Niagara/sun4v machines and the
history behind this is a little complicated.
Previous to sun4v, the sun4u TTE entries lacked a hardware execute
permission bit. So OF didn't have to ever worry about setting
anything to handle executable pages. Any valid TTE loaded into the
I-TLB would be respected by the chip.
But sun4v Niagara chips have a real hardware enforced executable bit
in their TTEs. So it has to be set or else the I-TLB throws an
instruction access exception with type code 6 (protection violation).
We've been extremely fortunate to not get bitten by this in the past.
The best I can tell is that the OF's mappings for it's executable code
were mapped using permanent locked mappings on sun4v in the past.
Therefore, the fact that we didn't have the exec bit set in the OF
translations we would use did not matter in practice.
Thanks to Greg Onufer for helping me track this down.
Signed-off-by: David S. Miller <davem@davemloft.net>
bootup: move 'usermodehelper_enable()' to the end of do_basic_setup()
Doing it just before starting to call into cpu_idle() made a sick kind
of sense only because the original bug we fixed (see commit 288d5abec831: "Boot up with usermodehelper disabled") was about problems
with some scheduler data structures not being initialized, and they had
better be initialized at that point.
But it really didn't make any other conceptual sense, and doing it after
the initial "schedule()" call for the idle thread actually opened up a
race: what if the main initialization thread did everything without
needing to sleep, and got all the way into user land too? Without
actually having scheduled back to the idle thread?
Now, in normal circumstances that doesn't ever happen, but it looks like
Richard Cochran triggered exactly that on his ARM IXP4xx machines:
"I have some ARM IXP4xx based machines that use the two on chip MAC
ports (aka NPEs). The NPE needs a firmware in order to function.
Ever since the following commit [that 288d5abec831 one], it is no
longer possible to bring up the interfaces during the init scripts."
with a call trace showing an ioctl coming from user space. Richard says:
"The init is busybox, and the startup script does mount, syslogd, and
then ifup, so that all can go by quickly."
The fix is to move the usermodehelper_enable() into the main 'init'
thread, and just put it after we've done all our initcalls. By then,
everything really should be up, but we've obviously not actually started
the user-mode portion of init yet.
Reported-and-tested-by: Richard Cochran <richardcochran@gmail.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Sage Weil [Wed, 28 Sep 2011 17:11:04 +0000 (10:11 -0700)]
libceph: fix pg_temp mapping update
The incremental map updates have a record for each pg_temp mapping that is
to be add/updated (len > 0) or removed (len == 0). The old code was
written as if the updates were a complete enumeration; that was just wrong.
Update the code to remove 0-length entries and drop the rbtree traversal.
This avoids misdirected (and hung) requests that manifest as server
errors like
[WRN] client4104 10.0.1.219:0/275025290 misdirected client4104.1:129 0.1 to osd0 not [1,0] in e11/11
Sage Weil [Wed, 28 Sep 2011 17:08:27 +0000 (10:08 -0700)]
libceph: fix pg_temp mapping calculation
We need to apply the modulo pg_num calculation before looking up a pgid in
the pg_temp mapping rbtree. This fixes pg_temp mappings, and fixes
(some) misdirected requests that result in messages like
[WRN] client4104 10.0.1.219:0/275025290 misdirected client4104.1:129 0.1 to osd0 not [1,0] in e11/11
on the server and stall make the client block without getting a reply (at
least until the pg_temp mapping goes way, but that can take a long long
time).
* git://github.com/davem330/net:
ipv6-multicast: Fix memory leak in IPv6 multicast.
ipv6: check return value for dst_alloc
net: check return value for dst_alloc
ipv6-multicast: Fix memory leak in input path.
bnx2x: add missing break in bnx2x_dcbnl_get_cap
bnx2x: fix WOL by enablement PME in config space
bnx2x: fix hw attention handling
net: fix a typo in Documentation/networking/scaling.txt
ath9k: Fix a dma warning/memory leak
rtlwifi: rtl8192cu: Fix unitialized struct
iwlagn: fix dangling scan request
batman-adv: do_bcast has to be true for broadcast packets only
cfg80211: Fix validation of AKM suites
iwlegacy: do not use interruptible waits
iwlegacy: fix command queue timeout
ath9k_hw: Fix Rx DMA stuck for AR9003 chips
* git://bedivere.hansenpartnership.com/git/scsi-rc-fixes-2.6:
[SCSI] 3w-9xxx: fix iommu_iova leak
[SCSI] cxgb3i: convert cdev->l2opt to use rcu to prevent NULL dereference
[SCSI] scsi: qla4xxx needs libiscsi.o
[SCSI] libsas: fix failure to revalidate domain for anything but the first expander child.
[SCSI] aacraid: reset should disable MSI interrupt
Hannes Reinecke [Wed, 28 Sep 2011 14:07:01 +0000 (08:07 -0600)]
block: Free queue resources at blk_release_queue()
A kernel crash is observed when a mounted ext3/ext4 filesystem is
physically removed. The problem is that blk_cleanup_queue() frees up
some resources eg by calling elevator_exit(), which are not checked for
in normal operation. So we should rather move these calls to the
destructor function blk_release_queue() as at that point all remaining
references are gone. However, in doing so we have to ensure that any
externally supplied queue_lock is disconnected as the driver might free
up the lock after the call of blk_cleanup_queue(),
Signed-off-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Merge branch 'for-linus' of git://github.com/tiwai/sound
* 'for-linus' of git://github.com/tiwai/sound:
ASoC: ssm2602: Re-enable oscillator after suspend
ALSA: usb-audio: Check for possible chip NULL pointer before clearing probing flag
ALSA: hda/realtek - Don't detect LO jack when identical with HP
ALSA: hda/realtek - Avoid bogus HP-pin assignment
ALSA: HDA: No power nids on 92HD93
ASoC: omap-mcbsp: Do not attempt to change DAI sysclk if stream is active
Signed-off-by: Dmitry Kravkov <dmitry@broadcom.com> Signed-off-by: Eilon Greenstein <eilong@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Dmitry Kravkov <dmitry@broadcom.com> Signed-off-by: Eilon Greenstein <eilong@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Dmitry Kravkov <dmitry@broadcom.com> Signed-off-by: Eilon Greenstein <eilong@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
That flag no longer makes sense, since we don't look up automount points
as eagerly any more. Additionally, it turns out that the NO_AUTOMOUNT
handling was buggy to begin with: it would avoid automounting even for
cases where we really *needed* to do the automount handling, and could
return ENOENT for autofs entries that hadn't been instantiated yet.
With our new non-eager automount semantics, one discussion has been
about adding a AT_AUTOMOUNT flag to vfs_fstatat (and thus the
newfstatat() and fstatat64() system calls), but it's probably not worth
it: you can always force at least directory automounting by simply
adding the final '/' to the filename, which works for *all* of the stat
family system calls, old and new.
So AT_NO_AUTOMOUNT (and thus LOOKUP_NO_AUTOMOUNT) really were just a
result of our bad default behavior.
Acked-by: Ian Kent <raven@themaw.net> Acked-by: Trond Myklebust <Trond.Myklebust@netapp.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently the the internal oscillator is powered down when entering BIAS_OFF
state, but not re-enabled when going back to BIAS_STANDBY. As a result the
CODEC will stop working after suspend if the internal oscillator is used to
generate the sysclock signal. This patch fixes it by clearing the appropriate
bit in the power down register when the CODEC is re-enabled.
Signed-off-by: Lars-Peter Clausen <lars@metafoo.de> Signed-off-by: Mark Brown <broonie@opensource.wolfsonmicro.com> Cc: stable@kernel.org
VFS: Fix the remaining automounter semantics regressions
The concensus seems to be that system calls such as stat() etc should
not trigger an automount. Neither should the l* versions.
This patch therefore adds a LOOKUP_AUTOMOUNT flag to tag those lookups
that _should_ trigger an automount on the last path element.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
[ Edited to leave out the cases that are already covered by LOOKUP_OPEN,
LOOKUP_DIRECTORY and LOOKUP_CREATE - all of which also fundamentally
force automounting for their own reasons - Linus ] Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Since we've now turned around and made LOOKUP_FOLLOW *not* force an
automount, we want to add the ability to force an automount event on
lookup even if we don't happen to have one of the other flags that force
it implicitly (LOOKUP_OPEN, LOOKUP_DIRECTORY, LOOKUP_PARENT..)
Most cases will never want to use this, since you'd normally want to
delay automounting as long as possible, which usually implies
LOOKUP_OPEN (when we open a file or directory, we really cannot avoid
the automount any more).
But Trond argued sufficiently forcefully that at a minimum bind mounting
a file and quotactl will want to force the automount lookup. Some other
cases (like nfs_follow_remote_path()) could use it too, although
LOOKUP_DIRECTORY would work there as well.
This commit just adds the flag and logic, no users yet, though. It also
doesn't actually touch the LOOKUP_NO_AUTOMOUNT flag that is related, and
was made irrelevant by the same change that made us not follow on
LOOKUP_FOLLOW.
Cc: Trond Myklebust <Trond.Myklebust@netapp.com> Cc: Ian Kent <raven@themaw.net> Cc: Jeff Layton <jlayton@redhat.com> Cc: Miklos Szeredi <miklos@szeredi.hu> Cc: David Howells <dhowells@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Greg KH <gregkh@suse.de> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
ARM: EXYNOS4: Rename sclk_cam clocks for FIMC driver
The sclk_cam clocks are now controlled by the top level FIMC media
device driver bound to "s5p-fimc-md" platform device.
Rename sclk_cam clocks so they accessible by the corresponding
driver.
Signed-off-by: Sylwester Nawrocki <s.nawrocki@samsung.com> Signed-off-by: Kyungmin Park <kyungmin.park@samsung.com> Signed-off-by: Kukjin Kim <kgene.kim@samsung.com>
ARM: S5PV210: Rename sclk_cam clocks for FIMC media driver
The sclk_cam clocks are now controlled by the top level FIMC media
device driver bound to "s5p-fimc-md" platform device.
Rename sclk_cam clocks so they accessible by the corresponding
driver.
Signed-off-by: Sylwester Nawrocki <s.nawrocki@samsung.com> Signed-off-by: Kyungmin Park <kyungmin.park@samsung.com> Signed-off-by: Kukjin Kim <kgene.kim@samsung.com>
Merge branch 'hwmon-for-linus' of git://github.com/groeck/linux
* 'hwmon-for-linus' of git://github.com/groeck/linux:
hwmon: (coretemp) remove struct platform_data * parameter from create_core_data()
hwmon: (coretemp) constify static data
hwmon: (coretemp) don't use kernel assigned CPU number as platform device ID
hwmon: (ds620) Fix handling of negative temperatures
hwmon: (w83791d) rename prototype parameter from 'register' to 'reg'
hwmon: (coretemp) Don't use threshold registers for tempX_max
hwmon: (coretemp) Let the user force TjMax
hwmon: (coretemp) Drop duplicate function get_pkg_tjmax
Merge branch 'fixes' of http://ftp.arm.linux.org.uk/pub/linux/arm/kernel/git-cur/linux-2.6-arm
* 'fixes' of http://ftp.arm.linux.org.uk/pub/linux/arm/kernel/git-cur/linux-2.6-arm:
ARM: 7099/1: futex: preserve oldval in SMP __futex_atomic_op
ARM: dma-mapping: free allocated page if unable to map
ARM: fix vmlinux.lds.S discarding sections
ARM: nommu: fix warning with checksyscalls.sh
ARM: 7091/1: errata: D-cache line maintenance operation by MVA may not succeed
proper dma_unmapping and freeing of skb's has to be done in the rx
cleanup for EDMA chipsets when the device is unloaded and this also
seems to address the following warning which shows up occasionally when
the device is unloaded
Larry Finger [Fri, 23 Sep 2011 03:59:02 +0000 (22:59 -0500)]
rtlwifi: rtl8192cu: Fix unitialized struct
Driver rtl8192cu assigns a new struct rtl_tcb_desc object, but fails to
clear it.
Signed-off-by: Larry Finger <Larry.Finger@lwfinger.net> Cc: Stable <stable@kernel.org> [2.6.39+] Signed-off-by: John W. Linville <linville@tuxdriver.com>
Johannes Berg [Thu, 22 Sep 2011 21:59:04 +0000 (14:59 -0700)]
iwlagn: fix dangling scan request
If iwl_scan_initiate() fails for any reason,
priv->scan_request and priv->scan_vif are left
dangling. This can lead to a crash later when
iwl_bg_scan_completed() tries to run a pending
scan request.
In practice, this seems to be very rare due to
the STATUS_SCANNING check earlier. That check,
however, is wrong -- it should allow a scan to
be queued when a reset/roc scan is going on.
When a normal scan is already going on, a new
one can't be issued by mac80211, so that code
can be removed completely. I introduced this
bug when adding off-channel support in commit 266af4c745952e9bebf687dd68af58df553cb59d.
Cc: stable@kernel.org [3.0] Reported-by: Peng Yan <peng.yan@intel.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: Wey-Yi Guy <wey-yi.w.guy@intel.com> Signed-off-by: John W. Linville <linville@tuxdriver.com>
PM / Clocks: Do not acquire a mutex under a spinlock
Commit b7ab83e (PM: Use spinlock instead of mutex in clock
management functions) introduced a regression causing clocks_mutex
to be acquired under a spinlock. This happens because
pm_clk_suspend() and pm_clk_resume() call pm_clk_acquire() under
pcd->lock, but pm_clk_acquire() executes clk_get() which causes
clocks_mutex to be acquired. Similarly, __pm_clk_remove(),
executed under pcd->lock, calls clk_put(), which also causes
clocks_mutex to be acquired.
To fix those problems make pm_clk_add() call pm_clk_acquire(), so
that pm_clk_suspend() and pm_clk_resume() don't have to do that.
Change pm_clk_remove() and pm_clk_destroy() to separate
modifications of the pcd->clock_list list from the actual removal of
PM clock entry objects done by __pm_clk_remove().
Reported-and-tested-by: Guennadi Liakhovetski <g.liakhovetski@gmx.de> Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Acked-by: Russell King <rmk+kernel@arm.linux.org.uk>
Analog to git commit 59e4c3a2fe9cb1681bb2cff508ff79466f7585ba
do not clear the additional personality flags on exec. We
need to inherit the personality bits in PER_MASK across exec.
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
James Bottomley [Sun, 18 Sep 2011 14:56:20 +0000 (18:56 +0400)]
[SCSI] 3w-9xxx: fix iommu_iova leak
Following reports on the list, it looks like the 3e-9xxx driver will leak dma
mappings every time we get a transient queueing error back from the card.
This is because it maps the sg list in the routine that sends the command, but
doesn't unmap again in the transient failure path (even though the command is
sent back to the block layer). Fix by unmapping before returning the status.
Reported-by: Chris Boot <bootc@bootc.net> Tested-by: Chris Boot <bootc@bootc.net> Acked-by: Adam Radford <aradford@gmail.com> Cc: stable@kernel.org Signed-off-by: James Bottomley <JBottomley@Parallels.com>
The root cause was an EEH error, which sent us down the offload_close path in
the cxgb3 driver, which in turn sets cdev->l2opt to NULL, without regard for
upper layer driver (like the cxgbi drivers) which might have execution contexts
in the middle of its use. The result is the oops above, when t3_l2t_get attempts
to dereference L2DATA(cdev)->nentries in arp_hash right after the EEH error handler sets it to NULL.
The fix is to prevent the setting of the NULL pointer until after there are no
further users of it. The t3cdev->l2opt pointer is now converted to be an rcu
pointer and the L2DATA macro is now called under the protection of the
rcu_read_lock(). When the EEH error path:
t3_adapter_error->offload_close->cxgb3_offload_deactivate
Is exectured, setting of that l2opt pointer to NULL, is now gated on an rcu
quiescence point, preventing, allowing L2DATA callers to safely check for a NULL
pointer without concern that the underlying data will be freeded before the
pointer is dereferenced.
This has been tested by the reporter and shown to fix the reproted oops
[nhorman: fix up unitinialised variable reported by Dan Carpenter] Signed-off-by: Neil Horman <nhorman@tuxdriver.com> Reviewed-by: Karen Xie <kxie@chelsio.com> Cc: stable@kernel.org Signed-off-by: James Bottomley <JBottomley@Parallels.com>
ALSA: hda/realtek - Don't detect LO jack when identical with HP
The spec->autocfg.line_out_pins[] may contain the same pins as hp_pins[]
depending on the configuration. When they are identical, detecting the
line_jack_present flag screws up the auto-mute because alc_line_automute()
is called unconditionally at initialization while it won't be triggered
by unsol events, thus the old line_jack_present flag is kept for the
whole run.
For fixing this buggy behavior, the driver needs to check whether the
line-outs are really individual, and skip if same as headphone jacks.
Will Deacon [Fri, 23 Sep 2011 13:34:12 +0000 (14:34 +0100)]
ARM: 7099/1: futex: preserve oldval in SMP __futex_atomic_op
The SMP implementation of __futex_atomic_op clobbers oldval with the
status flag from the exclusive store. This causes it to always read as
zero when performing the FUTEX_OP_CMP_* operation.
This patch updates the ARM __futex_atomic_op implementations to take a
tmp argument, allowing us to store the strex status flag without
overwriting the register containing oldval.
Cc: stable@kernel.org Reported-by: Minho Ban <mhban@samsung.com> Reviewed-by: Nicolas Pitre <nicolas.pitre@linaro.org> Signed-off-by: Will Deacon <will.deacon@arm.com> Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
Simon Kirby [Fri, 23 Sep 2011 00:03:46 +0000 (17:03 -0700)]
sched: Fix up wchan borkage
Commit c259e01a1ec ("sched: Separate the scheduler entry for
preemption") contained a boo-boo wrecking wchan output. It forgot to
put the new schedule() function in the __sched section and thereby
doesn't get properly ignored for things like wchan.
Tested-by: Simon Kirby <sim@hostway.ca> Cc: stable@kernel.org # 2.6.39+ Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/20110923000346.GA25425@hostway.ca Signed-off-by: Ingo Molnar <mingo@elte.hu>
When the headphone pin is assigned as primary output to line_out_pins[],
the automatic HP-pin assignment by ASSID must be suppressed. Otherwise
a wrong pin might be assigned to the headphone and breaks the auto-mute.
Russell King [Thu, 22 Sep 2011 09:32:25 +0000 (10:32 +0100)]
ARM: dma-mapping: free allocated page if unable to map
If the attempt to map a page for DMA fails (eg, because we're out of
mapping space) then we must not hold on to the page we allocated for
DMA - doing so will result in a memory leak.
Cc: <stable@kernel.org> Reported-by: Bryan Phillippe <bp@darkforest.org> Tested-by: Bryan Phillippe <bp@darkforest.org> Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
Marek Szyprowski [Mon, 26 Sep 2011 04:16:45 +0000 (13:16 +0900)]
ARM: S5P: fix incorrect loop iterator usage on gpio-interrupt
Loop iterator value after terminating list_for_each_entry()
is not NULL. This patch fixes incorrect iterator usage in
GPIO interrupt code for SAMSUNG S5P platforms.
Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com> Signed-off-by: Kyungmin Park <kyungmin.park@samsung.com> Signed-off-by: Kukjin Kim <kgene.kim@samsung.com>
Merge branch 'spi/merge' of git://git.secretlab.ca/git/linux-2.6
* 'spi/merge' of git://git.secretlab.ca/git/linux-2.6:
spi: Fix WARN when removing spi-fsl-spi module
spi/imx: Fix spi-imx when the hardware SPI chipselects are used
Jeff Harris [Fri, 23 Sep 2011 15:49:36 +0000 (11:49 -0400)]
spi: Fix WARN when removing spi-fsl-spi module
If CPM mode is not used, the fsl_dummy_rx variable is never allocated. When
the cleanup attempts to free it, the reference count is zero and a WARN is
generated. The same CPM mode check used in the initialize is applied to the
free as well.
Tested on 2.6.33 with the previous spi_mpc8xxx driver. The renamed
spi-fsl-spi driver looks to have the same problem.
Signed-off-by: Jeff Harris <jeff_harris@kentrox.com> Signed-off-by: Grant Likely <grant.likely@secretlab.ca>
Merge branch 'perf-tools-for-linus' of git://github.com/acmel/linux
* 'perf-tools-for-linus' of git://github.com/acmel/linux:
perf tools: Add support for disabling -Werror via WERROR=0
perf top: Fix userspace sample addr map offset
perf symbols: Fix issue with binaries using 16-bytes buildids (v2)
perf tool: Fix endianness handling of u32 data in samples
perf sort: Fix symbol sort output by separating unresolved samples by type
perf symbols: Synthesize anonymous mmap events
perf record: Create events initially disabled and enable after init
perf symbols: Add some heuristics for choosing the best duplicate symbol
perf symbols: Preserve symbol scope when parsing /proc/kallsyms
perf symbols: /proc/kallsyms does not sort module symbols
perf symbols: Fix ppc64 SEGV in dso__load_sym with debuginfo files
perf probe: Fix regression of variable finder
Merge branch 'drm-fixes' of git://people.freedesktop.org/~airlied/linux
* 'drm-fixes' of git://people.freedesktop.org/~airlied/linux:
drm/radeon/kms: fix DDIA enable on some rs690 systems
Revert "drm/radeon/kms: fix typo in r100_blit_copy"
Merge branch 'for-linus' of git://github.com/tiwai/sound
* 'for-linus' of git://github.com/tiwai/sound:
ALSA: usb-audio - clear chip->probing on error exit
ALSA: fm801: Gracefully handle failure of tuner auto-detect
ALSA: fm801: Fix double free in case of error in tuner detection
ASoC: Ensure we generate a driver name
ASoC: Remove bitrotted wm8962_resume()
ASoC: bf5xx-ad73311: Fix prototype for bf5xx_probe
Jan Beulich [Fri, 23 Sep 2011 10:40:08 +0000 (06:40 -0400)]
hwmon: (coretemp) remove struct platform_data * parameter from create_core_data()
The only caller of the function obtained the pointer solely for the
purpose of passing it to this function, while it can be easily
determined from the struct platform_device * parameter also passed.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Signed-off-by: Guenter Roeck <guenter.roeck@ericsson.com>
Jan Beulich [Fri, 23 Sep 2011 10:35:00 +0000 (06:35 -0400)]
hwmon: (coretemp) don't use kernel assigned CPU number as platform device ID
... as that has the potential to conflict with (particularly soft) CPU
hot removal and re-adding.
Signed-off-by: Jan Beulich <jbeulich@suse.com>
[guenter.roeck@ericsson.com: use platform device ID as physical CPU id] Signed-off-by: Guenter Roeck <guenter.roeck@ericsson.com>
Darren Hart [Thu, 8 Sep 2011 20:42:39 +0000 (13:42 -0700)]
perf tools: Add support for disabling -Werror via WERROR=0
GCC often introduces new warnings with lots of false positives -
breaking -Werror builds. WERROR=0 allows one to build perf without much
fuss - while still encouraging people to send patches to avoid the fuss
of having to type WERROR=0.
Bisecting back to commits that produce a (mostly harmless) warning on
some compilers is more difficult. With WERROR=0 one could bisect without
worrying about harmless warnings.
The 'perf top' tool came from the kernel where we had each DSO (vmlinux,
modules) loaded just once at a time.
But userspace may have DSOs loaded in multiple addresses (shared
libraries), requiring that we use the just resolved map instead of the
first one found.
Cc: David Ahern <dsahern@gmail.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Mike Galbraith <efault@gmx.de> Cc: Paul Mackerras <paulus@samba.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Link: http://lkml.kernel.org/n/tip-ag53wz0yllpgers0n2w7hchp@git.kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Stephane Eranian [Fri, 22 Oct 2010 15:25:01 +0000 (17:25 +0200)]
perf symbols: Fix issue with binaries using 16-bytes buildids (v2)
Buildid can vary in size. According to the man page of ld, buildid can
be 160 bits (sha1) or 128 bits (md5, uuid). Perf assumes buildid size of
20 bytes (160 bits) regardless. When dealing with md5 buildids, it would
thus read more than needed and that would cause mismatches and samples
without symbols.
This patch fixes this by taking into account the actual buildid size as
encoded int he section header. The leftover bytes are also cleared.
This second version fixes a minor issue with the memset() base position.
Cc: David S. Miller <davem@davemloft.net> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Paul Mackerras <paulus@samba.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Robert Richter <robert.richter@amd.com> Cc: Stephane Eranian <eranian@gmail.com> Link: http://lkml.kernel.org/r/4cc1af3c.8ee7d80a.5a28.ffff868e@mx.google.com Signed-off-by: Stephane Eranian <eranian@google.com> Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
David Ahern [Tue, 6 Sep 2011 15:12:26 +0000 (09:12 -0600)]
perf tool: Fix endianness handling of u32 data in samples
Currently, analyzing PPC data files on x86 the cpu field is always 0 and
the tid and pid are backwards. For example, analyzing a PPC file on PPC
the pid/tid fields show:
rsyslogd 1210/1212
and analyzing the same PPC file using an x86 perf binary shows:
rsyslogd 1212/1210
The problem is that the swap_op method for samples is
perf_event__all64_swap which assumes all elements in the sample_data
struct are u64s. cpu, tid and pid are u32s and need to be handled
individually. Given that the swap is done before the sample is parsed,
the simplest solution is to undo the 64-bit swap of those elements when
the sample is parsed and do the proper swap.
The RAW data field is generic and perf cannot have programmatic knowledge
of how to treat that data. Instead a warning is given to the user.
Thanks to Anton Blanchard for providing a data file for a mult-CPU
PPC system so I could verify the fix for the CPU fields.
v3 -> v4:
- fixed use of WARN_ONCE
v2 -> v3:
- used WARN_ONCE for message regarding raw data
- removed struct wrapper around union
- fixed whitespace issues
v1 -> v2:
- added a union for undoing the byte-swap on u64 and redoing swap on
u32's to address compiler errors (see git commit 65014ab3)
Cc: Anton Blanchard <anton@samba.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Paul Mackerras <paulus@samba.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1315321946-16993-1-git-send-email-dsahern@gmail.com Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Using perf stat to get the user/kernel/hypervisor breakdown contradicted
this.
The problem is we merge all unresolved samples into the one unknown
bucket. If add a comparison by sample type to sort__sym_cmp we get the
real picture:
This happens because we aren't consistent with our sorting. On
one hand we check to see if both symbols match and for two unresolved
samples sym is NULL so we match:
if (left->ms.sym == right->ms.sym)
return 0;
On the other hand we use sample IP for unresolved samples when
comparing against a symbol:
Acked-by: Eric B Munson <emunson@mgebm.net> Cc: Eric B Munson <emunson@mgebm.net> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Paul Mackerras <paulus@samba.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Ian Munsie <imunsie@au1.ibm.com> Link: http://lkml.kernel.org/r/20110831115145.4f598ab2@kryten Signed-off-by: Anton Blanchard <anton@samba.org> Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>