Peter Zijlstra [Wed, 15 Apr 2015 09:41:58 +0000 (11:41 +0200)]
perf: Fix mux_interval hrtimer wreckage
Thomas stumbled over the hrtimer_forward_now() in
perf_event_mux_interval_ms_store() and noticed its broken-ness.
You cannot just change the expiry time of an active timer, it will
destroy the red-black tree order and cause havoc.
Change it to (re)start the timer instead, (re)starting a timer will
dequeue and enqueue a timer and therefore preserve rb-tree order.
Since we cannot enqueue remotely, wrap the thing in
cpu_function_call(), this however mandates that we restrict ourselves
to online cpus. Also serialize the entire setting so we don't get
multiple concurrent threads trying to update to different values.
Also fix a problem in perf_mux_hrtimer_restart(), checking against
hrtimer_active() can actually loose us the timer when timer->state ==
HRTIMER_STATE_CALLBACK and the callback has already decided NORESTART.
Furthermore it doesn't make any sense to test
hrtimer_callback_running() when we already tested hrtimer_active(),
but with the above change, we explicitly must call it when
callback_running.
Lastly, rename a few functions:
s/perf_cpu_hrtimer_/perf_mux_hrtimer_/ -- because I could not find
the mux timer function
s/\<hr\>/timer/ -- because that's the normal way of calling things.
Fixes: 62b856397927 ("perf: Add sysfs entry to adjust multiplexing interval per PMU") Reported-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: http://lkml.kernel.org/r/20150415095011.863052571@infradead.org Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Peter Zijlstra [Wed, 15 Apr 2015 09:41:57 +0000 (11:41 +0200)]
sched: Cleanup bandwidth timers
Roman reported a 3 cpu lockup scenario involving __start_cfs_bandwidth().
The more I look at that code the more I'm convinced its crack, that
entire __start_cfs_bandwidth() thing is brain melting, we don't need to
cancel a timer before starting it, *hrtimer_start*() will happily remove
the timer for you if its still enqueued.
Removing that, removes a big part of the problem, no more ugly cancel
loop to get stuck in.
So now, if I understand things right, the entire reason you have this
cfs_b->lock guarded ->timer_active nonsense is to make sure we don't
accidentally lose the timer.
It appears to me that it should be possible to guarantee that same by
unconditionally (re)starting the timer when !queued. Because regardless
what hrtimer::function will return, if we beat it to (re)enqueue the
timer, it doesn't matter.
Now, because hrtimers don't come with any serialization guarantees we
must ensure both handler and (re)start loop serialize their access to
the hrtimer to avoid both trying to forward the timer at the same
time.
Update the rt bandwidth timer to match.
This effectively reverts: 09dc4ab03936 ("sched/fair: Fix
tg_set_cfs_bandwidth() deadlock on rq->lock").
Reported-by: Roman Gushchin <klamm@yandex-team.ru> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Ben Segall <bsegall@google.com> Cc: Paul Turner <pjt@google.com> Link: http://lkml.kernel.org/r/20150415095011.804589208@infradead.org Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Peter Zijlstra [Tue, 20 May 2014 13:49:48 +0000 (15:49 +0200)]
hrtimer: Allow concurrent hrtimer_start() for self restarting timers
Because we drop cpu_base->lock around calling hrtimer::function, it is
possible for hrtimer_start() to come in between and enqueue the timer.
If hrtimer::function then returns HRTIMER_RESTART we'll hit the BUG_ON
because HRTIMER_STATE_ENQUEUED will be set.
Since the above is a perfectly valid scenario, remove the BUG_ON and
make the enqueue_hrtimer() call conditional on the timer not being
enqueued already.
NOTE: in that concurrent scenario its entirely common for both sites
to want to modify the hrtimer, since hrtimers don't provide
serialization themselves be sure to provide some such that the
hrtimer::function and the hrtimer_start() caller don't both try and
fudge the expiration state at the same time.
To that effect, add a WARN when someone tries to forward an already
enqueued timer, the most common way to change the expiry of self
restarting timers. Ideally we'd put the WARN in everything modifying
the expiry but most of that is inlines and we don't need the bloat.
Fixes: 2d44ae4d7135 ("hrtimer: clean up cpu->base locking tricks") Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Ben Segall <bsegall@google.com> Cc: Roman Gushchin <klamm@yandex-team.ru> Cc: Paul Turner <pjt@google.com> Link: http://lkml.kernel.org/r/20150415113105.GT5029@twins.programming.kicks-ass.net Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Thomas Gleixner [Tue, 14 Apr 2015 21:09:25 +0000 (21:09 +0000)]
hrtimer: Avoid locking in hrtimer_cancel() if timer not active
We can do a lockless check for hrtimer_active before actually taking
the lock in hrtimer[_try_to]_cancel. This is useful for hotpath users
like nanosleep as they avoid the lock dance when the timer has
expired.
This is safe because active is true when the timer is enqueued or the
callback is running. Taking the hrtimer base lock does not protect
against concurrent hrtimer_start calls, the callsite has to do the
proper serialization itself.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: Preeti U Murthy <preeti@linux.vnet.ibm.com> Cc: Viresh Kumar <viresh.kumar@linaro.org> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Link: http://lkml.kernel.org/r/20150414203503.580273114@linutronix.de Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Thomas Gleixner [Tue, 14 Apr 2015 21:09:22 +0000 (21:09 +0000)]
tick: broadcast-hrtimer: Remove overly clever return value abuse
The assignment of bc_moved in the conditional construct relies on the
fact that in the case of hrtimer_start() invocation the return value
is always 0. It took me a while to understand it.
We want to get rid of the hrtimer_start() return value. Open code the
logic which makes it readable as well.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Preeti U Murthy <preeti@linux.vnet.ibm.com> Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: Viresh Kumar <viresh.kumar@linaro.org> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Link: http://lkml.kernel.org/r/20150414203503.404751457@linutronix.de Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
The check for hrtimer_active() after starting the timer is
pointless. If the timer is inactive it has expired already and
therefor the task pointer is already NULL.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: Preeti U Murthy <preeti@linux.vnet.ibm.com> Cc: Viresh Kumar <viresh.kumar@linaro.org> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Acked-by: David S. Miller <davem@davemloft.net> Link: http://lkml.kernel.org/r/20150414203503.165258315@linutronix.de Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Thomas Gleixner [Tue, 14 Apr 2015 21:09:15 +0000 (21:09 +0000)]
rtmutex: Remove bogus hrtimer_active() check
The check for hrtimer_active() after starting the timer is
pointless. If the timer is inactive it has expired already and
therefor the task pointer is already NULL.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: Preeti U Murthy <preeti@linux.vnet.ibm.com> Cc: Viresh Kumar <viresh.kumar@linaro.org> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Link: http://lkml.kernel.org/r/20150414203503.081830481@linutronix.de Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Thomas Gleixner [Tue, 14 Apr 2015 21:09:13 +0000 (21:09 +0000)]
futex: Remove bogus hrtimer_active() check
The check for hrtimer_active() after starting the timer is
pointless. If the timer is inactive it has expired already and
therefor the task pointer is already NULL.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: Preeti U Murthy <preeti@linux.vnet.ibm.com> Cc: Viresh Kumar <viresh.kumar@linaro.org> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Link: http://lkml.kernel.org/r/20150414203502.985825453@linutronix.de Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Thomas Gleixner [Tue, 14 Apr 2015 21:09:11 +0000 (21:09 +0000)]
hrtimer: Remove bogus hrtimer_active() check
The check for hrtimer_active() after starting the timer is
pointless. If the timer is inactive it has expired already and
therefor the task pointer is already NULL.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: Preeti U Murthy <preeti@linux.vnet.ibm.com> Cc: Viresh Kumar <viresh.kumar@linaro.org> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Link: http://lkml.kernel.org/r/20150414203502.907149271@linutronix.de Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Thomas Gleixner [Tue, 14 Apr 2015 21:08:58 +0000 (21:08 +0000)]
tick: Nohz: Rework next timer evaluation
The evaluation of the next timer in the nohz code is based on jiffies
while all the tick internals are nano seconds based. We have also to
convert hrtimer nanoseconds to jiffies in the !highres case. That's
just wrong and introduces interesting corner cases.
Turn it around and convert the next timer wheel timer expiry and the
rcu event to clock monotonic and base all calculations on
nanoseconds. That identifies the case where no timer is pending
clearly with an absolute expiry value of KTIME_MAX.
Makes the code more readable and gets rid of the jiffies magic in the
nohz code.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: Preeti U Murthy <preeti@linux.vnet.ibm.com> Cc: Viresh Kumar <viresh.kumar@linaro.org> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Josh Triplett <josh@joshtriplett.org> Cc: Lai Jiangshan <laijs@cn.fujitsu.com> Cc: John Stultz <john.stultz@linaro.org> Cc: Marcelo Tosatti <mtosatti@redhat.com> Link: http://lkml.kernel.org/r/20150414203502.184198593@linutronix.de Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Thomas Gleixner [Tue, 14 Apr 2015 21:08:54 +0000 (21:08 +0000)]
tick: sched: Force tick interrupt and get rid of softirq magic
We already got rid of the hrtimer reprogramming loops and hoops as
hrtimer now enforces an interrupt if the enqueued time is in the past.
Do the same for the nohz non highres mode. That gets rid of the need
to raise the softirq which only serves the purpose of getting the
machine out of the inner idle loop.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: Preeti U Murthy <preeti@linux.vnet.ibm.com> Cc: Viresh Kumar <viresh.kumar@linaro.org> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: John Stultz <john.stultz@linaro.org> Cc: Marcelo Tosatti <mtosatti@redhat.com> Link: http://lkml.kernel.org/r/20150414203502.023464878@linutronix.de Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Thomas Gleixner [Tue, 14 Apr 2015 21:08:51 +0000 (21:08 +0000)]
hrtimer: Get rid of hrtimer softirq
hrtimer softirq is a leftover from the initial implementation and
serves only the purpose to handle the enqueueing of already expired
timers in the high resolution timer mode. We discussed whether we
change the return value and force all start sites to handle that the
timer is already expired, but that would be a Herculean task and I'm
not sure whether its a good idea to enforce that handling on
everyone.
A simpler solution is to enforce a timer interrupt instead of raising
and scheduling a softirq. Just use the existing infrastructure to do
so and remove all the softirq leftovers.
The HRTIMER softirq enum is now unused, but kept around because trace
parsers rely on the existing numbering.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: Preeti U Murthy <preeti@linux.vnet.ibm.com> Cc: Viresh Kumar <viresh.kumar@linaro.org> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Link: http://lkml.kernel.org/r/20150414203501.840834708@linutronix.de Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Thomas Gleixner [Tue, 14 Apr 2015 21:08:49 +0000 (21:08 +0000)]
hrtimer: Keep pointer to first timer and simplify __remove_hrtimer()
__remove_hrtimer() needs to evaluate the expiry time to figure out
whether the timer which is removed is eventually the first expiring
timer on the cpu. Keep a pointer to it, which is lazily updated, so we
can avoid the evaluation dance and retrieve the information from there.
Generates slightly better code.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: Preeti U Murthy <preeti@linux.vnet.ibm.com> Cc: Viresh Kumar <viresh.kumar@linaro.org> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Link: http://lkml.kernel.org/r/20150414203501.752838019@linutronix.de Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Thomas Gleixner [Tue, 14 Apr 2015 21:08:46 +0000 (21:08 +0000)]
timerqueue: Let timerqueue_add/del return information
The hrtimer code is interested whether the added timer is the first
one to expire and whether the removed timer was the last one in the
tree. The add/del routines have that information already. So we can
return it right away instead of reevaluating it at the call site.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: Preeti U Murthy <preeti@linux.vnet.ibm.com> Cc: Viresh Kumar <viresh.kumar@linaro.org> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: John Stultz <john.stultz@linaro.org> Link: http://lkml.kernel.org/r/20150414203501.579063647@linutronix.de Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Thomas Gleixner [Tue, 14 Apr 2015 21:08:44 +0000 (21:08 +0000)]
hrtimer: Align the hrtimer clock bases as well
We don't use cacheline_align here because that might waste lot of
space on 32bit machine with 64 bytes cachelines and on 64bit machines
with 128 bytes cachelines.
The size of struct hrtimer_clock_base is 64byte on 64bit and 32byte on
32bit machines. So we utilize the cache lines proper.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: Preeti U Murthy <preeti@linux.vnet.ibm.com> Cc: Viresh Kumar <viresh.kumar@linaro.org> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Link: http://lkml.kernel.org/r/20150414203501.498165771@linutronix.de Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Thomas Gleixner [Tue, 14 Apr 2015 21:08:41 +0000 (21:08 +0000)]
hrtimer: Use cpu_base->active_base for hotpath iterators
The active_bases field is guaranteed to be in sync with the timerqueue
of the corresponding clock base. So we can use it for iterating over
the clock bases. This allows to break out early if no more active
clock bases are available and avoids touching the cache lines of
inactive clock bases.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Preeti U Murthy <preeti@linux.vnet.ibm.com> Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: Viresh Kumar <viresh.kumar@linaro.org> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Link: http://lkml.kernel.org/r/20150414203501.322887675@linutronix.de Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Thomas Gleixner [Tue, 14 Apr 2015 21:08:37 +0000 (21:08 +0000)]
hrtimer: Make offset update smarter
On every tick/hrtimer interrupt we update the offset variables of the
clock bases. That's silly because these offsets change very seldom.
Add a sequence counter to the time keeping code which keeps track of
the offset updates (clock_was_set()). Have a sequence cache in the
hrtimer cpu bases to evaluate whether the offsets must be updated or
not. This allows us later to avoid pointless cacheline pollution.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Preeti U Murthy <preeti@linux.vnet.ibm.com> Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: Viresh Kumar <viresh.kumar@linaro.org> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: John Stultz <john.stultz@linaro.org> Link: http://lkml.kernel.org/r/20150414203501.132820245@linutronix.de Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: John Stultz <john.stultz@linaro.org>
Thomas Gleixner [Tue, 14 Apr 2015 21:08:35 +0000 (21:08 +0000)]
hrtimer: Get rid of softirq time
The softirq time field in the clock bases is an optimization from the
early days of hrtimers. It provides a coarse "jiffies" like time
mostly for self rearming timers.
But that comes with a price:
- Larger code size
- Extra storage space
- Duplicated functions with really small differences
The benefit of this is optimization is marginal for contemporary
systems.
Consolidate everything on the high resolution timer
implementation. This makes further optimizations possible.
Text size reduction:
x8664 -95, i386 -356, ARM -148, ARM64 -40, power64 -16
Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: Preeti U Murthy <preeti@linux.vnet.ibm.com> Cc: Viresh Kumar <viresh.kumar@linaro.org> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Link: http://lkml.kernel.org/r/20150414203501.039977424@linutronix.de Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Thomas Gleixner [Tue, 14 Apr 2015 21:08:30 +0000 (21:08 +0000)]
sound: Use hrtimer_resolution instead of hrtimer_get_res()
No point in converting a timespec now that the value is directly
accessible. Get rid of the null check while at it. Resolution is
guaranteed to be > 0.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Takashi Iwai <tiwai@suse.de> Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: Preeti U Murthy <preeti@linux.vnet.ibm.com> Cc: Viresh Kumar <viresh.kumar@linaro.org> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Jaroslav Kysela <perex@perex.cz> Cc: alsa-devel@alsa-project.org Link: http://lkml.kernel.org/r/20150414203500.799133359@linutronix.de Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Thomas Gleixner [Tue, 14 Apr 2015 21:08:27 +0000 (21:08 +0000)]
hrtimer: Get rid of the resolution field in hrtimer_clock_base
The field has no value because all clock bases have the same
resolution. The resolution only changes when we switch to high
resolution timer mode. We can evaluate that from a single static
variable as well. In the !HIGHRES case its simply a constant.
Export the variable, so we can simplify the usage sites.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Preeti U Murthy <preeti@linux.vnet.ibm.com> Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: Viresh Kumar <viresh.kumar@linaro.org> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Link: http://lkml.kernel.org/r/20150414203500.645454122@linutronix.de Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
hrtimer: Update active_bases before calling hrtimer_force_reprogram()
'active_bases' indicates which clock-base have active timer. The
intention of this bit field was to avoid evaluating inactive bases. It
was introduced with the introduction of the BOOTTIME and TAI clock
bases, but it was never brought into full use.
We want to use it now, but in __remove_hrtimer() the update happens
after the calling hrtimer_force_reprogram() which has to evaluate all
clock bases for the next expiring timer. So in case the last timer of
a clock base got removed we still see the active bit and therefor
evaluate the clock base for no value. There are further optimizations
possible when active_bases is updated in the right place.
Move the update before the call to hrtimer_force_reprogram()
Thomas Gleixner [Wed, 22 Apr 2015 09:44:15 +0000 (11:44 +0200)]
timekeeping: Remove stale function prototype
commit 61edec81d260 "timekeeping: Simplify timekeeping_clocktai()"
implemented timekeeping_clocktai() as an inline function, but left the
old extern prototype in the header file. Remove it.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
The function clocksource_get_next() was removed in commit 75c5158f70
(timekeeping: Update clocksource with stop_machine), but the
prototype was not removed with it. Remove the prototype.
Signed-off-by: Yingjoe Chen <yingjoe.chen@mediatek.com> Cc: <linux-arm-kernel@lists.infradead.org> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: <srv_heupstream@mediatek.com> Cc: John Stultz <john.stultz@linaro.org> Link: http://lkml.kernel.org/r/1428674150-1780-1-git-send-email-yingjoe.chen@mediatek.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Merge tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi
Pull SCSI fixes from James Bottomley:
"I'd like to say these were a set of regressions for the recent merge
window code. Unfortunately, they all predate the merge window code
(stable cc'd).
There are two fixes for data integrity (mostly only showing up on
module removal), an mvsas crash with expander attached SATA devices
which goes back to the dawn of the driver but is only just being
picked up as sas expanders become a standard item in low end server
hardware, an am53c974 one because the interrupt data isn't fully
initialised before the line is and a megaraid_sas one because it uses
smp_processor_id() to select MSI-X queues and that now triggers a
WARN_ON()"
* tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
mvsas: fix panic on expander attached SATA devices
am53c974: Fix crash during modprobe
megaraid_sas: use raw_smp_processor_id()
sd: Fix missing ATO tag check
sd: Unregister integrity profile
Merge branch 'drm-next-merged' of git://people.freedesktop.org/~airlied/linux
Pull drm updates from Dave Airlie:
"Highlights:
Core:
- Virtual GEM layer merged, this has been around for a long time, and
it provides a software backed device that allows userspace to use
it as a GEM shared memory handler. This makes it a lot easier to
do certain things when you have no GPU but still have to deal with
DRI expectations.
- atomic helper updates.
- framebuffer modifier interface added.
- i2c over auxch displayport fixes.
- fb width/height confusion fixes.
- new driver for ps8622/ps8625 bridge chips
- lots of new panels
i915:
- more plane atomic conversion
- vGPU guest support for XenGT
- Skylake workarounds and fixes
- Y-tiling support
- work on dynamic pagetable allocation
- EU count report param for gen9+
- CHV fixes (no longer prelim)
- remove ilk rc6
- frontbuffer tracking for fbc
- Displayport link rate refactoring
- sprite colorkey refactor
radeon:
- Displayport MST support (not enabled by default)
- non-ATOM native hw auxch support (DCE5+)
- output csc support
- new queries for userspace debug support
- new VCE packet
nouveau:
- gk20a iommu support
- gm107 graphics support
- more gm20x bringup (waiting on signed nvidia fw).
amdkfd:
- multiple kgd instance support
- use 64-bit time accessors
msm:
- stolen memory support
- DSI and dual-DSI support
- snapdragon 410 support
exynos:
- cleanups for atomic and pageflip
imx-drm:
- more media-bus formats
- TV output prep
- drm panel support
tegra:
- hw vblank counter using host1x syncpoints
omap:
- universal plane support
- prep work for atomic modesetting
rcar-du:
- ported to atomic modesetting
atmel-hlcdc:
- ported to atomic modesetting
- added suspend/resume support
sti:
- ported to atomic modesetting
dwhdmi:
- more compliant audio support
- update rockchip phy support
* 'drm-next-merged' of git://people.freedesktop.org/~airlied/linux: (689 commits)
media-bus: Fixup RGB444_1X12, RGB565_1X16, and YUV8_1X24 media bus format
drm/i915: Dont enable CS_PARSER_ERROR interrupts at all
drm/i915: Move drm_framebuffer_unreference out of struct_mutex for takeover
drm: fix trivial typo mistake
drm: Make integer overflow checking cover universal cursor updates (v2)
drm/nouveau/bios: fix fetching from acpi on certain systems
drm/nouveau/gr/gm206: initial init+ctx code
drm/nouveau/ce/gm206: enable support via gm204 code
drm/nouveau/fifo/gm206: enable support via gm204 code
drm/nouveau/gr/gm204: initial init+ctx code
drm/nouveau: support for buffer moves via MaxwellDmaCopyA
drm/nouveau/ce/gm204: initial support
drm/nouveau: add support for gm20x fifo channels
drm/nouveau/fifo/gm204: initial support
drm/nouveau/gr/gk104-: prevent reading non-existent regs in intr handler
drm/nouveau/gr/gm107: very slightly demagic part of attrib cb setup
drm/nouveau/gr/gk104-: correct crop/zrop num_active_fbps setting
drm/nouveau/gr/gf100-: add symbolic names for classes
drm/nouveau/gr/gm107: support tpc "strand" ctxsw in gpccs ucode
drm/nouveau/gr/gf100-: support mmio access with gpc offset from gpccs ucode
...
Merge tag 'iommu-updates-v4.1' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu
Pull IOMMU updates from Joerg Roedel:
"Not much this time, but the changes include:
- moving domain allocation into the iommu drivers to prepare for the
introduction of default domains for devices
- fixing the IO page-table code in the AMD IOMMU driver to correctly
encode large page sizes
- extension of the PCI support in the ARM-SMMU driver
- various fixes and cleanups"
* tag 'iommu-updates-v4.1' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu: (34 commits)
iommu/amd: Correctly encode huge pages in iommu page tables
iommu/amd: Optimize amd_iommu_iova_to_phys for new fetch_pte interface
iommu/amd: Optimize alloc_new_range for new fetch_pte interface
iommu/amd: Optimize iommu_unmap_page for new fetch_pte interface
iommu/amd: Return the pte page-size in fetch_pte
iommu/amd: Add support for contiguous dma allocator
iommu/amd: Don't allocate with __GFP_ZERO in alloc_coherent
iommu/amd: Ignore BUS_NOTIFY_UNBOUND_DRIVER event
iommu/amd: Use BUS_NOTIFY_REMOVED_DEVICE
iommu/tegra: smmu: Compute PFN mask at runtime
iommu/tegra: gart: Set aperture at domain initialization time
iommu/tegra: Setup aperture
iommu: Remove domain_init and domain_free iommu_ops
iommu/fsl: Make use of domain_alloc and domain_free
iommu/rockchip: Make use of domain_alloc and domain_free
iommu/ipmmu-vmsa: Make use of domain_alloc and domain_free
iommu/shmobile: Make use of domain_alloc and domain_free
iommu/msm: Make use of domain_alloc and domain_free
iommu/tegra-gart: Make use of domain_alloc and domain_free
iommu/tegra-smmu: Make use of domain_alloc and domain_free
...
Merge tag 'cpumask-next-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux
Pull final removal of deprecated cpus_* cpumask functions from Rusty Russell:
"This is the final removal (after several years!) of the obsolete
cpus_* functions, prompted by their mis-use in staging.
With these function removed, all cpu functions should only iterate to
nr_cpu_ids, so we finally only allocate that many bits when cpumasks
are allocated offstack"
* tag 'cpumask-next-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux: (25 commits)
cpumask: remove __first_cpu / __next_cpu
cpumask: resurrect CPU_MASK_CPU0
linux/cpumask.h: add typechecking to cpumask_test_cpu
cpumask: only allocate nr_cpumask_bits.
Fix weird uses of num_online_cpus().
cpumask: remove deprecated functions.
mips: fix obsolete cpumask_of_cpu usage.
x86: fix more deprecated cpu function usage.
ia64: remove deprecated cpus_ usage.
powerpc: fix deprecated CPU_MASK_CPU0 usage.
CPU_MASK_ALL/CPU_MASK_NONE: remove from deprecated region.
staging/lustre/o2iblnd: Don't use cpus_weight
staging/lustre/libcfs: replace deprecated cpus_ calls with cpumask_
staging/lustre/ptlrpc: Do not use deprecated cpus_* functions
blackfin: fix up obsolete cpu function usage.
parisc: fix up obsolete cpu function usage.
tile: fix up obsolete cpu function usage.
arm64: fix up obsolete cpu function usage.
mips: fix up obsolete cpu function usage.
x86: fix up obsolete cpu function usage.
...
Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux
Pull more s390 updates from Martin Schwidefsky:
"The big thing in this second merge for s390 is the new eBPF JIT from
Michael which replaces the old 32-bit backend.
The remaining commits are bug fixes"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux:
s390/pci: add locking for fmb access
s390/pci: extract software counters from fmb
s390/dasd: Fix unresumed device after suspend/resume having no paths
s390/dasd: fix unresumed device after suspend/resume
s390/dasd: fix inability to set a DASD device offline
s390/mm: Fix memory hotplug for unaligned standby memory
s390/bpf: Add s390x eBPF JIT compiler backend
s390: Use bool function return values of true/false not 1/0
Merge branch 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/gerg/m68knommu
Pull m68k fixes from Greg Ungerer:
"Nothing big, spelling fixes and fix/cleanup for ColdFire eth device setup"
* 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/gerg/m68knommu:
m68knommu: fix fec setup warning for ColdFire 5271 builds
m68knommu: ColdFire 5271 only has a single FEC controller
m68k: Fix trivial typos in comments
Yes, it should work, but it's a bad idea. Not only did ARM64 not have
the 16-bit access code (there's a separate patch to add it), it's just
not a good atomic type. Some architectures fundamentally don't do
atomic accesses in them (alpha), and it's not like it saves any space
here anyway because of structure packing issues.
We normally should aim for flags to be "unsigned int" or "unsigned
long". And if space is at a premium, use a single byte (although that
causes problems on alpha again). There might be very special cases
where a 16-byte entity is really wanted, but this is not one of them.
OMAPDSS: Correct video ports description file path in DT binding doc
The doc refers to Documentation/devicetree/bindings/video/video-ports.txt
which does not exist. The documentation seems to be outdated and wants to
refer to Documentation/devicetree/bindings/graph.txt instead.
Signed-off-by: Javier Martinez Canillas <javier.martinez@collabora.co.uk> Signed-off-by: Tomi Valkeinen <tomi.valkeinen@ti.com>
Philipp Zabel [Fri, 17 Apr 2015 17:12:41 +0000 (19:12 +0200)]
media-bus: Fixup RGB444_1X12, RGB565_1X16, and YUV8_1X24 media bus format
Change the constant values for RGB444_1X12, RGB565_1X16, and YUV8_1X24 media
bus formats in anticipation of a merge conflict with the media tree, where
the old values are already taken by RBG888_1X24, RGB888_1X32_PADHI, and
VUY8_1X24, respectively.
Signed-off-by: Philipp Zabel <p.zabel@pengutronix.de> Signed-off-by: Dave Airlie <airlied@redhat.com>
Merge branch 'turbostat' of git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux
Pull turbostat update from Len Brown:
"Updates to the turbostat utility.
Just one kernel dependency in this batch -- added a #define to
msr-index.h"
* 'turbostat' of git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux:
tools/power turbostat: correct dumped pkg-cstate-limit value
tools/power turbostat: calculate TSC frequency from CPUID(0x15) on SKL
tools/power turbostat: correct DRAM RAPL units on recent Xeon processors
tools/power turbostat: Initial Skylake support
tools/power turbostat: Use $(CURDIR) instead of $(PWD) and add support for O= option in Makefile
tools/power turbostat: modprobe msr, if needed
tools/power turbostat: dump MSR_TURBO_RATIO_LIMIT2
tools/power turbostat: use new MSR_TURBO_RATIO_LIMIT names
x86 msr-index: define MSR_TURBO_RATIO_LIMIT,1,2
tools/power turbostat: label base frequency
tools/power turbostat: update PERF_LIMIT_REASONS decoding
tools/power turbostat: simplify default output
The test_data_1_le[] array is a const array of const char *. To avoid
dropping any const information, we need to use "const char * const *",
not just "const char **".
I'm not sure why the different test arrays end up having different
const'ness, but let's make the pointer we use to traverse them as const
as possible, since we modify neither the array of pointers _or_ the
pointers we find in the array.
smp: Fix error case handling in smp_call_function_*()
Commit 8053871d0f7f ("smp: Fix smp_call_function_single_async()
locking") fixed the locking for the asynchronous smp-call case, but in
the process of moving the lock handling around, one of the error cases
ended up not unlocking the call data at all.
This went unnoticed on x86, because this is a "caller is buggy" case,
where the caller is trying to call a non-existent CPU. But apparently
ARM does that (at least under qemu-arm). Bindly doing cross-cpu calls
to random CPU's that aren't even online seems a bit fishy, but the error
handling was clearly not correct.
Simply add the missing "csd_unlock()" to the error path.
Pull sparc fixes from David Miller
"Unfortunately, I brown paper bagged the generic iommu pool allocator
by applying the wrong revision of the patch series.
This reverts the bad one, and puts the right one in"
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/sparc:
iommu-common: Fix PARISC compile-time warnings
sparc: Make LDC use common iommu poll management functions
sparc: Make sparc64 use scalable lib/iommu-common.c functions
Break up monolithic iommu table/lock into finer graularity pools and lock
sparc: Revert generic IOMMU allocator.
Merge tag 'for-linus-4.1-merge-window' of git://git.kernel.org/pub/scm/linux/kernel/git/ericvh/v9fs
Pull 9pfs updates from Eric Van Hensbergen:
"Some accumulated cleanup patches for kerneldoc and unused variables as
well as some lock bug fixes and adding privateport option for RDMA"
* tag 'for-linus-4.1-merge-window' of git://git.kernel.org/pub/scm/linux/kernel/git/ericvh/v9fs:
net/9p: add a privport option for RDMA transport.
fs/9p: Initialize status in v9fs_file_do_lock.
net/9p: Initialize opts->privport as it should be.
net/9p: use memcpy() instead of snprintf() in p9_mount_tag_show()
9p: use unsigned integers for nwqid/count
9p: do not crash on unknown lock status code
9p: fix error handling in v9fs_file_do_lock
9p: remove unused variable in p9_fd_create()
9p: kerneldoc warning fixes
David S. Miller [Sat, 18 Apr 2015 19:35:09 +0000 (12:35 -0700)]
Merge branch 'iommu-generic-allocator'
Sowmini Varadhan says:
====================
Generic IOMMU pooled allocator
Investigation of network performance on Sparc shows a high
degree of locking contention in the IOMMU allocator, and it
was noticed that the PowerPC code has a better locking model.
This patch series tries to extract the generic parts of the
PowerPC code so that it can be shared across multiple PCI
devices and architectures.
v10: resend patchv9 without RFC tag, and a new mail Message-Id,
(previous non-RFC attempt did not show up on the patchwork queue?)
Full revision history below:
v2 changes:
- incorporate David Miller editorial comments: sparc specific
fields moved from iommu-common into sparc's iommu_64.h
- make the npools value an input parameter, for the case when
the iommu map size is not very large
- cookie_to_index mapping, and optimizations for span-boundary
check, for use case such as LDC.
v3: eliminate iommu_sparc, rearrange the ->demap indirection to
be invoked under the pool lock.
v4: David Miller review changes:
- s/IOMMU_ERROR_CODE/DMA_ERROR_CODE
- page_table_map_base and page_table_shift are unsigned long, not u32.
v5: removed ->cookie_to_index and ->demap indirection from the
iommu_tbl_ops The caller needs to call these functions as needed,
before invoking the generic arena allocator functions.
Added the "skip_span_boundary" argument to iommu_tbl_pool_init() for
those callers like LDC which do no care about span boundary checks.
v6: removed iommu_tbl_ops, and instead pass the ->flush_all as
an indirection to iommu_tbl_pool_init(); only invoke ->flush_all
when there is no large_pool, based on the assumption that large-pool
usage is infrequently encountered
v7: moved pool_hash initialization to lib/iommu-common.c and cleaned up
code duplication from sun4v/sun4u/ldc.
v8: Addresses BenH comments with one exception: I've left the
IOMMU_POOL_HASH as is, so that powerpc can tailor it to their
convenience. Discard trylock for simple spin_lock to acquire pool
v9: Addresses latest BenH comments: need_flush checks, add support
for dma mask and align_order.
v10: resend without RFC tag, and new mail Message-Id.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
sparc: Make LDC use common iommu poll management functions
Note that this conversion is only being done to consolidate the
code and ensure that the common code provides the sufficient
abstraction. It is not expected to result in any noticeable
performance improvement, as there is typically one ldc_iommu
per vnet_port, and each one has 8k entries, with a typical
request for 1-4 pages. Thus LDC uses npools == 1.
Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com> Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: David S. Miller <davem@davemloft.net>
sparc: Make sparc64 use scalable lib/iommu-common.c functions
In iperf experiments running linux as the Tx side (TCP client) with
10 threads results in a severe performance drop when TSO is disabled,
indicating a weakness in the software that can be avoided by using
the scalable IOMMU arena DMA allocation.
Baseline numbers before this patch:
with default settings (TSO enabled) : 9-9.5 Gbps
Disable TSO using ethtool- drops badly: 2-3 Gbps.
After this patch, iperf client with 10 threads, can give a
throughput of at least 8.5 Gbps, even when TSO is disabled.
Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com> Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: David S. Miller <davem@davemloft.net>
Break up monolithic iommu table/lock into finer graularity pools and lock
Investigation of multithreaded iperf experiments on an ethernet
interface show the iommu->lock as the hottest lock identified by
lockstat, with something of the order of 21M contentions out of
27M acquisitions, and an average wait time of 26 us for the lock.
This is not efficient. A more scalable design is to follow the ppc
model, where the iommu_map_table has multiple pools, each stretching
over a segment of the map, and with a separate lock for each pool.
This model allows for better parallelization of the iommu map search.
This patch adds the iommu range alloc/free function infrastructure.
Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com> Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: David S. Miller <davem@davemloft.net>
Len Brown [Thu, 2 Apr 2015 01:02:57 +0000 (21:02 -0400)]
tools/power turbostat: correct dumped pkg-cstate-limit value
HSW expanded MSR_PKG_CST_CONFIG_CONTROL.Package-C-State-Limit,
from bits[2:0] used by previous implementations, to [3:0].
The value 1000b is unlimited, and is used by BDW and SKL too.
Andrey Semin [Fri, 5 Dec 2014 05:07:00 +0000 (00:07 -0500)]
tools/power turbostat: correct DRAM RAPL units on recent Xeon processors
While not yet documented in the Software Developer's Manual,
the data-sheet for modern Xeon states that DRAM RAPL ENERGY units
are fixed at 15.3 uJ, rather than being discovered via MSR.
Before this patch, DRAM energy on these products is over-stated by turbostat
because the RAPL units are 4x larger.
Thomas D [Mon, 5 Jan 2015 20:37:23 +0000 (21:37 +0100)]
tools/power turbostat: Use $(CURDIR) instead of $(PWD) and add support for O= option in Makefile
Since commit ee0778a30153
("tools/power: turbostat: make Makefile a bit more capable")
turbostat's Makefile is using
[...]
BUILD_OUTPUT := $(PWD)
[...]
which obviously causes trouble when building "turbostat" with
make -C /usr/src/linux/tools/power/x86/turbostat ARCH=x86 turbostat
because GNU make does not update nor guarantee that $PWD is set.
This patch changes the Makefile to use $CURDIR instead, which GNU make
guarantees to set and update (i.e. when using "make -C ...") and also
adds support for the O= option (see "make help" in your root of your
kernel source tree for more details).
Link: https://bugs.gentoo.org/show_bug.cgi?id=533918 Fixes: ee0778a30153 ("tools/power: turbostat: make Makefile a bit more capable") Signed-off-by: Thomas D. <whissi@whissi.de> Cc: Mark Asselstine <mark.asselstine@windriver.com> Signed-off-by: Len Brown <len.brown@intel.com>
Len Brown [Tue, 24 Mar 2015 20:37:35 +0000 (16:37 -0400)]
tools/power turbostat: modprobe msr, if needed
Some distros (Ubuntu) ship the msr driver as a module.
If turbosat is run as root on those systems, and discovers
that there is no /dev/cpu/cpu0/msr, it will now "modprobe msr"
for the user.
If not root, the modprobe attempt will fail, and turbostat will exit as before:
turbostat: no /dev/cpu/0/msr, Try "# modprobe msr" : No such file or directory
Merge branch 'x86-pmem-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull PMEM driver from Ingo Molnar:
"This is the initial support for the pmem block device driver:
persistent non-volatile memory space mapped into the system's physical
memory space as large physical memory regions.
The driver is based on Intel code, written by Ross Zwisler, with fixes
by Boaz Harrosh, integrated with x86 e820 memory resource management
and tidied up by Christoph Hellwig.
Note that there were two other separate pmem driver submissions to
lkml: but apparently all parties (Ross Zwisler, Boaz Harrosh) are
reasonably happy with this initial version.
This version enables minimal support that enables persistent memory
devices out in the wild to work as block devices, identified through a
magic (non-standard) e820 flag and auto-discovered if
CONFIG_X86_PMEM_LEGACY=y, or added explicitly through manipulating the
memory maps via the "memmap=..." boot option with the new, special '!'
modifier character.
Limitations: this is a regular block device, and since the pmem areas
are not struct page backed, they are invisible to the rest of the
system (other than the block IO device), so direct IO to/from pmem
areas, direct mmap() or XIP is not possible yet. The page cache will
also shadow and double buffer pmem contents, etc.
Initial support is for x86"
* 'x86-pmem-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
drivers/block/pmem: Fix 32-bit build warning in pmem_alloc()
drivers/block/pmem: Add a driver for persistent memory
x86/mm: Add support for the non-standard protected e820 type
Merge branch 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fixes from Ingo Molnar:
"This tree includes:
- an FPU related crash fix
- a ptrace fix (with matching testcase in tools/testing/selftests/)
- an x86 Kconfig DMA-config defaults tweak to better avoid
non-working drivers"
* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
config: Enable NEED_DMA_MAP_STATE by default when SWIOTLB is selected
x86/fpu: Load xsave pointer *after* initialization
x86/ptrace: Fix the TIF_FORCED_TF logic in handle_signal()
x86, selftests: Add single_step_syscall test
Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf updates from Ingo Molnar:
"This update has mostly fixes, but also other bits:
- perf tooling fixes
- PMU driver fixes
- Intel Broadwell PMU driver HW-enablement for LBR callstacks
- a late coming 'perf kmem' tool update that enables it to also
analyze page allocation data. Note, this comes with MM tracepoint
changes that we believe to not break anything: because it changes
the formerly opaque 'struct page *' field that uniquely identifies
pages to 'pfn' which identifies pages uniquely too, but isn't as
opaque and can be used for other purposes as well"
* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf/x86/intel/pt: Fix and clean up error handling in pt_event_add()
perf/x86/intel: Add Broadwell support for the LBR callstack
perf/x86/intel/rapl: Fix energy counter measurements but supporing per domain energy units
perf/x86/intel: Fix Core2,Atom,NHM,WSM cycles:pp events
perf/x86: Fix hw_perf_event::flags collision
perf probe: Fix segfault when probe with lazy_line to file
perf probe: Find compilation directory path for lazy matching
perf probe: Set retprobe flag when probe in address-based alternative mode
perf kmem: Analyze page allocator events also
tracing, mm: Record pfn instead of pointer to struct page
Merge branch 'locking-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull locking fixes from Ingo Molnar:
"Two fixes: an smp-call fix and a lockdep fix"
* 'locking-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
smp: Fix smp_call_function_single_async() locking
lockdep: Make print_lock() robust against concurrent release
Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace
Pull usernamespace mount fixes from Eric Biederman:
"Way back in October Andrey Vagin reported that umount(MNT_DETACH)
could be used to defeat MNT_LOCKED. As I worked to fix this I
discovered that combined with mount propagation and an appropriate
selection of shared subtrees a reference to a directory on an
unmounted filesystem is not necessary.
That MNT_DETACH is allowed in user namespace in a form that can break
MNT_LOCKED comes from my early misunderstanding what MNT_DETACH does.
To avoid breaking existing userspace the conflict between MNT_DETACH
and MNT_LOCKED is fixed by leaving mounts that are locked to their
parents in the mount hash table until the last reference goes away.
While investigating this issue I also found an issue with
__detach_mounts. The code was unnecessarily and incorrectly
triggering mount propagation. Resulting in too many mounts going away
when a directory is deleted, and too many cpu cycles are burned while
doing that.
Looking some more I realized that __detach_mounts by only keeping
mounts connected that were MNT_LOCKED it had the potential to still
leak information so I tweaked the code to keep everything locked
together that possibly could be.
This code was almost ready last cycle but Al invented fs_pin which
slightly simplifies this code but required rewrites and retesting, and
I have not been in top form for a while so it took me a while to get
all of that done. Similiarly this pull request is late because I have
been feeling absolutely miserable all week.
The issue of being able to escape a bind mount has not yet been
addressed, as the fixes are not yet mature"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
mnt: Update detach_mounts to leave mounts connected
mnt: Fix the error check in __detach_mounts
mnt: Honor MNT_LOCKED when detaching mounts
fs_pin: Allow for the possibility that m_list or s_list go unused.
mnt: Factor umount_mnt from umount_tree
mnt: Factor out unhash_mnt from detach_mnt and umount_tree
mnt: Fail collect_mounts when applied to unmounted mounts
mnt: Don't propagate unmounts to locked mounts
mnt: On an unmount propagate clearing of MNT_LOCKED
mnt: Delay removal from the mount hash.
mnt: Add MNT_UMOUNT flag
mnt: In umount_tree reuse mnt_list instead of mnt_hash
mnt: Don't propagate umounts in __detach_mounts
mnt: Improve the umount_tree flags
mnt: Use hlist_move_list in namespace_unlock
Merge tag 'for-f2fs-4.1' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs
Pull f2fs updates from Jaegeuk Kim:
"New features:
- in-memory extent_cache
- fs_shutdown to test power-off-recovery
- use inline_data to store symlink path
- show f2fs as a non-misc filesystem
Major fixes:
- avoid CPU stalls on sync_dirty_dir_inodes
- fix some power-off-recovery procedure
- fix handling of broken symlink correctly
- fix missing dot and dotdot made by sudden power cuts
- handle wrong data index during roll-forward recovery
- preallocate data blocks for direct_io
... and a bunch of minor bug fixes and cleanups"
* tag 'for-f2fs-4.1' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs: (71 commits)
f2fs: pass checkpoint reason on roll-forward recovery
f2fs: avoid abnormal behavior on broken symlink
f2fs: flush symlink path to avoid broken symlink after POR
f2fs: change 0 to false for bool type
f2fs: do not recover wrong data index
f2fs: do not increase link count during recovery
f2fs: assign parent's i_mode for empty dir
f2fs: add F2FS_INLINE_DOTS to recover missing dot dentries
f2fs: fix mismatching lock and unlock pages for roll-forward recovery
f2fs: fix sparse warnings
f2fs: limit b_size of mapped bh in f2fs_map_bh
f2fs: persist system.advise into on-disk inode
f2fs: avoid NULL pointer dereference in f2fs_xattr_advise_get
f2fs: preallocate fallocated blocks for direct IO
f2fs: enable inline data by default
f2fs: preserve extent info for extent cache
f2fs: initialize extent tree with on-disk extent info of inode
f2fs: introduce __{find,grab}_extent_tree
f2fs: split set_data_blkaddr from f2fs_update_extent_cache
f2fs: enable fast symlink by utilizing inline data
...
Merge tag 'docs-for-linus' of git://git.lwn.net/linux-2.6
Pull documentation updates from Jonathan Corbet:
"Numerous fixes, the overdue removal of the i2o docs, some new Chinese
translations, and, hopefully, the README fix that will end the flow of
identical patches to that file"
* tag 'docs-for-linus' of git://git.lwn.net/linux-2.6: (34 commits)
Documentation/memcg: update memcg/kmem status
Documentation: blackfin: Makefile: Typo building issue
Documentation/vm/pagemap.txt: correct location of page-types tool
Documentation/memory-barriers.txt: typo fix
doc: Add guest_nice column to example output of `cat /proc/stat'
Documentation/kernel-parameters: Move "eagerfpu" to its right place
Documentation: gpio: Update ACPI part of the document to mention _DSD
docs/completion.txt: Various tweaks and corrections
doc: completion: context, scope and language fixes
Documentation:Update Documentation/zh_CN/arm64/memory.txt
Documentation:Update Documentation/zh_CN/arm64/booting.txt
Documentation: Chinese translation of arm64/legacy_instructions.txt
DocBook media: fix broken EIA hyperlink
Documentation: tweak the maintainers entry
README: Change gzip/bzip2 to xz compression format
README: Update version number reference
doc:pci: Fix typo in Documentation/PCI
Documentation: drm: Use '->' when describing access through pointers.
Documentation: Remove mentioning of block barriers
Documentation/email-clients.txt: Fix one grammar mistake, add extra info about TB
...
Bluetooth: hidp: Fix regression with older userspace and flags validation
While it is not used by newer userspace anymore, the older userspace was
utilizing HIDP_VIRTUAL_CABLE_UNPLUG and HIDP_BOOT_PROTOCOL_MODE flags
when adding a new HIDP connection.
The flags validation is important, but we can not break older userspace
and with that allow providing these flags even if newer userspace does
not use them anymore.
config: Enable NEED_DMA_MAP_STATE by default when SWIOTLB is selected
A huge amount of NIC drivers use the DMA API, however if
compiled under 32-bit an very important part of the DMA API can
be ommitted leading to the drivers not working at all
(especially if used with 'swiotlb=force iommu=soft').
As Prashant Sreedharan explains it: "the driver [tg3] uses
DEFINE_DMA_UNMAP_ADDR(), dma_unmap_addr_set() to keep a copy of
the dma "mapping" and dma_unmap_addr() to get the "mapping"
value. On most of the platforms this is a no-op, but ... with
"iommu=soft and swiotlb=force" this house keeping is required,
... otherwise we pass 0 while calling pci_unmap_/pci_dma_sync_
instead of the DMA address."
As such enable this even when using 32-bit kernels.
Reported-by: Ian Jackson <Ian.Jackson@eu.citrix.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Acked-by: David S. Miller <davem@davemloft.net> Acked-by: Prashant Sreedharan <prashant@broadcom.com> Cc: Borislav Petkov <bp@alien8.de> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Michael Chan <mchan@broadcom.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: boris.ostrovsky@oracle.com Cc: cascardo@linux.vnet.ibm.com Cc: david.vrabel@citrix.com Cc: sanjeevb@broadcom.com Cc: siva.kallam@broadcom.com Cc: vyasevich@gmail.com Cc: xen-devel@lists.xensource.com Link: http://lkml.kernel.org/r/20150417190448.GA9462@l.oracle.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
Merge tag 'devicetree-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/glikely/linux
Pull devicetree changes from Grant Likely:
"Here are the devicetree changes queued up for v4.1. Nothing really
exciting here. Rob has another few commits for big-endian attached
UARTs, but those will be sent in a separate merge request since they
haven't been as thoroughly tested as this batch.
Here are the highlights:
- lots of unittest cleanup from Frank Rowand
- bugfixes and updates to the of_graph code
- tighten up of_get_mac_address() code
- documentation updates"
* tag 'devicetree-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/glikely/linux:
of/unittest: Fix of_platform_depopulate test case
of/unittest: early return from test skips tests
of/unittest: breadcrumbs to reduce pain of future maintainers
of/unittest: reduce checkpatch noise - line after declarations
of/unittest: typo in error string
of/unittest: add const where needed
of_net: factor out repetitive code from of_get_mac_address()
drivers/of: Add empty ranges quirk for PA-Semi
of: Allow selection of OF_DYNAMIC and OF_OVERLAY if OF_UNITTEST
of: Empty node & property flag accessors when !OF
of: Explicitly include linux/types.h in of_graph.h
dt-bindings: brcm: rationalize Broadcom documentation naming
of/unittest: replace 'selftest' with 'unittest'
Documentation: rename of_selftest.txt to of_unittest.txt
Documentation: update the of_selftest.txt
dt: OF_UNITTEST make dependency broken
MAINTAINERS: Pantelis Antoniou device tree overlay maintainer
of: Add of_graph_get_port_by_id function
of: Add for_each_endpoint_of_node helper macro
of: Decrement refcount of previous endpoint in of_graph_get_next_endpoint
Merge tag 'gpio-v4.1-1' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-gpio
Pull GPIO updates from Linus Walleij:
"This is the bulk of GPIO changes for the v4.1 development cycle:
- A new GPIO hogging mechanism has been added. This can be used on
boards that want to drive some GPIO line high, low, or set it as
input on boot and then never touch it again. For some embedded
systems this is bliss and simplifies things to a great extent.
- Some API cleanup and closure: gpiod_get_array() and
gpiod_put_array() has been added to get and put GPIOs in bulk as
was possible with the non-descriptor API.
- Encapsulate cross-calls to the pin control subsystem in
<linux/gpio/driver.h>. Now this should be the only header any GPIO
driver needs to include or something is wrong. Cleanups
restricting drivers to this include are welcomed if tested.
- Sort the GPIO Kconfig and split it into submenus, as it was
becoming and unstructured, illogical and unnavigatable mess. I
hope this is easier to follow. Menus that require a certain
subsystem like I2C can now be hidden nicely for example, still
working on others.
- New drivers:
- New driver for the Altera Soft GPIO.
- The F7188x driver now handles the F71869 and F71869A variants.
- The MIPS Loongson driver has been moved to drivers/gpio for
consolidation and cleanup.
- Cleanups:
- The MAX732x is converted to use the GPIOLIB_IRQCHIP
infrastructure.
- The PCF857x is converted to use the GPIOLIB_IRQCHIP
infrastructure.
- Radical cleanup of the OMAP driver.
- Misc:
- Enable the DWAPB GPIO for all architectures. This is a "hard
IP" block from Synopsys which has started to turn up in so
diverse architectures as X86 Quark, ARC and a slew of ARM
systems. So even though it's not an expander, it's generic
enough to be available for all.
- We add a mock GPIO on Crystalcove PMIC after a long discussion
with Daniel Vetter et al, tracing back to the shootout at the
kernel summit where DRM drivers and sub-componentization was
discussed. In this case a mock GPIO is assumed to be the best
compromise gaining some reuse of infrastructure without making
DRM drivers overly complex at the same time. Let's see"
* tag 'gpio-v4.1-1' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-gpio: (62 commits)
Revert "gpio: sch: use uapi/linux/pci_ids.h directly"
gpio: dwapb: remove dependencies
gpio: dwapb: enable for ARC
gpio: removing kfree remove functionality
gpio: mvebu: Fix mask/unmask managment per irq chip type
gpio: split GPIO drivers in submenus
gpio: move MFD GPIO drivers under their own comment
gpio: move BCM Kona Kconfig option
gpio: arrange SPI Kconfig symbols alphabetically
gpio: arrange PCI GPIO controllers alphabetically
gpio: arrange I2C Kconfig symbols alphabetically
gpio: arrange Kconfig symbols alphabetically
gpio: ich: Implement get_direction function
gpio: use (!foo) instead of (foo == NULL)
gpio: arizona: drop owner assignment from platform_drivers
gpio: max7300: remove 'ret' variable
gpio: use devm_kzalloc
gpio: sch: use uapi/linux/pci_ids.h directly
gpio: x-gene: fix devm_ioremap_resource() check
gpio: loongson: Add Loongson-3A/3B GPIO driver support
...
Merge branch 'i2c/for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux
Pull i2c fixes from Wolfram Sang:
"Two buildfixes for I2C based on your tree from the day before
yesterday. Sadly, these build errors never reached me while the
patches were sitting in -next. Need to fix that"
* 'i2c/for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux:
i2c: core: Export bus recovery functions
i2c: jz4780: Fix build for m68k and sparc64
Merge tag 'dm-4.1-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm
Pull device mapper updates from Mike Snitzer:
- the most extensive changes this cycle are the DM core improvements to
add full blk-mq support to request-based DM.
- disabled by default but user can opt-in with CONFIG_DM_MQ_DEFAULT
- depends on some blk-mq changes from Jens' for-4.1/core branch so
that explains why this pull is built on linux-block.git
- update DM to use name_to_dev_t() rather than open-coding a less
capable device parser.
- includes a couple small improvements to name_to_dev_t() that offer
stricter constraints that DM's code provided.
- improvements to the dm-cache "mq" cache replacement policy.
- a DM crypt crypt_ctr() error path fix and an async crypto deadlock
fix
- a small efficiency improvement for DM crypt decryption by leveraging
immutable biovecs
- add error handling modes for corrupted blocks to DM verity
- a new "log-writes" DM target from Josef Bacik that is meant for file
system developers to test file system integrity at particular points
in the life of a file system
- a few DM log userspace cleanups and fixes
- a few Documentation fixes (for thin, cache, crypt and switch)
* tag 'dm-4.1-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: (34 commits)
dm crypt: fix missing error code return from crypt_ctr error path
dm crypt: fix deadlock when async crypto algorithm returns -EBUSY
dm crypt: leverage immutable biovecs when decrypting on read
dm crypt: update URLs to new cryptsetup project page
dm: add log writes target
dm table: use bool function return values of true/false not 1/0
dm verity: add error handling modes for corrupted blocks
dm thin: remove stale 'trim' message documentation
dm delay: use msecs_to_jiffies for time conversion
dm log userspace base: fix compile warning
dm log userspace transfer: match wait_for_completion_timeout return type
dm table: fall back to getting device using name_to_dev_t()
init: stricter checking of major:minor root= values
init: export name_to_dev_t and mark name argument as const
dm: add 'use_blk_mq' module param and expose in per-device ro sysfs attr
dm: optimize dm_mq_queue_rq to _not_ use kthread if using pure blk-mq
dm: add full blk-mq support to request-based DM
dm: impose configurable deadline for dm_request_fn's merge heuristic
dm sysfs: introduce ability to add writable attributes
dm: don't start current request if it would've merged with the previous
...
perf/x86/intel/pt: Fix and clean up error handling in pt_event_add()
Dan Carpenter reported that pt_event_add() has buggy
error handling logic: it returns 0 instead of -EBUSY when
it fails to start a newly added event.
Furthermore, the control flow in this function is messy,
with cleanup labels mixed with direct returns.
Fix the bug and clean up the code by converting it to
a straight fast path for the regular non-failing case,
plus a clear sequence of cascading goto labels to do
all cleanup.
NOTE: I materially changed the existing clean up logic in the
pt_event_start() failure case to use the direct
perf_aux_output_end() path, not pt_event_del(), because
perf_aux_output_end() is enough here.
Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Acked-by: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Julia Lawall <julia.lawall@lip6.fr> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/20150416103830.GB7847@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
1) Fix verifier memory corruption and other bugs in BPF layer, from
Alexei Starovoitov.
2) Add a conservative fix for doing BPF properly in the BPF classifier
of the packet scheduler on ingress. Also from Alexei.
3) The SKB scrubber should not clear out the packet MARK and security
label, from Herbert Xu.
4) Fix oops on rmmod in stmmac driver, from Bryan O'Donoghue.
5) Pause handling is not correct in the stmmac driver because it
doesn't take into consideration the RX and TX fifo sizes. From
Vince Bridgers.
6) Failure path missing unlock in FOU driver, from Wang Cong.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (44 commits)
net: dsa: use DEVICE_ATTR_RW to declare temp1_max
netns: remove BUG_ONs from net_generic()
IB/ipoib: Fix ndo_get_iflink
sfc: Fix memcpy() with const destination compiler warning.
altera tse: Fix network-delays and -retransmissions after high throughput.
net: remove unused 'dev' argument from netif_needs_gso()
act_mirred: Fix bogus header when redirecting from VLAN
inet_diag: fix access to tcp cc information
tcp: tcp_get_info() should fetch socket fields once
net: dsa: mv88e6xxx: Add missing initialization in mv88e6xxx_set_port_state()
skbuff: Do not scrub skb mark within the same name space
Revert "net: Reset secmark when scrubbing packet"
bpf: fix two bugs in verification logic when accessing 'ctx' pointer
bpf: fix bpf helpers to use skb->mac_header relative offsets
stmmac: Configure Flow Control to work correctly based on rxfifo size
stmmac: Enable unicast pause frame detect in GMAC Register 6
stmmac: Read tx-fifo-depth and rx-fifo-depth from the devicetree
stmmac: Add defines and documentation for enabling flow control
stmmac: Add properties for transmit and receive fifo sizes
stmmac: fix oops on rmmod after assigning ip addr
...
Pull sparc updates from David Miller:
"The PowerPC folks have a really nice scalable IOMMU pool allocator
that we wanted to make use of for sparc. So here we have a series
that abstracts out their code into a common layer that anyone can make
use of.
Sparc is converted, and the PowerPC folks have reviewed and ACK'd this
series and plan to convert PowerPC over as well"
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/sparc:
iommu-common: Fix PARISC compile-time warnings
sparc: Make LDC use common iommu poll management functions
sparc: Make sparc64 use scalable lib/iommu-common.c functions
sparc: Break up monolithic iommu table/lock into finer graularity pools and lock