Rik van Riel [Sat, 27 Oct 2012 16:12:11 +0000 (12:12 -0400)]
x86/mm: Completely drop the TLB flush from ptep_set_access_flags()
Intel has an architectural guarantee that the TLB entry causing
a page fault gets invalidated automatically. This means
we should be able to drop the local TLB invalidation.
Because of the way other areas of the page fault code work,
chances are good that all x86 CPUs do this. However, if
someone somewhere has an x86 CPU that does not invalidate
the TLB entry causing a page fault, this one-liner should
be easy to revert - or a CPU model specific quirk could
be added to retain this optimization on most CPUs.
Signed-off-by: Rik van Riel <riel@redhat.com> Acked-by: Linus Torvalds <torvalds@kernel.org> Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Michel Lespinasse <walken@google.com>
[ Applied changelog massage and moved this last in the series,
to create bisection distance. ] Signed-off-by: Ingo Molnar <mingo@kernel.org>
Peter Zijlstra [Mon, 22 Oct 2012 18:15:40 +0000 (20:15 +0200)]
sched, numa, mm: Implement slow start for working set sampling
Add a 1 second delay before starting to scan the working set of
a task and starting to balance it amongst nodes.
[ note that before the constant per task WSS sampling rate patch
the initial scan would happen much later still, in effect that
patch caused this regression. ]
The theory is that short-run tasks benefit very little from NUMA
placement: they come and go, and they better stick to the node
they were started on. As tasks mature and rebalance to other CPUs
and nodes, so does their NUMA placement have to change and so
does it start to matter more and more.
In practice this change fixes an observable kbuild regression:
# [ a perf stat --null --repeat 10 test of ten bzImage builds to /dev/shm ]
!NUMA:
45.291088843 seconds time elapsed ( +- 0.40% )
45.154231752 seconds time elapsed ( +- 0.36% )
+NUMA, no slow start:
46.172308123 seconds time elapsed ( +- 0.30% )
46.343168745 seconds time elapsed ( +- 0.25% )
The implementation is simple and straightforward, most of the patch
deals with adding the /proc/sys/kernel/sched_numa_scan_delay_ms tunable
knob.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Rik van Riel <riel@redhat.com> Link: http://lkml.kernel.org/n/tip-vn7p3ynbwqt3qqewhdlvjltc@git.kernel.org
[ Wrote the changelog, ran measurements, tuned the default. ] Signed-off-by: Ingo Molnar <mingo@kernel.org>
Peter Zijlstra [Sun, 14 Oct 2012 14:59:13 +0000 (16:59 +0200)]
sched, numa, mm: Implement constant, per task Working Set Sampling (WSS) rate
Previously, to probe the working set of a task, we'd use
a very simple and crude method: mark all of its address
space PROT_NONE.
That method has various (obvious) disadvantages:
- it samples the working set at dissimilar rates,
giving some tasks a sampling quality advantage
over others.
- creates performance problems for tasks with very
large working sets
- over-samples processes with large address spaces but
which only very rarely execute
Improve that method by keeping a rotating offset into the
address space that marks the current position of the scan,
and advance it by a constant rate (in a CPU cycles execution
proportional manner). If the offset reaches the last mapped
address of the mm then it then it starts over at the first
address.
The per-task nature of the working set sampling functionality
in this tree allows such constant rate, per task,
execution-weight proportional sampling of the working set,
with an adaptive sampling interval/frequency that goes from
once per 100 msecs up to just once per 1.6 seconds.
The current sampling volume is 256 MB per interval.
As tasks mature and converge their working set, so does the
sampling rate slow down to just a trickle, 256 MB per 1.6
seconds of CPU time executed.
This, beyond being adaptive, also rate-limits rarely
executing systems and does not over-sample on overloaded
systems.
[ In AutoNUMA speak, this patch deals with the effective sampling
rate of the 'hinting page fault'. AutoNUMA's scanning is
currently rate-limited, but it is also fundamentally
single-threaded, executing in the knuma_scand kernel thread,
so the limit in AutoNUMA is global and does not scale up with
the number of CPUs, nor does it scan tasks in an execution
proportional manner.
So the idea of rate-limiting the scanning was first implemented
in the AutoNUMA tree via a global rate limit. This patch goes
beyond that by implementing an execution rate proportional
working set sampling rate that is not implemented via a single
global scanning daemon. ]
[ Dan Carpenter pointed out a possible NULL pointer dereference in the
first version of this patch. ]
Based-on-idea-by: Andrea Arcangeli <aarcange@redhat.com> Bug-Found-By: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Rik van Riel <riel@redhat.com> Link: http://lkml.kernel.org/n/tip-wt5b48o2226ec63784i58s3j@git.kernel.org
[ Wrote changelog and fixed bug. ] Signed-off-by: Ingo Molnar <mingo@kernel.org>
Rik van Riel [Thu, 18 Oct 2012 21:19:28 +0000 (17:19 -0400)]
sched, numa, mm: Add credits for NUMA placement
The NUMA placement code has been rewritten several times, but
the basic ideas took a lot of work to develop. The people who
put in the work deserve credit for it. Thanks Andrea & Peter :)
[ The Documentation/scheduler/numa-problem.txt file should
probably be rewritten once we figure out the final details of
what the NUMA code needs to do, and why. ]
Signed-off-by: Rik van Riel <riel@redhat.com> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: aarcange@redhat.com Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Link: http://lkml.kernel.org/r/20121018171928.24d06af4@cuia.bos.redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
----
This is against tip.git numa/core
Peter Zijlstra [Tue, 9 Oct 2012 11:46:22 +0000 (13:46 +0200)]
sched, numa, mm: Add fault driven placement and migration policy
As per the problem/design document Documentation/scheduler/numa-problem.txt
implement 3ac & 4.
( A pure 3a was found too unstable, I did briefly try 3bc
but found no significant improvement. )
Implement a per-task memory placement scheme relying on a regular
PROT_NONE 'migration' fault to scan the memory space of the procress
and uses a two stage migration scheme to reduce the invluence of
unlikely usage relations.
It relies on the assumption that the compute part is tied to a
paticular task and builds a task<->page relation set to model the
compute<->data relation.
In the previous patch we made memory migrate towards where the task
is running, here we select the node on which most memory is located
as the preferred node to run on.
This creates a feed-back control loop between trying to schedule a
task on a node and migrating memory towards the node the task is
scheduled on.
Suggested-by: Andrea Arcangeli <aarcange@redhat.com> Suggested-by: Rik van Riel <riel@redhat.com> Fixes-by: David Rientjes <rientjes@google.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/n/tip-8ejt0ioj62k5ruf5zd2ix9zu@git.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
Peter Zijlstra [Mon, 13 Aug 2012 13:22:20 +0000 (15:22 +0200)]
sched, numa, mm: Introduce last_nid in the pageframe
Introduce a per-page last_nid field, fold this into the struct
page::flags field whenever possible.
The unlikely/rare 32bit NUMA configs will likely grow the page-frame.
Completely dropping 32bit support for CONFIG_SCHED_NUMA would simplify
things, but it would also remove the warning if we grow enough 64bit
only page-flags to push the last-nid out.
Suggested-by: Rik van Riel <riel@redhat.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Rik van Riel <riel@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Link: http://lkml.kernel.org/n/tip-0uois4f9skfw9mwyk1yoy0jq@git.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
Peter Zijlstra [Sat, 3 Mar 2012 15:56:25 +0000 (16:56 +0100)]
sched, numa, mm: Implement home-node awareness
Implement home node preference in the scheduler's load-balancer.
This is done in four pieces:
- task_numa_hot(); make it harder to migrate tasks away from their
home-node, controlled using the NUMA_HOT feature flag.
- select_task_rq_fair(); prefer placing the task in their home-node,
controlled using the NUMA_TTWU_BIAS feature flag. Disabled by
default for we found this to be far too agressive.
- load_balance(); during the regular pull load-balance pass, try
pulling tasks that are on the wrong node first with a preference
of moving them nearer to their home-node through task_numa_hot(),
controlled through the NUMA_PULL feature flag.
- load_balance(); when the balancer finds no imbalance, introduce
some such that it still prefers to move tasks towards their
home-node, using active load-balance if needed, controlled through
the NUMA_PULL_BIAS feature flag.
In particular, only introduce this BIAS if the system is otherwise
properly (weight) balanced and we either have an offnode or !numa
task to trade for it.
In order to easily find off-node tasks, split the per-cpu task list
into two parts.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Paul Turner <pjt@google.com> Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com> Cc: Christoph Lameter <cl@linux.com> Cc: Rik van Riel <riel@redhat.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/n/tip-g0gzzzjrvrzxl6kgwnil1e97@git.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
Peter Zijlstra [Sat, 3 Mar 2012 16:05:16 +0000 (17:05 +0100)]
sched, numa, mm: Introduce tsk_home_node()
Introduce the home-node concept for tasks. In order to keep memory
locality we need to have a something to stay local to, we define the
home-node of a task as the node we prefer to allocate memory from and
prefer to execute on.
These are no hard guarantees, merely soft preferences. This allows for
optimal resource usage, we can run a task away from the home-node, the
remote memory hit -- while expensive -- is less expensive than not
running at all, or very little, due to severe cpu overload.
Similarly, we can allocate memory from another node if our home-node
is depleted, again, some memory is better than no memory.
This patch merely introduces the basic infrastructure, all policy
comes later.
NOTE: we introduce the concept of EMBEDDED_NUMA, these are
architectures where the memory access cost doesn't depend on the cpu
but purely on the physical address -- embedded boards with cheap
(slow) and expensive (fast) memory banks.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com> Cc: Rik van Riel <riel@redhat.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/n/tip-ii8j8cp87cgctecfqp2ib6rn@git.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
Peter Zijlstra [Tue, 17 Jul 2012 20:54:51 +0000 (22:54 +0200)]
mm/mpol: Use special PROT_NONE to migrate pages
Combine our previous PROT_NONE, mpol_misplaced and
migrate_misplaced_page() pieces into an effective migrate on fault
scheme.
Note that (on x86) we rely on PROT_NONE pages being !present and avoid
the TLB flush from try_to_unmap(TTU_MIGRATION). This greatly improves
the page-migration performance.
Suggested-by: Rik van Riel <riel@redhat.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Reviewed-by: Rik van Riel <riel@redhat.com> Cc: Paul Turner <pjt@google.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Link: http://lkml.kernel.org/n/tip-e98gyl8kr9jzooh2s4piuils@git.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
numa, mm: Support NUMA hinting page faults from gup/gup_fast
Introduce FOLL_NUMA to tell follow_page to check
pte/pmd_numa. get_user_pages must use FOLL_NUMA, and it's safe to do
so because it always invokes handle_mm_fault and retries the
follow_page later.
KVM secondary MMU page faults will trigger the NUMA hinting page
faults through gup_fast -> get_user_pages -> follow_page ->
handle_mm_fault.
Other follow_page callers like KSM should not use FOLL_NUMA, or they
would fail to get the pages if they use follow_page instead of
get_user_pages.
[ This patch was picked up from the AutoNUMA tree. ]
Originally-by: Andrea Arcangeli <aarcange@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Rik van Riel <riel@redhat.com>
[ ported to this tree. ] Signed-off-by: Ingo Molnar <mingo@kernel.org>
Lee Schermerhorn [Thu, 12 Jan 2012 11:37:17 +0000 (12:37 +0100)]
mm/mpol: Add MPOL_MF_LAZY
This patch adds another mbind() flag to request "lazy migration". The
flag, MPOL_MF_LAZY, modifies MPOL_MF_MOVE* such that the selected
pages are marked PROT_NONE. The pages will be migrated in the fault
path on "first touch", if the policy dictates at that time.
"Lazy Migration" will allow testing of migrate-on-fault via mbind().
Also allows applications to specify that only subsequently touched
pages be migrated to obey new policy, instead of all pages in range.
This can be useful for multi-threaded applications working on a
large shared data area that is initialized by an initial thread
resulting in all pages on one [or a few, if overflowed] nodes.
After PROT_NONE, the pages in regions assigned to the worker threads
will be automatically migrated local to the threads on 1st touch.
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Reviewed-by: Rik van Riel <riel@redhat.com> Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Linus Torvalds <torvalds@linux-foundation.org>
[ nearly complete rewrite.. ] Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/n/tip-7rsodo9x8zvm5awru5o7zo0y@git.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
Peter Zijlstra [Tue, 17 Jul 2012 16:25:14 +0000 (18:25 +0200)]
mm/mpol: Create special PROT_NONE infrastructure
In order to facilitate a lazy -- fault driven -- migration of pages,
create a special transient PROT_NONE variant, we can then use the
'spurious' protection faults to drive our migrations from.
Pages that already had an effective PROT_NONE mapping will not
be detected to generate these 'spuriuos' faults for the simple reason
that we cannot distinguish them on their protection bits, see
pte_numa().
This isn't a problem since PROT_NONE (and possible PROT_WRITE with
dirty tracking) aren't used or are rare enough for us to not care
about their placement.
Suggested-by: Rik van Riel <riel@redhat.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Reviewed-by: Rik van Riel <riel@redhat.com> Cc: Paul Turner <pjt@google.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Link: http://lkml.kernel.org/n/tip-0g5k80y4df8l83lha9j75xph@git.kernel.org
[ fixed various cross-arch and THP/!THP details ] Signed-off-by: Ingo Molnar <mingo@kernel.org>
Lee Schermerhorn [Wed, 11 Jan 2012 14:48:13 +0000 (15:48 +0100)]
mm/mpol: Check for misplaced page
This patch provides a new function to test whether a page resides
on a node that is appropriate for the mempolicy for the vma and
address where the page is supposed to be mapped. This involves
looking up the node where the page belongs. So, the function
returns that node so that it may be used to allocated the page
without consulting the policy again.
A subsequent patch will call this function from the fault path.
Because of this, I don't want to go ahead and allocate the page, e.g.,
via alloc_page_vma() only to have to free it if it has the correct
policy. So, I just mimic the alloc_page_vma() node computation
logic--sort of.
Note: we could use this function to implement a MPOL_MF_STRICT
behavior when migrating pages to match mbind() mempolicy--e.g.,
to ensure that pages in an interleaved range are reinterleaved
rather than left where they are when they reside on any page in
the interleave nodemask.
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Reviewed-by: Rik van Riel <riel@redhat.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Linus Torvalds <torvalds@linux-foundation.org>
[ Added MPOL_F_LAZY to trigger migrate-on-fault;
simplified code now that we don't have to bother
with special crap for interleaved ] Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/n/tip-z3mgep4tgrc08o07vl1ahb2m@git.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
Lee Schermerhorn [Mon, 16 Jan 2012 13:43:29 +0000 (14:43 +0100)]
mm/mpol: Add MPOL_MF_NOOP
This patch augments the MPOL_MF_LAZY feature by adding a "NOOP" policy
to mbind(). When the NOOP policy is used with the 'MOVE and 'LAZY
flags, mbind() will map the pages PROT_NONE so that they will be
migrated on the next touch.
This allows an application to prepare for a new phase of operation
where different regions of shared storage will be assigned to
worker threads, w/o changing policy. Note that we could just use
"default" policy in this case. However, this also allows an
application to request that pages be migrated, only if necessary,
to follow any arbitrary policy that might currently apply to a
range of pages, without knowing the policy, or without specifying
multiple mbind()s for ranges with different policies.
[ Bug in early version of mpol_parse_str() reported by Fengguang Wu. ]
Bug-Reported-by: Reported-by: Fengguang Wu <fengguang.wu@intel.com> Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Reviewed-by: Rik van Riel <riel@redhat.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@kernel.org>
mm/pgprot: Move the pgprot_modify() fallback definition to mm.h
pgprot_modify() is available on x86, but on other architectures it only
gets defined in mm/mprotect.c - breaking the build if anything outside
of mprotect.c tries to make use of this function.
Move it to the generic pgprot area in mm.h, so that an upcoming patch
can make use of it.
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Rik van Riel <riel@redhat.com> Cc: Paul Turner <pjt@google.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Link: http://lkml.kernel.org/n/tip-nfvarGMj9gjavowroorkizb4@git.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
Rik van Riel [Tue, 9 Oct 2012 13:31:59 +0000 (15:31 +0200)]
mm: Only flush the TLB when clearing an accessible pte
If ptep_clear_flush() is called to clear a page table entry that is
accessible anyway by the CPU, eg. a _PAGE_PROTNONE page table entry,
there is no need to flush the TLB on remote CPUs.
Signed-off-by: Rik van Riel <riel@redhat.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Link: http://lkml.kernel.org/n/tip-vm3rkzevahelwhejx5uwm8ex@git.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
Rik van Riel [Tue, 9 Oct 2012 13:31:12 +0000 (15:31 +0200)]
x86/mm: Introduce pte_accessible()
We need pte_present to return true for _PAGE_PROTNONE pages, to indicate that
the pte is associated with a page.
However, for TLB flushing purposes, we would like to know whether the pte
points to an actually accessible page. This allows us to skip remote TLB
flushes for pages that are not actually accessible.
Fill in this method for x86 and provide a safe (but slower) method
on other architectures.
Signed-off-by: Rik van Riel <riel@redhat.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Fixed-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/n/tip-66p11te4uj23gevgh4j987ip@git.kernel.org
[ Added Linus's review fixes. ] Signed-off-by: Ingo Molnar <mingo@kernel.org>
Peter Zijlstra [Fri, 12 Oct 2012 10:13:10 +0000 (12:13 +0200)]
sched, numa, mm: Describe the NUMA scheduling problem formally
This is probably a first: formal description of a complex high-level
computing problem, within the kernel source.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Mike Galbraith <efault@gmx.de>
Rik van Riel <riel@redhat.com> Link: http://lkml.kernel.org/n/tip-mmnlpupoetcatimvjEld16Pb@git.kernel.org
[ Next step: generate the kernel source from such formal descriptions and retire to a tropical island! ] Signed-off-by: Ingo Molnar <mingo@kernel.org>
Rik van Riel [Sat, 27 Oct 2012 16:12:11 +0000 (12:12 -0400)]
mm/generic: Only flush the local TLB in ptep_set_access_flags()
The function ptep_set_access_flags() is only ever used to upgrade
access permissions to a page - i.e. they make it less restrictive.
That means the only negative side effect of not flushing remote
TLBs in this function is that other CPUs may incur spurious page
faults, if they happen to access the same address, and still have
a PTE with the old permissions cached in their TLB caches.
Having another CPU maybe incur a spurious page fault is faster
than always incurring the cost of a remote TLB flush, so replace
the remote TLB flush with a purely local one.
This should be safe on every architecture that correctly
implements flush_tlb_fix_spurious_fault() to actually invalidate
the local TLB entry that caused a page fault, as well as on
architectures where the hardware invalidates TLB entries that
cause page faults.
In the unlikely event that you are hitting what appears to be
an infinite loop of page faults, and 'git bisect' took you to
this changeset, your architecture needs to implement
flush_tlb_fix_spurious_fault() to actually flush the TLB entry.
Signed-off-by: Rik van Riel <riel@redhat.com> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Michel Lespinasse <walken@google.com>
[ Changelog massage. ] Signed-off-by: Ingo Molnar <mingo@kernel.org>
Pull networking fixes from David Miller:
"This is what we usually expect at this stage of the game, lots of
little things, mostly in drivers. With the occasional 'oops didn't
mean to do that' kind of regressions in the core code."
1) Uninitialized data in __ip_vs_get_timeouts(), from Arnd Bergmann
2) Reject invalid ACK sequences in Fast Open sockets, from Jerry Chu.
3) Lost error code on return from _rtl_usb_receive(), from Christian
Lamparter.
4) Fix reset resume on USB rt2x00, from Stanislaw Gruszka.
5) Release resources on error in pch_gbe driver, from Veaceslav Falico.
6) Default hop limit not set correctly in ip6_template_metrics[], fix
from Li RongQing.
7) Gianfar PTP code requests wrong kind of resource during probe, fix
from Wei Yang.
8) Fix VHOST net driver on big-endian, from Michael S Tsirkin.
9) Mallenox driver bug fixes from Jack Morgenstein, Or Gerlitz, Moni
Shoua, Dotan Barak, and Uri Habusha.
10) usbnet leaks memory on TX path, fix from Hemant Kumar.
11) Use socket state test, rather than presence of FIN bit packet, to
determine FIONREAD/SIOCINQ value. Fix from Eric Dumazet.
12) Fix cxgb4 build failure, from Vipul Pandya.
13) Provide a SYN_DATA_ACKED state to complement SYN_FASTOPEN in socket
info dumps. From Yuchung Cheng.
14) Fix leak of security path in kfree_skb_partial(). Fix from Eric
Dumazet.
15) Handle RX FIFO overflows more resiliently in pch_gbe driver, from
Veaceslav Falico.
16) Fix MAINTAINERS file pattern for networking drivers, from Jean
Delvare.
17) Add iPhone5 IDs to IPHETH driver, from Jay Purohit.
18) VLAN device type change restriction is too strict, and should not
trigger for the automatically generated vlan0 device. Fix from Jiri
Pirko.
19) Make PMTU/redirect flushing work properly again in ipv4, from
Steffen Klassert.
20) Fix memory corruptions by using kfree_rcu() in netlink_release().
From Eric Dumazet.
21) More qmi_wwan device IDs, from Bjørn Mork.
22) Fix unintentional change of SNAT/DNAT hooks in generic NAT
infrastructure, from Elison Niven.
23) Fix 3.6.x regression in xt_TEE netfilter module, from Eric Dumazet.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (57 commits)
tilegx: fix some issues in the SW TSO support
qmi_wwan/cdc_ether: move Novatel 551 and E362 to qmi_wwan
net: usb: Fix memory leak on Tx data path
net/mlx4_core: Unmap UAR also in the case of error flow
net/mlx4_en: Don't use vlan tag value as an indication for vlan presence
net/mlx4_en: Fix double-release-range in tx-rings
bas_gigaset: fix pre_reset handling
vhost: fix mergeable bufs on BE hosts
gianfar_ptp: use iomem, not ioports resource tree in probe
ipv6: Set default hoplimit as zero.
NET_VENDOR_TI: make available for am33xx as well
pch_gbe: fix error handling in pch_gbe_up()
b43: Fix oops on unload when firmware not found
mwifiex: clean up scan state on error
mwifiex: return -EBUSY if specific scan request cannot be honored
brcmfmac: fix potential NULL dereference
Revert "ath9k_hw: Updated AR9003 tx gain table for 5GHz"
ath9k_htc: Add PID/VID for a Ubiquiti WiFiStation
rt2x00: usb: fix reset resume
rtlwifi: pass rx setup error code to caller
...
Linus Torvalds [Fri, 26 Oct 2012 21:59:01 +0000 (14:59 -0700)]
Merge branch 'fixes' of git://git.infradead.org/users/vkoul/slave-dma
Pull slave-dmaengine fixes from Vinod Koul:
"Three fixes for slave dmanegine.
Two are for typo omissions in sifr dmaengine driver and the last one
is for the imx driver fixing a missing unlock"
* 'fixes' of git://git.infradead.org/users/vkoul/slave-dma:
dmaengine: sirf: fix a typo in moving running dma_desc to active queue
dmaengine: sirf: fix a typo in dma_prep_interleaved
dmaengine: imx-dma: fix missing unlock on error in imxdma_xfer_desc()
Linus Torvalds [Fri, 26 Oct 2012 21:23:35 +0000 (14:23 -0700)]
Merge tag 'pm+acpi-for-3.7-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull power management and ACPI fixes from Rafael J Wysocki:
- Fix for a memory leak in acpi_bind_one() from Jesper Juhl.
- Fix for an error code path memory leak in pm_genpd_attach_cpuidle()
from Jonghwan Choi.
- Fix for smp_processor_id() usage in preemptible code in powernow-k8
from Andreas Herrmann.
- Fix for a suspend-related memory leak in cpufreq stats from Xiaobing
Tu.
- Freezer fix for failure to clear PF_NOFREEZE along with PF_KTHREAD in
flush_old_exec() from Oleg Nesterov.
- acpi_processor_notify() fix from Alan Cox.
* tag 'pm+acpi-for-3.7-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
ACPI: missing break
freezer: exec should clear PF_NOFREEZE along with PF_KTHREAD
Fix memory leak in cpufreq stats.
cpufreq / powernow-k8: Remove usage of smp_processor_id() in preemptible code
PM / Domains: Fix memory leak on error path in pm_genpd_attach_cpuidle
ACPI: Fix memory leak in acpi_bind_one()
Linus Torvalds [Fri, 26 Oct 2012 20:46:41 +0000 (13:46 -0700)]
Merge tag 'rdma-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband
Pull infiniband fixes from Roland Dreier:
"Small batch of fixes for 3.7:
- Fix crash in error path in cxgb4
- Fix build error on 32 bits in mlx4
- Fix SR-IOV bugs in mlx4"
* tag 'rdma-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband:
mlx4_core: Perform correct resource cleanup if mlx4_QUERY_ADAPTER() fails
mlx4_core: Remove annoying debug messages from SR-IOV flow
RDMA/cxgb4: Don't free chunk that we have failed to allocate
IB/mlx4: Synchronize cleanup of MCGs in MCG paravirtualization
IB/mlx4: Fix QP1 P_Key processing in the Primary Physical Function (PPF)
IB/mlx4: Fix build error on platforms where UL is not 64 bits
Linus Torvalds [Fri, 26 Oct 2012 17:26:36 +0000 (10:26 -0700)]
Merge tag 'usb-3.7-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb
Pull USB fixes from Greg Kroah-Hartman:
"Here are a bunch of USB fixes for the 3.7-rc tree.
There's a lot of small USB serial driver fixes, and one larger one
(the mos7840 driver changes are mostly just moving code around to fix
problems.) Thanks to Johan Hovold for finding the problems and fixing
them all up.
Other than those, there is the usual new device ids, xhci bugfixes,
and gadget driver fixes, nothing out of the ordinary.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>"
* tag 'usb-3.7-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb: (49 commits)
xhci: trivial: Remove assigned but unused ep_ctx.
xhci: trivial: Remove assigned but unused slot_ctx.
xhci: Fix missing break in xhci_evaluate_context_result.
xhci: Fix potential NULL ptr deref in command cancellation.
ehci: Add yet-another Lucid nohandoff pci quirk
ehci: fix Lucid nohandoff pci quirk to be more generic with BIOS versions
USB: mos7840: fix port_probe flow
USB: mos7840: fix port-data memory leak
USB: mos7840: remove invalid disconnect handling
USB: mos7840: remove NULL-urb submission
USB: qcserial: fix interface-data memory leak in error path
USB: option: fix interface-data memory leak in error path
USB: ipw: fix interface-data memory leak in error path
USB: mos7840: fix port-device leak in error path
USB: mos7840: fix urb leak at release
USB: sierra: fix port-data memory leak
USB: sierra: fix memory leak in probe error path
USB: sierra: fix memory leak in attach error path
USB: usb-wwan: fix multiple memory leaks in error paths
USB: keyspan: fix NULL-pointer dereferences and memory leaks
...
Linus Torvalds [Fri, 26 Oct 2012 17:25:31 +0000 (10:25 -0700)]
Merge tag 'staging-3.7-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging
Pull staging driver fixes from Greg Kroah-Hartman:
"Here are some staging driver fixes for your 3.7-rc tree.
Nothing major here, a number of iio driver fixups that were causing
problems, some comedi driver bugfixes, and a bunch of tidspbridge
warning squashing and other regressions fixed from the 3.6 release.
All have been in the linux-next releases for a bit.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>"
* tag 'staging-3.7-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging: (32 commits)
staging: tidspbridge: delete unused mmu functions
staging: tidspbridge: ioremap physical address of the stack segment in shm
staging: tidspbridge: ioremap dsp sync addr
staging: tidspbridge: change type to __iomem for per and core addresses
staging: tidspbridge: drop const from custom mmu implementation
staging: tidspbridge: request the right irq for mmu
staging: ipack: add missing include (implicit declaration of function 'kfree')
staging: ramster: depends on NET
staging: omapdrm: fix allocation size for page addresses array
staging: zram: Fix handling of incompressible pages
Staging: android: binder: Allow using highmem for binder buffers
Staging: android: binder: Fix memory leak on thread/process exit
staging: comedi: ni_labpc: fix possible NULL deref during detach
staging: comedi: das08: fix possible NULL deref during detach
staging: comedi: amplc_pc263: fix possible NULL deref during detach
staging: comedi: amplc_pc236: fix possible NULL deref during detach
staging: comedi: amplc_pc236: fix invalid register access during detach
staging: comedi: amplc_dio200: fix possible NULL deref during detach
staging: comedi: 8255_pci: fix possible NULL deref during detach
staging: comedi: ni_daq_700: fix dio subdevice regression
...
Linus Torvalds [Fri, 26 Oct 2012 17:24:51 +0000 (10:24 -0700)]
Merge tag 'driver-core-3.7-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core
Pull driver core fixes from Greg Kroah-Hartman:
"Here are a number of firmware core fixes for 3.7, and some other minor
fixes. And some documentation updates thrown in for good measure.
All have been in the linux-next tree for a while.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>"
* tag 'driver-core-3.7-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core:
Documentation:Chinese translation of Documentation/arm64/memory.txt
Documentation:Chinese translation of Documentation/arm64/booting.txt
Documentation:Chinese translation of Documentation/IRQ.txt
firmware loader: document kernel direct loading
sysfs: sysfs_pathname/sysfs_add_one: Use strlcat() instead of strcat()
dynamic_debug: Remove unnecessary __used
firmware loader: sync firmware cache by async_synchronize_full_domain
firmware loader: let direct loading back on 'firmware_buf'
firmware loader: fix one reqeust_firmware race
firmware loader: cancel uncache work before caching firmware
Linus Torvalds [Fri, 26 Oct 2012 17:24:19 +0000 (10:24 -0700)]
Merge tag 'char-misc-3.7-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc
Pull char/misc driver fixes from Greg Kroah-Hartman:
"Here are some driver fixes for 3.7. They include extcon driver fixes,
a hyper-v bugfix, and two other minor driver fixes.
All of these have been in the linux-next releases for a while.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>"
* tag 'char-misc-3.7-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc:
sonypi: suspend/resume callbacks should be conditionally compiled on CONFIG_PM_SLEEP
Drivers: hv: Cleanup error handling in vmbus_open()
extcon : register for cable interest by cable name
extcon: trivial: kfree missed from remove path
extcon: driver model release call not needed
extcon: MAX77693: Add platform data for MUIC device to initialize registers
extcon: max77693: Use max77693_update_reg for rmw operations
extcon: Fix kerneldoc for extcon_set_cable_state and extcon_set_cable_state_
extcon: adc-jack: Add missing MODULE_LICENSE
extcon: adc-jack: Fix checking return value of request_any_context_irq
extcon: Fix return value in extcon_register_interest()
extcon: unregister compat link on cleanup
extcon: Unregister compat class at module unload to fix oops
extcon: optimising the check_mutually_exclusive function
extcon: standard cable names definition and declaration changed
extcon-max8997: remove usage of ret in max8997_muic_handle_charger_type_detach
extcon: Remove duplicate inclusion of extcon.h header file
Linus Torvalds [Fri, 26 Oct 2012 17:05:07 +0000 (10:05 -0700)]
VFS: don't do protected {sym,hard}links by default
In commit 800179c9b8a1 ("This adds symlink and hardlink restrictions to
the Linux VFS"), the new link protections were enabled by default, in
the hope that no actual application would care, despite it being
technically against legacy UNIX (and documented POSIX) behavior.
However, it does turn out to break some applications. It's rare, and
it's unfortunate, but it's unacceptable to break existing systems, so
we'll have to default to legacy behavior.
In particular, it has broken the way AFD distributes files, see
http://www.dwd.de/AFD/
along with some legacy scripts.
Distributions can end up setting this at initrd time or in system
scripts: if you have security problems due to link attacks during your
early boot sequence, you have bigger problems than some kernel sysctl
setting. Do:
Alternatively, we may at some point introduce a kernel config option
that sets these kinds of "more secure but not traditional" behavioural
options automatically.
Reported-by: Nick Bowler <nbowler@elliptictech.com> Reported-by: Holger Kiehl <Holger.Kiehl@dwd.de> Cc: Kees Cook <keescook@chromium.org> Cc: Ingo Molnar <mingo@elte.hu> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Alan Cox <alan@lxorguk.ukuu.org.uk> Cc: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org # v3.6 Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Linus Torvalds [Fri, 26 Oct 2012 17:03:22 +0000 (10:03 -0700)]
Merge tag 'sound-3.7' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound
Pull sound fixes from Takashi Iwai:
"Slightly a high amount of commits come from Adrian Knoth's HDSPM
driver fixes. Other than that, all small trival fixes or quirks that
are pretty driver-specific."
* tag 'sound-3.7' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound:
ASoC: wm8994: Only enable extra BCLK cycles when required
ALSA: als3000: check for the kzalloc return value
ALSA: sound/isa/opti9xx/miro.c: eliminate possible double free
ALSA: hda - Fix silent headphone output from Toshiba P200
ALSA: hdspm - Fix coding style in CTL_ELEM macros
ALSA: hdspm - Fix typo in kcontrol element on RME MADI cards
ALSA: hdspm - Fix sync_in detection on AES/AES32
ALSA: hdspm - Fix sync_in reporting on RME MADI cards
ALSA: hdspm - Also report autosync_sample_rate on MADI and MADIface
ALSA: hdspm - Fix reported autosync_sample_rate
ALSA: hdspm - Fix sync check reporting on all RME HDSPM cards
ALSA: hdspm - Report external rate in slave mode on PCI MADI
ALSA: hdspm - Allow DDS/Varispeed to be set from userspace
ALSA: hda - add dock support for Thinkpad T430
ASoC: ux500_msp_i2s: Fix devm_* and return code merge error
ASoC: Ux500: Dispose of device nodes correctly
Linus Torvalds [Fri, 26 Oct 2012 17:01:43 +0000 (10:01 -0700)]
Merge branch 'fixes_for_linus' of git://git.linaro.org/people/mszyprowski/linux-dma-mapping
Pull DMA-mapping revert from Marek Szyprowski:
"Due to my mistake, my previous pull request (merged as commit cff7b8ba60e3: "Merge branch 'fixes_for_linus' ..") contained a patch
which is aimed for v3.8 and lacks its dependences. This pull request
reverts it and fixes build break of ARM architecture."
* 'fixes_for_linus' of git://git.linaro.org/people/mszyprowski/linux-dma-mapping:
Revert "ARM: dma-mapping: support debug_dma_mapping_error"
Linus Torvalds [Fri, 26 Oct 2012 16:35:46 +0000 (09:35 -0700)]
Merge branch 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fixes from Ingo Molnar:
"This fixes a couple of nasty page table initialization bugs which were
causing kdump regressions. A clean rearchitecturing of the code is in
the works - meanwhile these are reverts that restore the
best-known-working state of the kernel.
There's also EFI fixes and other small fixes."
* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86, mm: Undo incorrect revert in arch/x86/mm/init.c
x86: efi: Turn off efi_enabled after setup on mixed fw/kernel
x86, mm: Find_early_table_space based on ranges that are actually being mapped
x86, mm: Use memblock memory loop instead of e820_RAM
x86, mm: Trim memory in memblock to be page aligned
x86/irq/ioapic: Check for valid irq_cfg pointer in smp_irq_move_cleanup_interrupt
x86/efi: Fix oops caused by incorrect set_memory_uc() usage
x86-64: Fix page table accounting
Revert "x86/mm: Fix the size calculation of mapping tables"
MAINTAINERS: Add EFI git repository location
Linus Torvalds [Fri, 26 Oct 2012 16:35:00 +0000 (09:35 -0700)]
Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf fixes from Ingo Molnar:
"Most of the kernel diffstat relates to a group of Intel P6 and KNC
(Xeon-Phi Knights Corner) PMU driver fixes, neither of which is in
heavy use, so we took the fixes.
The rest is diverse smallish fixes to the tooling and kernel side."
* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf/x86: Remove unused variable in nhmex_rbox_alter_er()
perf/x86: Enable overflow on Intel KNC with a custom knc_pmu_handle_irq()
perf/x86: Remove cpuc->enable check on Intl KNC event enable/disable
perf/x86: Make Intel KNC use full 40-bit width of counters
perf/x86/uncore: Handle pci_read_config_dword() errors
perf/x86: Remove P6 cpuc->enabled check
perf/x86: Update/fix generic events on P6 PMU
perf/x86: Fix P6 FP_ASSIST event constraint
perf, cpu hotplug: Use cached value of smp_processor_id()
perf, cpu hotplug: Run CPU_STARTING notifiers with irqs disabled
x86/perf: Fix virtualization sanity check
perf test: Fix exclude_guest parse events tests
perf tools: do not flush maps on COMM for perf report
perf help: Fix --help for builtins
perf trace: Check if sample raw_data field is set
perf trace: Validate syscall id before growing syscall table
Linus Torvalds [Fri, 26 Oct 2012 16:34:04 +0000 (09:34 -0700)]
Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs fixes from Chris Mason:
"This has our series of fixes for the next rc. The biggest batch is
from Jan Schmidt, fixing up some problems in our subvolume quota code
and fixing btrfs send/receive to work with the new extended inode
refs."
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
Btrfs: do not bug when we fail to commit the transaction
Btrfs: fix memory leak when cloning root's node
Btrfs: Use btrfs_update_inode_fallback when creating a snapshot
Btrfs: Send: preserve ownership (uid and gid) also for symlinks.
Btrfs: fix deadlock caused by the nested chunk allocation
btrfs: Return EINVAL when length to trim is less than FSB
Btrfs: fix memory leak in btrfs_quota_enable()
Btrfs: send correct rdev and mode in btrfs-send
Btrfs: extended inode refs support for send mechanism
Btrfs: Fix wrong error handling code
Fix a sign bug causing invalid memory access in the ino_paths ioctl.
Btrfs: comment for loop in tree_mod_log_insert_move
Btrfs: fix extent buffer reference for tree mod log roots
Btrfs: determine level of old roots
Btrfs: tree mod log's old roots could still be part of the tree
Btrfs: fix a tree mod logging issue for root replacement operations
Btrfs: don't put removals from push_node_left into tree mod log twice
Chris Metcalf [Thu, 25 Oct 2012 07:25:20 +0000 (07:25 +0000)]
tilegx: fix some issues in the SW TSO support
This change correctly computes the header length and data length in
the fragments to avoid a bug where we would end up with extremely
slow performance. Also adopt use of skb_frag_size() accessor.
Signed-off-by: Chris Metcalf <cmetcalf@tilera.com> Cc: stable@vger.kernel.org [v3.6] Signed-off-by: David S. Miller <davem@davemloft.net>
Dan Williams [Wed, 24 Oct 2012 12:10:34 +0000 (12:10 +0000)]
qmi_wwan/cdc_ether: move Novatel 551 and E362 to qmi_wwan
These devices provide QMI and ethernet functionality via a standard CDC
ethernet descriptor. But when driven by cdc_ether, the QMI
functionality is unavailable because only cdc_ether can claim the USB
interface. Thus blacklist the devices in cdc_ether and add their IDs to
qmi_wwan, which enables both QMI and ethernet simultaneously.
Signed-off-by: Dan Williams <dcbw@redhat.com> Cc: stable@vger.kernel.org Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Acked-by: Bjørn Mork <bjorn@mork.no> Signed-off-by: David S. Miller <davem@davemloft.net>
Hemant Kumar [Thu, 25 Oct 2012 18:17:54 +0000 (18:17 +0000)]
net: usb: Fix memory leak on Tx data path
Driver anchors the tx urbs and defers the urb submission if
a transmit request comes when the interface is suspended.
Anchoring urb increments the urb reference count. These
deferred urbs are later accessed by calling usb_get_from_anchor()
for submission during interface resume. usb_get_from_anchor()
unanchors the urb but urb reference count remains same.
This causes the urb reference count to remain non-zero
after usb_free_urb() gets called and urb never gets freed.
Hence call usb_put_urb() after anchoring the urb to properly
balance the reference count for these deferred urbs. Also,
unanchor these deferred urbs during disconnect, to free them
up.
Signed-off-by: Hemant Kumar <hemantk@codeaurora.org> Acked-by: Oliver Neukum <oneukum@suse.de> Signed-off-by: David S. Miller <davem@davemloft.net>
Dotan Barak [Thu, 25 Oct 2012 01:12:49 +0000 (01:12 +0000)]
net/mlx4_core: Unmap UAR also in the case of error flow
If a failure takes place during the EQ creation, we need to unmap the
UAR memory block too.
Signed-off-by: Dotan Barak <dotanb@dev.mellanox.co.il> Signed-off-by: Uri Habusha <urih@mellanox.com> Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Jack Morgenstein [Thu, 25 Oct 2012 01:12:47 +0000 (01:12 +0000)]
net/mlx4_en: Fix double-release-range in tx-rings
The QP range is reserved as a single block. However, when freeing the
en resources, the tx-ring QPs are released both in mlx4_en_destroy_tx_ring
(one at a time) and in mlx4_en_free_resources (as a block release).
Fix by eliminating the one-at-a-time release in mlx4_en_destroy_tx_ring.
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il> Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Tilman Schmidt [Wed, 24 Oct 2012 08:44:32 +0000 (08:44 +0000)]
bas_gigaset: fix pre_reset handling
The delayed work function int_in_work() may call usb_reset_device()
and thus, indirectly, the driver's pre_reset method. Trying to
cancel the work synchronously in that situation would deadlock.
Fix by avoiding cancel_work_sync() in the pre_reset method.
If the reset was NOT initiated by int_in_work() this might cause
int_in_work() to run after the post_reset method, with urb_int_in
already resubmitted, so handle that case gracefully.
Signed-off-by: Tilman Schmidt <tilman@imap.cc> Signed-off-by: David S. Miller <davem@davemloft.net>
Linus Torvalds [Fri, 26 Oct 2012 02:26:54 +0000 (19:26 -0700)]
Merge branch 'drm-fixes' of git://people.freedesktop.org/~airlied/linux
Pull drm radeon fixes from Dave Airlie:
"Just radeon fixes in this one:
- some new PCI IDs
- ATPX regression fix
- async VM regression fixes
- some module options fixes"
* 'drm-fixes' of git://people.freedesktop.org/~airlied/linux:
drm/radeon: fix ATPX regression in acpi rework
drm/radeon: fix ATPX function documentation
drm/radeon: move the retry to gem_object_create
drm/radeon: move size limits to gem_object_create.
drm/radeon: use vzalloc for gart pages
drm/radeon: fix and simplify pot argument checks v3
drm/radeon: fix header size estimation in VM code
drm/radeon: remove set_page check from VM code
drm/radeon: fix si_set_page v2
drm/radeon: fix cayman_vm_set_page v2
drm/radeon: fix PFP sync in vm_flush
drm/radeon: add error output if VM CS fails on cayman
drm/radeon: give each backlight a unique id
drm/radeon: fix sparse warning
drm/radeon: add some new SI PCI ids
Linus Torvalds [Fri, 26 Oct 2012 02:26:16 +0000 (19:26 -0700)]
Merge tag 'nfs-for-3.7-3' of git://git.linux-nfs.org/projects/trondmy/linux-nfs
Pull NFS bugfixes from Trond Myklebust:
- Fix the NFSv2/v3 kernel statd protocol, which broke due to net
namespace related changes.
- Fix a number of races in the SUNRPC TCP disconnect/reconnect code.
* tag 'nfs-for-3.7-3' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
LOCKD: Clear ln->nsm_clnt only when ln->nsm_users is zero
LOCKD: fix races in nsm_client_get
SUNRPC: Get rid of the xs_error_report socket callback
SUNRPC: Prevent races in xs_abort_connection()
Revert "SUNRPC: Ensure we close the socket on EPIPE errors too..."
SUNRPC: Clear the connect flag when socket state is TCP_CLOSE_WAIT
Dave Airlie [Thu, 25 Oct 2012 01:36:05 +0000 (11:36 +1000)]
Merge branch 'drm-fixes-3.7' of git://people.freedesktop.org/~agd5f/linux into drm-fixes
Alex writes:
"Fixes pull request for radeon. The main things here are
fixing a ATPX regression from the acpi rework, fixing some
fallout from the async VM work, and fixing some module options
that were broken in certain cases. Other than that, mainly
just bug fixes."
* 'drm-fixes-3.7' of git://people.freedesktop.org/~agd5f/linux:
drm/radeon: fix ATPX regression in acpi rework
drm/radeon: fix ATPX function documentation
drm/radeon: move the retry to gem_object_create
drm/radeon: move size limits to gem_object_create.
drm/radeon: use vzalloc for gart pages
drm/radeon: fix and simplify pot argument checks v3
drm/radeon: fix header size estimation in VM code
drm/radeon: remove set_page check from VM code
drm/radeon: fix si_set_page v2
drm/radeon: fix cayman_vm_set_page v2
drm/radeon: fix PFP sync in vm_flush
drm/radeon: add error output if VM CS fails on cayman
drm/radeon: give each backlight a unique id
drm/radeon: fix sparse warning
drm/radeon: add some new SI PCI ids
Linus Torvalds [Thu, 25 Oct 2012 23:05:57 +0000 (16:05 -0700)]
Merge branch 'akpm' (Andrew's fixes)
Merge misc fixes from Andrew Morton:
"18 total. 15 fixes and some updates to a device_cgroup patchset which
bring it up to date with the version which I should have merged in the
first place."
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (18 patches)
fs/compat_ioctl.c: VIDEO_SET_SPU_PALETTE missing error check
gen_init_cpio: avoid stack overflow when expanding
drivers/rtc/rtc-imxdi.c: add missing spin lock initialization
mm, numa: avoid setting zone_reclaim_mode unless a node is sufficiently distant
pidns: limit the nesting depth of pid namespaces
drivers/dma/dw_dmac: make driver's endianness configurable
mm/mmu_notifier: allocate mmu_notifier in advance
tools/testing/selftests/epoll/test_epoll.c: fix build
UAPI: fix tools/vm/page-types.c
mm/page_alloc.c:alloc_contig_range(): return early for err path
rbtree: include linux/compiler.h for definition of __always_inline
genalloc: stop crashing the system when destroying a pool
backlight: ili9320: add missing SPI dependency
device_cgroup: add proper checking when changing default behavior
device_cgroup: stop using simple_strtoul()
device_cgroup: rename deny_all to behavior
cgroup: fix invalid rcu dereference
mm: fix XFS oops due to dirty pages without buffers on s390
Jason Gerecke [Sun, 21 Oct 2012 07:38:03 +0000 (00:38 -0700)]
Input: wacom - handle split-sensor devices with internal hubs
Like our other pen-and-touch products, the Cintiq 24HD touch needs data
to be shared between its two sensors to facilitate proximity-based palm
rejection.
Unlike other tablets that report sensor data through separate interfaces
of the same USB device, the Cintiq 24HD touch has separate USB devices
that are connected to an internal USB hub.
This patch makes it possible to designate the USB VID/PID of the other
device so that the two may share data. To ensure we don't accidentally
link to a sensor from a physically separate device (if several have been
plugged in), we limit the search to siblings (i.e., devices directly
connected to the same hub).
H. Peter Anvin [Wed, 24 Oct 2012 21:11:48 +0000 (14:11 -0700)]
Makefile: Documentation for external tool should be correct
If one includes documentation for an external tool, it should be
correct. This is not:
1. Overriding the input to rngd should typically be neither
necessary nor desired. This is especially so since newer
versions of rngd support a number of different *types* of sources.
2. The default kernel-exported device is called /dev/hwrng not
/dev/hwrandom nor /dev/hw_random (both of which were used in the
past; however, kernel and udev seem to have converged on
/dev/hwrng.)
Overall it is better if the documentation for rngd is kept with rngd
rather than in a kernel Makefile.
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com> Cc: David Howells <dhowells@redhat.com> Cc: Jeff Garzik <jgarzik@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Linus Torvalds [Thu, 25 Oct 2012 22:59:34 +0000 (15:59 -0700)]
Merge branch 'fixes' of git://git.linaro.org/people/rmk/linux-arm
Pull ARM fixes from Russell King:
"A random collection of various fixes, mainly from Arnd and a few other
people. Not thing really stands out here."
* 'fixes' of git://git.linaro.org/people/rmk/linux-arm:
ARM: drop experimental status for hotplug and Thumb2
ARM: 7560/1: SMP_TWD: use DIV_ROUND_CLOSEST() for periodic mode
ARM: 7559/1: smp: switch away from the idmap before updating init_mm.mm_count
ARM: 7556/1: perf: fix updated event period in response to PERF_EVENT_IOC_PERIOD
ARM: 7555/1: kexec: fix segment memory addresses check
ARM: warnings in arch/arm/include/asm/uaccess.h
ARM: binfmt_flat: unused variable 'persistent'
ARM: be really quiet when building with 'make -s'
ARM: pass -marm to gcc by default for both C and assembler
ARM: Xen: fix initial build problems
ARM: export default read_current_timer
ARM: Fix another build warning in arch/arm/mm/alignment.c
ARM: export set_irq_flags
ARM: kprobes: make more tests conditional
Linus Torvalds [Thu, 25 Oct 2012 22:57:48 +0000 (15:57 -0700)]
Merge branch 'fixes_for_linus' of git://git.linaro.org/people/mszyprowski/linux-dma-mapping
Pull CMA and DMA-mapping fixes from Marek Szyprowski:
"This consists mainly of a set of one-liner fixes and cleanups for a
few minor issues identified in both Contiguous Memory Allocator code
and ARM DMA-mapping subsystem."
* 'fixes_for_linus' of git://git.linaro.org/people/mszyprowski/linux-dma-mapping:
ARM: mm: Remove unused arm_vmregion priv field
ARM: dma-mapping: fix build warning in __dma_alloc()
ARM: dma-mapping: support debug_dma_mapping_error
mm: cma: alloc_contig_range: return early for err path
drivers: cma: Fix wrong CMA selected region size default value
drivers: dma-coherent: Fix typo in dma_mmap_from_coherent documentation
drivers: dma-contiguous: Don't redefine SZ_1M
The compat ioctl for VIDEO_SET_SPU_PALETTE was missing an error check
while converting ioctl arguments. This could lead to leaking kernel
stack contents into userspace.
Patch extracted from existing fix in grsecurity.
Signed-off-by: Kees Cook <keescook@chromium.org> Cc: David Miller <davem@davemloft.net> Cc: Brad Spengler <spender@grsecurity.net> Cc: PaX Team <pageexec@freemail.hu> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Kees Cook [Thu, 25 Oct 2012 20:38:14 +0000 (13:38 -0700)]
gen_init_cpio: avoid stack overflow when expanding
Fix possible overflow of the buffer used for expanding environment
variables when building file list.
In the extremely unlikely case of an attacker having control over the
environment variables visible to gen_init_cpio, control over the
contents of the file gen_init_cpio parses, and gen_init_cpio was built
without compiler hardening, the attacker can gain arbitrary execution
control via a stack buffer overflow.
David Rientjes [Thu, 25 Oct 2012 20:38:08 +0000 (13:38 -0700)]
mm, numa: avoid setting zone_reclaim_mode unless a node is sufficiently distant
Commit 957f822a0ab9 ("mm, numa: reclaim from all nodes within reclaim
distance") caused zone_reclaim_mode to be set for all systems where two
nodes are within RECLAIM_DISTANCE of each other. This is the opposite
of what we actually want: zone_reclaim_mode should be set if two nodes
are sufficiently distant.
Andrew Vagin [Thu, 25 Oct 2012 20:38:07 +0000 (13:38 -0700)]
pidns: limit the nesting depth of pid namespaces
'struct pid' is a "variable sized struct" - a header with an array of
upids at the end.
The size of the array depends on a level (depth) of pid namespaces. Now a
level of pidns is not limited, so 'struct pid' can be more than one page.
Looks reasonable, that it should be less than a page. MAX_PIS_NS_LEVEL is
not calculated from PAGE_SIZE, because in this case it depends on
architectures, config options and it will be reduced, if someone adds a
new fields in struct pid or struct upid.
I suggest to set MAX_PIS_NS_LEVEL = 32, because it saves ability to expand
"struct pid" and it's more than enough for all known for me use-cases.
When someone finds a reasonable use case, we can add a config option or a
sysctl parameter.
In addition it will reduce the effect of another problem, when we have
many nested namespaces and the oldest one starts dying.
zap_pid_ns_processe will be called for each namespace and find_vpid will
be called for each process in a namespace. find_vpid will be called
minimum max_level^2 / 2 times. The reason of that is that when we found a
bit in pidmap, we can't determine this pidns is top for this process or it
isn't.
vpid is a heavy operation, so a fork bomb, which create many nested
namespace, can make a system inaccessible for a long time. For example my
system becomes inaccessible for a few minutes with 4000 processes.
[akpm@linux-foundation.org: return -EINVAL in response to excessive nesting, not -ENOMEM] Signed-off-by: Andrew Vagin <avagin@openvz.org> Acked-by: Oleg Nesterov <oleg@redhat.com> Cc: Cyrill Gorcunov <gorcunov@openvz.org> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Pavel Emelyanov <xemul@parallels.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Hein Tibosch [Thu, 25 Oct 2012 20:38:05 +0000 (13:38 -0700)]
drivers/dma/dw_dmac: make driver's endianness configurable
The dw_dmac driver was originally developed for avr32 to be used with the
Synopsys DesignWare AHB DMA controller. Starting from 2.6.38, access to
the device's i/o memory was done with the little-endian readl/writel
functions(1)
This broke the driver for the avr32 platform, because it needs big
(native) endian accessors. This patch makes the endianness configurable
using 'DW_DMAC_BIG_ENDIAN_IO', which will default be true for AVR32
I submitted this patch before(2) but then waited for Andy to finish other
changes to the same module(3).
Gavin Shan [Thu, 25 Oct 2012 20:38:01 +0000 (13:38 -0700)]
mm/mmu_notifier: allocate mmu_notifier in advance
While allocating mmu_notifier with parameter GFP_KERNEL, swap would start
to work in case of tight available memory. Eventually, that would lead to
a deadlock while the swap deamon swaps anonymous pages. It was caused by
commit e0f3c3f78da29b ("mm/mmu_notifier: init notifier if necessary").
=================================
[ INFO: inconsistent lock state ]
3.7.0-rc1+ #518 Not tainted
---------------------------------
inconsistent {RECLAIM_FS-ON-W} -> {IN-RECLAIM_FS-W} usage.
kswapd0/35 [HC0[0]:SC0[0]:HE1:SE1] takes:
(&mapping->i_mmap_mutex){+.+.?.}, at: page_referenced+0x9c/0x2e0
{RECLAIM_FS-ON-W} state was registered at:
mark_held_locks+0x86/0x150
lockdep_trace_alloc+0x67/0xc0
kmem_cache_alloc_trace+0x33/0x230
do_mmu_notifier_register+0x87/0x180
mmu_notifier_register+0x13/0x20
kvm_dev_ioctl+0x428/0x510
do_vfs_ioctl+0x98/0x570
sys_ioctl+0x91/0xb0
system_call_fastpath+0x16/0x1b
irq event stamp: 825
hardirqs last enabled at (825): _raw_spin_unlock_irq+0x30/0x60
hardirqs last disabled at (824): _raw_spin_lock_irq+0x19/0x80
softirqs last enabled at (0): copy_process+0x630/0x17c0
softirqs last disabled at (0): (null)
...
Simply back out the above commit, which was a small performance
optimization.
Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com> Reported-by: Andrea Righi <andrea@betterlinux.com> Tested-by: Andrea Righi <andrea@betterlinux.com> Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Avi Kivity <avi@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> Cc: Sagi Grimberg <sagig@mellanox.co.il> Cc: Haggai Eran <haggaie@mellanox.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Latest Linus head run of "make selftests" in the tools directory failed
with references to undefined variables. Reference was to
'write_thread_data' which is the name of a struct that is being used, not
the variable itself. Change reference so it points to the variable.
Signed-off-by: Daniel Hazelton <dshadowwolf@gmail.com> Cc: "Paton J. Lewis" <palewis@adobe.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
David Howells [Thu, 25 Oct 2012 20:37:57 +0000 (13:37 -0700)]
UAPI: fix tools/vm/page-types.c
Fix tools/vm/page-types.c to use the UAPI variant of linux/kernel-page-flags.h
lest the following error appear:
In file included from page-types.c:38:0:
../../include/linux/kernel-page-flags.h:4:42: fatal error:
uapi/linux/kernel-page-flags.h: No such file or directory
Reported-by: Daniel Hazelton <dshadowwolf@gmail.com> Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: Fengguang Wu <fengguang.wu@intel.com> Tested-by: Daniel Hazelton <dshadowwolf@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Bob Liu [Thu, 25 Oct 2012 20:37:56 +0000 (13:37 -0700)]
mm/page_alloc.c:alloc_contig_range(): return early for err path
If start_isolate_page_range() failed, unset_migratetype_isolate() has been
done inside it.
Signed-off-by: Bob Liu <lliubbo@gmail.com> Cc: Ni zhan Chen <nizhan.chen@gmail.com> Cc: Marek Szyprowski <m.szyprowski@samsung.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Will Deacon [Thu, 25 Oct 2012 20:37:53 +0000 (13:37 -0700)]
rbtree: include linux/compiler.h for definition of __always_inline
rb_erase_augmented() is a static function annotated with
__always_inline. This causes a compile failure when attempting to use
the rbtree implementation as a library (e.g. kvm tool):
rbtree_augmented.h:125:24: error: expected `=', `,', `;', `asm' or `__attribute__' before `void'
Include linux/compiler.h in rbtree_augmented.h so that the __always_inline
macro is resolved correctly.
Signed-off-by: Will Deacon <will.deacon@arm.com> Cc: Pekka Enberg <penberg@kernel.org> Reviewed-by: Michel Lespinasse <walken@google.com> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
genalloc: stop crashing the system when destroying a pool
The genalloc code uses the bitmap API from include/linux/bitmap.h and
lib/bitmap.c, which is based on long values. Both bitmap_set from
lib/bitmap.c and bitmap_set_ll, which is the lockless version from
genalloc.c, use BITMAP_LAST_WORD_MASK to set the first bits in a long in
the bitmap.
That one uses (1 << bits) - 1, 0b111, if you are setting the first three
bits. This means that the API counts from the least significant bits
(LSB from now on) to the MSB. The LSB in the first long is bit 0, then.
The same works for the lookup functions.
The genalloc code uses longs for the bitmap, as it should. In
include/linux/genalloc.h, struct gen_pool_chunk has unsigned long
bits[0] as its last member. When allocating the struct, genalloc should
reserve enough space for the bitmap. This should be a proper number of
longs that can fit the amount of bits in the bitmap.
However, genalloc allocates an integer number of bytes that fit the
amount of bits, but may not be an integer amount of longs. 9 bytes, for
example, could be allocated for 70 bits.
This is a problem in itself if the Least Significat Bit in a long is in
the byte with the largest address, which happens in Big Endian machines.
This means genalloc is not allocating the byte in which it will try to
set or check for a bit.
This may end up in memory corruption, where genalloc will try to set the
bits it has not allocated. In fact, genalloc may not set these bits
because it may find them already set, because they were not zeroed since
they were not allocated. And that's what causes a BUG when
gen_pool_destroy is called and check for any set bits.
What really happens is that genalloc uses kmalloc_node with __GFP_ZERO
on gen_pool_add_virt. With SLAB and SLUB, this means the whole slab
will be cleared, not only the requested bytes. Since struct
gen_pool_chunk has a size that is a multiple of 8, and slab sizes are
multiples of 8, we get lucky and allocate and clear the right amount of
bytes.
Hower, this is not the case with SLOB or with older code that did memset
after allocating instead of using __GFP_ZERO.
So, a simple module as this (running 3.6.0), will cause a crash when
rmmod'ed.
module_init(foo_init);
module_exit(foo_exit);
[root@phantom-lp2 foo]# zcat /proc/config.gz | grep SLOB
CONFIG_SLOB=y
[root@phantom-lp2 foo]# insmod ./foo.ko
[root@phantom-lp2 foo]# rmmod foo
------------[ cut here ]------------
kernel BUG at lib/genalloc.c:243!
cpu 0x4: Vector: 700 (Program Check) at [c0000000bb0e7960]
pc: c0000000003cb50c: .gen_pool_destroy+0xac/0x110
lr: c0000000003cb4fc: .gen_pool_destroy+0x9c/0x110
sp: c0000000bb0e7be0
msr: 8000000000029032
current = 0xc0000000bb0e0000
paca = 0xc000000006d30e00 softe: 0 irq_happened: 0x01
pid = 13044, comm = rmmod
kernel BUG at lib/genalloc.c:243!
[c0000000bb0e7ca0] d000000004b00020 .foo_exit+0x20/0x38 [foo]
[c0000000bb0e7d20] c0000000000dff98 .SyS_delete_module+0x1a8/0x290
[c0000000bb0e7e30] c0000000000097d4 syscall_exit+0x0/0x94
--- Exception: c00 (System Call) at 000000800753d1a0
SP (fffd0b0e640) is in userspace
Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@linux.vnet.ibm.com> Cc: Paul Gortmaker <paul.gortmaker@windriver.com> Cc: Benjamin Gaignard <benjamin.gaignard@stericsson.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Jingoo Han [Thu, 25 Oct 2012 20:37:48 +0000 (13:37 -0700)]
backlight: ili9320: add missing SPI dependency
Add this missing SPI dependency and prevent the driver from building
without SPI, because functions of the spi driver are used in this
driver.
drivers/video/backlight/ili9320.c:51: undefined reference to `spi_sync'
Also, a prompt string for CONFIG_LCD_ILI9320 is added for explicit
selection.
Signed-off-by: Jingoo Han <jg1.han@samsung.com> Cc: Richard Purdie <rpurdie@rpsys.net> Cc: Ben Dooks <ben-linux@fluff.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Aristeu Rozanski [Thu, 25 Oct 2012 20:37:41 +0000 (13:37 -0700)]
device_cgroup: stop using simple_strtoul()
Convert the code to use kstrtou32() instead of simple_strtoul() which is
deprecated. The real size of the variables are u32, so use kstrtou32
instead of kstrtoul
Signed-off-by: Aristeu Rozanski <aris@redhat.com> Cc: Dave Jones <davej@redhat.com> Cc: Tejun Heo <tj@kernel.org> Cc: Li Zefan <lizefan@huawei.com> Cc: James Morris <jmorris@namei.org> Cc: Pavel Emelyanov <xemul@openvz.org> Acked-by: Serge Hallyn <serge.hallyn@canonical.com> Cc: Jiri Slaby <jslaby@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Aristeu Rozanski [Thu, 25 Oct 2012 20:37:38 +0000 (13:37 -0700)]
device_cgroup: rename deny_all to behavior
This was done in a v2 patch but v1 ended up being committed. The
variable name is less confusing and stores the default behavior when no
matching exception exists.
Signed-off-by: Aristeu Rozanski <aris@redhat.com> Cc: Dave Jones <davej@redhat.com> Cc: Tejun Heo <tj@kernel.org> Cc: Li Zefan <lizefan@huawei.com> Cc: James Morris <jmorris@namei.org> Cc: Pavel Emelyanov <xemul@openvz.org> Acked-by: Serge Hallyn <serge.hallyn@canonical.com> Cc: Jiri Slaby <jslaby@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Jiri Slaby [Thu, 25 Oct 2012 20:37:34 +0000 (13:37 -0700)]
cgroup: fix invalid rcu dereference
Commit ad676077a2ae ("device_cgroup: convert device_cgroup internally to
policy + exceptions") removed rcu locks which are needed in
task_devcgroup called in this chain:
Jan Kara [Thu, 25 Oct 2012 20:37:31 +0000 (13:37 -0700)]
mm: fix XFS oops due to dirty pages without buffers on s390
On s390 any write to a page (even from kernel itself) sets architecture
specific page dirty bit. Thus when a page is written to via buffered
write, HW dirty bit gets set and when we later map and unmap the page,
page_remove_rmap() finds the dirty bit and calls set_page_dirty().
Dirtying of a page which shouldn't be dirty can cause all sorts of
problems to filesystems. The bug we observed in practice is that
buffers from the page get freed, so when the page gets later marked as
dirty and writeback writes it, XFS crashes due to an assertion
BUG_ON(!PagePrivate(page)) in page_buffers() called from
xfs_count_page_state().
Similar problem can also happen when zero_user_segment() call from
xfs_vm_writepage() (or block_write_full_page() for that matter) set the
hardware dirty bit during writeback, later buffers get freed, and then
page unmapped.
Fix the issue by ignoring s390 HW dirty bit for page cache pages of
mappings with mapping_cap_account_dirty(). This is safe because for
such mappings when a page gets marked as writeable in PTE it is also
marked dirty in do_wp_page() or do_page_fault(). When the dirty bit is
cleared by clear_page_dirty_for_io(), the page gets writeprotected in
page_mkclean(). So pagecache page is writeable if and only if it is
dirty.
Thanks to Hugh Dickins for pointing out mapping has to have
mapping_cap_account_dirty() for things to work and proposing a cleaned
up variant of the patch.
The patch has survived about two hours of running fsx-linux on tmpfs
while heavily swapping and several days of running on out build machines
where the original problem was triggered.
Signed-off-by: Jan Kara <jack@suse.cz> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Hugh Dickins <hughd@google.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: <stable@vger.kernel.org> [3.0+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Sarah Sharp [Tue, 16 Oct 2012 20:26:22 +0000 (13:26 -0700)]
xhci: Fix missing break in xhci_evaluate_context_result.
Coverity complains that xhci_evaluate_context_result() is missing a
break statement after the COMP_EBADSLT switch case. It's not a big
deal, since we wanted to return the same error code as the case
statement below it does. The end result would be one that a Slot
Disabled error completion code would also print the warning message
associated with a Context State error code. No other bad behavior would
result.
It's not worth backporting to stable kernels, since it only fixes an
issue with too much debugging.
Signed-off-by: Sarah Sharp <sarah.a.sharp@linux.intel.com>
Sarah Sharp [Tue, 16 Oct 2012 20:17:43 +0000 (13:17 -0700)]
xhci: Fix potential NULL ptr deref in command cancellation.
The command cancellation code doesn't check whether find_trb_seg()
couldn't find the segment that contains the TRB to be canceled. This
could cause a NULL pointer deference later in the function when next_trb
is called. It's unlikely to happen unless something is wrong with the
command ring pointers, so add some debugging in case it happens.
This patch should be backported to stable kernels as old as 3.0, that
contain the commit b63f4053cc8aa22a98e3f9a97845afe6c15d0a0d "xHCI:
handle command after aborting the command ring".
Signed-off-by: Sarah Sharp <sarah.a.sharp@linux.intel.com> Cc: stable@vger.kernel.org
Josef Bacik [Mon, 22 Oct 2012 19:43:12 +0000 (15:43 -0400)]
Btrfs: Use btrfs_update_inode_fallback when creating a snapshot
On a really full file system I was getting ENOSPC back from
btrfs_update_inode when trying to update the parent inode when creating a
snapshot. Just use the fallback method so we can update the inode and not
have to worry about having a delayed ref. Thanks,
Alex Lyakas [Wed, 17 Oct 2012 13:52:47 +0000 (13:52 +0000)]
Btrfs: Send: preserve ownership (uid and gid) also for symlinks.
This patch also requires a change in the user-space part of "receive".
We need to use "lchown" instead of "chown". We will do this in the
following patch.
Signed-off-by: Alex Lyakas <alex.btrfs@zadarastorage.com>
if (S_ISREG(sctx->cur_inode_mode)) {
It is because of the nested chunk allocation. When we wrote the data
into the filesystem, we would allocate the data chunk because there was
no data chunk in the filesystem. At the end of the data chunk allocation,
we should insert the metadata of the data chunk into the extent tree, but
there was no raid1 chunk, so we tried to lock the chunk allocation mutex to
allocate the new chunk, but we had held the mutex, the deadlock happened.
By rights, we would allocate the raid1 chunk when we added the second device
because the profile of the seed filesystem is raid1 and we had two devices.
But we didn't do that in fact. It is because the last step of the first device
insertion didn't commit the transaction. So when we added the second device,
we didn't cow the tree, and just inserted the relative metadata into the leaves
which were generated by the first device insertion, and its profile was dup.
So, I fix this problem by commiting the transaction at the end of the first
device insertion.
Lukas Czerner [Tue, 16 Oct 2012 09:34:36 +0000 (09:34 +0000)]
btrfs: Return EINVAL when length to trim is less than FSB
Currently if len argument in btrfs_ioctl_fitrim() is smaller than
one FSB we will continue and finally return 0 bytes discarded.
However if the length to discard is smaller then file system block
we should really return EINVAL.