Rainer Weikusat [Thu, 11 Feb 2016 19:37:27 +0000 (19:37 +0000)]
af_unix: Guard against other == sk in unix_dgram_sendmsg
The unix_dgram_sendmsg routine use the following test
if (unlikely(unix_peer(other) != sk && unix_recvq_full(other))) {
to determine if sk and other are in an n:1 association (either
established via connect or by using sendto to send messages to an
unrelated socket identified by address). This isn't correct as the
specified address could have been bound to the sending socket itself or
because this socket could have been connected to itself by the time of
the unix_peer_get but disconnected before the unix_state_lock(other). In
both cases, the if-block would be entered despite other == sk which
might either block the sender unintentionally or lead to trying to unlock
the same spin lock twice for a non-blocking send. Add a other != sk
check to guard against this.
Fixes: 7d267278a9ec ("unix: avoid use-after-free in ep_remove_wait_queue") Reported-By: Philipp Hahn <pmhahn@pmhahn.de> Signed-off-by: Rainer Weikusat <rweikusat@mobileactivedefense.com> Tested-by: Philipp Hahn <pmhahn@pmhahn.de> Signed-off-by: David S. Miller <davem@davemloft.net>
Change it such that err is only set if an error condition was detected.
Fixes: 3822b5c2fc62 ("af_unix: Revert 'lock_interruptible' in stream receive code") Reported-by: Joseph Salisbury <joseph.salisbury@canonical.com> Signed-off-by: Rainer Weikusat <rweikusat@mobileactivedefense.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Michael McConville <mmcco@mykolab.com> Acked-by: Francois Romieu <romieu@fr.zoreil.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Vivien Didelot [Fri, 5 Feb 2016 19:07:14 +0000 (14:07 -0500)]
net: dsa: mv88e6xxx: do not leave reserved VLANs
BRIDGE_VLAN_FILTERING automatically adds a newly bridged port to the
VLAN with the bridge's default_pvid.
The mv88e6xxx driver currently reserves VLANs 4000+ for unbridged ports
isolation. When a port joins a bridge, it leaves its reserved VLAN. When
a port leaves a bridge, it joins again its reserved VLAN.
But if the VLAN filtering is disabled, or if this hardware VLAN is
already in use, the bridged port ends up with no default VLAN, and the
communication with the CPU is thus broken.
To fix this, make a port join its reserved VLAN once on setup, never
leave it, and restore its PVID after another one was eventually used.
Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com> Tested-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
Vivien Didelot [Fri, 5 Feb 2016 19:04:39 +0000 (14:04 -0500)]
net: dsa: mv88e6xxx: fix software VLAN deletion
The current bridge code calls switchdev_port_obj_del on a VLAN port even
if the corresponding switchdev_port_obj_add call returned -EOPNOTSUPP.
If the DSA driver doesn't return -EOPNOTSUPP for a software port VLAN in
its port_vlan_del function, the VLAN is not deleted. Unbridging the port
also generates a stack trace for the same reason.
This can be quickly tested on a VLAN filtering enabled system with:
To fix this, return -EOPNOTSUPP in _mv88e6xxx_port_vlan_del instead of
-ENOENT if the hardware VLAN doesn't exist or the port is not a member.
Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com> Tested-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
Commit c0eb454034aa ("hv_netvsc: Don't ask for additional head room in the
skb") got rid of needed_headroom setting for the driver. With the change I
hit the following issue trying to use ptkgen module:
Basically, we're calling skb_cow_head(skb, RNDIS_AND_PPI_SIZE) and crash on
if (skb_shared(skb))
BUG();
We probably need to restore needed_headroom setting (but shrunk to
RNDIS_AND_PPI_SIZE as we don't need more) to request the required headroom
space. In theory, it should not give us performance penalty.
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Sat, 13 Feb 2016 11:02:25 +0000 (06:02 -0500)]
Merge branch 'mvneta-fixes'
Gregory CLEMENT says:
====================
mvneta fixes for SMP
Following this bug report:
http://thread.gmane.org/gmane.linux.ports.arm.kernel/468173 and the
suggestions from Russell King, I reviewed all the code involving
multi-CPU. It ended with this series of patches which should improve
the stability of the driver.
During my test I found another bug which is fixed by new patch (the
second one of this new version of the series)
The two first patches fix real bugs, the others fix potential issues
in the driver.
Changelog:
v1 -> v2
Fix spinlock comment. Pointed by David Miller
v2 -> v3
- Fix typos and mistake in the comments. Pointed by Sergei Shtylyov
- Add a new patch fixing the CPU choice in mvneta_percpu_elect
- Use lock in last patch to prevent remaining race condition. Pointed
by Jisheng
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Gregory CLEMENT [Thu, 4 Feb 2016 21:09:29 +0000 (22:09 +0100)]
net: mvneta: Fix race condition during stopping
When stopping the port, the CPU notifier are still there whereas the
mvneta_stop_dev function calls mvneta_percpu_disable() on each CPUs.
It was possible to have a new CPU coming at this point which could be
racy.
This patch adds a flag preventing executing the code notifier for a new
CPU when the port is stopping. It also uses the spinlock introduces
previously. To avoid the deadlock, the lock has been moved outside the
mvneta_percpu_elect function.
Signed-off-by: Gregory CLEMENT <gregory.clement@free-electrons.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Gregory CLEMENT [Thu, 4 Feb 2016 21:09:28 +0000 (22:09 +0100)]
net: mvneta: The mvneta_percpu_elect function should be atomic
Electing a CPU must be done in an atomic way: it should be done after or
before the removal/insertion of a CPU and this function is not reentrant.
During the loop of mvneta_percpu_elect we associates the queues to the
CPUs, if there is a topology change during this loop, then the mapping
between the CPUs and the queues could be wrong. During this loop the
interrupt mask is also updating for each CPUs, It should not be changed
in the same time by other part of the driver.
This patch adds spinlock to create the needed critical sections.
Signed-off-by: Gregory CLEMENT <gregory.clement@free-electrons.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Gregory CLEMENT [Thu, 4 Feb 2016 21:09:27 +0000 (22:09 +0100)]
net: mvneta: Modify the queue related fields from each cpu
In the MVNETA_INTR_* registers, the queues related fields are per cpu,
according to the datasheet (comment in [] are added by me):
"In a multi-CPU system, bits of RX[or TX] queues for which the access by
the reading[or writing] CPU is disabled are read as 0, and cannot be
cleared[or written]."
That means that each time we want to manipulate these bits we had to do
it on each cpu and not only on the current cpu.
Signed-off-by: Gregory CLEMENT <gregory.clement@free-electrons.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Gregory CLEMENT [Thu, 4 Feb 2016 21:09:26 +0000 (22:09 +0100)]
net: mvneta: Remove unused code
Since the commit 2dcf75e2793c ("net: mvneta: Associate RX queues with
each CPU") all the percpu irq are used and disabled at initialization, so
there is no point to disable them first.
Signed-off-by: Gregory CLEMENT <gregory.clement@free-electrons.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Gregory CLEMENT [Thu, 4 Feb 2016 21:09:25 +0000 (22:09 +0100)]
net: mvneta: Use on_each_cpu when possible
Instead of using a for_each_* loop in which we just call the
smp_call_function_single macro, it is more simple to directly use the
on_each_cpu macro. Moreover, this macro ensures that the calls will be
done all at once.
Suggested-by: Russell King <rmk+kernel@arm.linux.org.uk> Signed-off-by: Gregory CLEMENT <gregory.clement@free-electrons.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Gregory CLEMENT [Thu, 4 Feb 2016 21:09:24 +0000 (22:09 +0100)]
net: mvneta: Fix the CPU choice in mvneta_percpu_elect
When passing to the management of multiple RX queue, the
mvneta_percpu_elect function was broken. The use of the modulo can lead
to elect the wrong cpu. For example with rxq_def=2, if the CPU 2 goes
offline and then online, we ended with the third RX queue activated in
the same time on CPU 0 and CPU2, which lead to a kernel crash.
With this fix, we don't try to get "the closer" CPU if the default CPU is
gone, now we just use CPU 0 which always be there. Thanks to this, the
code becomes more readable, easier to maintain and more predicable.
Cc: stable@vger.kernel.org Fixes: 2dcf75e2793c ("net: mvneta: Associate RX queues with each CPU") Signed-off-by: Gregory CLEMENT <gregory.clement@free-electrons.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Gregory CLEMENT [Thu, 4 Feb 2016 21:09:23 +0000 (22:09 +0100)]
net: mvneta: Fix for_each_present_cpu usage
This patch convert the for_each_present in on_each_cpu, instead of
applying on the present cpus it will be applied only on the online cpus.
This fix a bug reported on
http://thread.gmane.org/gmane.linux.ports.arm.kernel/468173.
Using the macro on_each_cpu (instead of a for_each_* loop) also ensures
that all the calls will be done all at once.
Fixes: f86428854480 ("net: mvneta: Statically assign queues to CPUs") Reported-by: Stefan Roese <stefan.roese@gmail.com> Suggested-by: Jisheng Zhang <jszhang@marvell.com> Suggested-by: Russell King <rmk+kernel@arm.linux.org.uk> Signed-off-by: Gregory CLEMENT <gregory.clement@free-electrons.com> Signed-off-by: David S. Miller <davem@davemloft.net>
transport->stream_enqueue may call copy_to_user so it should
not be called inside a prepare_to_wait. Narrow the scope of
the prepare_to_wait to avoid the bad call. This also applies
to vsock_stream_recvmsg as well.
Reported-by: Vinson Lee <vlee@freedesktop.org> Tested-by: Vinson Lee <vlee@freedesktop.org> Signed-off-by: Laura Abbott <labbott@fedoraproject.org> Signed-off-by: David S. Miller <davem@davemloft.net>
This function may also return -1 after calling mpp2_prs_tcam_port_map_get.
So that the function consistently returns meaningful error values on
failure, the -1 is changed to -EINVAL.
Signed-off-by: Amitoj Kaur Chawla <amitoj1606@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
The single call site of the containing function checks whether the
returned value is -1, so this check is changed as well. The single call
site of this call site, however, only checks whether the value is not 0,
so no further change was required.
Signed-off-by: Amitoj Kaur Chawla <amitoj1606@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Jay Vosburgh [Tue, 2 Feb 2016 21:35:56 +0000 (13:35 -0800)]
bonding: Fix ARP monitor validation
The current logic in bond_arp_rcv will accept an incoming ARP for
validation if (a) the receiving slave is either "active" (which includes
the currently active slave, or the current ARP slave) or, (b) there is a
currently active slave, and it has received an ARP since it became active.
For case (b), the receiving slave isn't the currently active slave, and is
receiving the original broadcast ARP request, not an ARP reply from the
target.
This logic can fail if there is no currently active slave. In
this situation, the ARP probe logic cycles through all slaves, assigning
each in turn as the "current_arp_slave" for one arp_interval, then setting
that one as "active," and sending an ARP probe from that slave. The
current logic expects the ARP reply to arrive on the sending
current_arp_slave, however, due to switch FDB updating delays, the reply
may be directed to another slave.
This can arise if the bonding slaves and switch are working, but
the ARP target is not responding. When the ARP target recovers, a
condition may result wherein the ARP target host replies faster than the
switch can update its forwarding table, causing each ARP reply to be sent
to the previous current_arp_slave. This will never pass the logic in
bond_arp_rcv, as neither of the above conditions (a) or (b) are met.
Some experimentation on a LAN shows ARP reply round trips in the
200 usec range, but my available switches never update their FDB in less
than 4000 usec.
This patch changes the logic in bond_arp_rcv to additionally
accept an ARP reply for validation on any slave if there is a current ARP
slave and it sent an ARP probe during the previous arp_interval.
Fixes: aeea64ac717a ("bonding: don't trust arp requests unless active slave really works") Cc: Veaceslav Falico <vfalico@gmail.com> Cc: Andy Gospodarek <gospo@cumulusnetworks.com> Signed-off-by: Jay Vosburgh <jay.vosburgh@canonical.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Linus Torvalds [Thu, 11 Feb 2016 19:25:55 +0000 (11:25 -0800)]
Merge tag 'gpio-v4.5-2' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-gpio
Pull GPIO fixes from Linus Walleij:
- Probe errorpath fix for the Altera
- irqchip ofnode pointer added to the DaVinci driver
- controller instance number correction for DaVinci
* tag 'gpio-v4.5-2' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-gpio:
gpio: davinci: Fix the number of controllers allocated
gpio: davinci: Add the missing of-node pointer
gpio: gpio-altera: Remove gpiochip on probe failure.
Linus Torvalds [Thu, 11 Feb 2016 19:17:19 +0000 (11:17 -0800)]
Merge tag 'platform-drivers-x86-v4.5-3' of git://git.infradead.org/users/dvhart/linux-platform-drivers-x86
Pull x86 platform driver fixes from Darren Hart:
"Just two small fixes for the 4.5-rc cycle:
intel_scu_ipcutil:
- underflow in scu_reg_access()
intel-hid:
- fix incorrect entries in intel_hid_keymap"
* tag 'platform-drivers-x86-v4.5-3' of git://git.infradead.org/users/dvhart/linux-platform-drivers-x86:
intel_scu_ipcutil: underflow in scu_reg_access()
intel-hid: fix incorrect entries in intel_hid_keymap
1) Fix BPF handling of branch offset adjustmnets on backjumps, from
Daniel Borkmann.
2) Make sure selinux knows about SOCK_DESTROY netlink messages, from
Lorenzo Colitti.
3) Fix openvswitch tunnel mtu regression, from David Wragg.
4) Fix ICMP handling of TCP sockets in syn_recv state, from Eric
Dumazet.
5) Fix SCTP user hmacid byte ordering bug, from Xin Long.
6) Fix recursive locking in ipv6 addrconf, from Subash Abhinov
Kasiviswanathan.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net:
bpf: fix branch offset adjustment on backjumps after patching ctx expansion
vxlan, gre, geneve: Set a large MTU on ovs-created tunnel devices
geneve: Relax MTU constraints
vxlan: Relax MTU constraints
flow_dissector: Fix unaligned access in __skb_flow_dissector when used by eth_get_headlen
of: of_mdio: Add marvell, 88e1145 to whitelist of PHY compatibilities.
selinux: nlmsgtab: add SOCK_DESTROY to the netlink mapping tables
sctp: translate network order to host order when users get a hmacid
enic: increment devcmd2 result ring in case of timeout
tg3: Fix for tg3 transmit queue 0 timed out when too many gso_segs
net:Add sysctl_max_skb_frags
tcp: do not drop syn_recv on all icmp reports
ipv6: fix a lockdep splat
unix: correctly track in-flight fds in sending process user_struct
update be2net maintainers' email addresses
dwc_eth_qos: Reset hardware before PHY start
ipv6: addrconf: Fix recursive spin lock call
Linus Torvalds [Wed, 10 Feb 2016 23:11:08 +0000 (15:11 -0800)]
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma
Pull rdma fixes from Doug Ledford:
"A few more minor fixes for rc3:
- One fix to ipoib
- One fix to core sysfs code
- Four patches that resolve an oops found in testing of ocrdma and a
couple other ocrdma issues"
* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma:
RDMA/ocrdma: Fixing ocrdma debugfs directory remove
RDMA/ocrdma: Fix pkey_index returned by driver in rq work completion
RDMA/ocrdma: populate max_sge_rd in device attributes
RDMA/ocrdma: Initialize stats resources in the driver before ib device registration.
IB/sysfs: remove unused va_list args
IB/IPoIB: Do not set skb truesize since using one linearskb
Daniel Borkmann [Wed, 10 Feb 2016 15:47:11 +0000 (16:47 +0100)]
bpf: fix branch offset adjustment on backjumps after patching ctx expansion
When ctx access is used, the kernel often needs to expand/rewrite
instructions, so after that patching, branch offsets have to be
adjusted for both forward and backward jumps in the new eBPF program,
but for backward jumps it fails to account the delta. Meaning, for
example, if the expansion happens exactly on the insn that sits at
the jump target, it doesn't fix up the back jump offset.
Analysis on what the check in adjust_branches() is currently doing:
/* adjust offset of jmps if necessary */
if (i < pos && i + insn->off + 1 > pos)
insn->off += delta;
else if (i > pos && i + insn->off + 1 < pos)
insn->off -= delta;
First case is if we cross pos-boundary and the jump instruction was
before pos. This is handeled correctly. I.e. if i == pos, then this
would mean our jump that we currently check was the patchlet itself
that we just injected. Since such patchlets are self-contained and
have no awareness of any insns before or after the patched one, the
delta is correctly not adjusted. Also, for the second condition in
case of i + insn->off + 1 == pos, means we jump to that newly patched
instruction, so no offset adjustment are needed. That part is correct.
Second interesting case is where we cross pos-boundary and the jump
instruction was after pos. Backward jump with i == pos would be
impossible and pose a bug somewhere in the patchlet, so the first
condition checking i > pos is okay only by itself. However, i +
insn->off + 1 < pos does not always work as intended to trigger the
adjustment. It works when jump targets would be far off where the
delta wouldn't matter. But, for example, where the fixed insn->off
before pointed to pos (target_Y), it now points to pos + delta, so
that additional room needs to be taken into account for the check.
This means that i) both tests here need to be adjusted into pos + delta,
and ii) for the second condition, the test needs to be <= as pos
itself can be a target in the backjump, too.
Fixes: 9bac3d6d548e ("bpf: allow extended BPF programs access skb fields") Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net>
Linus Torvalds [Wed, 10 Feb 2016 20:04:59 +0000 (12:04 -0800)]
Merge branch 'for-4.5-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/libata
Pull libata fixes from Tejun Heo:
- PORTS_IMPL workaround for very early ahci controllers is misbehaving
on new systems. Disabled on recent ahci versions.
- Old-style PIO state machine had a horrible locking problem. Don't
know how we've been getting away this far. Fixed.
- Other device specific updates.
* 'for-4.5-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/libata:
ahci: Intel DNV device IDs SATA
libata: fix sff host state machine locking while polling
libata-sff: use WARN instead of BUG on illegal host state machine state
libata: disable forced PORTS_IMPL for >= AHCI 1.3
libata: blacklist a Viking flash model for MWDMA corruption
drivers: ata: wake port before DMA stop for ALPM
Linus Torvalds [Wed, 10 Feb 2016 19:36:19 +0000 (11:36 -0800)]
Merge branch 'for-4.5-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup fixes from Tejun Heo:
- The destruction path of cgroup objects are asynchronous and
multi-staged and some of them ended up destroying parents before
children leading to failures in cpu and memory controllers. Ensure
that parents are always destroyed after children.
- cpuset mm node migration was performed synchronously while holding
threadgroup and cgroup mutexes and the recent threadgroup locking
update resulted in a possible deadlock. The migration is best effort
and shouldn't have been performed under those locks to begin with.
Made asynchronous.
- Minor documentation fix.
* 'for-4.5-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
Documentation: cgroup: Fix 'cgroup-legacy' -> 'cgroup-v1'
cgroup: make sure a parent css isn't freed before its children
cgroup: make sure a parent css isn't offlined before its children
cpuset: make mm migration asynchronous
Linus Torvalds [Wed, 10 Feb 2016 19:04:05 +0000 (11:04 -0800)]
Merge branch 'for-4.5-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq
Pull workqueue fixes from Tejun Heo:
"Workqueue fixes for v4.5-rc3.
- Remove a spurious triggering of flush dependency warning.
- Officially break local execution guarantee of unbound work items
and add a debug feature to flush out usages which depend on it.
- Work around CPU -> NODE mapping becoming invalid on CPU offline.
The branch is young but pushing out early as stable kernels are being
affected"
* 'for-4.5-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
workqueue: handle NUMA_NO_NODE for unbound pool_workqueue lookup
workqueue: implement "workqueue.debug_force_rr_cpu" debug feature
workqueue: schedule WORK_CPU_UNBOUND work on wq_unbound_cpumask CPUs
Revert "workqueue: make sure delayed work run in local cpu"
workqueue: skip flush dependency checks for legacy workqueues
Tejun Heo [Wed, 3 Feb 2016 18:54:25 +0000 (13:54 -0500)]
workqueue: handle NUMA_NO_NODE for unbound pool_workqueue lookup
When looking up the pool_workqueue to use for an unbound workqueue,
workqueue assumes that the target CPU is always bound to a valid NUMA
node. However, currently, when a CPU goes offline, the mapping is
destroyed and cpu_to_node() returns NUMA_NO_NODE.
This has always been broken but hasn't triggered often enough before 874bbfe600a6 ("workqueue: make sure delayed work run in local cpu").
After the commit, workqueue forcifully assigns the local CPU for
delayed work items without explicit target CPU to fix a different
issue. This widens the window where CPU can go offline while a
delayed work item is pending causing delayed work items dispatched
with target CPU set to an already offlined CPU. The resulting
NUMA_NO_NODE mapping makes workqueue try to queue the work item on a
NULL pool_workqueue and thus crash.
While 874bbfe600a6 has been reverted for a different reason making the
bug less visible again, it can still happen. Fix it by mapping
NUMA_NO_NODE to the default pool_workqueue from unbound_pwq_by_node().
This is a temporary workaround. The long term solution is keeping CPU
-> NODE mapping stable across CPU off/online cycles which is being
worked on.
David S. Miller [Wed, 10 Feb 2016 10:50:16 +0000 (05:50 -0500)]
Merge branch 'ovs-tunnel-mtu'
David Wragg says:
====================
Set a large MTU on ovs-created tunnel devices
Prior to 4.3, openvswitch tunnel vports (vxlan, gre and geneve) could
transmit vxlan packets of any size, constrained only by the ability to
send out the resulting packets. 4.3 introduced netdevs corresponding
to tunnel vports. These netdevs have an MTU, which limits the size of
a packet that can be successfully encapsulated. The default MTU
values are low (1500 or less), which is awkwardly small in the context
of physical networks supporting jumbo frames, and leads to a
conspicuous change in behaviour for userspace.
This patch series sets the MTU on openvswitch-created netdevs to be
the relevant maximum (i.e. the maximum IP packet size minus any
relevant overhead), effectively restoring the behaviour prior to 4.3.
Where relevant, the limits on MTU values that can be directly set on
the netdevs are also relaxed.
Changes in v2:
* Extend to all openvswitch tunnel types, i.e. gre and geneve as well
* Use IP_MAX_MTU
Changes in v3:
* Fix block comment style
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
David Wragg [Wed, 10 Feb 2016 00:05:58 +0000 (00:05 +0000)]
vxlan, gre, geneve: Set a large MTU on ovs-created tunnel devices
Prior to 4.3, openvswitch tunnel vports (vxlan, gre and geneve) could
transmit vxlan packets of any size, constrained only by the ability to
send out the resulting packets. 4.3 introduced netdevs corresponding
to tunnel vports. These netdevs have an MTU, which limits the size of
a packet that can be successfully encapsulated. The default MTU
values are low (1500 or less), which is awkwardly small in the context
of physical networks supporting jumbo frames, and leads to a
conspicuous change in behaviour for userspace.
Instead, set the MTU on openvswitch-created netdevs to be the relevant
maximum (i.e. the maximum IP packet size minus any relevant overhead),
effectively restoring the behaviour prior to 4.3.
Signed-off-by: David Wragg <david@weave.works> Signed-off-by: David S. Miller <davem@davemloft.net>
David Wragg [Wed, 10 Feb 2016 00:05:57 +0000 (00:05 +0000)]
geneve: Relax MTU constraints
Allow the MTU of geneve devices to be set to large values, in order to
exploit underlying networks with larger frame sizes.
GENEVE does not have a fixed encapsulation overhead (an openvswitch
rule can add variable length options), so there is no relevant maximum
MTU to enforce. A maximum of IP_MAX_MTU is used instead.
Encapsulated packets that are too big for the underlying network will
get dropped on the floor.
Signed-off-by: David Wragg <david@weave.works> Signed-off-by: David S. Miller <davem@davemloft.net>
David Wragg [Wed, 10 Feb 2016 00:05:55 +0000 (00:05 +0000)]
vxlan: Relax MTU constraints
Allow the MTU of vxlan devices without an underlying device to be set
to larger values (up to a maximum based on IP packet limits and vxlan
overhead).
Previously, their MTUs could not be set to higher than the
conventional ethernet value of 1500. This is a very arbitrary value
in the context of vxlan, and prevented vxlan devices from being able
to take advantage of jumbo frames etc.
The default MTU remains 1500, for compatibility.
Signed-off-by: David Wragg <david@weave.works> Acked-by: Roopa Prabhu <roopa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Lokesh Vutla [Thu, 28 Jan 2016 13:38:51 +0000 (19:08 +0530)]
gpio: davinci: Fix the number of controllers allocated
Driver only needs to allocate for [ngpio / 32] controllers,
as each controller handles 32 gpios. But the current driver
allocates for ngpio of which the extra allocated are unused.
Fix it be registering only the required number of controllers.
Keerthy [Thu, 28 Jan 2016 13:38:50 +0000 (19:08 +0530)]
gpio: davinci: Add the missing of-node pointer
Currently the first parameter of irq_domain_add_legacy is NULL.
irq_find_host function returns NULL when we do not populate the of_node
and hence irq_of_parse_and_map call fails whenever we want to request a
gpio irq. This fixes the request_irq failures for gpio interrupts.
Linus Torvalds [Wed, 10 Feb 2016 00:40:59 +0000 (16:40 -0800)]
Merge tag 'fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux
Pull module fixes from Rusty Russell:
"Fix for async_probe module param added in 4.3 (clearly not widely used
yet), and a much more interesting kallsyms race which has been around
approximately forever. This fix is more invasive, and will require
some care in backporting, but I hated all the bandaids I could think
of, so...
There are some more coming, which are only for breakages introduced
this cycle (livepatch), but wanted these in now"
* tag 'fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux:
modules: fix longstanding /proc/kallsyms vs module insertion race.
module: wrapper for symbol name.
modules: fix modparam async_probe request
drivers/input/touchscreen/colibri-vf50-ts.c: In function ‘vf50_ts_probe’:
drivers/input/touchscreen/colibri-vf50-ts.c:302: error: implicit declaration of function ‘of_property_read_u32’
The adp5589 has row 5, don't skip it when creating the GPIO mapping.
Otherwise the pin gets reserved as used and it is not possible to use it as
a GPIO.
Philipp Zabel [Tue, 9 Feb 2016 17:32:42 +0000 (09:32 -0800)]
Input: edt-ft5x06 - fix setting gain, offset, and threshold via device tree
A recent patch broke parsing the gain, offset, and threshold parameters
from device tree. Instead of setting the cached values and writing them
to the correct registers during probe, it would write the values from DT
into the register address variables and never write them to the chip
during normal operation.
Fixes: 2e23b7a96372 ("Input: edt-ft5x06 - use generic properties API") Signed-off-by: Philipp Zabel <p.zabel@pengutronix.de> Signed-off-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>
Workqueue used to guarantee local execution for work items queued
without explicit target CPU. The guarantee is gone now which can
break some usages in subtle ways. To flush out those cases, this
patch implements a debug feature which forces round-robin CPU
selection for all such work items.
The debug feature defaults to off and can be enabled with a kernel
parameter. The default can be flipped with a debug config option.
If you hit this commit during bisection, please refer to 041bd12e272c
("Revert "workqueue: make sure delayed work run in local cpu"") for
more information and ping me.
Mike Galbraith [Tue, 9 Feb 2016 22:59:38 +0000 (17:59 -0500)]
workqueue: schedule WORK_CPU_UNBOUND work on wq_unbound_cpumask CPUs
WORK_CPU_UNBOUND work items queued to a bound workqueue always run
locally. This is a good thing normally, but not when the user has
asked us to keep unbound work away from certain CPUs. Round robin
these to wq_unbound_cpumask CPUs instead, as perturbation avoidance
trumps performance.
tj: Cosmetic and comment changes. WARN_ON_ONCE() dropped from empty
(wq_unbound_cpumask AND cpu_online_mask). If we want that, it
should be done when config changes.
Signed-off-by: Mike Galbraith <umgwanakikbuti@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>
Workqueue used to implicity guarantee that work items queued without
explicit CPU specified are put on the local CPU. Recent changes in
timer broke the guarantee and led to vmstat breakage which was fixed
by 176bed1de5bf ("vmstat: explicitly schedule per-cpu work on the CPU
we need it to run on").
vmstat is the most likely to expose the issue and it's quite possible
that there are other similar problems which are a lot more difficult
to trigger. As a preventive measure, 874bbfe600a6 ("workqueue: make
sure delayed work run in local cpu") was applied to restore the local
CPU guarnatee. Unfortunately, the change exposed a bug in timer code
which got fixed by 22b886dd1018 ("timers: Use proper base migration in
add_timer_on()"). Due to code restructuring, the commit couldn't be
backported beyond certain point and stable kernels which only had 874bbfe600a6 started crashing.
The local CPU guarantee was accidental more than anything else and we
want to get rid of it anyway. As, with the vmstat case fixed, 874bbfe600a6 is causing more problems than it's fixing, it has been
decided to take the chance and officially break the guarantee by
reverting the commit. A debug feature will be added to force foreign
CPU assignment to expose cases relying on the guarantee and fixes for
the individual cases will be backported to stable as necessary.
Signed-off-by: Tejun Heo <tj@kernel.org> Fixes: 874bbfe600a6 ("workqueue: make sure delayed work run in local cpu") Link: http://lkml.kernel.org/g/20160120211926.GJ10810@quack.suse.cz Cc: stable@vger.kernel.org Cc: Mike Galbraith <umgwanakikbuti@gmail.com> Cc: Henrique de Moraes Holschuh <hmh@hmh.eng.br> Cc: Daniel Bilik <daniel.bilik@neosystem.cz> Cc: Jan Kara <jack@suse.cz> Cc: Shaohua Li <shli@fb.com> Cc: Sasha Levin <sasha.levin@oracle.com> Cc: Ben Hutchings <ben@decadent.org.uk> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Daniel Bilik <daniel.bilik@neosystem.cz> Cc: Jiri Slaby <jslaby@suse.cz> Cc: Michal Hocko <mhocko@kernel.org>
Linus Torvalds [Tue, 9 Feb 2016 18:15:16 +0000 (10:15 -0800)]
Merge branch 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6
Pull crypto fixes from Herbert Xu:
"This fixes the following issues:
API:
- Fix async algif_skcipher, it was broken by recent fixes.
- Fix potential race condition in algif_skcipher with ctx.
- Fix potential memory corruption in algif_skcipher.
- Add missing lock to crypto_user when doing an alg dump.
Drivers:
- marvell/cesa was testing the wrong variable for NULL after
allocation.
- Fix potential double-free in atmel-sha.
- Fix illegal call to sleepin function from atomic context in
atmel-sha"
* 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6:
crypto: marvell/cesa - fix test in mv_cesa_dev_dma_init()
crypto: atmel-sha - remove calls of clk_prepare() from atomic contexts
crypto: atmel-sha - fix atmel_sha_remove()
crypto: algif_skcipher - Do not set MAY_BACKLOG on the async path
crypto: algif_skcipher - Do not dereference ctx without socket lock
crypto: algif_skcipher - Do not assume that req is unchanged
crypto: user - lock crypto_alg_list on alg dump
J. Bruce Fields [Wed, 10 Jul 2013 20:54:34 +0000 (16:54 -0400)]
scripts: add "prune-kernel" script to clean up old kernel images
Long ago, Dave Jones complained about CONFIG_LOCALVERSION_AUTO:
"I don't use the auto config, because I end up filling up /boot unless
I go through and clean them out by hand every time I install a new one
(which I do probably a dozen or so times a day). Is there some easy
way to prune old builds I'm missing?"
To which Bruce replied:
"I run this by hand every now and then. I'm probably doing it all wrong"
And if he is running it wrong, then so am I - because I've been using
this script ever since. It is true that CONFIG_LOCALVERSION_AUTO easily
ends up filling your /boot partition if you don't clean up old versions
regularly, and this script helps make that easier.
Checked with Bruce to see that it's fine to add this to the kernel
scripts. Maybe people will come up with enhancements, but more
importantly, this way I won't misplace this script whenever I install a
new machine and start doing custom kernels for it.
Alexander Duyck [Tue, 9 Feb 2016 10:49:54 +0000 (02:49 -0800)]
flow_dissector: Fix unaligned access in __skb_flow_dissector when used by eth_get_headlen
This patch fixes an issue with unaligned accesses when using
eth_get_headlen on a page that was DMA aligned instead of being IP aligned.
The fact is when trying to check the length we don't need to be looking at
the flow label so we can reorder the checks to first check if we are
supposed to gather the flow label and then make the call to actually get
it.
v2: Updated path so that either STOP_AT_FLOW_LABEL or KEY_FLOW_LABEL can
cause us to check for the flow label.
Reported-by: Sowmini Varadhan <sowmini.varadhan@oracle.com> Signed-off-by: Alexander Duyck <aduyck@mirantis.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Aaro Koskinen [Wed, 3 Feb 2016 19:35:29 +0000 (21:35 +0200)]
of: of_mdio: Add marvell, 88e1145 to whitelist of PHY compatibilities.
Commit ae461131960b ("of: of_mdio: Add a whitelist of PHY
compatibilities.") missed one compatible string used in in-tree DTBs:
in OCTEON, for selected boards, the kernel DTB pruning code will overwrite
the DTB compatible string with "marvell,88e1145", which is missing
from the whitelist. Add it.
The patch fixes broken networking on EdgeRouter Lite.
Fixes: ae461131960b ("of: of_mdio: Add a whitelist of PHY compatibilities.") Signed-off-by: Aaro Koskinen <aaro.koskinen@iki.fi> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
Xin Long [Wed, 3 Feb 2016 15:33:30 +0000 (23:33 +0800)]
sctp: translate network order to host order when users get a hmacid
Commit ed5a377d87dc ("sctp: translate host order to network order when
setting a hmacid") corrected the hmacid byte-order when setting a hmacid.
but the same issue also exists on getting a hmacid.
We fix it by changing hmacids to host order when users get them with
getsockopt.
Fixes: Commit ed5a377d87dc ("sctp: translate host order to network order when setting a hmacid") Signed-off-by: Xin Long <lucien.xin@gmail.com> Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Sandeep Pillai [Wed, 3 Feb 2016 09:10:44 +0000 (14:40 +0530)]
enic: increment devcmd2 result ring in case of timeout
Firmware posts the devcmd result in result ring. In case of timeout, driver
does not increment the current result pointer and firmware could post the
result after timeout has occurred. During next devcmd, driver would be
reading the result of previous devcmd.
Fix this by incrementing result even in case of timeout.
Fixes: 373fb0873d43 ("enic: add devcmd2") Signed-off-by: Sandeep Pillai <sanpilla@cisco.com> Signed-off-by: Govindarajulu Varadarajan <_govind@gmx.com> Signed-off-by: David S. Miller <davem@davemloft.net>
tg3: Fix for tg3 transmit queue 0 timed out when too many gso_segs
tg3_tso_bug() can hit a condition where the entire tx ring is not big
enough to segment the GSO packet. For example, if MSS is very small,
gso_segs can exceed the tx ring size. When we hit the condition, it
will cause tx timeout.
tg3_tso_bug() is called to handle TSO and DMA hardware bugs.
For TSO bugs, if tg3_tso_bug() cannot succeed, we have to drop the packet.
For DMA bugs, we can still fall back to linearize the SKB and let the
hardware transmit the TSO packet.
This patch adds a function tg3_tso_bug_gso_check() to check if there
are enough tx descriptors for GSO before calling tg3_tso_bug().
The caller will then handle the error appropriately - drop or
lineraize the SKB.
v2: Corrected patch description to avoid confusion.
Signed-off-by: Siva Reddy Kallam <siva.kallam@broadcom.com> Signed-off-by: Michael Chan <mchan@broadcom.com> Acked-by: Prashant Sreedharan <prashant@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Devices may have limits on the number of fragments in an skb they support.
Current codebase uses a constant as maximum for number of fragments one
skb can hold and use.
When enabling scatter/gather and running traffic with many small messages
the codebase uses the maximum number of fragments and may thereby violate
the max for certain devices.
The patch introduces a global variable as max number of fragments.
Signed-off-by: Hans Westgaard Ry <hans.westgaard.ry@oracle.com> Reviewed-by: HÃ¥kon Bugge <haakon.bugge@oracle.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Wed, 3 Feb 2016 03:31:12 +0000 (19:31 -0800)]
tcp: do not drop syn_recv on all icmp reports
Petr Novopashenniy reported that ICMP redirects on SYN_RECV sockets
were leading to RST.
This is of course incorrect.
A specific list of ICMP messages should be able to drop a SYN_RECV.
For instance, a REDIRECT on SYN_RECV shall be ignored, as we do
not hold a dst per SYN_RECV pseudo request.
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=111751 Fixes: 079096f103fa ("tcp/dccp: install syn_recv requests into ehash table") Reported-by: Petr Novopashenniy <pety@rusnet.ru> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Linus Torvalds [Mon, 8 Feb 2016 18:32:30 +0000 (10:32 -0800)]
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull KVM fixes from Paolo Bonzini:
"KVM-ARM fixes, mostly coming from the PMU work"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
arm64: KVM: Fix guest dead loop when register accessor returns false
arm64: KVM: Fix comments of the CP handler
arm64: KVM: Fix wrong use of the CPSR MODE mask for 32bit guests
arm64: KVM: Obey RES0/1 reserved bits when setting CPTR_EL2
arm64: KVM: Fix AArch64 guest userspace exception injection
Linus Torvalds [Mon, 8 Feb 2016 18:20:06 +0000 (10:20 -0800)]
Merge tag 'regmap-fix-v4.5-big-endian' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regmap
Pull regmap fix from Mark Brown:
"A single revert back to v4.4 endianness handling.
Commit 29bb45f25ff3 ("regmap-mmio: Use native endianness for
read/write") attempted to fix some long standing bugs in the MMIO
implementation for big endian systems caused by duplicate byte
swapping in both regmap and readl()/writel(). Sadly the fix makes
things worse rather than better, so revert it for now"
* tag 'regmap-fix-v4.5-big-endian' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regmap:
regmap: mmio: Revert to v4.4 endianness handling
Eric Dumazet [Wed, 3 Feb 2016 01:55:01 +0000 (17:55 -0800)]
ipv6: fix a lockdep splat
Silence lockdep false positive about rcu_dereference() being
used in the wrong context.
First one should use rcu_dereference_protected() as we own the spinlock.
Second one should be a normal assignation, as no barrier is needed.
Fixes: 18367681a10bd ("ipv6 flowlabel: Convert np->ipv6_fl_list to RCU.") Reported-by: Dave Jones <davej@codemonkey.org.uk> Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
unix: correctly track in-flight fds in sending process user_struct
The commit referenced in the Fixes tag incorrectly accounted the number
of in-flight fds over a unix domain socket to the original opener
of the file-descriptor. This allows another process to arbitrary
deplete the original file-openers resource limit for the maximum of
open files. Instead the sending processes and its struct cred should
be credited.
To do so, we add a reference counted struct user_struct pointer to the
scm_fp_list and use it to account for the number of inflight unix fds.
Fixes: 712f4aad406bb1 ("unix: properly account for FDs passed over unix sockets") Reported-by: David Herrmann <dh.herrmann@gmail.com> Cc: David Herrmann <dh.herrmann@gmail.com> Cc: Willy Tarreau <w@1wt.eu> Cc: Linus Torvalds <torvalds@linux-foundation.org> Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
Paolo Bonzini [Mon, 8 Feb 2016 15:20:51 +0000 (16:20 +0100)]
Merge tag 'kvm-arm-for-4.5-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into kvm-master
KVM/ARM fixes for v4.5-rc2
A few random fixes, mostly coming from the PMU work by Shannon:
- fix for injecting faults coming from the guest's userspace
- cleanup for our CPTR_EL2 accessors (reserved bits)
- fix for a bug impacting perf (user/kernel discrimination)
- fix for a 32bit sysreg handling bug
Linus Torvalds [Sun, 7 Feb 2016 23:23:20 +0000 (15:23 -0800)]
Merge tag 'armsoc-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc
Pull ARM SoC fixes from Olof Johansson:
"The first real batch of fixes for this release cycle, so there are a
few more than usual.
Most of these are fixes and tweaks to board support (DT bugfixes,
etc). I've also picked up a couple of small cleanups that seemed
innocent enough that there was little reason to wait (const/
__initconst and Kconfig deps).
Quite a bit of the changes on OMAP were due to fixes to no longer
write to rodata from assembly when ARM_KERNMEM_PERMS was enabled, but
there were also other fixes.
Kirkwood had a bunch of gpio fixes for some boards. OMAP had RTC
fixes on OMAP5, and Nomadik had changes to MMC parameters in DT.
All in all, mostly the usual mix of various fixes"
* tag 'armsoc-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc: (46 commits)
ARM: multi_v7_defconfig: enable DW_WATCHDOG
ARM: nomadik: fix up SD/MMC DT settings
ARM64: tegra: Add chosen node for tegra132 norrin
ARM: realview: use "depends on" instead of "if" after prompt
ARM: tango: use "depends on" instead of "if" after prompt
ARM: tango: use const and __initconst for smp_operations
ARM: realview: use const and __initconst for smp_operations
bus: uniphier-system-bus: revive tristate prompt
arm64: dts: Add missing DMA Abort interrupt to Juno
bus: vexpress-config: Add missing of_node_put
ARM: dts: am57xx: sbc-am57x: correct Eth PHY settings
ARM: dts: am57xx: cl-som-am57x: fix CPSW EMAC pinmux
ARM: dts: am57xx: sbc-am57x: fix UART3 pinmux
ARM: dts: am57xx: cl-som-am57x: update SPI Flash frequency
ARM: dts: am57xx: cl-som-am57x: set HOST mode for USB2
ARM: dts: am57xx: sbc-am57x: fix SB-SOM EEPROM I2C address
ARM: dts: LogicPD Torpedo: Revert Duplicative Entries
ARM: dts: am437x: pixcir_tangoc: use correct flags for irq types
ARM: dts: am4372: fix irq type for arm twd and global timer
ARM: dts: at91: sama5d4 xplained: fix phy0 IRQ type
...
Sathya Perla [Tue, 2 Feb 2016 13:10:10 +0000 (08:10 -0500)]
update be2net maintainers' email addresses
be2net maintainers' email addresses changed from avagotech.com to
broadcom.com starting today. While updating the list, I'm also adding
Somnath's name to the list.
Signed-off-by: Sathya Perla <sathya.perla@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Linus Torvalds [Sun, 7 Feb 2016 06:14:46 +0000 (22:14 -0800)]
Merge tag 'usb-4.5-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb
Pull USB fixes from Greg KH:
"Here are some USB fixes for 4.5-rc3.
The usual, xhci fixes for reported issues, combined with some small
gadget driver fixes, and a MAINTAINERS file update. All have been in
linux-next with no reported issues"
* tag 'usb-4.5-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb:
xhci: harden xhci_find_next_ext_cap against device removal
xhci: Fix list corruption in urb dequeue at host removal
usb: host: xhci-plat: fix NULL pointer in probe for device tree case
usb: xhci-mtk: fix AHB bus hang up caused by roothubs polling
usb: xhci-mtk: fix bpkts value of LS/HS periodic eps not behind TT
usb: xhci: apply XHCI_PME_STUCK_QUIRK to Intel Broxton-M platforms
usb: xhci: set SSIC port unused only if xhci_suspend succeeds
usb: xhci: add a quirk bit for ssic port unused
usb: xhci: handle both SSIC ports in PME stuck quirk
usb: dwc3: gadget: set the OTG flag in dwc3 gadget driver.
Revert "xhci: don't finish a TD if we get a short-transfer event mid TD"
MAINTAINERS: fix my email address
usb: dwc2: Fix probe problem on bcm2835
Revert "usb: dwc2: Move reset into dwc2_get_hwparams()"
usb: musb: ux500: Fix NULL pointer dereference at system PM
usb: phy: mxs: declare variable with initialized value
usb: phy: msm: fix error handling in probe.
Linus Torvalds [Sun, 7 Feb 2016 06:13:16 +0000 (22:13 -0800)]
Merge tag 'staging-4.5-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging
Pull staging and IIO driver fixes from Greg KH:
"Here are some IIO and staging driver fixes for 4.5-rc3.
All of them, except one, are for IIO drivers, and one is for a speakup
driver fix caused by some earlier patches, to resolve a reported build
failure"
* tag 'staging-4.5-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging:
Staging: speakup: Fix allyesconfig build on mn10300
iio: dht11: Use boottime
iio: ade7753: avoid uninitialized data
iio: pressure: mpl115: fix temperature offset sign
iio: imu: Fix dependencies for !HAS_IOMEM archs
staging: iio: Fix dependencies for !HAS_IOMEM archs
iio: adc: Fix dependencies for !HAS_IOMEM archs
iio: inkern: fix a NULL dereference on error
iio:adc:ti_am335x_adc Fix buffered mode by identifying as software buffer.
iio: light: acpi-als: Report data as processed
iio: dac: mcp4725: set iio name property in sysfs
iio: add HAS_IOMEM dependency to VF610_ADC
iio: add IIO_TRIGGER dependency to STK8BA50
iio: proximity: lidar: correct return value
iio-light: Use a signed return type for ltr501_match_samp_freq()
Rabin Vincent [Tue, 2 Feb 2016 08:39:02 +0000 (09:39 +0100)]
dwc_eth_qos: Reset hardware before PHY start
The hardware reset is currently done after phy_start() is called,
leading to a race where we can lose the link status if the phy state
machine calls dwceqos_adjust_link() before we reset the MAC registers.
Acked-by: Lars Persson <larper@axis.com> Signed-off-by: Rabin Vincent <rabinv@axis.com> Signed-off-by: David S. Miller <davem@davemloft.net>
A rcu stall with the following backtrace was seen on a system with
forwarding, optimistic_dad and use_optimistic set. To reproduce,
set these flags and allow ipv6 autoconf.
This occurs because the device write_lock is acquired while already
holding the read_lock. Back trace below -
v2: do addrconf_dad_kick inside read lock and then acquire write
lock for ipv6_ifa_notify as suggested by Eric
Fixes: 7fd2561e4ebdd ("net: ipv6: Add a sysctl to make optimistic
addresses useful candidates")
Cc: Eric Dumazet <edumazet@google.com> Cc: Erik Kline <ek@google.com> Cc: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: Subash Abhinov Kasiviswanathan <subashab@codeaurora.org> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Herbert Xu [Wed, 3 Feb 2016 13:39:27 +0000 (21:39 +0800)]
crypto: algif_skcipher - Do not set MAY_BACKLOG on the async path
The async path cannot use MAY_BACKLOG because it is not meant to
block, which is what MAY_BACKLOG does. On the other hand, both
the sync and async paths can make use of MAY_SLEEP.
Cc: stable@vger.kernel.org Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Herbert Xu [Wed, 3 Feb 2016 13:39:26 +0000 (21:39 +0800)]
crypto: algif_skcipher - Do not dereference ctx without socket lock
Any access to non-constant bits of the private context must be
done under the socket lock, in particular, this includes ctx->req.
This patch moves such accesses under the lock, and fetches the
tfm from the parent socket which is guaranteed to be constant,
rather than from ctx->req.
Cc: stable@vger.kernel.org Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Herbert Xu [Wed, 3 Feb 2016 13:39:24 +0000 (21:39 +0800)]
crypto: algif_skcipher - Do not assume that req is unchanged
The async path in algif_skcipher assumes that the crypto completion
function will be called with the original request. This is not
necessarily the case. In fact there is no need for this anyway
since we already embed information into the request with struct
skcipher_async_req.
This patch adds a pointer to that struct and then passes it as
the data to the callback function.
Cc: stable@vger.kernel.org Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Tested-by: Tadeusz Struk <tadeusz.struk@intel.com>
Mathias Krause [Mon, 1 Feb 2016 13:27:30 +0000 (14:27 +0100)]
crypto: user - lock crypto_alg_list on alg dump
We miss to take the crypto_alg_sem semaphore when traversing the
crypto_alg_list for CRYPTO_MSG_GETALG dumps. This allows a race with
crypto_unregister_alg() removing algorithms from the list while we're
still traversing it, thereby leading to a use-after-free as show below:
To trigger the race run the following loops simultaneously for a while:
$ while : ; do modprobe aesni-intel; rmmod aesni-intel; done
$ while : ; do crconf show all > /dev/null; done
Fix the race by taking the crypto_alg_sem read lock, thereby preventing
crypto_unregister_alg() from modifying the algorithm list during the
dump.
This bug has been detected by the PaX memory sanitize feature.
Cc: stable@vger.kernel.org Signed-off-by: Mathias Krause <minipli@googlemail.com> Cc: Steffen Klassert <steffen.klassert@secunet.com> Cc: PaX Team <pageexec@freemail.hu> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Linus Torvalds [Sat, 6 Feb 2016 04:20:07 +0000 (20:20 -0800)]
Merge branch 'akpm' (patches from Andrew)
Merge fixes from Andrew Morton:
"22 fixes"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (22 commits)
epoll: restrict EPOLLEXCLUSIVE to POLLIN and POLLOUT
radix-tree: fix oops after radix_tree_iter_retry
MAINTAINERS: trim the file triggers for ABI/API
dax: dirty inode only if required
thp: make deferred_split_scan() work again
mm: replace vma_lock_anon_vma with anon_vma_lock_read/write
ocfs2/dlm: clear refmap bit of recovery lock while doing local recovery cleanup
um: asm/page.h: remove the pte_high member from struct pte_t
mm, hugetlb: don't require CMA for runtime gigantic pages
mm/hugetlb: fix gigantic page initialization/allocation
mm: downgrade VM_BUG in isolate_lru_page() to warning
mempolicy: do not try to queue pages from !vma_migratable()
mm, vmstat: fix wrong WQ sleep when memory reclaim doesn't make any progress
vmstat: make vmstat_update deferrable
mm, vmstat: make quiet_vmstat lighter
mm/Kconfig: correct description of DEFERRED_STRUCT_PAGE_INIT
memblock: don't mark memblock_phys_mem_size() as __init
dump_stack: avoid potential deadlocks
mm: validate_mm browse_rb SMP race condition
m32r: fix build failure due to SMP and MMU
...
Linus Torvalds [Sat, 6 Feb 2016 03:52:57 +0000 (19:52 -0800)]
Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client
Pull Ceph fixes from Sage Weil:
"We have a few wire protocol compatibility fixes, ports of a few recent
CRUSH mapping changes, and a couple error path fixes"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client:
libceph: MOSDOpReply v7 encoding
libceph: advertise support for TUNABLES5
crush: decode and initialize chooseleaf_stable
crush: add chooseleaf_stable tunable
crush: ensure take bucket value is valid
crush: ensure bucket id is valid before indexing buckets array
ceph: fix snap context leak in error path
ceph: checking for IS_ERR instead of NULL
Linus Torvalds [Sat, 6 Feb 2016 03:38:15 +0000 (19:38 -0800)]
Merge branch 'drm-fixes' of git://people.freedesktop.org/~airlied/linux
Pull drm fixes from Dave Airlie:
"Fixes all over the place:
- amdkfd: two static checker fixes
- mst: a bunch of static checker and spec/hw interaction fixes
- amdgpu: fix Iceland hw properly, and some fiji bugs, along with
some write-combining fixes.
- exynos: some regression fixes
- adv7511: fix some EDID reading issues"
* 'drm-fixes' of git://people.freedesktop.org/~airlied/linux: (38 commits)
drm/dp/mst: deallocate payload on port destruction
drm/dp/mst: Reverse order of MST enable and clearing VC payload table.
drm/dp/mst: move GUID storage from mgr, port to only mst branch
drm/dp/mst: change MST detection scheme
drm/dp/mst: Calculate MST PBN with 31.32 fixed point
drm: Add drm_fixp_from_fraction and drm_fixp2int_ceil
drm/mst: Add range check for max_payloads during init
drm/mst: Don't ignore the MST PBN self-test result
drm: fix missing reference counting decrease
drm/amdgpu: disable uvd and vce clockgating on Fiji
drm/amdgpu: remove exp hardware support from iceland
drm/amdgpu: load MEC ucode manually on iceland
drm/amdgpu: don't load MEC2 on topaz
drm/amdgpu: drop topaz support from gmc8 module
drm/amdgpu: pull topaz gmc bits into gmc_v7
drm/amdgpu: The VI specific EXE bit should only apply to GMC v8.0 above
drm/amdgpu: iceland use CI based MC IP
drm/amdgpu: move gmc7 support out of CIK dependency
drm/amdgpu/gfx7: enable cp inst/reg error interrupts
drm/amdgpu/gfx8: enable cp inst/reg error interrupts
...
Linus Torvalds [Sat, 6 Feb 2016 02:11:23 +0000 (18:11 -0800)]
Merge tag 'pm+acpi-4.5-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull power management and ACPI fixes from Rafael Wysocki:
"These are: a fix for a recently introduced false-positive warnings
about PM domain pointers being changed inappropriately (harmless but
annoying), an MCH size workaround quirk for one more platform, a
compiler warning fix (generic power domains framework), an ACPI LPSS
(Intel SoCs) driver fixup and a cleanup of the ACPI CPPC core code.
Specifics:
- PM core fix to avoid false-positive warnings generated when the
pm_domain field is cleared for a device that appears to be bound to
a driver (Rafael Wysocki).
- New MCH size workaround quirk for Intel Haswell-ULT (Josh Boyer).
- Fix for an "unused function" compiler warning in the generic power
domains framework (Ulf Hansson).
- Fixup for the ACPI driver for Intel SoCs (acpi-lpss) to set the PM
domain pointer of a device properly in one place that was
overlooked by a recent PM core update (Andy Shevchenko).
- Removal of a redundant function declaration in the ACPI CPPC core
code (Timur Tabi)"
* tag 'pm+acpi-4.5-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
PM: Avoid false-positive warnings in dev_pm_domain_set()
PM / Domains: Silence compiler warning for an unused function
ACPI / CPPC: remove redundant mbox_send_message() declaration
ACPI / LPSS: set PM domain via helper setter
PNP: Add Haswell-ULT to Intel MCH size workaround
Jason Baron [Fri, 5 Feb 2016 23:37:04 +0000 (15:37 -0800)]
epoll: restrict EPOLLEXCLUSIVE to POLLIN and POLLOUT
In the current implementation of the EPOLLEXCLUSIVE flag (added for
4.5-rc1), if epoll waiters create different POLL* sets and register them
as exclusive against the same target fd, the current implementation will
stop waking any further waiters once it finds the first idle waiter.
This means that waiters could miss wakeups in certain cases.
For example, when we wake up a pipe for reading we do:
wake_up_interruptible_sync_poll(&pipe->wait, POLLIN | POLLRDNORM); So if
one epoll set or epfd is added to pipe p with POLLIN and a second set
epfd2 is added to pipe p with POLLRDNORM, only epfd may receive the
wakeup since the current implementation will stop after it finds any
intersection of events with a waiter that is blocked in epoll_wait().
We could potentially address this by requiring all epoll waiters that
are added to p be required to pass the same set of POLL* events. IE the
first EPOLL_CTL_ADD that passes EPOLLEXCLUSIVE establishes the set POLL*
flags to be used by any other epfds that are added as EPOLLEXCLUSIVE.
However, I think it might be somewhat confusing interface as we would
have to reference count the number of users for that set, and so
userspace would have to keep track of that count, or we would need a
more involved interface. It also adds some shared state that we'd have
store somewhere. I don't think anybody will want to bloat
__wait_queue_head for this.
I think what we could do instead, is to simply restrict EPOLLEXCLUSIVE
such that it can only be specified with EPOLLIN and/or EPOLLOUT. So
that way if the wakeup includes 'POLLIN' and not 'POLLOUT', we can stop
once we hit the first idle waiter that specifies the EPOLLIN bit, since
any remaining waiters that only have 'POLLOUT' set wouldn't need to be
woken. Likewise, we can do the same thing if 'POLLOUT' is in the wakeup
bit set and not 'POLLIN'. If both 'POLLOUT' and 'POLLIN' are set in the
wake bit set (there is at least one example of this I saw in fs/pipe.c),
then we just wake the entire exclusive list. Having both 'POLLOUT' and
'POLLIN' both set should not be on any performance critical path, so I
think that's ok (in fs/pipe.c its in pipe_release()). We also continue
to include EPOLLERR and EPOLLHUP by default in any exclusive set. Thus,
the user can specify EPOLLERR and/or EPOLLHUP but is not required to do
so.
Since epoll waiters may be interested in other events as well besides
EPOLLIN, EPOLLOUT, EPOLLERR and EPOLLHUP, these can still be added by
doing a 'dup' call on the target fd and adding that as one normally
would with EPOLL_CTL_ADD. Since I think that the POLLIN and POLLOUT
events are what we are interest in balancing, I think that the 'dup'
thing could perhaps be added to only one of the waiter threads.
However, I think that EPOLLIN, EPOLLOUT, EPOLLERR and EPOLLHUP should be
sufficient for the majority of use-cases.
Since EPOLLEXCLUSIVE is intended to be used with a target fd shared
among multiple epfds, where between 1 and n of the epfds may receive an
event, it does not satisfy the semantics of EPOLLONESHOT where only 1
epfd would get an event. Thus, it is not allowed to be specified in
conjunction with EPOLLEXCLUSIVE.
EPOLL_CTL_MOD is also not allowed if the fd was previously added as
EPOLLEXCLUSIVE. It seems with the limited number of flags to not be as
interesting, but this could be relaxed at some further point.
Signed-off-by: Jason Baron <jbaron@akamai.com> Tested-by: Madars Vitolins <m@silodev.com> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Al Viro <viro@ftp.linux.org.uk> Cc: Eric Wong <normalperson@yhbt.net> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Hagen Paul Pfeifer <hagen@jauu.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Helper radix_tree_iter_retry() resets next_index to the current index.
In following radix_tree_next_slot current chunk size becomes zero. This
isn't checked and it tries to dereference null pointer in slot.
Tagged iterator is fine because retry happens only at slot 0 where tag
bitmask in iter->tags is filled with single bit.
Fixes: 46437f9a554f ("radix-tree: fix race in gang lookup") Signed-off-by: Konstantin Khlebnikov <koct9i@gmail.com> Cc: Matthew Wilcox <willy@linux.intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Ohad Ben-Cohen <ohad@wizery.com> Cc: Jeremiah Mahler <jmmahler@gmail.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit ea8f8fc8631 ("MAINTAINERS: add linux-api for review of API/ABI
changes") added file triggers for various paths that likely indicated
API/ABI changes. However, catching all changes in Documentation/ABI/
and include/uapi/ produces a large volume of mail to linux-api, rather
than only API/ABI changes. Drop those two entries, but leave
include/linux/syscalls.h and kernel/sys_ni.c to catch syscall-related
changes.
Dmitry Monakhov [Fri, 5 Feb 2016 23:36:55 +0000 (15:36 -0800)]
dax: dirty inode only if required
Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Matthew Wilcox <matthew.r.wilcox@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We need to iterate over split_queue, not local empty list to get
anything split from the shrinker.
Fixes: e3ae19535c66 ("thp: limit number of object to scan on deferred_split_scan()") Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
mm: replace vma_lock_anon_vma with anon_vma_lock_read/write
Sequence vma_lock_anon_vma() - vma_unlock_anon_vma() isn't safe if
anon_vma appeared between lock and unlock. We have to check anon_vma
first or call anon_vma_prepare() to be sure that it's here. There are
only few users of these legacy helpers. Let's get rid of them.
This patch fixes anon_vma lock imbalance in validate_mm(). Write lock
isn't required here, read lock is enough.
And reorders expand_downwards/expand_upwards: security_mmap_addr() and
wrapping-around check don't have to be under anon vma lock.
Link: https://lkml.kernel.org/r/CACT4Y+Y908EjM2z=706dv4rV6dWtxTLK9nFg9_7DhRMLppBo2g@mail.gmail.com Signed-off-by: Konstantin Khlebnikov <koct9i@gmail.com> Reported-by: Dmitry Vyukov <dvyukov@google.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
xuejiufei [Fri, 5 Feb 2016 23:36:47 +0000 (15:36 -0800)]
ocfs2/dlm: clear refmap bit of recovery lock while doing local recovery cleanup
When recovery master down, dlm_do_local_recovery_cleanup() only remove
the $RECOVERY lock owned by dead node, but do not clear the refmap bit.
Which will make umount thread falling in dead loop migrating $RECOVERY
to the dead node.
Signed-off-by: xuejiufei <xuejiufei@huawei.com> Reviewed-by: Joseph Qi <joseph.qi@huawei.com> Cc: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Nicolai Stange [Fri, 5 Feb 2016 23:36:44 +0000 (15:36 -0800)]
um: asm/page.h: remove the pte_high member from struct pte_t
Commit 16da306849d0 ("um: kill pfn_t") introduced a compile warning for
defconfig (SUBARCH=i386):
arch/um/kernel/skas/mmu.c:38:206:
warning: right shift count >= width of type [-Wshift-count-overflow]
Aforementioned patch changes the definition of the phys_to_pfn() macro
from
((pfn_t) ((p) >> PAGE_SHIFT))
to
((p) >> PAGE_SHIFT)
This effectively changes the phys_to_pfn() expansion's type from
unsigned long long to unsigned long.
Through the callchain init_stub_pte() => mk_pte(), the expansion of
phys_to_pfn() is (indirectly) fed into the 'phys' argument of the
pte_set_val(pte, phys, prot) macro, eventually leading to
(pte).pte_high = (phys) >> 32;
This results in the warning from above.
Since UML only deals with 32 bit addresses, the upper 32 bits from
'phys' used to be always zero anyway. Also, all page protection flags
defined by UML don't use any bits beyond bit 9. Since the contents of a
PTE are defined within architecture scope only, the ->pte_high member
can be safely removed.
Remove the ->pte_high member from struct pte_t.
Rename ->pte_low to ->pte.
Adapt the pte helper macros in arch/um/include/asm/page.h.
Noteworthy is the pte_copy() macro where a smp_wmb() gets dropped. This
write barrier doesn't seem to be paired with any read barrier though and
thus, was useless anyway.
Fixes: 16da306849d0 ("um: kill pfn_t") Signed-off-by: Nicolai Stange <nicstange@gmail.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Richard Weinberger <richard@nod.at> Cc: Nicolai Stange <nicstange@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Vlastimil Babka [Fri, 5 Feb 2016 23:36:41 +0000 (15:36 -0800)]
mm, hugetlb: don't require CMA for runtime gigantic pages
Commit 944d9fec8d7a ("hugetlb: add support for gigantic page allocation
at runtime") has added the runtime gigantic page allocation via
alloc_contig_range(), making this support available only when CONFIG_CMA
is enabled. Because it doesn't depend on MIGRATE_CMA pageblocks and the
associated infrastructure, it is possible with few simple adjustments to
require only CONFIG_MEMORY_ISOLATION instead of full CONFIG_CMA.
After this patch, alloc_contig_range() and related functions are
available and used for gigantic pages with just CONFIG_MEMORY_ISOLATION
enabled. Note CONFIG_CMA selects CONFIG_MEMORY_ISOLATION. This allows
supporting runtime gigantic pages without the CMA-specific checks in
page allocator fastpaths.
Attempting to preallocate 1G gigantic huge pages at boot time with
"hugepagesz=1G hugepages=1" on the kernel command line will prevent
booting with the following:
kernel BUG at mm/hugetlb.c:1218!
When mapcount accounting was reworked, the setting of
compound_mapcount_ptr in prep_compound_gigantic_page was overlooked. As
a result, the validation of mapcount in free_huge_page fails.
The "BUG_ON" checks in free_huge_page were also changed to
"VM_BUG_ON_PAGE" to assist with debugging.
Fixes: 53f9263baba69 ("mm: rework mapcount accounting to enable 4k mapping of THPs") Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: David Rientjes <rientjes@google.com> Tested-by: Vlastimil Babka <vbabka@suse.cz> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> Cc: Jerome Marchand <jmarchan@redhat.com> Cc: Michal Hocko <mhocko@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The only case when we can queue pages from such VMA is MPOL_MF_STRICT
plus MPOL_MF_MOVE or MPOL_MF_MOVE_ALL for VMA which has pages on LRU,
but gfp mask is not sutable for migaration (see mapping_gfp_mask() check
in vma_migratable()). That's looks like a bug to me.
Let's filter out non-migratable vma at start of queue_pages_test_walk()
and go to queue_pages_pte_range() only if MPOL_MF_MOVE or
MPOL_MF_MOVE_ALL flag is set.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Dmitry Vyukov <dvyukov@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: David Rientjes <rientjes@google.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Tetsuo Handa [Fri, 5 Feb 2016 23:36:30 +0000 (15:36 -0800)]
mm, vmstat: fix wrong WQ sleep when memory reclaim doesn't make any progress
Jan Stancek has reported that system occasionally hanging after "oom01"
testcase from LTP triggers OOM. Guessing from a result that there is a
kworker thread doing memory allocation and the values between "Node 0
Normal free:" and "Node 0 Normal:" differs when hanging, vmstat is not
up-to-date for some reason.
According to commit 373ccbe59270 ("mm, vmstat: allow WQ concurrency to
discover memory reclaim doesn't make any progress"), it meant to force
the kworker thread to take a short sleep, but it by error used
schedule_timeout(1). We missed that schedule_timeout() in state
TASK_RUNNING doesn't do anything.
Fix it by using schedule_timeout_uninterruptible(1) which forces the
kworker thread to take a short sleep in order to make sure that vmstat
is up-to-date.
Fixes: 373ccbe59270 ("mm, vmstat: allow WQ concurrency to discover memory reclaim doesn't make any progress") Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Reported-by: Jan Stancek <jstancek@redhat.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Tejun Heo <tj@kernel.org> Cc: Cristopher Lameter <clameter@sgi.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Arkadiusz Miskiewicz <arekm@maven.pl> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Michal Hocko [Fri, 5 Feb 2016 23:36:27 +0000 (15:36 -0800)]
vmstat: make vmstat_update deferrable
Commit 0eb77e988032 ("vmstat: make vmstat_updater deferrable again and
shut down on idle") made vmstat_shepherd deferrable. vmstat_update
itself is still useing standard timer which might interrupt idle task.
This is possible because "mm, vmstat: make quiet_vmstat lighter" removed
cancel_delayed_work from the quiet_vmstat.
Change vmstat_work to use DEFERRABLE_WORK to prevent from pointless
wakeups from the idle context.
Acked-by: Christoph Lameter <cl@linux.com> Signed-off-by: Michal Hocko <mhocko@suse.com> Cc: Mike Galbraith <umgwanakikbuti@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This is caused by commit 0eb77e988032 ("vmstat: make vmstat_updater
deferrable again and shut down on idle") which has placed quiet_vmstat
into cpu_idle_loop. The main reason here seems to be that the idle
entry has to get over all zones and perform atomic operations for each
vmstat entry even though there might be no per cpu diffs. This is a
pointless overhead for _each_ idle entry.
Make sure that quiet_vmstat is as light as possible.
First of all it doesn't make any sense to do any local sync if the
current cpu is already set in oncpu_stat_off because vmstat_update puts
itself there only if there is nothing to do.
Then we can check need_update which should be a cheap way to check for
potential per-cpu diffs and only then do refresh_cpu_vm_stats.
The original patch also did cancel_delayed_work which we are not doing
here. There are two reasons for that. Firstly cancel_delayed_work from
idle context will blow up on RT kernels (reported by Mike):
And secondly, even on !RT kernels it might add some non trivial overhead
which is not necessary. Even if the vmstat worker wakes up and preempts
idle then it will be most likely a single shot noop because the stats
were already synced and so it would end up on the oncpu_stat_off anyway.
We just need to teach both vmstat_shepherd and vmstat_update to stop
scheduling the worker if there is nothing to do.
[mgalbraith@suse.de: cancel pending work of the cpu_stat_off CPU] Signed-off-by: Michal Hocko <mhocko@suse.com> Reported-by: Mike Galbraith <umgwanakikbuti@gmail.com> Acked-by: Christoph Lameter <cl@linux.com> Signed-off-by: Mike Galbraith <mgalbraith@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
David Gibson [Fri, 5 Feb 2016 23:36:19 +0000 (15:36 -0800)]
memblock: don't mark memblock_phys_mem_size() as __init
At the moment memblock_phys_mem_size() is marked as __init, and so is
discarded after boot. This is different from most of the memblock
functions which are marked __init_memblock, and are only discarded after
boot if memory hotplug is not configured.
To allow for upcoming code which will need memblock_phys_mem_size() in
the hotplug path, change it from __init to __init_memblock.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The mmap_sem for reading in validate_mm called from expand_stack is not
enough to prevent the argumented rbtree rb_subtree_gap information to
change from under us because expand_stack may be running from other
threads concurrently which will hold the mmap_sem for reading too.
The argumented rbtree is updated with vma_gap_update under the
page_table_lock so use it in browse_rb() too to avoid false positives.
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Reported-by: Dmitry Vyukov <dvyukov@google.com> Tested-by: Dmitry Vyukov <dvyukov@google.com> Cc: Konstantin Khlebnikov <koct9i@gmail.com> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Sudip Mukherjee [Fri, 5 Feb 2016 23:36:10 +0000 (15:36 -0800)]
m32r: fix build failure due to SMP and MMU
One of the randconfig build failed with the error:
arch/m32r/kernel/smp.c: In function 'smp_flush_tlb_mm':
arch/m32r/kernel/smp.c:283:20: error: subscripted value is neither array nor pointer nor vector
mmc = &mm->context[cpu_id];
^
arch/m32r/kernel/smp.c: In function 'smp_flush_tlb_page':
arch/m32r/kernel/smp.c:353:20: error: subscripted value is neither array nor pointer nor vector
mmc = &mm->context[cpu_id];
^
arch/m32r/kernel/smp.c: In function 'smp_invalidate_interrupt':
arch/m32r/kernel/smp.c:479:41: error: subscripted value is neither array nor pointer nor vector
unsigned long *mmc = &flush_mm->context[cpu_id];
It turned out that CONFIG_SMP was defined but CONFIG_MMU was not
defined. But arch/m32r/include/asm/mmu.h only defines mm_context_t as
an array when both CONFIG_SMP and CONFIG_MMU are defined. And
arch/m32r/kernel/smp.c is always using context as an array. So without
MMU SMP can not work.