]> git.karo-electronics.de Git - linux-beck.git/log
linux-beck.git
9 years agoMerge branch 'net-y2038'
David S. Miller [Mon, 5 Oct 2015 10:16:49 +0000 (03:16 -0700)]
Merge branch 'net-y2038'

Arnd Bergmann says:

====================
net: assorted y2038 changes

This is a set of changes for network drivers and core code to
get rid of the use of time_t and derived data structures.

I have a longer set of patches that enables me to build kernels
with the time_t definition removed completely as a help to find
y2038 overflow issues. This is the subset for networking that
contains all code that has a reasonable way of fixing at the
moment and that is either commonly used (in one of the defconfigs)
or that blocks building a whole subsystem.

Most of the patches in this series should be noncontroversial,
but the last two that I marked [RFC] are a bit tricky and
need input from people that are more familiar with the code than
I am. All 12 patches are independent of one another and can
be applied in any order, so feel free to pick all that look
good.

Patches that are not included here are:

 - disabling less common device drivers that I don't have a fix
   for yet, this includes
drivers/net/ethernet/brocade/bna/bfa_ioc.c
drivers/net/ethernet/qlogic/netxen/netxen_nic_hw.c
drivers/net/ethernet/tile/tilegx.c
drivers/net/hamradio/baycom_ser_fdx.c
drivers/net/wireless/ath/ath10k/core.h
drivers/net/wireless/ath/ath9k/
drivers/net/wireless/ath/ath9k/
drivers/net/wireless/atmel.c
drivers/net/wireless/prism54/isl_38xx.c
drivers/net/wireless/rt2x00/rt2x00debug.c
drivers/net/wireless/rtlwifi/
drivers/net/wireless/ti/wlcore/
drivers/staging/ozwpan/
net/atm/mpoa_caches.c
net/atm/mpoa_proc.c
net/dccp/probe.c
net/ipv4/tcp_probe.c
net/netfilter/nfnetlink_queue_core.c
net/netfilter/nfnetlink_queue_core.c
net/netfilter/xt_time.c
net/openvswitch/flow.c
net/sctp/probe.c
net/sunrpc/auth_gss/
net/sunrpc/svcauth_unix.c
net/vmw_vsock/af_vsock.c
   We'll get there eventually, or we an add a dependency to ensure
   they are not built on 32-bit kernels that need to survive
   beyond 2038. Most of these should be really easy to fix.

 - recvmmsg/sendmmsg system calls: patches have been sent out
   as part of the syscall series, need a little more work and
   review

 - SIOCGSTAMP/SIOCGSTAMPNS/ ioctl calls: tricky, need to discuss
   with some folks at kernel summit

 - SO_RCVTIMEO/SO_SNDTIMEO/SO_TIMESTAMP/SO_TIMESTAMPNS socket
   opt: similar and related to the ioctl

 - mmapped packet socket: need to create v4 of the API, nontrivial

 - pktgen: sends 32-bit timestamps over network, need to find out
   if using unsigned stamps is good enough

 - af_rxpc: similar to pktgen, uses 32-bit times for deadlines

 - ppp ioctl: patch is being worked on, nontrivial but doable
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
9 years agonet: sctp: avoid incorrect time_t use
Arnd Bergmann [Wed, 30 Sep 2015 11:26:40 +0000 (13:26 +0200)]
net: sctp: avoid incorrect time_t use

We want to avoid using time_t in the kernel because of the y2038
overflow problem. The use in sctp is not for storing seconds at
all, but instead uses microseconds and is passed as 32-bit
on all machines.

This patch changes the type to u32, which better fits the use.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Cc: Vlad Yasevich <vyasevich@gmail.com>
Cc: Neil Horman <nhorman@tuxdriver.com>
Cc: linux-sctp@vger.kernel.org
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
9 years agoipv6: use ktime_t for internal timestamps
Arnd Bergmann [Wed, 30 Sep 2015 11:26:39 +0000 (13:26 +0200)]
ipv6: use ktime_t for internal timestamps

The ipv6 mip6 implementation is one of only a few users of the
skb_get_timestamp() function in the kernel, which is both unsafe
on 32-bit architectures because of the 2038 overflow, and slightly
less efficient than the skb_get_ktime() based approach.

This converts the function call and the mip6_report_rate_limiter
structure that stores the time stamp, eliminating all uses of
timeval in the ipv6 code.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Cc: James Morris <jmorris@namei.org>
Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
Cc: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
9 years agonfnetlink: use y2038 safe timestamp
Arnd Bergmann [Wed, 30 Sep 2015 11:26:38 +0000 (13:26 +0200)]
nfnetlink: use y2038 safe timestamp

The __build_packet_message function fills a nfulnl_msg_packet_timestamp
structure that uses 64-bit seconds and is therefore y2038 safe, but
it uses an intermediate 'struct timespec' which is not.

This trivially changes the code to use 'struct timespec64' instead,
to correct the result on 32-bit architectures.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Cc: Pablo Neira Ayuso <pablo@netfilter.org>
Cc: Patrick McHardy <kaber@trash.net>
Cc: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
Cc: netfilter-devel@vger.kernel.org
Cc: coreteam@netfilter.org
Acked-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
9 years agoatm: remove 'struct zatm_t_hist'
Arnd Bergmann [Wed, 30 Sep 2015 15:32:01 +0000 (17:32 +0200)]
atm: remove 'struct zatm_t_hist'

The zatm_t_hist structure is not used anywhere in the kernel, but is
exported to user space. As we are trying to eliminate uses of time_t
in the kernel for y2038 compatibility, the current definition triggers
checking tools because it contains 'struct timeval'.

As pointed out by Chas Williams, the only user of this structure was
the ZATM_GETHIST ioctl command that has been removed a long time ago,
and we can remove the structure as well without breaking any user
space.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Cc: Chas Williams <3chas3@gmail.com>
Cc: linux-atm-general@lists.sourceforge.net
Signed-off-by: David S. Miller <davem@davemloft.net>
9 years agomac80211: use ktime_get_seconds
Arnd Bergmann [Wed, 30 Sep 2015 11:26:36 +0000 (13:26 +0200)]
mac80211: use ktime_get_seconds

The mac80211 code uses ktime_get_ts to measure the connected time.
As this uses monotonic time, it is y2038 safe on 32-bit systems,
but we still want to deprecate the use of 'timespec' because most
other users are broken.

This changes the code to use ktime_get_seconds() instead, which
avoids the timespec structure and is slightly more efficient.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Cc: Johannes Berg <johannes@sipsolutions.net>
Cc: linux-wireless@vger.kernel.org
Signed-off-by: David S. Miller <davem@davemloft.net>
9 years agomwifiex: avoid gettimeofday in ba_threshold setting
Arnd Bergmann [Wed, 30 Sep 2015 11:26:35 +0000 (13:26 +0200)]
mwifiex: avoid gettimeofday in ba_threshold setting

mwifiex_get_random_ba_threshold() uses a complex homegrown implementation
to generate a pseudo-random number from the current time as returned
from do_gettimeofday().

This currently requires two 32-bit divisions plus a couple of other
computations that are eventually discarded as only eight bits of
the microsecond portion are used at all.

We could replace this with a call to get_random_bytes(), but that
might drain the entropy pool too fast if this is called for each
packet.

Instead, this patch converts it to use ktime_get_ns(), which is a
bit faster than do_gettimeofday(), and then uses a similar algorithm
as before, but in a way that takes both the nanosecond and second
portion into account for slightly-more-but-still-not-very-random
pseudorandom number.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Cc: Amitkumar Karwar <akarwar@marvell.com>
Cc: Nishant Sarmukadam <nishants@marvell.com>
Cc: Kalle Valo <kvalo@codeaurora.org>
Cc: linux-wireless@vger.kernel.org
Signed-off-by: David S. Miller <davem@davemloft.net>
9 years agomwifiex: use ktime_get_real for timestamping
Arnd Bergmann [Wed, 30 Sep 2015 11:26:34 +0000 (13:26 +0200)]
mwifiex: use ktime_get_real for timestamping

The mwifiex_11n_aggregate_pkt() function creates a ktime_t from
a timeval returned by do_gettimeofday, which is slow and causes
an overflow in 2038 on 32-bit architectures.

This solves both problems by using the appropriate ktime_get_real()
function.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Cc: Amitkumar Karwar <akarwar@marvell.com>
Cc: Nishant Sarmukadam <nishants@marvell.com>
Cc: Kalle Valo <kvalo@codeaurora.org>
Cc: linux-wireless@vger.kernel.org
Signed-off-by: David S. Miller <davem@davemloft.net>
9 years agonet: igb: avoid using timespec
Arnd Bergmann [Wed, 30 Sep 2015 11:26:33 +0000 (13:26 +0200)]
net: igb: avoid using timespec

We want to deprecate the use of 'struct timespec' on 32-bit
architectures, as it is will overflow in 2038. The igb
driver uses it to read the current time, and can simply
be changed to use ktime_get_real_ts64() instead.

Because of hardware limitations, there is still an overflow
in year 2106, which we cannot really avoid, but this documents
the overflow.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Cc: intel-wired-lan@lists.osuosl.org
Reviewed-by: Richard Cochran <richardcochran@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
9 years agonet: stmmac: avoid using timespec
Arnd Bergmann [Wed, 30 Sep 2015 11:26:32 +0000 (13:26 +0200)]
net: stmmac: avoid using timespec

We want to deprecate the use of 'struct timespec' on 32-bit
architectures, as it is will overflow in 2038. The stmmac
driver uses it to read the current time, and can simply
be changed to use ktime_get_real_ts64() instead.

Because of hardware limitations, there is still an overflow
in year 2106, which we cannot really avoid, but this documents
the overflow.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Cc: Giuseppe Cavallaro <peppe.cavallaro@st.com>
Cc: Richard Cochran <richardcochran@gmail.com>
Acked-by: Richard Cochran <richardcochran@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
9 years agonet: fec: avoid timespec use
Arnd Bergmann [Wed, 30 Sep 2015 11:26:31 +0000 (13:26 +0200)]
net: fec: avoid timespec use

The fec_ptp_enable_pps uses an open-coded implementation of ns_to_timespec,
which will be removed eventually as it is not y2038-safe on 32-bit
architectures. Two more instances of the same code in this file were
already converted to use the safe ns_to_timespec64 in commit 6630514fcee
("ptp: fec: use helpers for converting ns to timespec"), this changes
the last one as well.

The seconds portion here is actually unused and we could just remove the
timespec variable, but using ns_to_timespec64 can still be better as the
implementation can be hand-optimized in the future.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Cc: Richard Cochran <richardcochran@gmail.com>
Cc: Fugang Duan <b38611@freescale.com>
Cc: Luwei Zhou <b45643@freescale.com>
Cc: Frank Li <Frank.Li@freescale.com>
Acked-by: Richard Cochran <richardcochran@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
9 years agoMerge branch 'ipv4-multipath-hash'
David S. Miller [Mon, 5 Oct 2015 10:00:26 +0000 (03:00 -0700)]
Merge branch 'ipv4-multipath-hash'

Peter Nørlund says:

====================
ipv4: Hash-based multipath routing

When the routing cache was removed in 3.6, the IPv4 multipath algorithm changed
from more or less being destination-based into being quasi-random per-packet
scheduling. This increases the risk of out-of-order packets and makes it
impossible to use multipath together with anycast services.

This patch series replaces the old implementation with flow-based load
balancing based on a hash over the source and destination addresses.

Distribution of the hash is done with thresholds as described in RFC 2992.
This reduces the disruption when a path is added/remove when having more than
two paths.

To futher the chance of successful usage in conjuction with anycast, ICMP
error packets are hashed over the inner IP addresses. This ensures that PMTU
will work together with anycast or load-balancers such as IPVS.

Port numbers are not considered since fragments could cause problems with
anycast and IPVS. Relying on the DF-flag for TCP packets is also insufficient,
since ICMP inspection effectively extracts information from the opposite
flow which might have a different state of the DF-flag. This is also why the
RSS hash is not used. These are typically based on the NDIS RSS spec which
mandates TCP support.

Measurements of the additional overhead of a two-path multipath
(p_mkroute_input excl. __mkroute_input) on a Xeon X3550 (4 cores, 2.66GHz):

Original per-packet: ~394 cycles/packet
L3 hash:              ~76 cycles/packet

Changes in v5:
- Fixed compilation error

Changes in v4:
- Functions take hash directly instead of func ptr
- Added inline hash function
- Added dummy macros to minimize ifdefs
- Use upper 31 bits of hash instead of lower

Changes in v3:
- Multipath algorithm is no longer configurable (always L3)
- Added random seed to hash
- Moved ICMP inspection to isolated function
- Ignore source quench packets (deprecated as per RFC 6633)

Changes in v2:
- Replaced 8-bit xor hash with 31-bit jenkins hash
- Don't scale weights (since 31-bit)
- Avoided unnecesary renaming of variables
- Rely on DF-bit instead of fragment offset when checking for fragmentation
- upper_bound is now inclusive to avoid overflow
- Use a callback to postpone extracting flow information until necessary
- Skipped ICMP inspection entirely with L4 hashing
- Handle newly added sysctl ignore_routes_with_linkdown
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
9 years agoipv4: ICMP packet inspection for multipath
Peter Nørlund [Wed, 30 Sep 2015 08:12:22 +0000 (10:12 +0200)]
ipv4: ICMP packet inspection for multipath

ICMP packets are inspected to let them route together with the flow they
belong to, minimizing the chance that a problematic path will affect flows
on other paths, and so that anycast environments can work with ECMP.

Signed-off-by: Peter Nørlund <pch@ordbogen.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
9 years agoipv4: L3 hash-based multipath
Peter Nørlund [Wed, 30 Sep 2015 08:12:21 +0000 (10:12 +0200)]
ipv4: L3 hash-based multipath

Replaces the per-packet multipath with a hash-based multipath using
source and destination address.

Signed-off-by: Peter Nørlund <pch@ordbogen.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
9 years agoMerge branch 'tcp-listener-fixes-and-improvement'
David S. Miller [Mon, 5 Oct 2015 09:46:26 +0000 (02:46 -0700)]
Merge branch 'tcp-listener-fixes-and-improvement'

Eric Dumazet says:

====================
tcp: lockless listener fixes and improvement

This fixes issues with TCP FastOpen vs lockless listeners,
and SYNACK being attached to request sockets.

Then, last patch brings performance improvement for
syncookies generation and validation.

Tested under a 4.3 Mpps SYNFLOOD attack, new perf profile looks
like :
    12.11%  [kernel]  [k] sha_transform
     5.83%  [kernel]  [k] tcp_conn_request
     4.59%  [kernel]  [k] __inet_lookup_listener
     4.11%  [kernel]  [k] ipt_do_table
     3.91%  [kernel]  [k] tcp_make_synack
     3.05%  [kernel]  [k] fib_table_lookup
     2.74%  [kernel]  [k] sock_wfree
     2.66%  [kernel]  [k] memcpy_erms
     2.12%  [kernel]  [k] tcp_v4_rcv
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
9 years agotcp: avoid two atomic ops for syncookies
Eric Dumazet [Mon, 5 Oct 2015 04:08:11 +0000 (21:08 -0700)]
tcp: avoid two atomic ops for syncookies

inet_reqsk_alloc() is used to allocate a temporary request
in order to generate a SYNACK with a cookie. Then later,
syncookie validation also uses a temporary request.

These paths already took a reference on listener refcount,
we can avoid a couple of atomic operations.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
9 years agonet: use sk_fullsock() in __netdev_pick_tx()
Eric Dumazet [Mon, 5 Oct 2015 04:08:10 +0000 (21:08 -0700)]
net: use sk_fullsock() in __netdev_pick_tx()

SYN_RECV & TIMEWAIT sockets are not full blown, they do not have a
sk_dst_cache pointer.

Fixes: ca6fb0651883 ("tcp: attach SYNACK messages to request sockets instead of listener")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
9 years agoipv6: inet6_sk() should use sk_fullsock()
Eric Dumazet [Mon, 5 Oct 2015 04:08:09 +0000 (21:08 -0700)]
ipv6: inet6_sk() should use sk_fullsock()

SYN_RECV & TIMEWAIT sockets are not full blown, they do not have a pinet6
pointer.

Fixes: ca6fb0651883 ("tcp: attach SYNACK messages to request sockets instead of listener")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
9 years agoinet: ip_skb_dst_mtu() should use sk_fullsock()
Eric Dumazet [Mon, 5 Oct 2015 04:08:08 +0000 (21:08 -0700)]
inet: ip_skb_dst_mtu() should use sk_fullsock()

SYN_RECV & TIMEWAIT sockets are not full blown,
do not even try to call ip_sk_use_pmtu() on them.

Fixes: ca6fb0651883 ("tcp: attach SYNACK messages to request sockets instead of listener")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
9 years agotcp: fix fastopen races vs lockless listener
Eric Dumazet [Mon, 5 Oct 2015 04:08:07 +0000 (21:08 -0700)]
tcp: fix fastopen races vs lockless listener

There are multiple races that need fixes :

1) skb_get() + queue skb + kfree_skb() is racy

An accept() can be done on another cpu, data consumed immediately.
tcp_recvmsg() uses __kfree_skb() as it is assumed all skb found in
socket receive queue are private.

Then the kfree_skb() in tcp_rcv_state_process() uses an already freed skb

2) tcp_reqsk_record_syn() needs to be done before tcp_try_fastopen()
for the same reasons.

3) We want to send the SYNACK before queueing child into accept queue,
otherwise we might reintroduce the ooo issue fixed in
commit 7c85af881044 ("tcp: avoid reorders for TFO passive connections")

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
9 years agoMerge branch 'bridge-netlink'
David S. Miller [Sun, 4 Oct 2015 23:46:14 +0000 (16:46 -0700)]
Merge branch 'bridge-netlink'

Nikolay Aleksandrov says:

====================
bridge: complete netlink support

This set completes the bridge device's netlink support and makes it
possible to view and configure everything that can be configured via
sysfs. I have tested all of these (setting and getting). There're a few
longer line warnings about the br_get_size() ifla comments but I think we
should have them to know what has been accounted for. I have used the sysfs
interface as a guide of what and how to set. As usual I'll send the
corresponding iproute2 patches later.
The bridge port's netlink interface will be completed after this set gets
applied in some form.

This patch-set is on top of my last vlan cleanups set:
http://www.spinics.net/lists/netdev/msg346005.html
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
9 years agobridge: netlink: add support for default_pvid
Nikolay Aleksandrov [Sun, 4 Oct 2015 12:23:47 +0000 (14:23 +0200)]
bridge: netlink: add support for default_pvid

Add IFLA_BR_VLAN_DEFAULT_PVID to allow setting/getting bridge's
default_pvid via netlink.

Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
9 years agobridge: netlink: add support for netfilter tables config
Nikolay Aleksandrov [Sun, 4 Oct 2015 12:23:46 +0000 (14:23 +0200)]
bridge: netlink: add support for netfilter tables config

Add support to allow getting/setting netfilter tables settings.
Currently these are IFLA_BR_NF_CALL_IPTABLES, IFLA_BR_NF_CALL_IP6TABLES
and IFLA_BR_NF_CALL_ARPTABLES.

Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
9 years agobridge: netlink: add support for igmp's intervals
Nikolay Aleksandrov [Sun, 4 Oct 2015 12:23:45 +0000 (14:23 +0200)]
bridge: netlink: add support for igmp's intervals

Add support to set/get all of the igmp's configurable intervals via
netlink. These currently are:
IFLA_BR_MCAST_LAST_MEMBER_INTVL
IFLA_BR_MCAST_MEMBERSHIP_INTVL
IFLA_BR_MCAST_QUERIER_INTVL
IFLA_BR_MCAST_QUERY_INTVL
IFLA_BR_MCAST_QUERY_RESPONSE_INTVL
IFLA_BR_MCAST_STARTUP_QUERY_INTVL

Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
9 years agobridge: netlink: add support for multicast_startup_query_count
Nikolay Aleksandrov [Sun, 4 Oct 2015 12:23:44 +0000 (14:23 +0200)]
bridge: netlink: add support for multicast_startup_query_count

Add IFLA_BR_MCAST_STARTUP_QUERY_CNT to allow setting/getting
br->multicast_startup_query_count via netlink. Also align the ifla
comments.

Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
9 years agobridge: netlink: add support for multicast_last_member_count
Nikolay Aleksandrov [Sun, 4 Oct 2015 12:23:43 +0000 (14:23 +0200)]
bridge: netlink: add support for multicast_last_member_count

Add IFLA_BR_MCAST_LAST_MEMBER_CNT to allow setting/getting
br->multicast_last_member_count via netlink.

Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
9 years agobridge: netlink: add support for igmp's hash_max
Nikolay Aleksandrov [Sun, 4 Oct 2015 12:23:42 +0000 (14:23 +0200)]
bridge: netlink: add support for igmp's hash_max

Add IFLA_BR_MCAST_HASH_MAX to allow setting/getting br->hash_max via
netlink.

Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
9 years agobridge: netlink: add support for igmp's hash_elasticity
Nikolay Aleksandrov [Sun, 4 Oct 2015 12:23:41 +0000 (14:23 +0200)]
bridge: netlink: add support for igmp's hash_elasticity

Add IFLA_BR_MCAST_HASH_ELASTICITY to allow setting/getting
br->hash_elasticity via netlink.

Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
9 years agobridge: netlink: add support for multicast_querier
Nikolay Aleksandrov [Sun, 4 Oct 2015 12:23:40 +0000 (14:23 +0200)]
bridge: netlink: add support for multicast_querier

Add IFLA_BR_MCAST_QUERIER to allow setting/getting br->multicast_querier
via netlink.

Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
9 years agobridge: netlink: add support for multicast_query_use_ifaddr
Nikolay Aleksandrov [Sun, 4 Oct 2015 12:23:39 +0000 (14:23 +0200)]
bridge: netlink: add support for multicast_query_use_ifaddr

Add IFLA_BR_MCAST_QUERY_USE_IFADDR to allow setting/getting
br->multicast_query_use_ifaddr via netlink.

Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
9 years agobridge: netlink: add support for multicast_snooping
Nikolay Aleksandrov [Sun, 4 Oct 2015 12:23:38 +0000 (14:23 +0200)]
bridge: netlink: add support for multicast_snooping

Add IFLA_BR_MCAST_SNOOPING to allow enabling/disabling multicast
snooping via netlink.

Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
9 years agobridge: netlink: add support for multicast_router
Nikolay Aleksandrov [Sun, 4 Oct 2015 12:23:37 +0000 (14:23 +0200)]
bridge: netlink: add support for multicast_router

Add IFLA_BR_MCAST_ROUTER to allow setting and retrieving
br->multicast_router when igmp snooping is enabled.

Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
9 years agobridge: netlink: add fdb flush
Nikolay Aleksandrov [Sun, 4 Oct 2015 12:23:36 +0000 (14:23 +0200)]
bridge: netlink: add fdb flush

Simple attribute that flushes the bridge's fdb.

Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
9 years agobridge: netlink: add group_addr support
Nikolay Aleksandrov [Sun, 4 Oct 2015 12:23:35 +0000 (14:23 +0200)]
bridge: netlink: add group_addr support

Add IFLA_BR_GROUP_ADDR attribute to allow setting and retrieving the
group_addr via netlink.

Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
9 years agobridge: netlink: export all timers
Nikolay Aleksandrov [Sun, 4 Oct 2015 12:23:34 +0000 (14:23 +0200)]
bridge: netlink: export all timers

Export the following bridge timers (also exported via sysfs):
IFLA_BR_HELLO_TIMER, IFLA_BR_TCN_TIMER, IFLA_BR_TOPOLOGY_CHANGE_TIMER,
IFLA_BR_GC_TIMER via netlink.

Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
9 years agobridge: netlink: export topology_change and topology_change_detected
Nikolay Aleksandrov [Sun, 4 Oct 2015 12:23:33 +0000 (14:23 +0200)]
bridge: netlink: export topology_change and topology_change_detected

Add IFLA_BR_TOPOLOGY_CHANGE and IFLA_BR_TOPOLOGY_CHANGE_DETECTED and
export them via netlink.

Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
9 years agobridge: netlink: export root path cost
Nikolay Aleksandrov [Sun, 4 Oct 2015 12:23:32 +0000 (14:23 +0200)]
bridge: netlink: export root path cost

Add IFLA_BR_ROOT_PATH_COST and export it via netlink.

Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
9 years agobridge: netlink: export root port
Nikolay Aleksandrov [Sun, 4 Oct 2015 12:23:31 +0000 (14:23 +0200)]
bridge: netlink: export root port

Add IFLA_BR_ROOT_PORT and export it via netlink.

Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
9 years agobridge: netlink: export bridge id
Nikolay Aleksandrov [Sun, 4 Oct 2015 12:23:30 +0000 (14:23 +0200)]
bridge: netlink: export bridge id

Add IFLA_BR_BRIDGE_ID and export br->bridge_id via netlink.

Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
9 years agobridge: netlink: export root id
Nikolay Aleksandrov [Sun, 4 Oct 2015 12:23:29 +0000 (14:23 +0200)]
bridge: netlink: export root id

Add IFLA_BR_ROOT_ID and export br->designated_root via netlink. For this
purpose add struct ifla_bridge_id that would represent struct bridge_id.

Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
9 years agobridge: netlink: add group_fwd_mask support
Nikolay Aleksandrov [Sun, 4 Oct 2015 12:23:28 +0000 (14:23 +0200)]
bridge: netlink: add group_fwd_mask support

Add IFLA_BR_GROUP_FWD_MASK attribute to allow setting and retrieving the
group_fwd_mask via netlink.

Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
9 years agoMerge branch 'bridge-vlan'
David S. Miller [Sun, 4 Oct 2015 23:43:56 +0000 (16:43 -0700)]
Merge branch 'bridge-vlan'

Nikolay Aleksandrov says:

====================
bridge: vlan: cleanups & fixes (part 2)

This is the second follow-up set with one fix (patch 01) and more cleanups
(patches 02,03 and 04). These are minor compared to the previous ones and
should be the last before taking on the optimization changes on the
fast-path.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
9 years agobridge: vlan: use br_vlan_should_use to simplify __vlan_add/del
Nikolay Aleksandrov [Fri, 2 Oct 2015 13:05:13 +0000 (15:05 +0200)]
bridge: vlan: use br_vlan_should_use to simplify __vlan_add/del

The checks that lead to num_vlans change are always what
br_vlan_should_use checks for, namely if the vlan is only a context or
not and depending on that it's either not counted or counted
as a real/used vlan respectively.
Also give better explanation in br_vlan_should_use's comment.

Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
9 years agobridge: vlan: drop master_flags from __vlan_add
Nikolay Aleksandrov [Fri, 2 Oct 2015 13:05:12 +0000 (15:05 +0200)]
bridge: vlan: drop master_flags from __vlan_add

There's only one user now and we can include the flag directly.

Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
9 years agobridge: vlan: use br_vlan_(get|put)_master to deal with refcounts
Nikolay Aleksandrov [Fri, 2 Oct 2015 13:05:11 +0000 (15:05 +0200)]
bridge: vlan: use br_vlan_(get|put)_master to deal with refcounts

Introduce br_vlan_(get|put)_master which take a reference (or create the
master vlan first if it didn't exist) and drop a reference respectively.

Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
9 years agobridge: vlan: use rcu list for the ordered vlan list
Nikolay Aleksandrov [Fri, 2 Oct 2015 13:05:10 +0000 (15:05 +0200)]
bridge: vlan: use rcu list for the ordered vlan list

When I did the conversion to rhashtable I missed the required locking of
one important user of the vlan list - br_get_link_af_size_filtered()
which is called:
br_ifinfo_notify() -> br_nlmsg_size() -> br_get_link_af_size_filtered()
and the notifications can be sent without holding rtnl. Before this
conversion the function relied on using rcu and since we already use rcu to
destroy the vlans, we can simply migrate the list to use the rcu helpers.

Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
9 years agotcp/dccp: add SLAB_DESTROY_BY_RCU flag for request sockets
Eric Dumazet [Sat, 3 Oct 2015 13:27:28 +0000 (06:27 -0700)]
tcp/dccp: add SLAB_DESTROY_BY_RCU flag for request sockets

Before letting request sockets being put in TCP/DCCP regular
ehash table, we need to add either :

- SLAB_DESTROY_BY_RCU flag to their kmem_cache
- add RCU grace period before freeing them.

Since we carefully respected the SLAB_DESTROY_BY_RCU protocol
like ESTABLISH and TIMEWAIT sockets, use it here.

req_prot_init() being only used by TCP and DCCP, I did not add
a new slab_flags into their rsk_prot, but reuse prot->slab_flags

Since all reqsk_alloc() users are correctly dealing with a failure,
add the __GFP_NOWARN flag to avoid traces under pressure.

Fixes: 079096f103fa ("tcp/dccp: install syn_recv requests into ehash table")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
9 years agoMerge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/next...
David S. Miller [Sat, 3 Oct 2015 12:16:50 +0000 (05:16 -0700)]
Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/next-queue

Jeff Kirsher says:

====================
Intel Wired LAN Driver Updates 2015-09-30

This series contains updates to i40e and i40evf only.

Vasily Averin provides a couple of rtnl lock/unlock fixes for both i40e
and i40evf.

Shannon provides several updates and fixes, first fixes up a type clash
in i40e_aq_rc_to_posix(), where the error codes are signed values, so we
need to treat them as such.  Then fixes up a padding issue where an
extra byte is added in i40e_aqc_get_cee_dcb_cfg_v1_resp to directly
acknowledge the padding.  Updated i40e to keep debugfs register read
and writes from accessing outside of the io-remapped space.  Added
support and device id for another 20 GbE device.

Jesse fixes the transmit hand workaround code for ARM that was causing
Tx hangs to still occur occasionally when there really was no hang.  Then
fixed the receive dropped counter to show up in netstat interface.
Refactor the interrupt enable function since it was always making the
caller add the base_vector from the VSI struct which is already passed
to the function.  Fix kbuild warnings found in 0day build infrastructure
by adding a harmless cast to a dev_info(), also fix 32 bit build
warnings found by sparse.

Greg fixed a configuration error that results if a port VLAN is set
for a VF before the VF driver is loaded, so that when the VF driver is
loaded the port VLAN is ignored.

Mitch fixes the use of QOS field consistently in
i40e_ndo_set_vf_port_vlan().  Modified the init timing of the driver
to increase stability on load/unload and SR-IOV enable/disable cycles.

Anjali updates i40e to not collect VEB stats if they are disabled in the
hardware for performance reasons.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
9 years agoMerge branch 'ravb-r8a7795'
David S. Miller [Sat, 3 Oct 2015 12:05:30 +0000 (05:05 -0700)]
Merge branch 'ravb-r8a7795'

Simon Horman says:

====================
ravb: Add support for r8a7795 SoC

please consider this series for net-next.
It enhances the ravb driver to support the r8a7795 SoC.

Changes:

* Dropped RFC prefix
* Details in changelog of individual patches

Base:

* net-next/master

Availability:

To aid review of this in conjunction with other EtherAVB changes
the following branches are available in my renesas tree on kernel.org.

* me/r8a7795-ravb-driver-v4: this series
* me/r8a7795-ravb-pfc-v2: r8a7795 sh-pfc update for EthernetAVB
* me/r8a7795-ravb-integration-v4: enable EthernetAVB on r8a7795
* me/r8a7795-ravb-driver-and-integration-v4.runtime:
      the above three branches with their runtime dependencies
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
9 years agoravb: Add support for r8a7795 SoC
Kazuya Mizuguchi [Wed, 30 Sep 2015 06:15:55 +0000 (15:15 +0900)]
ravb: Add support for r8a7795 SoC

This patch supports the r8a7795 SoC by:
- Using two interrupts
  + One for E-MAC
  + One for everything else
  + Both can be handled by the existing common interrupt handler, which
    affords a simpler update to support the new SoC. In future some
    consideration may be given to implementing multiple interrupt handlers
- Limiting the phy speed to 100Mbit/s for the new SoC;
  at this time it is not clear how this restriction may be lifted
  but I hope it will be possible as more information comes to light

Signed-off-by: Kazuya Mizuguchi <kazuya.mizuguchi.ks@renesas.com>
[horms: reworked]
Signed-off-by: Simon Horman <horms+renesas@verge.net.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
9 years agoravb: Document binding for r8a7795 SoC
Kazuya Mizuguchi [Wed, 30 Sep 2015 06:15:54 +0000 (15:15 +0900)]
ravb: Document binding for r8a7795 SoC

This patch updates the ravb binding to support the r8a7795 SoC by:
- Adding a compat string for the new hardware
- Adding 25 named interrupts to binding for the new SoC;
  older SoCs continue to use a single multiplexed interrupt

The example is also updated to reflect the r8a7795 as this is the
more complex case.

Based on work by Kazuya Mizuguchi and others.

Signed-off-by: Simon Horman <horms+renesas@verge.net.au>
Acked-by: Geert Uytterhoeven <geert+renesas@glider.be>
Signed-off-by: David S. Miller <davem@davemloft.net>
9 years agoravb: Provide dev parameter to DMA API
Kazuya Mizuguchi [Wed, 30 Sep 2015 06:15:53 +0000 (15:15 +0900)]
ravb: Provide dev parameter to DMA API

This patch is in preparation for using this driver on arm64 where the
implementation of __dma_alloc_coherent fails if a device parameter is not
provided.

Signed-off-by: Kazuya Mizuguchi <kazuya.mizuguchi.ks@renesas.com>
Signed-off-by: Yoshihiro Shimoda <yoshihiro.shimoda.uh@renesas.com>
Signed-off-by: Masaru Nagai <masaru.nagai.vx@renesas.com>
[horms: squashed into a single patch]
Signed-off-by: Simon Horman <horms+renesas@verge.net.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
9 years agophylib: Add phy_set_max_speed helper
Simon Horman [Wed, 30 Sep 2015 06:15:52 +0000 (15:15 +0900)]
phylib: Add phy_set_max_speed helper

Add a helper to allow ethernet drivers to limit the speed of a phy
(that they are attached to).

This mainly involves factoring out the business-end of
of_set_phy_supported() and exporting a new symbol.

This code seems to be open coded in several places, in several different
variants.

It is is envisaged that this will be used in situations where setting the
"max-speed" property in DT is not appropriate, e.g. because the maximum
speed is not a property of the phy hardware.

Signed-off-by: Simon Horman <horms+renesas@verge.net.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
9 years agoMerge branch 'bpf-updates'
David S. Miller [Sat, 3 Oct 2015 12:02:50 +0000 (05:02 -0700)]
Merge branch 'bpf-updates'

Daniel Borkmann says:

====================
BPF updates

Some minor updates to {cls,act}_bpf to retrieve routing realms
and to make skb->priority writable.

Thanks!

v1 -> v2:
 - Dropped preclassify patch for now from the series as the
   rest is pretty much independent of it
 - Rest unchanged, only rebased and already posted Acked-by's kept
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
9 years agosched, bpf: make skb->priority writable
Daniel Borkmann [Tue, 29 Sep 2015 23:41:52 +0000 (01:41 +0200)]
sched, bpf: make skb->priority writable

{cls,act}_bpf can now set the skb->priority from an eBPF program based
on various critera, so that for example classful qdiscs like multiq can
update the skb's priority during enqueue time and further push it down
into subsequent qdiscs.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
9 years agosched, bpf: add helper for retrieving routing realms
Daniel Borkmann [Tue, 29 Sep 2015 23:41:51 +0000 (01:41 +0200)]
sched, bpf: add helper for retrieving routing realms

Using routing realms as part of the classifier is quite useful, it
can be viewed as a tag for one or multiple routing entries (think of
an analogy to net_cls cgroup for processes), set by user space routing
daemons or via iproute2 as an indicator for traffic classifiers and
later on processed in the eBPF program.

Unlike actions, the classifier can inspect device flags and enable
netif_keep_dst() if necessary. tc actions don't have that possibility,
but in case people know what they are doing, it can be used from there
as well (e.g. via devs that must keep dsts by design anyway).

If a realm is set, the handler returns the non-zero realm. User space
can set the full 32bit realm for the dst.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
9 years agoebpf: migrate bpf_prog's flags to bitfield
Daniel Borkmann [Tue, 29 Sep 2015 23:41:50 +0000 (01:41 +0200)]
ebpf: migrate bpf_prog's flags to bitfield

As we need to add further flags to the bpf_prog structure, lets migrate
both bools to a bitfield representation. The size of the base structure
(excluding insns) remains unchanged at 40 bytes.

Add also tags for the kmemchecker, so that it doesn't throw false
positives. Even in case gcc would generate suboptimal code, it's not
being accessed in performance critical paths.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
9 years agoMerge branch 'switchdev-obj'
David S. Miller [Sat, 3 Oct 2015 11:49:48 +0000 (04:49 -0700)]
Merge branch 'switchdev-obj'

Jiri Pirko says:

====================
switchdev: bring back switchdev_obj

Second version of the patch extends to a patchset. Basically this patchset
brings object structure back which disappeared with recent Vivien's patchset.
Also it does a bit of naming changes in order to get the things in line.
Also, object id is put back into object structure.
Thanks to Scott and Vivien for review and suggestions.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
9 years agoswitchdev: push object ID back to object structure
Jiri Pirko [Thu, 1 Oct 2015 09:03:46 +0000 (11:03 +0200)]
switchdev: push object ID back to object structure

Suggested-by: Scott Feldman <sfeldma@gmail.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Scott Feldman <sfeldma@gmail.com>
Reviewed-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
9 years agoswitchdev: bring back switchdev_obj and use it as a generic object param
Jiri Pirko [Thu, 1 Oct 2015 09:03:45 +0000 (11:03 +0200)]
switchdev: bring back switchdev_obj and use it as a generic object param

Replace "void *obj" with a generic structure. Introduce couple of
helpers along that.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Scott Feldman <sfeldma@gmail.com>
Reviewed-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
9 years agoswitchdev: rename switchdev_obj_fdb to switchdev_obj_port_fdb
Jiri Pirko [Thu, 1 Oct 2015 09:03:44 +0000 (11:03 +0200)]
switchdev: rename switchdev_obj_fdb to switchdev_obj_port_fdb

Make the struct name in sync with object id name.

Suggested-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Scott Feldman <sfeldma@gmail.com>
Reviewed-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
9 years agoswitchdev: rename switchdev_obj_vlan to switchdev_obj_port_vlan
Jiri Pirko [Thu, 1 Oct 2015 09:03:43 +0000 (11:03 +0200)]
switchdev: rename switchdev_obj_vlan to switchdev_obj_port_vlan

Make the struct name in sync with object id name.

Suggested-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Scott Feldman <sfeldma@gmail.com>
Reviewed-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
9 years agoswitchdev: rename SWITCHDEV_ATTR_* enum values to SWITCHDEV_ATTR_ID_*
Jiri Pirko [Thu, 1 Oct 2015 09:03:42 +0000 (11:03 +0200)]
switchdev: rename SWITCHDEV_ATTR_* enum values to SWITCHDEV_ATTR_ID_*

To be aligned with obj.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Scott Feldman <sfeldma@gmail.com>
Reviewed-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
9 years agoswitchdev: rename SWITCHDEV_OBJ_* enum values to SWITCHDEV_OBJ_ID_*
Jiri Pirko [Thu, 1 Oct 2015 09:03:41 +0000 (11:03 +0200)]
switchdev: rename SWITCHDEV_OBJ_* enum values to SWITCHDEV_OBJ_ID_*

Suggested-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Scott Feldman <sfeldma@gmail.com>
Reviewed-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
9 years agoMerge branch 'tcp-lockless-listener'
David S. Miller [Sat, 3 Oct 2015 11:32:52 +0000 (04:32 -0700)]
Merge branch 'tcp-lockless-listener'

Eric Dumazet says:

====================
tcp/dccp: lockless listener

TCP listener refactoring : this is becoming interesting !

This patch series takes the steps to use normal TCP/DCCP ehash
table to store SYN_RECV requests, instead of the private per-listener
hash table we had until now.

SYNACK skb are now attached to their syn_recv request socket,
so that we no longer heavily modify listener sk_wmem_alloc.

listener lock is no longer held in fast path, including
SYNCOOKIE mode.

During my tests, my server was able to process 3,500,000
SYN packets per second on one listener and still had available
cpu cycles.

That is about 2 to 3 order of magnitude what we had with older kernels.

This effort started two years ago and I am pleased to reach expectations.

We'll probably extend SO_REUSEPORT to add proper cpu/numa affinities,
so that heavy duty TCP servers can get proper siloing thanks to multi-queues
NIC.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
9 years agotcp: do not lock listener to process SYN packets
Eric Dumazet [Fri, 2 Oct 2015 18:43:39 +0000 (11:43 -0700)]
tcp: do not lock listener to process SYN packets

Everything should now be ready to finally allow SYN
packets processing without holding listener lock.

Tested:

3.5 Mpps SYNFLOOD. Plenty of cpu cycles available.

Next bottleneck is the refcount taken on listener,
that could be avoided if we remove SLAB_DESTROY_BY_RCU
strict semantic for listeners, and use regular RCU.

    13.18%  [kernel]  [k] __inet_lookup_listener
     9.61%  [kernel]  [k] tcp_conn_request
     8.16%  [kernel]  [k] sha_transform
     5.30%  [kernel]  [k] inet_reqsk_alloc
     4.22%  [kernel]  [k] sock_put
     3.74%  [kernel]  [k] tcp_make_synack
     2.88%  [kernel]  [k] ipt_do_table
     2.56%  [kernel]  [k] memcpy_erms
     2.53%  [kernel]  [k] sock_wfree
     2.40%  [kernel]  [k] tcp_v4_rcv
     2.08%  [kernel]  [k] fib_table_lookup
     1.84%  [kernel]  [k] tcp_openreq_init_rwin

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
9 years agotcp/dccp: add a reschedule point in inet_csk_listen_stop()
Eric Dumazet [Fri, 2 Oct 2015 18:43:38 +0000 (11:43 -0700)]
tcp/dccp: add a reschedule point in inet_csk_listen_stop()

If a listener with thousands of children in accept queue
is dismantled, it can take a while to close all of them.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
9 years agotcp: remove max_qlen_log
Eric Dumazet [Fri, 2 Oct 2015 18:43:37 +0000 (11:43 -0700)]
tcp: remove max_qlen_log

This control variable was set at first listen(fd, backlog)
call, but not updated if application tried to increase or decrease
backlog. It made sense at the time listener had a non resizeable
hash table.

Also rounding to powers of two was not very friendly.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
9 years agotcp/dccp: remove struct listen_sock
Eric Dumazet [Fri, 2 Oct 2015 18:43:36 +0000 (11:43 -0700)]
tcp/dccp: remove struct listen_sock

It is enough to check listener sk_state, no need for an extra
condition.

max_qlen_log can be moved into struct request_sock_queue

We can remove syn_wait_lock and the alignment it enforced.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
9 years agotcp: attach SYNACK messages to request sockets instead of listener
Eric Dumazet [Fri, 2 Oct 2015 18:43:35 +0000 (11:43 -0700)]
tcp: attach SYNACK messages to request sockets instead of listener

If a listen backlog is very big (to avoid syncookies), then
the listener sk->sk_wmem_alloc is the main source of false
sharing, as we need to touch it twice per SYNACK re-transmit
and TX completion.

(One SYN packet takes listener lock once, but up to 6 SYNACK
are generated)

By attaching the skb to the request socket, we remove this
source of contention.

Tested:

 listen(fd, 10485760); // single listener (no SO_REUSEPORT)
 16 RX/TX queue NIC
 Sustain a SYNFLOOD attack of ~320,000 SYN per second,
 Sending ~1,400,000 SYNACK per second.
 Perf profiles now show listener spinlock being next bottleneck.

    20.29%  [kernel]  [k] queued_spin_lock_slowpath
    10.06%  [kernel]  [k] __inet_lookup_established
     5.12%  [kernel]  [k] reqsk_timer_handler
     3.22%  [kernel]  [k] get_next_timer_interrupt
     3.00%  [kernel]  [k] tcp_make_synack
     2.77%  [kernel]  [k] ipt_do_table
     2.70%  [kernel]  [k] run_timer_softirq
     2.50%  [kernel]  [k] ip_finish_output
     2.04%  [kernel]  [k] cascade

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
9 years agoipv6: remove obsolete inet6 functions
Eric Dumazet [Fri, 2 Oct 2015 18:43:34 +0000 (11:43 -0700)]
ipv6: remove obsolete inet6 functions

inet6_csk_search_req() and inet6_csk_reqsk_queue_hash_add()
no longer exist.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
9 years agotcp/dccp: shrink struct listen_sock
Eric Dumazet [Fri, 2 Oct 2015 18:43:33 +0000 (11:43 -0700)]
tcp/dccp: shrink struct listen_sock

We no longer use hash_rnd, nr_table_entries and syn_table[]

For a listener with a backlog of 10 millions sockets, this
saves 80 MBytes of vmalloced memory.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
9 years agotcp/dccp: install syn_recv requests into ehash table
Eric Dumazet [Fri, 2 Oct 2015 18:43:32 +0000 (11:43 -0700)]
tcp/dccp: install syn_recv requests into ehash table

In this patch, we insert request sockets into TCP/DCCP
regular ehash table (where ESTABLISHED and TIMEWAIT sockets
are) instead of using the per listener hash table.

ACK packets find SYN_RECV pseudo sockets without having
to find and lock the listener.

In nominal conditions, this halves pressure on listener lock.

Note that this will allow for SO_REUSEPORT refinements,
so that we can select a listener using cpu/numa affinities instead
of the prior 'consistent hash', since only SYN packets will
apply this selection logic.

We will shrink listen_sock in the following patch to ease
code review.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Ying Cai <ycai@google.com>
Cc: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
9 years agotcp/dccp: remove inet_csk_reqsk_queue_added() timeout argument
Eric Dumazet [Fri, 2 Oct 2015 18:43:31 +0000 (11:43 -0700)]
tcp/dccp: remove inet_csk_reqsk_queue_added() timeout argument

This is no longer used.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
9 years agotcp: get_openreq[46]() changes
Eric Dumazet [Fri, 2 Oct 2015 18:43:30 +0000 (11:43 -0700)]
tcp: get_openreq[46]() changes

When request sockets are no longer in a per listener hash table
but on regular TCP ehash, we need to access listener uid
through req->rsk_listener

get_openreq6() also gets a const for its request socket argument.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
9 years agotcp: remove BUG_ON() in tcp_check_req()
Eric Dumazet [Fri, 2 Oct 2015 18:43:29 +0000 (11:43 -0700)]
tcp: remove BUG_ON() in tcp_check_req()

Once listener is lockless, its sk_state can change anytime.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
9 years agotcp: cleanup tcp_v[46]_inbound_md5_hash()
Eric Dumazet [Fri, 2 Oct 2015 18:43:28 +0000 (11:43 -0700)]
tcp: cleanup tcp_v[46]_inbound_md5_hash()

We'll soon have to call tcp_v[46]_inbound_md5_hash() twice.
Also add const attribute to the socket, as it might be the
unlocked listener for SYN packets.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
9 years agotcp/dccp: init sk_prot and call sk_node_init() in reqsk_alloc()
Eric Dumazet [Fri, 2 Oct 2015 18:43:27 +0000 (11:43 -0700)]
tcp/dccp: init sk_prot and call sk_node_init() in reqsk_alloc()

We plan to use generic functions to insert request sockets
into ehash table.

sk_prot needs to be set (to retrieve sk_prot->h.hashinfo)
sk_node needs to be cleared.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
9 years agotcp: call sk_mark_napi_id() on the child, not the listener
Eric Dumazet [Fri, 2 Oct 2015 18:43:26 +0000 (11:43 -0700)]
tcp: call sk_mark_napi_id() on the child, not the listener

This fixes a typo : We want to store the NAPI id on child socket.
Presumably nobody really uses busy polling, on short lived flows.

Fixes: 3d97379a67486 ("tcp: move sk_mark_napi_id() at the right place")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
9 years agotcp: move synflood_warned into struct request_sock_queue
Eric Dumazet [Fri, 2 Oct 2015 18:43:25 +0000 (11:43 -0700)]
tcp: move synflood_warned into struct request_sock_queue

long term plan is to remove struct listen_sock when its hash
table is no longer there.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
9 years agotcp: move qlen/young out of struct listen_sock
Eric Dumazet [Fri, 2 Oct 2015 18:43:24 +0000 (11:43 -0700)]
tcp: move qlen/young out of struct listen_sock

qlen_inc & young_inc were protected by listener lock,
while qlen_dec & young_dec were atomic fields.

Everything needs to be atomic for upcoming lockless listener.

Also move qlen/young in request_sock_queue as we'll get rid
of struct listen_sock eventually.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
9 years agotcp: add a spinlock to protect struct request_sock_queue
Eric Dumazet [Fri, 2 Oct 2015 18:43:23 +0000 (11:43 -0700)]
tcp: add a spinlock to protect struct request_sock_queue

struct request_sock_queue fields are currently protected
by the listener 'lock' (not a real spinlock)

We need to add a private spinlock instead, so that softirq handlers
creating children do not have to worry with backlog notion
that the listener 'lock' carries.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
9 years agoMerge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
David S. Miller [Fri, 2 Oct 2015 14:21:25 +0000 (07:21 -0700)]
Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net

Conflicts:
net/dsa/slave.c

net/dsa/slave.c simply had overlapping changes.

Signed-off-by: David S. Miller <davem@davemloft.net>
9 years agoMerge tag 'mmc-v4.3-rc3' of git://git.linaro.org/people/ulf.hansson/mmc
Linus Torvalds [Fri, 2 Oct 2015 12:03:04 +0000 (08:03 -0400)]
Merge tag 'mmc-v4.3-rc3' of git://git.linaro.org/people/ulf.hansson/mmc

Pull MMC fixes from Ulf Hansson:
 "Here are some mmc fixes intended for v4.3 rc4:

  MMC core:
   - Allow users of mmc_of_parse() to succeed when CONFIG_GPIOLIB is
     unset
   - Prevent infinite loop of re-tuning for CRC-errors for CMD19 and
     CMD21

   MMC host:
   - pxamci: Fix issues with card detect
   - sunxi: Fix clk-delay settings"

* tag 'mmc-v4.3-rc3' of git://git.linaro.org/people/ulf.hansson/mmc:
  mmc: core: fix dead loop of mmc_retune
  mmc: pxamci: fix card detect with slot-gpio API
  mmc: sunxi: Fix clk-delay settings
  mmc: core: Don't return an error for CD/WP GPIOs when GPIOLIB is unset

9 years agoMerge git://git.infradead.org/intel-iommu
Linus Torvalds [Fri, 2 Oct 2015 11:59:29 +0000 (07:59 -0400)]
Merge git://git.infradead.org/intel-iommu

Pull IOVA fixes from David Woodhouse:
 "The main fix here is the first one, fixing the over-allocation of
   size-aligned requests.  The other patches simply make the existing
  IOVA code available to users other than the Intel VT-d driver, with no
  functional change.

  I concede the latter really *should* have been submitted during the
  merge window, but since it's basically risk-free and people are
  waiting to build on top of it and it's my fault I didn't get it in, I
  (and they) would be grateful if you'd take it"

* git://git.infradead.org/intel-iommu:
  iommu: Make the iova library a module
  iommu: iova: Export symbols
  iommu: iova: Move iova cache management to the iova library
  iommu/iova: Avoid over-allocating when size-aligned

9 years agoMerge branch 'akpm' (patches from Andrew)
Linus Torvalds [Fri, 2 Oct 2015 02:20:11 +0000 (22:20 -0400)]
Merge branch 'akpm' (patches from Andrew)

Merge misc fixes from Andrew Morton:
 "12 fixes"

* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
  dmapool: fix overflow condition in pool_find_page()
  thermal: avoid division by zero in power allocator
  memcg: remove pcp_counter_lock
  kprobes: use _do_fork() in samples to make them work again
  drivers/input/joystick/Kconfig: zhenhua.c needs BITREVERSE
  memcg: make mem_cgroup_read_stat() unsigned
  memcg: fix dirty page migration
  dax: fix NULL pointer in __dax_pmd_fault()
  mm: hugetlbfs: skip shared VMAs when unmapping private pages to satisfy a fault
  mm/slab: fix unexpected index mapping result of kmalloc_size(INDEX_NODE+1)
  userfaultfd: remove kernel header include from uapi header
  arch/x86/include/asm/efi.h: fix build failure

9 years agoMerge tag 'pm+acpi-4.3-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael...
Linus Torvalds [Fri, 2 Oct 2015 02:06:40 +0000 (22:06 -0400)]
Merge tag 'pm+acpi-4.3-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm

Pull power management and ACPI fixes from Rafael Wysocki:
 "These are fixes mostly, for a few changes made in this cycle (the
  intel_idle driver, the OPP library, the ACPI EC driver, turbostat) and
  for some issues that have just been discovered (ACPI PCI IRQ
  management, PCI power management documentation, turbostat), with a
  couple of cleanups on top of them.

  Specifics:

   - intel_idle driver fixup for the recently added Skylake chips
     support (Len Brown).

   - Operating Performance Points (OPP) library fix related to the
     recently added support for new DT bindings and a fix for a typo in
     a comment (Viresh Kumar, Stephen Boyd).

   - ACPI EC driver fix for a recently introduced memory leak in an
     error code path (Lv Zheng).

   - ACPI PCI IRQ management fix for the issue where an ISA IRQ is
     shared with a PCI device which requires it to be configured in a
     different way and may cause an interrupt storm to happen as a
     result with an extra ACPI SCI IRQ handling simplification on top of
     it (Jiang Liu).

   - Update of the PCI power management documentation that became
     outdated and started to actively confuse the readers to make it
     actually reflect the code (Rafael J Wysocki).

   - turbostat fixes including an IVB Xeon regression fix (related to
     the --debug command line option), Skylake adjustment for the TSC
     running at a frequency that doesn't match the base one exactly, and
     a Knights Landing quirk to account for the fact that it only
     updates APERF and MPERF every 1024 clock cycles plus bumping up the
     turbostat version number (Len Brown, Hubert Chrzaniuk)"

* tag 'pm+acpi-4.3-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
  tools/power turbosat: update version number
  tools/power turbostat: SKL: Adjust for TSC difference from base frequency
  tools/power turbostat: KNL workaround for %Busy and Avg_MHz
  tools/power turbostat: IVB Xeon: fix --debug regression
  ACPI / PCI: Remove duplicated penalty on SCI IRQ
  ACPI, PCI, irq: Do not share PCI IRQ with ISA IRQ
  ACPI / EC: Fix a memory leak issue in acpi_ec_query()
  PM / OPP: Fix typo modifcation -> modification
  PCI / PM: Update runtime PM documentation for PCI devices
  PM / OPP: of_property_count_u32_elems() can return errors
  intel_idle: Skylake Client Support - updated

9 years agoMerge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Linus Torvalds [Fri, 2 Oct 2015 01:55:35 +0000 (21:55 -0400)]
Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net

Pull networking fixes from David Miller:

1) Fix regression in SKB partial checksum handling, from Pravin B
   Shalar.

2) Fix VLAN inside of VXLAN handling in i40e driver, from Jesse
   Brandeburg.

3) Cure softlockups during accept() in SCTP, from Karl Heiss.

4) MSG_PEEK should return multiple SKBs worth of data in AF_UNIX, from
   Aaron Conole.

5) IPV6 erroneously ignores output interface specifier in lookup key for
   route lookups, fix from David Ahern.

6) In Marvell DSA driver, forward unknown frames to CPU port, from
   Andrew Lunn.

7) Mission flow flag initializations in some code paths, from David
   Ahern.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net:
  net: Initialize flow flags in input path
  net: dsa: fix preparation of a port STP update
  testptp: Silence compiler warnings on ppc64
  net/mlx4: Handle return codes in mlx4_qp_attach_common
  dsa: mv88e6xxx: Enable forwarding for unknown to the CPU port
  skbuff: Fix skb checksum partial check.
  net: ipv6: Add RT6_LOOKUP_F_IFACE flag if oif is set
  net sysfs: Print link speed as signed integer
  bna: fix error handling
  af_unix: return data from multiple SKBs on recv() with MSG_PEEK flag
  af_unix: Convert the unix_sk macro to an inline function for type safety
  net: sctp: Don't use 64 kilobyte lookup table for four elements
  l2tp: protect tunnel->del_work by ref_count
  net/ibm/emac: bump version numbers for correct work with ethtool
  sctp: Prevent soft lockup when sctp_accept() is called during a timeout event
  sctp: Whitespace fix
  i40e/i40evf: check for stopped admin queue
  i40e: fix VLAN inside VXLAN
  r8169: fix handling rtl_readphy result
  net: hisilicon: fix handling platform_get_irq result

9 years agodmapool: fix overflow condition in pool_find_page()
Robin Murphy [Thu, 1 Oct 2015 22:37:19 +0000 (15:37 -0700)]
dmapool: fix overflow condition in pool_find_page()

If a DMA pool lies at the very top of the dma_addr_t range (as may
happen with an IOMMU involved), the calculated end address of the pool
wraps around to zero, and page lookup always fails.

Tweak the relevant calculation to be overflow-proof.

Signed-off-by: Robin Murphy <robin.murphy@arm.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Sumit Semwal <sumit.semwal@linaro.org>
Cc: Sakari Ailus <sakari.ailus@iki.fi>
Cc: Russell King <rmk+kernel@arm.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
9 years agothermal: avoid division by zero in power allocator
Andrea Arcangeli [Thu, 1 Oct 2015 22:37:16 +0000 (15:37 -0700)]
thermal: avoid division by zero in power allocator

During boot I get a div by zero Oops regression starting in v4.3-rc3.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Reviewed-by: Javi Merino <javi.merino@arm.com>
Cc: Zhang Rui <rui.zhang@intel.com>
Cc: Eduardo Valentin <edubezval@gmail.com>
Cc: Daniel Kurtz <djkurtz@chromium.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
9 years agomemcg: remove pcp_counter_lock
Greg Thelen [Thu, 1 Oct 2015 22:37:13 +0000 (15:37 -0700)]
memcg: remove pcp_counter_lock

Commit 733a572e66d2 ("memcg: make mem_cgroup_read_{stat|event}() iterate
possible cpus instead of online") removed the last use of the per memcg
pcp_counter_lock but forgot to remove the variable.

Kill the vestigial variable.

Signed-off-by: Greg Thelen <gthelen@google.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
9 years agokprobes: use _do_fork() in samples to make them work again
Petr Mladek [Thu, 1 Oct 2015 22:37:11 +0000 (15:37 -0700)]
kprobes: use _do_fork() in samples to make them work again

Commit 3033f14ab78c ("clone: support passing tls argument via C rather
than pt_regs magic") introduced _do_fork() that allowed to pass @tls
parameter.

The old do_fork() is defined only for architectures that are not ready
to use this way and do not define HAVE_COPY_THREAD_TLS.

Let's use _do_fork() in the kprobe examples to make them work again on
all architectures.

Signed-off-by: Petr Mladek <pmladek@suse.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Thiago Macieira <thiago.macieira@intel.com>
Cc: Jiri Kosina <jkosina@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
9 years agodrivers/input/joystick/Kconfig: zhenhua.c needs BITREVERSE
Andrew Morton [Thu, 1 Oct 2015 22:37:08 +0000 (15:37 -0700)]
drivers/input/joystick/Kconfig: zhenhua.c needs BITREVERSE

It uses bitrev8(), so it must ensure that lib/bitrev.o gets included in
vmlinux.

Cc: Fengguang Wu <fengguang.wu@gmail.com>
Cc: yalin wang <yalin.wang2010@gmail.com>
Cc: Dmitry Torokhov <dmitry.torokhov@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
9 years agomemcg: make mem_cgroup_read_stat() unsigned
Greg Thelen [Thu, 1 Oct 2015 22:37:05 +0000 (15:37 -0700)]
memcg: make mem_cgroup_read_stat() unsigned

mem_cgroup_read_stat() returns a page count by summing per cpu page
counters.  The summing is racy wrt.  updates, so a transient negative
sum is possible.  Callers don't want negative values:

 - mem_cgroup_wb_stats() doesn't want negative nr_dirty or nr_writeback.
   This could confuse dirty throttling.

 - oom reports and memory.stat shouldn't show confusing negative usage.

 - tree_usage() already avoids negatives.

Avoid returning negative page counts from mem_cgroup_read_stat() and
convert it to unsigned.

[akpm@linux-foundation.org: fix old typo while we're in there]
Signed-off-by: Greg Thelen <gthelen@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: <stable@vger.kernel.org> [4.2+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
9 years agomemcg: fix dirty page migration
Greg Thelen [Thu, 1 Oct 2015 22:37:02 +0000 (15:37 -0700)]
memcg: fix dirty page migration

The problem starts with a file backed dirty page which is charged to a
memcg.  Then page migration is used to move oldpage to newpage.

Migration:
 - copies the oldpage's data to newpage
 - clears oldpage.PG_dirty
 - sets newpage.PG_dirty
 - uncharges oldpage from memcg
 - charges newpage to memcg

Clearing oldpage.PG_dirty decrements the charged memcg's dirty page
count.

However, because newpage is not yet charged, setting newpage.PG_dirty
does not increment the memcg's dirty page count.  After migration
completes newpage.PG_dirty is eventually cleared, often in
account_page_cleaned().  At this time newpage is charged to a memcg so
the memcg's dirty page count is decremented which causes underflow
because the count was not previously incremented by migration.  This
underflow causes balance_dirty_pages() to see a very large unsigned
number of dirty memcg pages which leads to aggressive throttling of
buffered writes by processes in non root memcg.

This issue:
 - can harm performance of non root memcg buffered writes.
 - can report too small (even negative) values in
   memory.stat[(total_)dirty] counters of all memcg, including the root.

To avoid polluting migrate.c with #ifdef CONFIG_MEMCG checks, introduce
page_memcg() and set_page_memcg() helpers.

Test:
    0) setup and enter limited memcg
    mkdir /sys/fs/cgroup/test
    echo 1G > /sys/fs/cgroup/test/memory.limit_in_bytes
    echo $$ > /sys/fs/cgroup/test/cgroup.procs

    1) buffered writes baseline
    dd if=/dev/zero of=/data/tmp/foo bs=1M count=1k
    sync
    grep ^dirty /sys/fs/cgroup/test/memory.stat

    2) buffered writes with compaction antagonist to induce migration
    yes 1 > /proc/sys/vm/compact_memory &
    rm -rf /data/tmp/foo
    dd if=/dev/zero of=/data/tmp/foo bs=1M count=1k
    kill %
    sync
    grep ^dirty /sys/fs/cgroup/test/memory.stat

    3) buffered writes without antagonist, should match baseline
    rm -rf /data/tmp/foo
    dd if=/dev/zero of=/data/tmp/foo bs=1M count=1k
    sync
    grep ^dirty /sys/fs/cgroup/test/memory.stat

                       (speed, dirty residue)
             unpatched                       patched
    1) 841 MB/s 0 dirty pages          886 MB/s 0 dirty pages
    2) 611 MB/s -33427456 dirty pages  793 MB/s 0 dirty pages
    3) 114 MB/s -33427456 dirty pages  891 MB/s 0 dirty pages

    Notice that unpatched baseline performance (1) fell after
    migration (3): 841 -> 114 MB/s.  In the patched kernel, post
    migration performance matches baseline.

Fixes: c4843a7593a9 ("memcg: add per cgroup dirty page accounting")
Signed-off-by: Greg Thelen <gthelen@google.com>
Reported-by: Dave Hansen <dave.hansen@intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: <stable@vger.kernel.org> [4.2+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
9 years agodax: fix NULL pointer in __dax_pmd_fault()
Ross Zwisler [Thu, 1 Oct 2015 22:36:59 +0000 (15:36 -0700)]
dax: fix NULL pointer in __dax_pmd_fault()

Commit 46c043ede471 ("mm: take i_mmap_lock in unmap_mapping_range() for
DAX") moved some code in __dax_pmd_fault() that was responsible for
zeroing newly allocated PMD pages.  The new location didn't properly set
up 'kaddr', so when run this code resulted in a NULL pointer BUG.

Fix this by getting the correct 'kaddr' via bdev_direct_access().

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reported-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Matthew Wilcox <willy@linux.intel.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
9 years agomm: hugetlbfs: skip shared VMAs when unmapping private pages to satisfy a fault
Mel Gorman [Thu, 1 Oct 2015 22:36:57 +0000 (15:36 -0700)]
mm: hugetlbfs: skip shared VMAs when unmapping private pages to satisfy a fault

SunDong reported the following on

  https://bugzilla.kernel.org/show_bug.cgi?id=103841

I think I find a linux bug, I have the test cases is constructed. I
can stable recurring problems in fedora22(4.0.4) kernel version,
arch for x86_64.  I construct transparent huge page, when the parent
and child process with MAP_SHARE, MAP_PRIVATE way to access the same
huge page area, it has the opportunity to lead to huge page copy on
write failure, and then it will munmap the child corresponding mmap
area, but then the child mmap area with VM_MAYSHARE attributes, child
process munmap this area can trigger VM_BUG_ON in set_vma_resv_flags
functions (vma - > vm_flags & VM_MAYSHARE).

There were a number of problems with the report (e.g.  it's hugetlbfs that
triggers this, not transparent huge pages) but it was fundamentally
correct in that a VM_BUG_ON in set_vma_resv_flags() can be triggered that
looks like this

 vma ffff8804651fd0d0 start 00007fc474e00000 end 00007fc475e00000
 next ffff8804651fd018 prev ffff8804651fd188 mm ffff88046b1b1800
 prot 8000000000000027 anon_vma           (null) vm_ops ffffffff8182a7a0
 pgoff 0 file ffff88106bdb9800 private_data           (null)
 flags: 0x84400fb(read|write|shared|mayread|maywrite|mayexec|mayshare|dontexpand|hugetlb)
 ------------
 kernel BUG at mm/hugetlb.c:462!
 SMP
 Modules linked in: xt_pkttype xt_LOG xt_limit [..]
 CPU: 38 PID: 26839 Comm: map Not tainted 4.0.4-default #1
 Hardware name: Dell Inc. PowerEdge R810/0TT6JF, BIOS 2.7.4 04/26/2012
 set_vma_resv_flags+0x2d/0x30

The VM_BUG_ON is correct because private and shared mappings have
different reservation accounting but the warning clearly shows that the
VMA is shared.

When a private COW fails to allocate a new page then only the process
that created the VMA gets the page -- all the children unmap the page.
If the children access that data in the future then they get killed.

The problem is that the same file is mapped shared and private.  During
the COW, the allocation fails, the VMAs are traversed to unmap the other
private pages but a shared VMA is found and the bug is triggered.  This
patch identifies such VMAs and skips them.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Reported-by: SunDong <sund_sky@126.com>
Reviewed-by: Michal Hocko <mhocko@suse.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: David Rientjes <rientjes@google.com>
Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
9 years agomm/slab: fix unexpected index mapping result of kmalloc_size(INDEX_NODE+1)
Joonsoo Kim [Thu, 1 Oct 2015 22:36:54 +0000 (15:36 -0700)]
mm/slab: fix unexpected index mapping result of kmalloc_size(INDEX_NODE+1)

Commit description is copied from the original post of this bug:

  http://comments.gmane.org/gmane.linux.kernel.mm/135349

Kernels after v3.9 use kmalloc_size(INDEX_NODE + 1) to get the next
larger cache size than the size index INDEX_NODE mapping.  In kernels
3.9 and earlier we used malloc_sizes[INDEX_L3 + 1].cs_size.

However, sometimes we can't get the right output we expected via
kmalloc_size(INDEX_NODE + 1), causing a BUG().

The mapping table in the latest kernel is like:
    index = {0,   1,  2 ,  3,  4,   5,   6,   n}
     size = {0,   96, 192, 8, 16,  32,  64,   2^n}
The mapping table before 3.10 is like this:
    index = {0 , 1 , 2,   3,  4 ,  5 ,  6,   n}
    size  = {32, 64, 96, 128, 192, 256, 512, 2^(n+3)}

The problem on my mips64 machine is as follows:

(1) When configured DEBUG_SLAB && DEBUG_PAGEALLOC && DEBUG_LOCK_ALLOC
    && DEBUG_SPINLOCK, the sizeof(struct kmem_cache_node) will be "150",
    and the macro INDEX_NODE turns out to be "2": #define INDEX_NODE
    kmalloc_index(sizeof(struct kmem_cache_node))

(2) Then the result of kmalloc_size(INDEX_NODE + 1) is 8.

(3) Then "if(size >= kmalloc_size(INDEX_NODE + 1)" will lead to "size
    = PAGE_SIZE".

(4) Then "if ((size >= (PAGE_SIZE >> 3))" test will be satisfied and
    "flags |= CFLGS_OFF_SLAB" will be covered.

(5) if (flags & CFLGS_OFF_SLAB)" test will be satisfied and will go to
    "cachep->slabp_cache = kmalloc_slab(slab_size, 0u)", and the result
    here may be NULL while kernel bootup.

(6) Finally,"BUG_ON(ZERO_OR_NULL_PTR(cachep->slabp_cache));" causes the
    BUG info as the following shows (may be only mips64 has this problem):

This patch fixes the problem of kmalloc_size(INDEX_NODE + 1) and removes
the BUG by adding 'size >= 256' check to guarantee that all necessary
small sized slabs are initialized regardless sequence of slab size in
mapping table.

Fixes: e33660165c90 ("slab: Use common kmalloc_index/kmalloc_size...")
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Reported-by: Liuhailong <liu.hailong6@zte.com.cn>
Acked-by: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
9 years agouserfaultfd: remove kernel header include from uapi header
Andre Przywara [Thu, 1 Oct 2015 22:36:51 +0000 (15:36 -0700)]
userfaultfd: remove kernel header include from uapi header

As include/uapi/linux/userfaultfd.h is a user visible header file, it
should not include kernel-exclusive header files.

So trying to build the userfaultfd test program from the selftests
directory fails, since it contains a reference to linux/compiler.h.  As
it turns out, that header is not really needed there, so we can simply
remove it to fix that issue.

Signed-off-by: Andre Przywara <andre.przywara@arm.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Shuah Khan <shuahkh@osg.samsung.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
9 years agoarch/x86/include/asm/efi.h: fix build failure
Andrey Ryabinin [Thu, 1 Oct 2015 22:36:48 +0000 (15:36 -0700)]
arch/x86/include/asm/efi.h: fix build failure

With KMEMCHECK=y, KASAN=n:

  arch/x86/platform/efi/efi.c:673:3: error: implicit declaration of function `memcpy' [-Werror=implicit-function-declaration]
  arch/x86/platform/efi/efi_64.c:139:2: error: implicit declaration of function `memcpy' [-Werror=implicit-function-declaration]
  arch/x86/include/asm/desc.h:121:2: error: implicit declaration of function `memcpy' [-Werror=implicit-function-declaration]

Don't #undef memcpy if KASAN=n.

Fixes: 769a8089c1fd ("x86, efi, kasan: #undef memset/memcpy/memmove per arch")
Signed-off-by: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Reported-by: Ingo Molnar <mingo@kernel.org>
Reported-by: Sedat Dilek <sedat.dilek@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>