Jiri Pirko [Fri, 15 May 2015 11:27:32 +0000 (13:27 +0200)]
flow_dissector: remove bogus return in tipc section
Fixes: 06635a35d13d42b9 ("flow_dissect: use programable dissector in skb_flow_dissect and friends") Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Jiri Pirko <jiri@resnulli.us> Signed-off-by: David S. Miller <davem@davemloft.net>
hv_netvsc: change member name of struct netvsc_stats
Currently the struct netvsc_stats has a member s_sync
of type u64_stats_sync.
This definition will break kernel build as the macro
netdev_alloc_pcpu_stats requires this member name to be syncp.
(see netdev_alloc_pcpu_stats definition in ./include/linux/netdevice.h)
This patch changes netvsc_stats's member name from s_sync to syncp to fix
the build break.
Signed-off-by: Simon Xiao <sixiao@microsoft.com> Signed-off-by: David S. Miller <davem@davemloft.net>
switchdev: add support for fdb add/del/dump via switchdev_port_obj ops.
- introduce port fdb obj and generic switchdev_port_fdb_add/del/dump()
- use switchdev_port_fdb_add/del/dump in rocker/team/bonding ndo ops.
- add support for fdb obj in switchdev_port_obj_add/del/dump()
- switch rocker to implement fdb ops via switchdev_ops
v3: updated to sync with named union changes.
Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com> Signed-off-by: Scott Feldman <sfeldma@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Mon, 18 May 2015 02:45:49 +0000 (22:45 -0400)]
Merge branch 'tcp_mem_pressure'
Eric Dumazet says:
====================
tcp: better handling of memory pressure
When testing commit 790ba4566c1a ("tcp: set SOCK_NOSPACE under memory
pressure") using edge triggered epoll applications, I found various
issues under memory pressure and thousands of active sockets.
This patch series is a first round to solve these issues, in send
and receive paths. There are probably other fixes needed, but
with this series, my tests now all succeed.
v2: fix typo in "allow one skb to be received per socket under memory pressure",
as spotted by Jason Baron.
====================
Acked-by: Jason Baron <jbaron@akamai.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Fri, 15 May 2015 19:39:30 +0000 (12:39 -0700)]
tcp: halves tcp_mem[] limits
Allowing tcp to use ~19% of physical memory is way too much,
and allowed bugs to be hidden. Add to this that some drivers use a full
page per incoming frame, so real cost can be twice the advertized one.
Reduce tcp_mem by 50 % as a first step to sanity.
tcp_mem[0,1,2] defaults are now 4.68%, 6.25%, 9.37% of physical memory.
Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Fri, 15 May 2015 19:39:28 +0000 (12:39 -0700)]
tcp: fix behavior for epoll edge trigger
Under memory pressure, tcp_sendmsg() can fail to queue a packet
while no packet is present in write queue. If we return -EAGAIN
with no packet in write queue, no ACK packet will ever come
to raise EPOLLOUT.
We need to allow one skb per TCP socket, and make sure that
tcp sockets can release their forward allocations under pressure.
This is a followup to commit 790ba4566c1a ("tcp: set SOCK_NOSPACE
under memory pressure")
Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Fri, 15 May 2015 19:39:25 +0000 (12:39 -0700)]
net: fix sk_mem_reclaim_partial()
sk_mem_reclaim_partial() goal is to ensure each socket has
one SK_MEM_QUANTUM forward allocation. This is needed both for
performance and better handling of memory pressure situations in
follow up patches.
SK_MEM_QUANTUM is currently a page, but might be reduced to 4096 bytes
as some arches have 64KB pages.
Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Willem de Bruijn [Sun, 17 May 2015 23:44:02 +0000 (19:44 -0400)]
net-packet: fix null pointer exception in rollover mode
Rollover can be enabled as flag or mode. Allocate state in both cases.
This solves a NULL pointer exception in fanout_demux_rollover on
referencing po->rollover if using mode rollover.
Also make sure that in rollover mode each silo is tried (contrary
to rollover flag, where the main socket is excluded after an initial
try_self).
Tested:
Passes tools/testing/net/psock_fanout.c, which tests both modes and
flag. My previous tests were limited to bench_rollover, which only
stresses the flag. The test now completes safely. it still gives an
error for mode rollover, because it does not expect the new headroom
(ROOM_NORMAL) requirement. I will send a separate patch to the test.
Fixes: 0648ab70afe6 ("packet: rollover prepare: per-socket state") Signed-off-by: Willem de Bruijn <willemb@google.com>
----
I should have run this test and caught this before submission, of
course. Apologies for the oversight. Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Fri, 15 May 2015 12:48:07 +0000 (05:48 -0700)]
net: fix two sparse errors
First one in __skb_checksum_validate_complete() fixes the following
(and other callers)
make C=2 CF=-D__CHECK_ENDIAN__ net/ipv4/tcp_ipv4.o
CHECK net/ipv4/tcp_ipv4.c
include/linux/skbuff.h:3052:24: warning: incorrect type in return expression (different base types)
include/linux/skbuff.h:3052:24: expected restricted __sum16
include/linux/skbuff.h:3052:24: got int
Second is fixing gso_make_checksum() :
CHECK net/ipv4/gre_offload.c
include/linux/skbuff.h:3360:14: warning: incorrect type in assignment (different base types)
include/linux/skbuff.h:3360:14: expected unsigned short [unsigned] [usertype] csum
include/linux/skbuff.h:3360:14: got restricted __sum16
include/linux/skbuff.h:3365:16: warning: incorrect type in return expression (different base types)
include/linux/skbuff.h:3365:16: expected restricted __sum16
include/linux/skbuff.h:3365:16: got unsigned short [unsigned] [usertype] csum
Fixes: 5a21232983aa7 ("net: Support for csum_bad in skbuff") Fixes: 7e2b10c1e52ca ("net: Support for multiple checksums with gso") Signed-off-by: Eric Dumazet <edumazet@google.com> CC: Tom Herbert <tom@herbertland.com> Acked-by: Tom Herbert <tom@herbertland.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Fri, 15 May 2015 16:07:31 +0000 (09:07 -0700)]
netfilter: synproxy: fix sparse errors
Fix verbose sparse errors :
make C=2 CF=-D__CHECK_ENDIAN__ net/ipv4/netfilter/ipt_SYNPROXY.o
Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Fri, 15 May 2015 15:58:45 +0000 (08:58 -0700)]
ipip: fix one sparse error
make C=2 CF=-D__CHECK_ENDIAN__ net/ipv4/ipip.o
CHECK net/ipv4/ipip.c
net/ipv4/ipip.c:254:27: warning: incorrect type in assignment (different base types)
net/ipv4/ipip.c:254:27: expected restricted __be32 [addressable] [usertype] o_key
net/ipv4/ipip.c:254:27: got restricted __be16 [addressable] [usertype] i_flags
Fixes: 3b7b514f44bf ("ipip: fix a regression in ioctl") Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Fri, 15 May 2015 15:52:19 +0000 (08:52 -0700)]
net: fix sparse error in csum_replace4()
make C=2 CF=-D__CHECK_ENDIAN__ net/ipv4/netfilter/nf_nat_l3proto_ipv4.o
CHECK net/ipv4/netfilter/nf_nat_l3proto_ipv4.c
include/net/checksum.h:125:64: warning: incorrect type in argument 2 (different base types)
include/net/checksum.h:125:64: expected restricted __wsum [usertype] addend
include/net/checksum.h:125:64: got restricted __be32 [usertype] from
include/net/checksum.h:125:71: warning: incorrect type in argument 2 (different base types)
include/net/checksum.h:125:71: expected restricted __wsum [usertype] addend
include/net/checksum.h:125:71: got restricted __be32 [usertype] to
include/net/checksum.h:125:64: warning: incorrect type in argument 2 (different base types)
include/net/checksum.h:125:64: expected restricted __wsum [usertype] addend
include/net/checksum.h:125:64: got restricted __be32 [usertype] from
include/net/checksum.h:125:71: warning: incorrect type in argument 2 (different base types)
include/net/checksum.h:125:71: expected restricted __wsum [usertype] addend
include/net/checksum.h:125:71: got restricted __be32 [usertype] to
Fixes: 4565af0d406b ("net: optimise csum_replace4()") Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Ying Xue [Fri, 15 May 2015 04:53:21 +0000 (12:53 +0800)]
rocker: fix a neigh entry leak issue
Once we get a neighbour through looking up arp cache or creating a
new one in rocker_port_ipv4_resolve(), the neighbour's refcount is
already taken. But as we don't put the refcount again after it's
used, this makes the neighbour entry leaked.
Suggested-by: Eric Dumazet <edumazet@google.com> Acked-by: Jiri Pirko <jiri@resnulli.us> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Ying Xue <ying.xue@windriver.com> Acked-by: Jiri Pirko <jiri@resnulli.us> Signed-off-by: David S. Miller <davem@davemloft.net>
The following series of patches includes functional updates and changes
to the driver.
- Add additional statistics to be collected and reported
- Use the netif_* functions for issuing some debug and informational
driver messages
- Rx path SKB allocation cleanup/simplification
- Remove stand-alone phylib driver and incorporate function into the nic
driver
- Simplify device tree support while maintaining backwards compatibility
- Fix the flow control negotiation logic to properly configure flow
control
- Remove the checking and setting of the device dma_mask field
This patch series is based on net-next.
Changes in v2:
- Change from using the netif_msg_*/netdev_* combination for issuing
messages to the more concise netif_*
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Lendacky, Thomas [Thu, 14 May 2015 16:44:33 +0000 (11:44 -0500)]
amd-xgbe: Remove manual check and set of dma_mask pointer
The underlying device support will set the device dma_mask pointer
if DMA is set up properly for the device. Remove the check for and
assignment of dma_mask when it is null. Instead, just error out if
the dma_set_mask_and_coherent function fails because dma_mask is null.
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Lendacky, Thomas [Thu, 14 May 2015 16:44:27 +0000 (11:44 -0500)]
amd-xgbe: Fix flow control setting logic
The flow control negotiation logic is flawed and does not properly
advertise and process auto-negotiation of the flow control settings.
Update the flow control support to properly set the flow control
auto-negotiation settings and process the results approrpriately.
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Lendacky, Thomas [Thu, 14 May 2015 16:44:21 +0000 (11:44 -0500)]
amd-xgbe: Support defining PHY resources in ETH device node
Simplify the device tree support of the amd-xgbe driver by defining
the PHY-related resources within the ethernet device node. The support
provides backwards compatibility with the original way.
Update the driver version to 1.0.2.
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Lendacky, Thomas [Thu, 14 May 2015 16:44:15 +0000 (11:44 -0500)]
amd-xgbe: Move the PHY support into amd-xgbe
The AMD XGBE device is intended to work with a specific integrated PHY
and that PHY is not meant to be a standalone PHY for use by other
devices. As such this patch removes the phylib driver and implements
the PHY support in the amd-xgbe driver (the majority of the logic from
the phylib driver is moved into the amd-xgbe driver).
Update the driver version to 1.0.1.
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Lendacky, Thomas [Thu, 14 May 2015 16:44:09 +0000 (11:44 -0500)]
amd-xgbe: Rework the Rx path SKB allocation
Rework the SKB allocation so that all of the buffers of the first
descriptor are handled in the SKB allocation routine. After copying the
data in the header buffer (which can be just the header if split header
processing succeeded for header plus data if split header processing did
not succeed) into the SKB, check for remaining data in the receive
buffer. If there is data remaining in the receive buffer, add that as a
frag to the SKB. Once an SKB has been allocated, all other descriptors
are added as frags to the SKB.
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Lendacky, Thomas [Thu, 14 May 2015 16:44:03 +0000 (11:44 -0500)]
amd-xgbe: Add netif_* message support to the driver
Add support for the network interface message level settings for
determining whether to issue some of the driver messages. Make
use of the netif_* interface where appropriate.
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Lendacky, Thomas [Thu, 14 May 2015 16:43:57 +0000 (11:43 -0500)]
amd-xgbe: Add additional stats to be reported via ethtool
Add additional/extended statistics beyond what is provided by the
hardware to be reported via ethtool. The new stats focus on the
calls into ndo_start_xmit and the napi_poll routine.
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Fri, 15 May 2015 16:44:23 +0000 (12:44 -0400)]
Merge branch 'stmmac-platform-glue'
Joachim Eastwood says:
====================
convert stmmac glue layers into platform drivers
This patch set aims to convert the current dwmac glue layers into
proper platform drivers as request by Arnd[1]. These changes start
from patch 3 and onwards.
Overview:
Platform driver functions like probe and remove are exported from
the stmmac platform and then used in subsequent glue later
conversions. The conversion involes adding the platform driver
boiler plate code and adding it to the build system. The last patch
removes the driver from the stmmac platform code thus making it into
a library for common platform driver functions.
The two first patches adds glue layer for my platform. I chose to
first add old style glue layer and then convert it. The churn this
creates is just 3 lines.
I would be very nice if people could test this patch set on their
respective platform. My testing has been limited to compiling and
testing on my (LPC18xx) platform. Thanks!
Next I will look into cleaning up the stmmac platform code.
Joachim Eastwood [Thu, 14 May 2015 10:11:06 +0000 (12:11 +0200)]
stmmac: drop driver from stmmac platform code
The dwmac-generic replaces the driver inside the stmmac
platform code. This turns stmmac platform into a library
used by drivers for common platform driver functions.
Signed-off-by: Joachim Eastwood <manabian@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Joachim Eastwood [Thu, 14 May 2015 10:10:59 +0000 (12:10 +0200)]
stmmac: add a generic dwmac driver
Create a new driver around the generic device tree match strings
in the stmmac platform code. This driver is intended to be used
by all platforms that doesn't require any platform specific code
to function or is using platform data.
Signed-off-by: Joachim Eastwood <manabian@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Joachim Eastwood [Thu, 14 May 2015 10:10:58 +0000 (12:10 +0200)]
stmmac: prepare stmmac platform to support stand alone drivers
Prepare the stmmac platform code to support standalone drivers
by exporting the need functions and having of_match_device use
the match table reference already present in the driver struct.
This will allow us to reuse the platform driver functions from
this code easily in other stand alone platform drivers.
Signed-off-by: Joachim Eastwood <manabian@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Joachim Eastwood [Thu, 14 May 2015 10:10:56 +0000 (12:10 +0200)]
stmmac: add dwmac glue for NXP 18xx/43xx family
Add support for Ethernet on NXP LPC18xx and LPC43xx using the
dwmac driver. This glue is required to setup phy interface
mode, MII or RMII, on the SoC.
Signed-off-by: Joachim Eastwood <manabian@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
hv_netvsc: use per_cpu stats to calculate TX/RX data
Current code does not lock anything when calculating the TX and RX stats.
As a result, the RX and TX data reported by ifconfig are not accuracy in a
system with high network throughput and multiple CPUs (in my test,
RX/TX = 83% between 2 HyperV VM nodes which have 8 vCPUs and 40G Ethernet).
This patch fixed the above issue by using per_cpu stats.
netvsc_get_stats64() summarizes TX and RX data by iterating over all CPUs
to get their respective stats.
This v2 patch addressed David's comments on the cleanup path when
netdev_alloc_pcpu_stats() failed.
Signed-off-by: Simon Xiao <sixiao@microsoft.com> Reviewed-by: K. Y. Srinivasan <kys@microsoft.com> Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Michael Holzheu [Thu, 14 May 2015 03:40:39 +0000 (20:40 -0700)]
test_bpf: fix sparse warnings
Fix several sparse warnings like:
lib/test_bpf.c:1824:25: sparse: constant 4294967295 is so big it is long
lib/test_bpf.c:1878:25: sparse: constant 0x0000ffffffff0000 is so big it is long
Fixes: cffc642d93f9 ("test_bpf: add 173 new testcases for eBPF") Reported-by: Fengguang Wu <fengguang.wu@intel.com> Signed-off-by: Michael Holzheu <holzheu@linux.vnet.ibm.com> Signed-off-by: Alexei Starovoitov <ast@plumgrid.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net>
Florian Westphal [Wed, 13 May 2015 22:36:28 +0000 (00:36 +0200)]
net: core: set qdisc pkt len before tc_classify
commit d2788d34885d4ce5ba ("net: sched: further simplify handle_ing")
removed the call to qdisc_enqueue_root().
However, after this removal we no longer set qdisc pkt length.
This breaks traffic policing on ingress.
This is the minimum fix: set qdisc pkt length before tc_classify.
Only setting the length does remove support for 'stab' on ingress, but
as Alexei pointed out:
"Though it was allowed to add qdisc_size_table to ingress, it's useless.
Nothing takes advantage of recomputed qdisc_pkt_len".
Jamal suggested to use qdisc_pkt_len_init(), but as Eric mentioned that
would result in qdisc_pkt_len_init to no longer get inlined due to the
additional 2nd call site.
ingress policing is rare and GRO doesn't really work that well with police
on ingress, as we see packets > mtu and drop skbs that -- without
aggregation -- would still have fitted the policier budget.
Thus to have reliable/smooth ingress policing GRO has to be turned off.
Cc: Alexei Starovoitov <alexei.starovoitov@gmail.com> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: Jamal Hadi Salim <jhs@mojatatu.com> Fixes: d2788d34885d ("net: sched: further simplify handle_ing") Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Florian Westphal <fw@strlen.de> Acked-by: Eric Dumazet <edumazet@google.com> Acked-by: Alexei Starovoitov <ast@plumgrid.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Nicolas Dichtel [Wed, 13 May 2015 11:43:09 +0000 (13:43 +0200)]
netns: fix unbalanced spin_lock on error
Unlock was missing on error path.
Fixes: 95f38411df05 ("netns: use a spin_lock to protect nsid management") Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Daniel Borkmann [Wed, 13 May 2015 11:12:43 +0000 (13:12 +0200)]
test_bpf: add tests related to BPF_MAXINSNS
Couple of torture test cases related to the bug fixed in 0b59d8806a31
("ARM: net: delegate filter to kernel interpreter when imm_offset()
return value can't fit into 12bits.").
I've added a helper to allocate and fill the insn space. Output on
x86_64 from my laptop:
test_bpf: #233 BPF_MAXINSNS: Maximum possible literals jited:0 7 PASS
test_bpf: #234 BPF_MAXINSNS: Single literal jited:0 8 PASS
test_bpf: #235 BPF_MAXINSNS: Run/add until end jited:0 11553 PASS
test_bpf: #236 BPF_MAXINSNS: Too many instructions PASS
test_bpf: #237 BPF_MAXINSNS: Very long jump jited:0 9 PASS
test_bpf: #238 BPF_MAXINSNS: Ctx heavy transformations jited:0 20329 20398 PASS
test_bpf: #239 BPF_MAXINSNS: Call heavy transformations jited:0 32178 32475 PASS
test_bpf: #240 BPF_MAXINSNS: Jump heavy test jited:0 10518 PASS
test_bpf: #233 BPF_MAXINSNS: Maximum possible literals jited:1 4 PASS
test_bpf: #234 BPF_MAXINSNS: Single literal jited:1 4 PASS
test_bpf: #235 BPF_MAXINSNS: Run/add until end jited:1 1625 PASS
test_bpf: #236 BPF_MAXINSNS: Too many instructions PASS
test_bpf: #237 BPF_MAXINSNS: Very long jump jited:1 8 PASS
test_bpf: #238 BPF_MAXINSNS: Ctx heavy transformations jited:1 3301 3174 PASS
test_bpf: #239 BPF_MAXINSNS: Call heavy transformations jited:1 24107 23491 PASS
test_bpf: #240 BPF_MAXINSNS: Jump heavy test jited:1 8651 PASS
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Cc: Alexei Starovoitov <ast@plumgrid.com> Cc: Nicolas Schichan <nschichan@freebox.fr> Acked-by: Alexei Starovoitov <ast@plumgrid.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Thu, 14 May 2015 21:26:56 +0000 (14:26 -0700)]
tcp: syncookies: extend validity range
Now we allow storing more request socks per listener, we might
hit syncookie mode less often and hit following bug in our stack :
When we send a burst of syncookies, then exit this mode,
tcp_synq_no_recent_overflow() can return false if the ACK packets coming
from clients are coming three seconds after the end of syncookie
episode.
This is a way too strong requirement and conflicts with rest of
syncookie code which allows ACK to be aged up to 2 minutes.
Perfectly valid ACK packets are dropped just because clients might be
in a crowded wifi environment or on another planet.
So let's fix this, and also change tcp_synq_overflow() to not
dirty a cache line for every syncookie we send, as we are under attack.
Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Florian Westphal <fw@strlen.de> Acked-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Alexander Duyck [Thu, 14 May 2015 21:31:28 +0000 (14:31 -0700)]
ip_tunnel: Report Rx dropped in ip_tunnel_get_stats64
The rx_dropped stat wasn't being reported when ip_tunnel_get_stats64 was
called. This was leading to some confusing results in my debug as I was
seeing rx_errors increment but no other value which pointed me toward the
type of error being seen.
This change corrects that by using netdev_stats_to_stats64 to copy all
available dev stats instead of just the few that were hand picked.
Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Willem de Bruijn [Thu, 14 May 2015 19:25:02 +0000 (15:25 -0400)]
packet: fix warnings in rollover lock contention
Avoid two xchg calls whose return values were unused, causing a
warning on some architectures.
The relevant variable is a hint and read without mutual exclusion.
This fix makes all writers hold the receive_queue lock.
Suggested-by: David S. Miller <davem@davemloft.net> Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Thu, 14 May 2015 17:40:55 +0000 (13:40 -0400)]
Merge branch 'phy_turn_around'
Florian Fainelli says:
====================
net: phy: broken turn-around support
This is an attempt at solving the broken turn-around problem in a way that
is not specific to the mdio-gpio driver, since it affects different kinds of
platforms.
We cannot make that localized to PHY device drivers because probing the PHY
device which has a broken turn-around can fail as early as in get_phy_id(),
therefore we need a bit of help from Device Tree/platform_data.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Florian Fainelli [Tue, 12 May 2015 17:33:25 +0000 (10:33 -0700)]
of: mdio: Add a "broken-turn-around" property
Some Ethernet PHY devices/switches may not properly release the MDIO bus
during turn-around time, and fail to drive it low, which can be seen by
some controllers as a read failure, while the data clocked in is still
correct.
Add a boolean property "broken-turn-around" which is parsed by the
generic MDIO bus probing code and will set the corresponding bit in the
MDIO bus phy_ignore_ta_mask bitmask for MDIO bus drivers to utilize that
information.
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Florian Fainelli [Tue, 12 May 2015 17:33:24 +0000 (10:33 -0700)]
net: phy: Add phy_ignore_ta_mask to account for broken turn-around
Some PHY devices/switches will not release the turn-around line as they
should do at the end of a MDIO transaction. To help with such
situations, allow MDIO bus drivers to be made aware of such
restrictions.
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Ying Xue [Wed, 13 May 2015 03:20:38 +0000 (11:20 +0800)]
tipc: use sock_create_kern interface to create kernel socket
After commit eeb1bd5c40ed ("net: Add a struct net parameter to
sock_create_kern"), we should use sock_create_kern() to create kernel
socket as the interface doesn't reference count struct net any more.
Signed-off-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Brian Haley [Thu, 14 May 2015 17:20:15 +0000 (13:20 -0400)]
cls_flower: Fix compile error
Fix compile error in net/sched/cls_flower.c
net/sched/cls_flower.c: In function ‘fl_set_key’:
net/sched/cls_flower.c:240:3: error: implicit declaration of
function ‘tcf_change_indev’ [-Werror=implicit-function-declaration]
err = tcf_change_indev(net, tb[TCA_FLOWER_INDEV]);
Fixes: 77b9900ef53ae ("tc: introduce Flower classifier") Signed-off-by: Brian Haley <brian.haley@hp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Thu, 14 May 2015 16:24:46 +0000 (12:24 -0400)]
Merge branch 'tipc-next'
Jon Maloy says:
====================
tipc: some link layer improvements
We continue eliminating redundant complexity at the link layer, and
add a couple of improvements to the packet sending functionality.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Jon Paul Maloy [Thu, 14 May 2015 14:46:18 +0000 (10:46 -0400)]
tipc: add packet sequence number at instant of transmission
Currently, the packet sequence number is updated and added to each
packet at the moment a packet is added to the link backlog queue.
This is wasteful, since it forces the code to traverse the send
packet list packet by packet when adding them to the backlog queue.
It would be better to just splice the whole packet list into the
backlog queue when that is the right action to do.
In this commit, we do this change. Also, since the sequence numbers
cannot now be assigned to the packets at the moment they are added
the backlog queue, we do instead calculate and add them at the moment
of transmission, when the backlog queue has to be traversed anyway.
We do this in the function tipc_link_push_packet().
Reviewed-by: Erik Hugne <erik.hugne@ericsson.com> Reviewed-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Jon Paul Maloy [Thu, 14 May 2015 14:46:17 +0000 (10:46 -0400)]
tipc: improve link congestion algorithm
The link congestion algorithm used until now implies two problems.
- It is too generous towards lower-level messages in situations of high
load by giving "absolute" bandwidth guarantees to the different
priority levels. LOW traffic is guaranteed 10%, MEDIUM is guaranted
20%, HIGH is guaranteed 30%, and CRITICAL is guaranteed 40% of the
available bandwidth. But, in the absence of higher level traffic, the
ratio between two distinct levels becomes unreasonable. E.g. if there
is only LOW and MEDIUM traffic on a system, the former is guaranteed
1/3 of the bandwidth, and the latter 2/3. This again means that if
there is e.g. one LOW user and 10 MEDIUM users, the former will have
33.3% of the bandwidth, and the others will have to compete for the
remainder, i.e. each will end up with 6.7% of the capacity.
- Packets of type MSG_BUNDLER are created at SYSTEM importance level,
but only after the packets bundled into it have passed the congestion
test for their own respective levels. Since bundled packets don't
result in incrementing the level counter for their own importance,
only occasionally for the SYSTEM level counter, they do in practice
obtain SYSTEM level importance. Hence, the current implementation
provides a gap in the congestion algorithm that in the worst case
may lead to a link reset.
We now refine the congestion algorithm as follows:
- A message is accepted to the link backlog only if its own level
counter, and all superior level counters, permit it.
- The importance of a created bundle packet is set according to its
contents. A bundle packet created from messges at levels LOW to
CRITICAL is given importance level CRITICAL, while a bundle created
from a SYSTEM level message is given importance SYSTEM. In the latter
case only subsequent SYSTEM level messages are allowed to be bundled
into it.
This solves the first problem described above, by making the bandwidth
guarantee relative to the total number of users at all levels; only
the upper limit for each level remains absolute. In the example
described above, the single LOW user would use 1/11th of the bandwidth,
the same as each of the ten MEDIUM users, but he still has the same
guarantee against starvation as the latter ones.
The fix also solves the second problem. If the CRITICAL level is filled
up by bundle packets of that level, no lower level packets will be
accepted any more.
Suggested-by: Gergely Kiss <gergely.kiss@ericsson.com> Reviewed-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Jon Paul Maloy [Thu, 14 May 2015 14:46:16 +0000 (10:46 -0400)]
tipc: simplify link supervision checkpointing
We change the sequence number checkpointing that is performed
by the timer in order to discover if the peer is active. Currently,
we store a checkpoint of the next expected sequence number "rcv_nxt"
at each timer expiration, and compare it to the current expected
number at next timeout expiration. Instead, we now use the already
existing field "silent_intv_cnt" for this task. We step the counter
at each timeout expiration, and zero it at each valid received packet.
If no valid packet has been received from the peer after "abort_limit"
number of silent timer intervals, the link is declared faulty and reset.
We also remove the multiple instances of timer activation from inside
the FSM function "link_state_event()", and now do it at only one place;
at the end of the timer function itself.
Reviewed-by: Erik Hugne <erik.hugne@ericsson.com> Reviewed-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Reviewed-by: Erik Hugne <erik.hugne@ericsson.com> Reviewed-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Jon Paul Maloy [Thu, 14 May 2015 14:46:14 +0000 (10:46 -0400)]
tipc: simplify packet sequence number handling
Although the sequence number in the TIPC protocol is 16 bits, we have
until now stored it internally as an unsigned 32 bits integer.
We got around this by always doing explicit modulo-65535 operations
whenever we need to access a sequence number.
We now make the incoming and outgoing sequence numbers to unsigned
16-bit integers, and remove the modulo operations where applicable.
We also move the arithmetic inline functions for 16 bit integers
to core.h, and the function buf_seqno() to msg.h, so they can easily
be accessed from anywhere in the code.
Reviewed-by: Erik Hugne <erik.hugne@ericsson.com> Reviewed-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Jon Paul Maloy [Thu, 14 May 2015 14:46:13 +0000 (10:46 -0400)]
tipc: simplify include dependencies
When we try to add new inline functions in the code, we sometimes
run into circular include dependencies.
The main problem is that the file core.h, which really should be at
the root of the dependency chain, instead is a leaf. I.e., core.h
includes a number of header files that themselves should be allowed
to include core.h. In reality this is unnecessary, because core.h does
not need to know the full signature of any of the structs it refers to,
only their type declaration.
In this commit, we remove all dependencies from core.h towards any
other tipc header file.
As a consequence of this change, we can now move the function
tipc_own_addr(net) from addr.c to addr.h, and make it inline.
There are no functional changes in this commit.
Reviewed-by: Erik Hugne <erik.hugne@ericsson.com> Reviewed-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Jon Paul Maloy [Thu, 14 May 2015 14:46:12 +0000 (10:46 -0400)]
tipc: simplify link timer handling
Prior to this commit, the link timer has been running at a "continuity
interval" of configured link tolerance/4. When a timer wakes up and
discovers that there has been no sign of life from the peer during the
previous interval, it divides its own timer interval by another factor
four, and starts sending one probe per new interval. When the configured
link tolerance time has passed without answer, i.e. after 16 unacked
probes, the link is declared faulty and reset.
This is unnecessary complex. It is sufficient to continue with the
original continuity interval, and instead reset the link after four
missed probe responses. This makes the timer handling in the link
simpler, and opens up for some planned later changes in this area.
This commit implements this change.
Reviewed-by: Richard Alpe <richard.alpe@ericsson.com> Reviewed-by: Erik Hugne <erik.hugne@ericsson.com> Reviewed-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Jon Paul Maloy [Thu, 14 May 2015 14:46:11 +0000 (10:46 -0400)]
tipc: simplify resetting and disabling of bearers
Since commit 4b475e3f2f8e4e241de101c8240f1d74d0470494
("tipc: eliminate delayed link deletion at link failover") the extra
boolean parameter "shutting_down" is not any longer needed for the
functions bearer_disable() and tipc_link_delete_list().
Furhermore, the function tipc_link_reset_links(), called from
bearer_reset() is now unnecessary. We can just as well delete
all the links, as we do in bearer_disable(), and start over with
creating new links.
This commit introduces those changes.
Reviewed-by: Erik Hugne <erik.hugne@ericsson.com> Reviewed-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Thu, 14 May 2015 16:21:42 +0000 (12:21 -0400)]
Merge branch 'be2net-next'
Venkat Duvvuru says:
====================
be2net: patch-set
The following patch set has one new feature addition and two fixes.
Patch 1 adds support for hwmon sysfs interface to display board temperature.
Board temperature display through ethtool statistics is removed.
Patch 2 reports "link down" in a few more error cases which are not handled
currently.
Patch 3 adds support for os2bmc. OS2BMC feature will allow the server to
communicate with the on-board BMC/idrac (Baseboard Management Controller)
over the LOM via standard Ethernet. More details are added in the commit log.
Please review.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Venkata Duvvuru [Wed, 13 May 2015 07:30:14 +0000 (13:00 +0530)]
be2net: Support for OS2BMC.
OS2BMC feature will allow the server to communicate with the on-board
BMC/idrac (Baseboard Management Controller) over the LOM via
standard Ethernet.
When OS2BMC feature is enabled, the LOM will filter traffic coming
from the host. If the destination MAC address matches the iDRAC MAC
address, it will forward the packet to the NC-SI side band interface
for iDRAC processing. Otherwise, it would send it out on the wire to
the external network. Broadcast and multicast packets are sent on the
side-band NC-SI channel and on the wire as well. Some of the packet
filters are not supported in the NIC and hence driver will identify
such packets and will hint the NIC to send those packets to the BMC.
This is done by duplicating packets on the management ring. Packets
are sent to the management ring, by setting mgmt bit in the wrb header.
The NIC will forward the packets on the management ring to the BMC
through the side-band NC-SI channel.
Please refer to this online document for more details,
http://www.dell.com/downloads/global/products/pedge/
os_to_bmc_passthrough_a_new_chapter_in_system_management.pdf
Signed-off-by: Venkat Duvvuru <VenkatKumar.Duvvuru@Emulex.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Venkata Duvvuru [Wed, 13 May 2015 07:30:13 +0000 (13:00 +0530)]
be2net: Report a "link down" to the stack when a fatal error or fw reset happens.
When an error (related to HW or FW) is detected on a function, the driver
must pro-actively report a "link down" to the stack so that a possible
failover can be initiated. This is being done currently only for some
HW errors. This patch reports a "link down" even for fatal FW errors and
EEH errors.
Signed-off-by: Venkat Duvvuru <VenkatKumar.Duvvuru@Emulex.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Venkata Duvvuru [Wed, 13 May 2015 07:30:12 +0000 (13:00 +0530)]
be2net: Export board temperature using hwmon-sysfs interface.
Ethtool statistics is not the right place to display board temperature.
This patch adds support to export die temperature of devices supported
by be2net driver via the sysfs hwmon interface.
Signed-off-by: Venkat Duvvuru <VenkatKumar.Duvvuru@Emulex.com> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Thu, 14 May 2015 05:10:06 +0000 (01:10 -0400)]
Merge branch 'nf-ingress'
Pablo Neira Ayuso says:
====================
Netfilter ingress support (v4)
This is the v4 round of patches to add the Netfilter ingress hook, it basically
comes in two steps:
1) Add the CONFIG_NET_INGRESS switch to wrap the ingress static key around it.
The idea is to use the same global static key to avoid adding more code to
the hot path.
2) Add the Netfilter ingress hook after the tc ingress hook, under the global
ingress_needed static key. As I said, the netfilter ingress hook also has
its own static key, that is nested under the global static key. Please, see
patch 5/5 for performance numbers and more information.
I originally started this next round, as it was suggested, exploring the
independent static key for netfilter ingress just after tc ingress, but the
results that I gathered from that patch are not good for non-users:
this roughly means 500Kpps less performance wrt. the base numbers, so that's
the reason why I discarded that approach and I focused on this.
The idea of this patchset is to open the window to nf_tables, which comes with
features that will work out-of-the-box (once the boiler plate code to support
the 'netdev' table family is in place), to avoid repeating myself [1], the most
relevant features are:
1) Multi-dimensional key dictionary lookups.
2) Arbitrary stateful flow tables.
3) Transactions and good support for dynamic updates.
But there are also interest aspects to consider from userspace, such as the
ability to support new layer 2 protocols without kernel updates, a well-defined
netlink interface, userspace libraries and utilities for third party
applications, among others.
Pablo Neira [Wed, 13 May 2015 16:19:38 +0000 (18:19 +0200)]
netfilter: add netfilter ingress hook after handle_ing() under unique static key
This patch adds the Netfilter ingress hook just after the existing tc ingress
hook, that seems to be the consensus solution for this.
Note that the Netfilter hook resides under the global static key that enables
ingress filtering. Nonetheless, Netfilter still also has its own static key for
minimal impact on the existing handle_ing().
Note that the results are very similar before and after.
I can see gcc gets the code under the ingress static key out of the hot path.
Then, on that cold branch, it generates the code to accomodate the netfilter
ingress static key. My explanation for this is that this reduces the pressure
on the instruction cache for non-users as the new code is out of the hot path,
and it comes with minimal impact for tc ingress users.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Acked-by: Alexei Starovoitov <ast@plumgrid.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira [Wed, 13 May 2015 16:19:37 +0000 (18:19 +0200)]
net: add CONFIG_NET_INGRESS to enable ingress filtering
This new config switch enables the ingress filtering infrastructure that is
controlled through the ingress_needed static key. This prepares the
introduction of the Netfilter ingress hook that resides under this unique
static key.
Note that CONFIG_SCH_INGRESS automatically selects this, that should be no
problem since this also depends on CONFIG_NET_CLS_ACT.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Acked-by: Alexei Starovoitov <ast@plumgrid.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Dan Carpenter [Tue, 12 May 2015 18:15:24 +0000 (21:15 +0300)]
net: macb: OR vs AND typos
The bitwise tests are always true here because it uses '|' where '&' is
intended.
Fixes: 98b5a0f4a228 ('net: macb: Add support for jumbo frames') Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Alexander Duyck [Wed, 13 May 2015 20:34:13 +0000 (13:34 -0700)]
net: Reserve skb headroom and set skb->dev even if using __alloc_skb
When I had inlined __alloc_rx_skb into __netdev_alloc_skb and
__napi_alloc_skb I had overlooked the fact that there was a return in the
__alloc_rx_skb. As a result we weren't reserving headroom or setting the
skb->dev in certain cases. This change corrects that by adding a couple of
jump labels to jump to depending on __alloc_skb either succeeding or failing.
Fixes: 9451980a6646 ("net: Use cached copy of pfmemalloc to avoid accessing page") Reported-by: Felipe Balbi <balbi@ti.com> Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com> Tested-by: Kevin Hilman <khilman@linaro.org> Signed-off-by: David S. Miller <davem@davemloft.net>
This 5-patch kernel series adds a netdev implementation of a GENEVE
tunnel driver, and the single iproute2 patch enables creation and
such for those netdevs. This makes use of the existing GENEVE
infrastructure already used by the OVS code. The net/ipv4/geneve.c
file is renamed as net/ipv4/geneve_core.c as part of these changes.
The overall structure of the GENEVE netdev driver is strongly
influenced by the VXLAN netdev driver. This is not surprising, as the
two drivers are intended to serve similar purposes. As development of
the GENEVE driver continues, it is likely that those similarities will
grow stronger. This will include both simple configuration options
(e.g. TOS and TTL settings) and new control plane support.
The current implementation is very simple, restricting itself to point
to point links over IPv4. This is due only to the simplicity of the
implementation, and no such limit is inherent to GENEVE in any way.
Support for IPv6 links and more sophisticated control plane options
are predictable enhancements.
Using the included iproute2 patch, a GENEVE tunnel is created thusly:
ip link add dev gnv0 type geneve remote 192.168.22.1 vni 1234
ip link set gnv0 up
ip addr add 10.1.1.1/24 dev gnv0
After a corresponding tunnel interface is created at the link partner,
traffic should proceed as expected.
Please let me know if anyone has problems...thanks!
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
John W. Linville [Wed, 13 May 2015 16:57:30 +0000 (12:57 -0400)]
geneve: add initial netdev driver for GENEVE tunnels
This is an initial implementation of a netdev driver for GENEVE
tunnels. This implementation uses a fixed UDP port, and only supports
point-to-point links with specific partner endpoints. Only IPv4
links are supported at this time.
Signed-off-by: John W. Linville <linville@tuxdriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
John W. Linville [Wed, 13 May 2015 16:57:26 +0000 (12:57 -0400)]
geneve: remove MODULE_ALIAS_RTNL_LINK from net/ipv4/geneve.c
This file is essentially a library for implementing the geneve
encapsulation protocol. The file does not register any rtnl_link_ops,
so the MODULE_ALIAS_RTNL_LINK macro is inappropriate here.
Signed-off-by: John W. Linville <linville@tuxdriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira [Tue, 12 May 2015 18:28:07 +0000 (20:28 +0200)]
net: kill useless net_*_ingress_queue() definitions when NET_CLS_ACT is unset
This fixes 4577139b2dabf589 ("net: use jump label patching for ingress qdisc in
__netif_receive_skb_core").
The only client of this is sch_ingress and it depends on NET_CLS_ACT. So
there is no way these definition can be of any help.
Cc: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@plumgrid.com> Signed-off-by: David S. Miller <davem@davemloft.net>
1. mitigate a case of lock contention
2. avoid exporting resource exhaustion to other sockets,
by migrating only to a victim socket that has ample room
3. avoid reordering of most flows on the socket,
by migrating first the flow responsible for load imbalance
4. help processes detect load imbalance,
by exporting rollover counters
Context: rollover implements flow migration in packet socket fanout
groups in case of extreme load imbalance. It is a specific
implementation of migration that minimizes reordering by selecting
the same victim socket when possible (and by selecting subsequent
victims in a round robin fashion, from which its name derives).
Changes:
v2 -> v3:
- statistics: replace unsigned long with __aligned_u64
v1 -> v2:
- huge flow detection: run lockless
- huge flow detection: replace stored index with random
- contention avoidance: test in packet_poll while lock held
- contention avoidance: clear pressure sooner
packet_poll and packet_recvmsg would clear only if the sock
is empty to avoid taking the necessary lock. But,
* packet_poll already holds this lock, so a lockless variant
__packet_rcv_has_room is cheap.
* packet_recvmsg is usually called only for non-ring sockets,
which also runs lockless.
- preparation: drop "single return" patch
packet_rcv_has_room is now a locked wrapper around
__packet_rcv_has_room, achieving the same (single footer).
The benchmark mentioned in the patches is at
https://github.com/wdebruij/kerneltools/blob/master/tests/bench_rollover.c
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Willem de Bruijn [Tue, 12 May 2015 15:56:49 +0000 (11:56 -0400)]
packet: rollover huge flows before small flows
Migrate flows from a socket to another socket in the fanout group not
only when the socket is full. Start migrating huge flows early, to
divert possible 4-tuple attacks without affecting normal traffic.
Introduce fanout_flow_is_huge(). This detects huge flows, which are
defined as taking up more than half the load. It does so cheaply, by
storing the rxhashes of the N most recent packets. If over half of
these are the same rxhash as the current packet, then drop it. This
only protects against 4-tuple attacks. N is chosen to fit all data in
a single cache line.
Tested:
Ran bench_rollover for 10 sec with 1.5 Mpps of single flow input.
Willem de Bruijn [Tue, 12 May 2015 15:56:48 +0000 (11:56 -0400)]
packet: rollover lock contention avoidance
Rollover has to call packet_rcv_has_room on sockets in the fanout
group to find a socket to migrate to. This operation is expensive
especially if the packet sockets use rings, when a lock has to be
acquired.
Avoid pounding on the lock by all sockets by temporarily marking a
socket as "under memory pressure" when such pressure is detected.
While set, only the socket owner may call packet_rcv_has_room on the
socket. Once it detects normal conditions, it clears the flag. The
socket is not used as a victim by any other socket in the meantime.
Under reasonably balanced load, each socket writer frequently calls
packet_rcv_has_room and clears its own pressure field. As a backup
for when the socket is rarely written to, also clear the flag on
reading (packet_recvmsg, packet_poll) if this can be done cheaply
(i.e., without calling packet_rcv_has_room). This is only for
edge cases.
Tested:
Ran bench_rollover: a process with 8 sockets in a single fanout
group, each pinned to a single cpu that receives one nic recv
interrupt. RPS and RFS are disabled. The benchmark uses packet
rx_ring, which has to take a lock when determining whether a
socket has room.
Sent 3.5 Mpps of UDP traffic with sufficient entropy to spread
uniformly across the packet sockets (and inserted an iptables
rule to drop in PREROUTING to avoid protocol stack processing).
Without this patch, all sockets try to migrate traffic to
neighbors, causing lock contention when searching for a non-
empty neighbor. The lock is the top 9 entries.
Willem de Bruijn [Tue, 12 May 2015 15:56:47 +0000 (11:56 -0400)]
packet: rollover only to socket with headroom
Only migrate flows to sockets that have sufficient headroom, where
sufficient is defined as having at least 25% empty space.
The kernel has three different buffer types: a regular socket, a ring
with frames (TPACKET_V[12]) or a ring with blocks (TPACKET_V3). The
latter two do not expose a read pointer to the kernel, so headroom is
not computed easily. All three needs a different implementation to
estimate free space.
Tested:
Ran bench_rollover for 10 sec with 1.5 Mpps of single flow input.
bench_rollover has as many sockets as there are NIC receive queues
in the system. Each socket is owned by a process that is pinned to
one of the receive cpus. RFS is disabled. RPS is enabled with an
identity mapping (cpu x -> cpu x), to count drops with softnettop.
lpbb5:/export/hda3/willemb# ./bench_rollover -r -l 1000 -s
Press [Enter] to exit
Willem de Bruijn [Tue, 12 May 2015 15:56:45 +0000 (11:56 -0400)]
packet: rollover prepare: move code out of callsites
packet_rcv_fanout calls fanout_demux_rollover twice. Move all rollover
logic into the callee to simplify these callsites, especially with
upcoming changes.
The main differences between the two callsites is that the FLAG
variant tests whether the socket previously selected by another
mode (RR, RND, HASH, ..) has room before migrating flows, whereas the
rollover mode has no original socket to test.
Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Tue, 12 May 2015 13:31:48 +0000 (06:31 -0700)]
ipv4: __ip_local_out_sk() is static
__ip_local_out_sk() is only used from net/ipv4/ip_output.c
net/ipv4/ip_output.c:94:5: warning: symbol '__ip_local_out_sk' was not
declared. Should it be static?
Fixes: 7026b1ddb6b8 ("netfilter: Pass socket pointer down through okfn().") Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Tue, 12 May 2015 13:22:56 +0000 (06:22 -0700)]
tcp/dccp: tw_timer_handler() is static
tw_timer_handler() is only used from net/ipv4/inet_timewait_sock.c
Fixes: 789f558cfb36 ("tcp/dccp: get rid of central timewait timer") Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Wed, 13 May 2015 19:19:48 +0000 (15:19 -0400)]
Merge branch 'cls_flower'
Jiri Pirko says:
====================
introduce programable flow dissector and cls_flower
Per Davem's request, I prepared this patchset which introduces programmable
flow dissector. For current users of flow_keys, there is a wrapper
skb_flow_dissect_flow_keys which maintains the previous behaviour.
For purposes of cls_flower, couple of new dissection keys were introduced.
Note that this dissector can be also eventually used by openvswitch code.
Also, as a next step, I plan to get rid of *skb_flow_get_ports(export)
and *__skb_get_poff as their functionality can be now implemented by
skb_flow_dissect as well.
v2->v3:
- remove TCA_FLOWER_POLICE attr suggested by Jamal
v1->v2:
- move __skb_tx_hash rather to dev.c as suggested by Alex
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Jiri Pirko [Tue, 12 May 2015 12:56:21 +0000 (14:56 +0200)]
tc: introduce Flower classifier
This patch introduces a flow-based filter. So far, the very essential
packet fields are supported.
This patch is only the first step. There is a lot of potential performance
improvements possible to implement. Also a lot of features are missing
now. They will be addressed in follow-up patches.
Signed-off-by: Jiri Pirko <jiri@resnulli.us> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>