John Fastabend [Fri, 3 Oct 2014 05:43:09 +0000 (22:43 -0700)]
net: sched: suspicious RCU usage in qdisc_watchdog
Suspicious RCU usage in qdisc_watchdog call needs to be done inside
rcu_read_lock/rcu_read_unlock. And then Qdisc destroy operations
need to ensure timer is cancelled before removing qdisc structure.
Fixes: b26b0d1e8b1 ("net: qdisc: use rcu prefix and silence sparse warnings") Signed-off-by: John Fastabend <john.r.fastabend@intel.com> Acked-by: Cong Wang <cwang@twopensource.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Commit f7f1de51edbd ("net: dsa: start and stop the PHY state machine")
add calls to phy_start() in dsa_slave_open() respectively phy_stop() in
dsa_slave_close().
We also call phy_start_aneg() in dsa_slave_create(), and this call is
messing up with the PHY state machine, since we basically start the
auto-negotiation, and later on restart it when calling phy_start().
phy_start() does not currently handle the PHY_FORCING or PHY_AN states
properly, but such a fix would be too invasive for this window.
Fixes: f7f1de51edbd ("net: dsa: start and stop the PHY state machine") Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Sébastien Barré [Thu, 2 Oct 2014 19:15:22 +0000 (21:15 +0200)]
Removed unused inet6 address state
the inet6 state INET6_IFADDR_STATE_UP only appeared in its definition.
Cc: Christoph Paasch <christoph.paasch@uclouvain.be> Cc: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: Sébastien Barré <sebastien.barre@uclouvain.be> Signed-off-by: David S. Miller <davem@davemloft.net>
net: Cleanup skb cloning by adding SKB_FCLONE_FREE
SKB_FCLONE_UNAVAILABLE has overloaded meaning depending on type of skb.
1: If skb is allocated from head_cache, it indicates fclone is not available.
2: If skb is a companion fclone skb (allocated from fclone_cache), it indicates
it is available to be used.
To avoid confusion for case 2 above, this patch replaces
SKB_FCLONE_UNAVAILABLE with SKB_FCLONE_FREE where appropriate. For fclone
companion skbs, this indicates it is free for use.
SKB_FCLONE_UNAVAILABLE will now simply indicate skb is from head_cache and
cannot / will not have a companion fclone.
Signed-off-by: Vijay Subramanian <subramanian.vijay@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Generic UDP Encapsulation (GUE) is UDP encapsulation protocol which
encapsulates packets of various IP protocols. The GUE protocol is
described in http://tools.ietf.org/html/draft-herbert-gue-01.
The receive path of GUE is implemented in the FOU over UDP module (FOU).
This includes a UDP encap receive function for GUE as well as GUE
specific GRO functions. Management and configuration of GUE ports shares
most of the same code with FOU.
For the transmit path, the previous FOU support for IPIP, sit, and GRE
was simply extended for GUE (when GUE is enabled insert the GUE
header on transmit in addition to UDP header inserted for FOU).
Semantically GUE is the same as FOU in that the encapsulation (UDP
and GUE headers) that are inserted on transmission and removed on
reception so that IP packet is processed with the inner header.
This patch set includes:
- Some fixes to FOU, removal of IPv4,v6 specific GRO functions
- Support to configure a GUE receive port
- Implementation of GUE receive path (normal and GRO)
- Additions to ip_tunnel netlink to configure GUE
- GUE header inserion in ip_tunnel transmit path
v2:
- Include net/gue.h in patch set
Testing:
I ran performance numbers using netperf TCP_RR with 200 streams,
comparing encapsulation without GUE, encapsulation with GUE, and
encapsulation with FOU.
GRE
TCP_STREAM
IPv4, FOU, UDP checksum enabled
14.04% TX CPU utilization
13.17% RX CPU utilization
9211 Mbps
IPv4, GUE, UDP checksum enabled
14.99% TX CPU utilization
13.79% RX CPU utilization
9185 Mbps
IPv4, FOU, UDP checksum disabled
13.14% TX CPU utilization
23.18% RX CPU utilization
9277 Mbps
IPv4, GUE, UDP checksum disabled
13.66% TX CPU utilization
23.57% RX CPU utilization
9184 Mbps
TCP_RR
IPv4, FOU, UDP checksum enabled
94.2% CPU utilization
155/249/460 90/95/99% latencies
1.17018e+06 tps
IPv4, GUE, UDP checksum enabled
93.9% CPU utilization
158/253/472 90/95/99% latencies
1.15045e+06 tps
IPIP
TCP_STREAM
FOU, UDP checksum enabled
15.28% TX CPU utilization
13.92% RX CPU utilization
9342 Mbps
GUE, UDP checksum enabled
13.99% TX CPU utilization
13.34% RX CPU utilization
9210 Mbps
FOU, UDP checksum disabled
15.08% TX CPU utilization
24.64% RX CPU utilization
9226 Mbps
GUE, UDP checksum disabled
15.90% TX CPU utilization
24.77% RX CPU utilization
9197 Mbps
TCP_RR
FOU, UDP checksum enabled
94.23% CPU utilization
149/237/429 90/95/99% latencies
1.19553e+06 tps
GUE, UDP checksum enabled
93.75% CPU utilization
152/243/442 90/95/99% latencies
1.17027e+06 tps
SIT
TCP_STREAM
FOU, UDP checksum enabled
14.47% TX CPU utilization
14.58% RX CPU utilization
9106 Mbps
GUE, UDP checksum enabled
15.09% TX CPU utilization
14.84% RX CPU utilization
9080 Mbps
FOU, UDP checksum disabled
15.70% TX CPU utilization
27.93% RX CPU utilization
9097 Mbps
GUE, UDP checksum disabled
15.04% TX CPU utilization
27.54% RX CPU utilization
9073 Mbps
TCP_RR
FOU, UDP checksum enabled
96.9% CPU utilization
170/281/581 90/95/99% latencies
1.03372e+06 tps
GUE, UDP checksum enabled
97.16% CPU utilization
172/286/576 90/95/99% latencies
1.00469e+06 tps
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Tom Herbert [Fri, 3 Oct 2014 22:48:10 +0000 (15:48 -0700)]
ip_tunnel: Add GUE support
This patch allows configuring IPIP, sit, and GRE tunnels to use GUE.
This is very similar to fou excpet that we need to insert the GUE header
in addition to the UDP header on transmit.
Signed-off-by: Tom Herbert <therbert@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Tom Herbert [Fri, 3 Oct 2014 22:48:09 +0000 (15:48 -0700)]
gue: Receive side for Generic UDP Encapsulation
This patch adds support receiving for GUE packets in the fou module. The
fou module now supports direct foo-over-udp (no encapsulation header)
and GUE. To support this a type parameter is added to the fou netlink
parameters.
For a GUE socket we define gue_udp_recv, gue_gro_receive, and
gue_gro_complete to handle the specifics of the GUE protocol. Most
of the code to manage and configure sockets is common with the fou.
Signed-off-by: Tom Herbert <therbert@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Tom Herbert [Fri, 3 Oct 2014 22:48:08 +0000 (15:48 -0700)]
fou: eliminate IPv4,v6 specific GRO functions
This patch removes fou[46]_gro_receive and fou[46]_gro_complete
functions. The v4 or v6 variants were chosen for the UDP offloads
based on the address family of the socket this is not necessary
or correct. Alternatively, this patch adds is_ipv6 to napi_gro_skb.
This is set in udp6_gro_receive and unset in udp4_gro_receive. In
fou_gro_receive the value is used to select the correct inet_offloads
for the protocol of the outer IP header.
Signed-off-by: Tom Herbert <therbert@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Tom Herbert [Fri, 3 Oct 2014 22:48:07 +0000 (15:48 -0700)]
ip_tunnel: Account for secondary encapsulation header in max_headroom
When adjusting max_header for the tunnel interface based on egress
device we need to account for any extra bytes in secondary encapsulation
(e.g. FOU).
Signed-off-by: Tom Herbert <therbert@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Chen Gang [Thu, 2 Oct 2014 14:32:56 +0000 (22:32 +0800)]
drivers/net/irda/Kconfig: Let SH_IRDA depend on HAS_IOMEM
SH_IRDA needs HAS_IOMEM, so depend on it. The related error(with
allmodconfig under um):
CC [M] drivers/net/irda/sh_irda.o
drivers/net/irda/sh_irda.c: In function ‘sh_irda_probe’:
drivers/net/irda/sh_irda.c:776:2: error: implicit declaration of function ‘ioremap_nocache’ [-Werror=implicit-function-declaration]
self->membase = ioremap_nocache(res->start, resource_size(res));
^
drivers/net/irda/sh_irda.c:776:16: warning: assignment makes pointer from integer without a cast [enabled by default]
self->membase = ioremap_nocache(res->start, resource_size(res));
^
drivers/net/irda/sh_irda.c:821:2: error: implicit declaration of function ‘iounmap’ [-Werror=implicit-function-declaration]
iounmap(self->membase);
^
Signed-off-by: Chen Gang <gang.chen.5i5j@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Chen Gang [Thu, 2 Oct 2014 14:23:33 +0000 (22:23 +0800)]
drivers/net/ethernet/marvell/Kconfig: Let PXA168_ETH depend on HAS_IOMEM
PXA168_ETH need HAS_IOMEM, so depend on it, the related error (with
allmodconfig under um):
CC [M] drivers/net/ethernet/marvell/pxa168_eth.o
drivers/net/ethernet/marvell/pxa168_eth.c: In function ‘pxa168_eth_probe’:
drivers/net/ethernet/marvell/pxa168_eth.c:1605:2: error: implicit declaration of function ‘iounmap’ [-Werror=implicit-function-declaration]
iounmap(pep->base);
^
Signed-off-by: Chen Gang <gang.chen.5i5j@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Chen Gang [Thu, 2 Oct 2014 14:14:04 +0000 (22:14 +0800)]
drivers/net/dsa/Kconfig: Let NET_DSA_BCM_SF2 depend on HAS_IOMEM
NET_DSA_BCM_SF2 need HAS_IOMEM, so depend on it, the related error (with
allmodconfig under um):
CC [M] drivers/net/dsa/bcm_sf2.o
drivers/net/dsa/bcm_sf2.c: In function ‘bcm_sf2_sw_setup’:
drivers/net/dsa/bcm_sf2.c:487:3: error: implicit declaration of function ‘iounmap’ [-Werror=implicit-function-declaration]
iounmap(*base);
^
Signed-off-by: Chen Gang <gang.chen.5i5j@gmail.com> Acked-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Chen Gang [Thu, 2 Oct 2014 14:01:42 +0000 (22:01 +0800)]
drivers/net/can/Kconfig: Let CAN_AT91 depend on HAS_IOMEM
CAN_AT91 needs HAS_IOMEM, so depends on it. The related error (with
allmodconfig under um):
CC [M] drivers/net/can/at91_can.o
drivers/net/can/at91_can.c: In function ‘at91_can_probe’:
drivers/net/can/at91_can.c:1329:2: error: implicit declaration of function ‘ioremap_nocache’ [-Werror=implicit-function-declaration]
addr = ioremap_nocache(res->start, resource_size(res));
^
drivers/net/can/at91_can.c:1329:7: warning: assignment makes pointer from integer without a cast [enabled by default]
addr = ioremap_nocache(res->start, resource_size(res));
^
drivers/net/can/at91_can.c:1384:2: error: implicit declaration of function ‘iounmap’ [-Werror=implicit-function-declaration]
iounmap(addr);
^
Signed-off-by: Chen Gang <gang.chen.5i5j@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Fri, 3 Oct 2014 22:43:50 +0000 (15:43 -0700)]
Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/net-next
Jeff Kirsher says:
====================
Intel Wired LAN Driver Updates 2014-10-02
This series contains updates to fm10k, igb, ixgbe and i40e.
Alex provides two updates to the fm10k driver. First reduces the buffer
size to 2k for all page sizes, since most frames only have a 1500 MTU
so supporting a buffer size larger than this is somewhat wasteful.
Second fixes an issue where the number of transmit queues was not being
updated, so added the lines necessary to update the number of transmit
queues.
Rick Jones provides two patches to convert ixgbe, igb and i40e to use
dev_consume_skb_any().
Emil provides two patches for ixgbe, first cleans up a couple of wait
loops on auto-negotiation that were not needed. Second fixes an issue
reported by Fujitsu/Red Hat, which consolidates the logic behind the
dynamically setting of TXDCTL.WTHRESH depending on interrupt throttle
rate (ITR) setting regardless of BQL.
Ethan Zhao provides a cleanup patch for ixgbe where he noticed a
duplicate define.
Bernhard Kaindl provides a patch for igb to remove a source of latency
spikes by not calling code that uses mdelay() for feeding a PHY stat
while being called with a spinlock held.
Todd bumps the igb version based on the recent changes.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Fri, 3 Oct 2014 22:42:37 +0000 (15:42 -0700)]
Merge branch 'mlx5-next'
Eli Cohen says:
====================
mlx5 update for 3.18
This series integrates a new mechanism for populating and extracting field values
used in the driver/firmware interaction around command mailboxes.
Changes from V1:
- Remove unused definition of memcpy_cpu_to_be32()
- Remove definitions of non_existent_*() and use BUILD_BUG_ON() instead.
- Added a patch one line patch to add support for ConnectX-4 devices.
Changes from V0:
- trimmed the auto-generated file to a minimum, as required by the reviewers.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Eli Cohen [Thu, 2 Oct 2014 09:19:45 +0000 (12:19 +0300)]
net/mlx5_core: Identify resources by their type
This patch puts a common part as the first field of mlx5_core_qp. This field is
used to identify which resource generated an event. This is required since upcoming
new resource types such as DC targets are allocated for the same numerical space
as regular QPs and may generate the same events. By searching the resource in the
same table we can then look at the common field to identify the resource.
Signed-off-by: Eli Cohen <eli@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Eli Cohen [Thu, 2 Oct 2014 09:19:43 +0000 (12:19 +0300)]
net/mlx5_core: Use hardware registers description header file
Add an auto generated header file that describes hardware registers along with
set of macros that set/get values. The macros do static checks to avoid
overflow, handle endianess, and overall provide a clean way to code commands.
Currently the header file is small and we will add structs as we make use of
the macros.
A few commands were removed from the commands enum since they are not supported
currently and will be added when support is available.
Signed-off-by: Eli Cohen <eli@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Rearrange struct mlx5_caps so it has a "gen" field to represent the current
capabilities configured for the device. Max capabilities can also be queried
from the device. Also update capabilities struct to contain more fields as per
the latest revision if firmware specification.
Signed-off-by: Eli Cohen <eli@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Fri, 3 Oct 2014 22:31:07 +0000 (15:31 -0700)]
qdisc: validate skb without holding lock
Validation of skb can be pretty expensive :
GSO segmentation and/or checksum computations.
We can do this without holding qdisc lock, so that other cpus
can queue additional packets.
Trick is that requeued packets were already validated, so we carry
a boolean so that sch_direct_xmit() can validate a fresh skb list,
or directly use an old one.
Tested on 40Gb NIC (8 TX queues) and 200 concurrent flows, 48 threads
host.
Turning TSO on or off had no effect on throughput, only few more cpu
cycles. Lock contention on qdisc lock disappeared.
Same if disabling TX checksum offload.
Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Tobias Klauser [Thu, 2 Oct 2014 08:15:30 +0000 (10:15 +0200)]
net: ethernet: Remove superfluous ether_setup after alloc_etherdev
There is no need to call ether_setup after alloc_ethdev since it was
already called there.
Follow commits c706471b2601 ("net: axienet: remove unnecessary
ether_setup after alloc_etherdev") and 3c87dcbfb36c ("net: ll_temac:
Remove unnecessary ether_setup after alloc_etherdev") and fix the
pattern in all remaining ethernet drivers.
Signed-off-by: Tobias Klauser <tklauser@distanz.ch> Acked-by: Nicolas Ferre <nicolas.ferre@atmel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Fri, 3 Oct 2014 19:37:23 +0000 (12:37 -0700)]
Merge branch 'qdisc_bulk_dequeue'
Jesper Dangaard Brouer says:
====================
qdisc: bulk dequeue support
This patchset uses DaveM's recent API changes to dev_hard_start_xmit(),
from the qdisc layer, to implement dequeue bulking.
Patch01: "qdisc: bulk dequeue support for qdiscs with TCQ_F_ONETXQUEUE"
- Implement basic qdisc dequeue bulking
- This time, 100% relying on BQL limits, no magic safe-guard constants
Patch02: "qdisc: dequeue bulking also pickup GSO/TSO packets"
- Extend bulking to bulk several GSO/TSO packets
- Seperate patch, as it introduce a small regression, see test section.
We do have a patch03, which exports a userspace tunable as a BQL
tunable, that can byte-cap or disable the bulking/bursting. But we
could not agree on it internally, thus not sending it now. We
basically strive to avoid adding any new userspace tunable.
Testing patch01:
================
Demonstrating the performance improvement of qdisc dequeue bulking, is
tricky because the effect only "kicks-in" once the qdisc system have a
backlog. Thus, for a backlog to form, we need either 1) to exceed wirespeed
of the link or 2) exceed the capability of the device driver.
For practical use-cases, the measureable effect of this will be a
reduction in CPU usage
01-TCP_STREAM:
--------------
Testing effect for TCP involves disabling TSO and GSO, because TCP
already benefit from bulking, via TSO and especially for GSO segmented
packets. This patch view TSO/GSO as a seperate kind of bulking, and
avoid further bulking of these packet types.
The measured perf diff benefit (at 10Gbit/s) for a single netperf
TCP_STREAM were 9.24% less CPU used on calls to _raw_spin_lock()
(mostly from sch_direct_xmit).
If my E5-2695v2(ES) CPU is tuned according to:
http://netoptimizer.blogspot.dk/2014/04/basic-tuning-for-network-overload.html
Then it is possible that a single netperf TCP_STREAM, with GSO and TSO
disabled, can utilize all bandwidth on a 10Gbit/s link. This will
then cause a standing backlog queue at the qdisc layer.
Trying to pressure the system some more CPU util wise, I'm starting
24x TCP_STREAMs and monitoring the overall CPU utilization. This
confirms bulking saves CPU cycles when it "kicks-in".
Tool mpstat, while stressing the system with netperf 24x TCP_STREAM, shows:
* Disabled bulking: sys:2.58% soft:8.50% idle:88.78%
* Enabled bulking: sys:2.43% soft:7.66% idle:89.79%
02-UDP_STREAM
-------------
The measured perf diff benefit for UDP_STREAM were 6.41% less CPU used
on calls to _raw_spin_lock(). 24x UDP_STREAM with packet size -m 1472 (to
avoid sending UDP/IP fragments).
03-trafgen driver test
----------------------
The performance of the 10Gbit/s ixgbe driver is limited due to
updating the HW ring-queue tail-pointer on every packet. As
previously demonstrated with pktgen.
Using trafgen to send RAW frames from userspace (via AF_PACKET), and
forcing it through qdisc path (with option --qdisc-path and -t0),
sending with 12 CPUs.
I can demonstrate this driver layer limitation:
* 12.8 Mpps with no qdisc bulking
* 14.8 Mpps with qdisc bulking (full 10G-wirespeed)
Testing patch02:
================
Testing Bulking several GSO/TSO packets:
Measuring HoL (Head-of-Line) blocking for TSO and GSO, with
netperf-wrapper. Bulking several TSO show no performance regressions
(requeues were in the area 32 requeues/sec for 10G while transmitting
approx 813Kpps).
Bulking several GSOs does show small regression or very small
improvement (requeues were in the area 8000 requeues/sec, for 10G
while transmitting approx 813Kpps).
Using ixgbe 10Gbit/s with GSO bulking, we can measure some additional
latency. Base-case, which is "normal" GSO bulking, sees varying
high-prio queue delay between 0.38ms to 0.47ms. Bulking several GSOs
together, result in a stable high-prio queue delay of 0.50ms.
Using igb at 100Mbit/s with GSO bulking, shows an improvement.
Base-case sees varying high-prio queue delay between 2.23ms to 2.35ms
diff of 0.12ms corrosponding to 1500 bytes at 100Mbit/s. Bulking
several GSOs together, result in a stable high-prio queue delay of
2.23ms.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
qdisc: dequeue bulking also pickup GSO/TSO packets
The TSO and GSO segmented packets already benefit from bulking
on their own.
The TSO packets have always taken advantage of the only updating
the tailptr once for a large packet.
The GSO segmented packets have recently taken advantage of
bulking xmit_more API, via merge commit 53fda7f7f9e8 ("Merge
branch 'xmit_list'"), specifically via commit 7f2e870f2a4 ("net:
Move main gso loop out of dev_hard_start_xmit() into helper.")
allowing qdisc requeue of remaining list. And via commit ce93718fb7cd ("net: Don't keep around original SKB when we
software segment GSO frames.").
This patch allow further bulking of TSO/GSO packets together,
when dequeueing from the qdisc.
Testing:
Measuring HoL (Head-of-Line) blocking for TSO and GSO, with
netperf-wrapper. Bulking several TSO show no performance regressions
(requeues were in the area 32 requeues/sec).
Bulking several GSOs does show small regression or very small
improvement (requeues were in the area 8000 requeues/sec).
Using ixgbe 10Gbit/s with GSO bulking, we can measure some additional
latency. Base-case, which is "normal" GSO bulking, sees varying
high-prio queue delay between 0.38ms to 0.47ms. Bulking several GSOs
together, result in a stable high-prio queue delay of 0.50ms.
Using igb at 100Mbit/s with GSO bulking, shows an improvement.
Base-case sees varying high-prio queue delay between 2.23ms to 2.35ms
Signed-off-by: David S. Miller <davem@davemloft.net>
qdisc: bulk dequeue support for qdiscs with TCQ_F_ONETXQUEUE
Based on DaveM's recent API work on dev_hard_start_xmit(), that allows
sending/processing an entire skb list.
This patch implements qdisc bulk dequeue, by allowing multiple packets
to be dequeued in dequeue_skb().
The optimization principle for this is two fold, (1) to amortize
locking cost and (2) avoid expensive tailptr update for notifying HW.
(1) Several packets are dequeued while holding the qdisc root_lock,
amortizing locking cost over several packet. The dequeued SKB list is
processed under the TXQ lock in dev_hard_start_xmit(), thus also
amortizing the cost of the TXQ lock.
(2) Further more, dev_hard_start_xmit() will utilize the skb->xmit_more
API to delay HW tailptr update, which also reduces the cost per
packet.
One restriction of the new API is that every SKB must belong to the
same TXQ. This patch takes the easy way out, by restricting bulk
dequeue to qdisc's with the TCQ_F_ONETXQUEUE flag, that specifies the
qdisc only have attached a single TXQ.
Some detail about the flow; dev_hard_start_xmit() will process the skb
list, and transmit packets individually towards the driver (see
xmit_one()). In case the driver stops midway in the list, the
remaining skb list is returned by dev_hard_start_xmit(). In
sch_direct_xmit() this returned list is requeued by dev_requeue_skb().
To avoid overshooting the HW limits, which results in requeuing, the
patch limits the amount of bytes dequeued, based on the drivers BQL
limits. In-effect bulking will only happen for BQL enabled drivers.
Small amounts for extra HoL blocking (2x MTU/0.24ms) were
measured at 100Mbit/s, with bulking 8 packets, but the
oscillating nature of the measurement indicate something, like
sched latency might be causing this effect. More comparisons
show, that this oscillation goes away occationally. Thus, we
disregard this artifact completely and remove any "magic" bulking
limit.
For now, as a conservative approach, stop bulking when seeing TSO and
segmented GSO packets. They already benefit from bulking on their own.
A followup patch add this, to allow easier bisect-ability for finding
regressions.
Jointed work with Hannes, Daniel and Florian.
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: Daniel Borkmann <dborkman@redhat.com> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>
Mark Einon [Tue, 30 Sep 2014 21:29:46 +0000 (22:29 +0100)]
et131x: Add PCIe gigabit ethernet driver et131x to drivers/net
This adds the ethernet driver for Agere et131x devices to
drivers/net/ethernet.
The driver being added has been in the staging tree for some time, and will be
removed from there in a seperate patch. This one merely disables the staging
version to prevent two instances being built.
Signed-off-by: Mark Einon <mark.einon@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Rick Jones [Wed, 17 Sep 2014 03:56:20 +0000 (03:56 +0000)]
i40e/igb: Convert to dev_consume_skb_any()
Convert two more Intel NIC drivers to dev_consume_skb_any() to help
make dropped packet profiling sane.
Signed-off-by: Rick Jones <rick.jones2@hp.com> Tested-by: Aaron Brown <aaron.f.brown@intel.com> Tested-by: Jim Young <jamesx.m.young@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Bernhard Kaindl [Wed, 17 Sep 2014 19:11:16 +0000 (19:11 +0000)]
igb: remove blocking phy read from inside spinlock
Remove a source of latency spikes (in my case up to 10ms) by not calling
code that uses mdelay() for feeding a phy statistic (rx errors for idle
symbols - not data -> idle_errors) while being called with a spinlock held.
As idle_errors isn't read, this patch only removes unused code and data.
Later, more complicated changes may be applied to address the spinlock and
allow for some PHY diagnostics by harvesting this PHY stats register fully.
This patch is designed to fix the issue and be safe for longterm/stable.
For the Intel e1000e driver, the same change was applied in 2008 with
commit 23033fad5be0 ("e1000e: remove phy read from inside spinlock").
The mdelay is triggered by HW/SW semaphores, thus it depends on the HW.
I've HW that triggers it even when idle. Others may trigger it only e.g.
when Ethernet ports aquire or loose the link or on ifconfig up / down.
We've noticed this first from delays in frame rx/tx due to the mdelay().
Example command for checking if the issue is triggered: cyclictest -Smp1
(Look for occasional "Max:" values > 4000 or use -b 4000 to stop if greater)
It was observed with I350 ports connected to other I350 ports, but not
if driver and EEPROM was modified to run the I350 in EEPROM-less mode.
phy_stats.idle_errors and .receive_errors (isn't touched) occupy 64 not
used bits in the adapter struct: Their allocation may be removed as well.
Cc: Carolyn Wyborny <carolyn.wyborny@intel.com> Cc: Todd Fujinaka <todd.fujinaka@intel.com> Fixes: 12dcd86b75d5 ("igb: fix stats handling") (this added the spin_lock) Signed-off-by: Bernhard Kaindl <bk-linux@use.startmail.com> Tested-by: Aaron Brown <aaron.f.brown@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Emil Tantilov [Thu, 18 Sep 2014 08:05:02 +0000 (08:05 +0000)]
ixgbe: fix setting of TXDCTL.WTRHESH when ITR is set to 0 and no BQL
This patch consolidates the logic behind dynamically setting TXDCTL.WTHRESH
depending on interrupt throttle rate (ITR) setting regardless of BQL.
Previously TXDCTL.WTHRESH was dynamically being set only with BQL being
enabled, but we have to set it regardless of BQL when ITR is low to avoid
Tx stalls/hangs.
CC: John Greene <jogreene@redhat.com>
Reported by: Masayuki Gouji <gouji.masayuki@jp.fujitsu.com> Signed-off-by: Emil Tantilov <emil.s.tantilov@intel.com> Tested-by: Phil Schmitt <phillip.j.schmitt@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Emil Tantilov [Sat, 6 Sep 2014 07:50:27 +0000 (07:50 +0000)]
ixgbe: remove wait loop on autoneg for copper devices
This patch removes couple of wait loops on autoneg that are not needed.
During validation we noticed that the loops always time out, so there
should be no user impact.
Signed-off-by: Emil Tantilov <emil.s.tantilov@intel.com> Tested-by: Phil Schmitt <phillip.j.schmitt@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Rick Jones [Fri, 12 Sep 2014 17:44:06 +0000 (17:44 +0000)]
ixgbe: Convert the normal transmit complete path to dev_consume_skb_any()
Convert the normal packet completion path to dev_consume_skb_any() so
packet drop profiling via dropwatch or perf top -G -e skb_kfree_skb
is not cluttered with false hits.
Compile tested only. There is a dev_kfree_skb_any() in the routine
ixgbe_ptp_tx_hwtstamp() in ixgbe_ptp.c that looks like a conversion
candidate but I wasn't familiar enough with the code to pull the
trigger.
Signed-off-by: Rick Jones <rick.jones2@hp.com> Tested-by: Phil Schmitt <phillip.j.schmitt@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Alexander Duyck [Tue, 30 Sep 2014 22:49:22 +0000 (22:49 +0000)]
fm10k: Correctly set the number of Tx queues
The number of Tx queues was not being updated due to some issues when
generating the patches. This change makes sure to add the lines necessary
to update the number of Tx queues correctly.
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com> Tested-by: Aaron Brown <aaron.f.brown@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Alexander Duyck [Fri, 26 Sep 2014 06:33:49 +0000 (06:33 +0000)]
fm10k: Reduce buffer size when pages are larger than 4K
This change reduces the buffer size to 2K for all page sizes. The basic
idea is that since most frames only have a 1500 MTU supporting a buffer
size larger than this is somewhat wasteful. As such I have reduced the
size to 2K for all page sizes which will allow for more uses per page.
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com> Tested-by: Aaron Brown <aaron.f.brown@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
1) Don't halt the firmware in r8152 driver, from Hayes Wang.
2) Handle full sized 802.1ad frames in bnx2 and tg3 drivers properly,
from Vlad Yasevich.
3) Don't sleep while holding tx_clean_lock in netxen driver, fix from
Manish Chopra.
4) Certain kinds of ipv6 routes can end up endlessly failing the route
validation test, causing it to be re-looked up over and over again.
This particularly kills input route caching in TCP sockets. Fix
from Hannes Frederic Sowa.
5) netvsc_start_xmit() has a use-after-free access to skb->len, fix
from K Y Srinivasan.
6) Fix matching of inverted containers in ematch module, from Ignacy
Gawędzki.
7) Aggregation of GRO frames via SKB ->frag_list for linear skbs isn't
handled properly, regression fix from Eric Dumazet.
8) Don't test return value of ipv4_neigh_lookup(), which returns an
error pointer, against NULL. From WANG Cong.
9) Fix an old regression where we mistakenly allow a double add of the
same tunnel. Fixes from Steffen Klassert.
10) macvtap device delete and open can run in parallel and corrupt lists
etc., fix from Vlad Yasevich.
11) Fix build error with IPV6=m NETFILTER_XT_TARGET_TPROXY=y, from Pablo
Neira Ayuso.
12) rhashtable_destroy() triggers lockdep splats, fix also from Pablo.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (32 commits)
bna: Update Maintainer Email
r8152: disable power cut for RTL8153
r8152: remove clearing bp
bnx2: Correctly receive full sized 802.1ad fragmes
tg3: Allow for recieve of full-size 8021AD frames
r8152: fix setting RTL8152_UNPLUG
netxen: Fix bug in Tx completion path.
netxen: Fix BUG "sleeping function called from invalid context"
ipv6: remove rt6i_genid
hyperv: Fix a bug in netvsc_start_xmit()
net: stmmac: fix stmmac_pci_probe failed when CONFIG_HAVE_CLK is selected
ematch: Fix matching of inverted containers.
gro: fix aggregation for skb using frag_list
neigh: check error pointer instead of NULL for ipv4_neigh_lookup()
ip6_gre: Return an error when adding an existing tunnel.
ip6_vti: Return an error when adding an existing tunnel.
ip6_tunnel: Return an error when adding an existing tunnel.
ip6gre: add a rtnl link alias for ip6gretap
net/mlx4_core: Allow not to specify probe_vf in SRIOV IB mode
r8152: fix the carrier off when autoresuming
...
Petri Gynther [Wed, 1 Oct 2014 18:58:02 +0000 (11:58 -0700)]
net: phy: add BCM7425 and BCM7429 PHYs
Signed-off-by: Petri Gynther <pgynther@google.com> Acked-by: Florian Fainelli <f.fainelli@gmai.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Petri Gynther [Wed, 1 Oct 2014 18:30:01 +0000 (11:30 -0700)]
net: bcmgenet: fix bcmgenet_put_tx_csum()
bcmgenet_put_tx_csum() needs to return skb pointer back to the caller
because it reallocates a new one in case of lack of skb headroom.
Signed-off-by: Petri Gynther <pgynther@google.com> Acked-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
This patch demonstrates the effect of delaying update of HW tailptr.
(based on earlier patch by Jesper)
burst=1 is the default. It sends one packet with xmit_more=false
burst=2 sends one packet with xmit_more=true and
2nd copy of the same packet with xmit_more=false
burst=3 sends two copies of the same packet with xmit_more=true and
3rd copy with xmit_more=false
Performance with ixgbe (usec 30):
burst=1 tx:9.2 Mpps
burst=2 tx:13.5 Mpps
burst=3 tx:14.5 Mpps full 10G line rate
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
In preparation for being able to propagate port states to e.g: notifiers
or other kernel parts, do not manipulate the port state directly, but
instead use a helper function which will allow us to do a bit more than
just setting the state.
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
tp could be freed in call_rcu callback too, the order is not guaranteed.
John Fastabend says:
====================
Its worth noting why this is safe. Any running schedulers will either
read the valid class field or it will be zeroed.
All schedulers today when the class is 0 do a lookup using the
same call used by the tcf_exts_bind(). So even if we have a running
classifier hit the null class pointer it will do a lookup and get
to the same result. This is particularly fragile at the moment because
the only way to verify this is to audit the schedulers call sites.
====================
Cc: John Fastabend <john.r.fastabend@intel.com> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Acked-by: John Fastabend <john.r.fastabend@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
struct list_head can not be simply copied and we should always init it.
Cc: John Fastabend <john.r.fastabend@intel.com> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Acked-by: John Fastabend <john.r.fastabend@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Thu, 2 Oct 2014 01:35:58 +0000 (21:35 -0400)]
Merge branch 'udp_gso'
Tom Herbert says:
====================
udp: Generalize GSO for UDP tunnels
This patch set generalizes the UDP tunnel segmentation functions so
that they can work with various protocol encapsulations. The primary
change is to set the inner_protocol field in the skbuff when creating
the encapsulated packet, and then in skb_udp_tunnel_segment this data
is used to determine the function for segmenting the encapsulated
packet. The inner_protocol field is overloaded to take either an
Ethertype or IP protocol.
The inner_protocol is set on transmit using skb_set_inner_ipproto or
skb_set_inner_protocol functions. VXLAN and IP tunnels (for fou GSO)
were modified to call these.
Notes:
- GSO for GRE/UDP where GRE checksum is enabled does not work.
Handling this will require some special case code.
- Software GSO now supports many varieties of encapsulation with
SKB_GSO_UDP_TUNNEL{_CSUM}. We still need a mechanism to query
for device support of particular combinations (I intend to
add ndo_gso_check for that).
- MPLS seems to be the only previous user of inner_protocol. I don't
believe these patches can affect that. For supporting GSO with
MPLS over UDP, the inner_protocol should be set using the
helper functions in this patch.
- GSO for L2TP/UDP should also be straightforward now.
v2:
- Respin for Eric's restructuring of skbuff.
Tested GRE, IPIP, and SIT over fou as well as VLXAN. This was
done using 200 TCP_STREAMs in netperf.
GRE
IPv4, FOU, UDP checksum enabled
TCP_STREAM TSO enabled on tun interface
14.04% TX CPU utilization
13.17% RX CPU utilization
9211 Mbps
TCP_STREAM TSO disabled on tun interface
27.82% TX CPU utilization
25.41% RX CPU utilization
9336 Mbps
IPv4, FOU, UDP checksum disabled
TCP_STREAM TSO enabled on tun interface
13.14% TX CPU utilization
23.18% RX CPU utilization
9277 Mbps
TCP_STREAM TSO disabled on tun interface
30.00% TX CPU utilization
31.28% RX CPU utilization
9327 Mbps
IPIP
FOU, UDP checksum enabled
TCP_STREAM TSO enabled on tun interface
15.28% TX CPU utilization
13.92% RX CPU utilization
9342 Mbps
TCP_STREAM TSO disabled on tun interface
27.82% TX CPU utilization
25.41% RX CPU utilization
9336 Mbps
FOU, UDP checksum disabled
TCP_STREAM TSO enabled on tun interface
15.08% TX CPU utilization
24.64% RX CPU utilization
9226 Mbps
TCP_STREAM TSO disabled on tun interface
30.00% TX CPU utilization
31.28% RX CPU utilization
9327 Mbps
SIT
FOU, UDP checksum enabled
TCP_STREAM TSO enabled on tun interface
14.47% TX CPU utilization
14.58% RX CPU utilization
9106 Mbps
TCP_STREAM TSO disabled on tun interface
31.82% TX CPU utilization
30.82% RX CPU utilization
9204 Mbps
FOU, UDP checksum disabled
TCP_STREAM TSO enabled on tun interface
15.70% TX CPU utilization
27.93% RX CPU utilization
9097 Mbps
TCP_STREAM TSO disabled on tun interface
33.48% TX CPU utilization
37.36% RX CPU utilization
9197 Mbps
VXLAN
TCP_STREAM TSO enabled on tun interface
16.42% TX CPU utilization
23.66% RX CPU utilization
9081 Mbps
TCP_STREAM TSO disabled on tun interface
30.32% TX CPU utilization
30.55% RX CPU utilization
9185 Mbps
Baseline (no encp, TSO and LRO enabled)
TCP_STREAM
11.85% TX CPU utilization
15.13% RX CPU utilization
9452 Mbps
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Tom Herbert [Tue, 30 Sep 2014 03:22:32 +0000 (20:22 -0700)]
gre: Set inner protocol in v4 and v6 GRE transmit
Call skb_set_inner_protocol to set inner Ethernet protocol to
protocol being encapsulation by GRE before tunnel_xmit. This is
needed for GSO if UDP encapsulation (fou) is being done.
Signed-off-by: Tom Herbert <therbert@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Tom Herbert [Tue, 30 Sep 2014 03:22:29 +0000 (20:22 -0700)]
udp: Generalize skb_udp_segment
skb_udp_segment is the function called from udp4_ufo_fragment to
segment a UDP tunnel packet. This function currently assumes
segmentation is transparent Ethernet bridging (i.e. VXLAN
encapsulation). This patch generalizes the function to
operate on either Ethertype or IP protocol.
The inner_protocol field must be set to the protocol of the inner
header. This can now be either an Ethertype or an IP protocol
(in a union). A new flag in the skbuff indicates which type is
effective. skb_set_inner_protocol and skb_set_inner_ipproto
helper functions were added to set the inner_protocol. These
functions are called from the point where the tunnel encapsulation
is occuring.
When skb_udp_tunnel_segment is called, the function to segment the
inner packet is selected based on the inner IP or Ethertype. In the
case of an IP protocol encapsulation, the function is derived from
inet[6]_offloads. In the case of Ethertype, skb->protocol is
set to the inner_protocol and skb_mac_gso_segment is called. (GRE
currently does this, but it might be possible to lookup the protocol
in offload_base and call the appropriate segmenation function
directly).
Signed-off-by: Tom Herbert <therbert@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Thu, 2 Oct 2014 01:30:46 +0000 (21:30 -0400)]
Merge branch 'bpf-next'
Alexei Starovoitov says:
====================
bpf: add search pruning optimization and tests
patch #1 commit log explains why eBPF verifier has to examine some
instructions multiple times and describes the search pruning optimization
that improves verification speed for branchy programs and allows more
complex programs to be verified successfully.
This patch completes the core verifier logic.
patch #2 adds more verifier tests related to branches and search pruning
I'm still working on Andy's 'bitmask for stack slots' suggestion. It will be
done on top of this patch.
The current verifier algorithm is brute force depth first search with
state pruning. If anyone can come up with another algorithm that demonstrates
better results, we'll replace the algorithm without affecting user space.
Note verifier doesn't guarantee that all possible valid programs are accepted.
Overly complex programs may still be rejected.
Verifier improvements/optimizations will guarantee that if a program
was passing verification in the past, it will still be passing.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
consider C program represented in eBPF:
int filter(int arg)
{
int a, b, c, *ptr;
if (arg == 1)
ptr = &a;
else if (arg == 2)
ptr = &b;
else
ptr = &c;
*ptr = 0;
return 0;
}
eBPF verifier has to follow all possible paths through the program
to recognize that '*ptr = 0' instruction would be safe to execute
in all situations.
It's doing it by picking a path towards the end and observes changes
to registers and stack at every insn until it reaches bpf_exit.
Then it comes back to one of the previous branches and goes towards
the end again with potentially different values in registers.
When program has a lot of branches, the number of possible combinations
of branches is huge, so verifer has a hard limit of walking no more
than 32k instructions. This limit can be reached and complex (but valid)
programs could be rejected. Therefore it's important to recognize equivalent
verifier states to prune this depth first search.
Basic idea can be illustrated by the program (where .. are some eBPF insns):
1: ..
2: if (rX == rY) goto 4
3: ..
4: ..
5: ..
6: bpf_exit
In the first pass towards bpf_exit the verifier will walk insns: 1, 2, 3, 4, 5, 6
Since insn#2 is a branch the verifier will remember its state in verifier stack
to come back to it later.
Since insn#4 is marked as 'branch target', the verifier will remember its state
in explored_states[4] linked list.
Once it reaches insn#6 successfully it will pop the state recorded at insn#2 and
will continue.
Without search pruning optimization verifier would have to walk 4, 5, 6 again,
effectively simulating execution of insns 1, 2, 4, 5, 6
With search pruning it will check whether state at #4 after jumping from #2
is equivalent to one recorded in explored_states[4] during first pass.
If there is an equivalent state, verifier can prune the search at #4 and declare
this path to be safe as well.
In other words two states at #4 are equivalent if execution of 1, 2, 3, 4 insns
and 1, 2, 4 insns produces equivalent registers and stack.
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Nimrod Andy [Tue, 30 Sep 2014 01:28:05 +0000 (09:28 +0800)]
net: fec: implement rx_copybreak to improve rx performance
- Copy short frames and keep the buffers mapped, re-allocate skb instead of
memory copy for long frames.
- Add support for setting/getting rx_copybreak using generic ethtool tunable
Changes V3:
* As Eric Dumazet's suggestion that removing the copybreak module parameter
and only keep the ethtool API support for rx_copybreak.
Changes V2:
* Implements rx_copybreak
* Rx_copybreak provides module parameter to change this value
* Add tunable_ops support for rx_copybreak
Signed-off-by: Fugang Duan <B38611@freescale.com> Signed-off-by: Frank Li <Frank.Li@freescale.com> Signed-off-by: David S. Miller <davem@davemloft.net>
hayeswang [Wed, 1 Oct 2014 05:25:10 +0000 (13:25 +0800)]
r8152: remove clearing bp
The xxx_clear_bp() is used to halt the firmware. It only necessary
for updating the new firmware. Besides, depend on the version of
the current firmware, it may have problem to halt the firmware
directly. Finally, halt the firmware would let the firmware code
useless, and the bugs which are fixed by the firmware would occur.
Signed-off-by: Hayes Wang <hayeswang@realtek.com> Signed-off-by: David S. Miller <davem@davemloft.net>
bnx2: Correctly receive full sized 802.1ad fragmes
This driver, similar to tg3, has a check that will
cause full sized 802.1ad frames to be dropped. The
frame will be larger then the standard mtu due to the
presense of vlan header that has not been stripped.
The driver should not drop this frame and should process
it just like it does for 802.1q.
CC: Sony Chacko <sony.chacko@qlogic.com> CC: Dept-HSGLinuxNICDev@qlogic.com Signed-off-by: Vladislav Yasevich <vyasevic@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
When receiving a vlan-tagged frame that still contains
a vlan header, the length of the packet will be greater
then MTU+ETH_HLEN since it will account of the extra
vlan header. TG3 checks this for the case for 802.1Q,
but not for 802.1ad. As a result, full sized 802.1ad
frames get dropped by the card.
Add a check for 802.1ad protocol when receving full
sized frames.
Suggested-by: Prashant Sreedharan <prashant@broadcom.com> CC: Prashant Sreedharan <prashant@broadcom.com> CC: Michael Chan <mchan@broadcom.com> Signed-off-by: Vladislav Yasevich <vyasevic@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Cc: Francois Romieu <romieu@fr.zoreil.com> Cc: Hayes Wang <hayeswang@realtek.com> Signed-off-by: Florian Westphal <fw@strlen.de> Acked-by: Eric Dumazet <edumazet@google.com> Acked-by: Tom Herbert <therbert@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
tcp: abort orphan sockets stalling on zero window probes
Currently we have two different policies for orphan sockets
that repeatedly stall on zero window ACKs. If a socket gets
a zero window ACK when it is transmitting data, the RTO is
used to probe the window. The socket is aborted after roughly
tcp_orphan_retries() retries (as in tcp_write_timeout()).
But if the socket was idle when it received the zero window ACK,
and later wants to send more data, we use the probe timer to
probe the window. If the receiver always returns zero window ACKs,
icsk_probes keeps getting reset in tcp_ack() and the orphan socket
can stall forever until the system reaches the orphan limit (as
commented in tcp_probe_timer()). This opens up a simple attack
to create lots of hanging orphan sockets to burn the memory
and the CPU, as demonstrated in the recent netdev post "TCP
connection will hang in FIN_WAIT1 after closing if zero window is
advertised." http://www.spinics.net/lists/netdev/msg296539.html
This patch follows the design in RTO-based probe: we abort an orphan
socket stalling on zero window when the probe timer reaches both
the maximum backoff and the maximum RTO. For example, an 100ms RTT
connection will timeout after roughly 153 seconds (0.3 + 0.6 +
.... + 76.8) if the receiver keeps the window shut. If the orphan
socket passes this check, but the system already has too many orphans
(as in tcp_out_of_resources()), we still abort it but we'll also
send an RST packet as the connection may still be active.
In addition, we change TCP_USER_TIMEOUT to cover (life or dead)
sockets stalled on zero-window probes. This changes the semantics
of TCP_USER_TIMEOUT slightly because it previously only applies
when the socket has pending transmission.
Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Reported-by: Andrey Dmitrov <andrey.dmitrov@oktetlabs.ru> Signed-off-by: David S. Miller <davem@davemloft.net>
Linus Torvalds [Wed, 1 Oct 2014 20:22:00 +0000 (13:22 -0700)]
Merge branch 'for-3.17' of git://linux-nfs.org/~bfields/linux
Pull nfsd bugfix from Bruce Fields:
"This fixes a data corruption bug introduced by the v3.16 xdr encoding
rewrite. I haven't managed to reproduce it myself yet, but it's
apparently not hard to hit given the right workload"
* 'for-3.17' of git://linux-nfs.org/~bfields/linux:
nfsd4: fix corruption of NFSv4 read data
Chun-Hao Lin [Wed, 1 Oct 2014 15:17:21 +0000 (23:17 +0800)]
r8169:call "rtl8168_driver_start" "rtl8168_driver_stop" only when hardware dash function is enabled
These two functions are used to inform dash firmware that driver is been
brought up or brought down. So call these two functions only when hardware dash
function is enabled.
Signed-off-by: Chun-Hao Lin <hau@realtek.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Chun-Hao Lin [Wed, 1 Oct 2014 15:17:20 +0000 (23:17 +0800)]
r8169:modify the behavior of function "rtl8168_oob_notify"
In function "rtl8168_oob_notify", using function "rtl_eri_write" to access
eri register 0xe8, instead of using MAC register "ERIDR" and "ERIAR" to
access it.
For using function "rtl_eri_write" in function "rtl8168_oob_notify", need to
move down "rtl8168_oob_notify" related functions under the function
"rtl_eri_write".
Signed-off-by: Chun-Hao Lin <hau@realtek.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Chun-Hao Lin [Wed, 1 Oct 2014 15:17:17 +0000 (23:17 +0800)]
r8169:for function "rtl_w1w0_phy" change its name and behavior
Change function name from "rtl_w1w0_phy" to "rtl_w0w1_phy".
And its behavior from "write ones then write zeros" to
"write zeros then write ones".
In Realtek internal driver, bitwise operations are almost "write zeros then
write ones". For easy to port hardware parameters from Realtek internal driver
to Linux kernal driver "r8169", we would like to change this function's
behavior and its name.
Signed-off-by: Chun-Hao Lin <hau@realtek.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Thierry Reding [Wed, 1 Oct 2014 11:59:00 +0000 (13:59 +0200)]
net: dsa: Fix build warning for !PM_SLEEP
The dsa_switch_suspend() and dsa_switch_resume() functions are only used
when PM_SLEEP is enabled, so they need #ifdef CONFIG_PM_SLEEP protection
to avoid a compiler warning.
Signed-off-by: Thierry Reding <treding@nvidia.com> Acked-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
net: ll_temac: Remove unnecessary ether_setup after alloc_etherdev
Calling ether_setup is redundant since alloc_etherdev calls it.
Signed-off-by: Subbaraya Sundeep Bhatta <sbhatta@xilinx.com> Signed-off-by: Michal Simek <michal.simek@xilinx.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Wed, 1 Oct 2014 05:12:05 +0000 (22:12 -0700)]
ipv4: mentions skb_gro_postpull_rcsum() in inet_gro_receive()
Proper CHECKSUM_COMPLETE support needs to adjust skb->csum
when we remove one header. Its done using skb_gro_postpull_rcsum()
In the case of IPv4, we know that the adjustment is not really needed,
because the checksum over IPv4 header is 0. Lets add a comment to
ease code comprehension and avoid copy/paste errors.
Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
fm10k: using vmalloc requires including linux/vmalloc.h
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> Acked-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Commit 3243acd37fd9
("ieee802154: add __init to lowpan_frags_sysctl_register")
added __init to lowpan_frags_ns_sysctl_register instead of
lowpan_frags_sysctl_register
Suggested-by: Alexander Aring <alex.aring@gmail.com> Signed-off-by: Fabian Frederick <fabf@skynet.be> Signed-off-by: David S. Miller <davem@davemloft.net>
Linus Torvalds [Wed, 1 Oct 2014 02:52:08 +0000 (19:52 -0700)]
Merge branch 'fixes' of git://ftp.arm.linux.org.uk/~rmk/linux-arm
Pull ARM fixes from Russell King:
"Some further ARM fixes:
- another build fix for the kprobes test code
- a fix for no kuser helpers for the set_tls code, which oopsed on
noMMU hardware
- a fix for alignment handler with neon opcodes being misinterpreted
- turning off the hardware access support, which is not implemented
- a build fix for the v7 coherency exiting code, which can be built
in non-v7 environments (but still only executed on v7 CPUs)"
* 'fixes' of git://ftp.arm.linux.org.uk/~rmk/linux-arm:
ARM: 8179/1: kprobes-test: Fix compile error "bad immediate value for offset"
ARM: 8178/1: fix set_tls for !CONFIG_KUSER_HELPERS
ARM: 8177/1: cacheflush: Fix v7_exit_coherency_flush exynos build breakage on ARMv6
ARM: 8165/1: alignment: don't break misaligned NEON load/store
ARM: 8164/1: mm: clear SCTLR.HA instead of setting it for LPAE
David S. Miller [Tue, 30 Sep 2014 21:10:47 +0000 (17:10 -0400)]
Merge branch 'sunvnet-jumbograms'
David L Stevens says:
====================
sunvnet: add jumbo frames support
This patch set updates the sunvnet driver to version 1.6 of the VIO protocol
to support per-port exchange of MTU information and allow non-standard MTU
sizes, including jumbo frames.
Using large MTUs shows a nearly 5X throughput improvement Linux-Solaris
and > 10X throughput improvement Linux-Linux.
Changes from v8:
-add a short timeout to free pending skbs if a new transmit doesn't
do it first per Dave Miller <davem@davemloft.net>
Changes from v7:
-handle skb allocation failures in vnet_skb_shape()
per Dave Miller <davem@davemloft.net>
Changes from v6:
-made kernel transmit path zero-copy to remove memory n^2 scaling issue
raised by Raghuram Kothakota <Raghuram.Kothakota@oracle.com>
Changes from v5:
- fixed comment per Sowmini Varadhan <sowmini.varadhan@oracle.com>
Changes from v4:
- changed VNET_MAXPACKET per David Laight <David.Laight@ACULAB.COM>
- added cookies to support non-contiguous buffers of max size
Changes from v3:
- added version functions per Dave Miller <davem@davemloft.net>
- moved rmtu to vnet_port per Dave Miller <davem@davemloft.net>
- explicitly set options bits and capability flags to 0 per
Raghuram Kothakota <Raghuram.Kothakota@oracle.com>
Changes from v2:
- make checkpatch clean
Changes from v1:
- fix brace formatting per Dave Miller <davem@davemloft.net>
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
David L Stevens [Mon, 29 Sep 2014 23:48:24 +0000 (19:48 -0400)]
sunvnet: generate ICMP PTMUD messages for smaller port MTUs
This patch sends ICMP and ICMPv6 messages for Path MTU Discovery when a remote
port MTU is smaller than the device MTU. This allows mixing newer VIO protocol
devices that support MTU negotiation with older devices that do not on the
same vswitch. It also allows Linux-Linux LDOMs to use 64K-1 data packets even
though Solaris vswitch is limited to <16K MTU.
Signed-off-by: David L Stevens <david.stevens@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
David L Stevens [Mon, 29 Sep 2014 23:48:11 +0000 (19:48 -0400)]
sunvnet: make transmit path zero-copy in the kernel
This patch removes pre-allocated transmit buffers and instead directly maps
pending packets on demand. This saves O(n^2) maximum-sized transmit buffers,
for n hosts on a vswitch, as well as a copy to those buffers.
Single-stream TCP throughput linux-solaris dropped ~5% for 1500-byte MTU,
but linux-linux at 1500-bytes increased ~20%.
Signed-off-by: David L Stevens <david.stevens@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
David L Stevens [Mon, 29 Sep 2014 23:47:59 +0000 (19:47 -0400)]
sunvnet: upgrade to VIO protocol version 1.6
This patch upgrades the sunvnet driver to support VIO protocol version 1.6.
In particular, it adds per-port MTU negotiation, allowing MTUs other than
ETH_FRAMELEN with ports using newer VIO protocol versions.
Signed-off-by: David L Stevens <david.stevens@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
arch/mips/net/bpf_jit.c: In function 'build_body':
arch/mips/net/bpf_jit.c:762:6: error: unused variable 'tmp'
cc1: all warnings being treated as errors
make[2]: *** [arch/mips/net/bpf_jit.o] Error 1
Seen when building mips:allmodconfig in -next since next-20140924.
Signed-off-by: Guenter Roeck <linux@roeck-us.net> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Tue, 30 Sep 2014 20:37:13 +0000 (16:37 -0400)]
Merge branch 'pxa168_eth'
Antoine Tenart says:
====================
ARM: Berlin: Ethernet support
This series introduce support for the Ethernet controller on Berlin SoCs,
using the existing pxa168 Ethernet driver. In order to do this, DT
support is added to the driver alongside some other modifications and
fixes.
This has been tested on a Berlin BG2Q DMP board.
Changes since v5:
- fixed the build when building the driver as a module
Changes since v4:
- removed the phy-addr property and added a phy subnode
- added COMPILE_TEST for the pxa168_eth driver
Changes since v3:
- moved the addition of pxa168_eth_get_mac_address() to the patch
using it first
Changes since v2:
- reworked how the MAC address is configured
- made the clock anonymous
Changes since v1:
- removed custom Berlin Ethernet driver
- used the pxa168 Ethernet driver instead
- made modifications to the pxa168 driver (DT support, fixes)
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Antoine Ténart [Tue, 30 Sep 2014 14:28:16 +0000 (16:28 +0200)]
ARM: dts: berlin: enable the Ethernet port on the BG2Q DMP
This patch enables the Ethernet port on the Marvell Berlin2Q DMP board.
Signed-off-by: Antoine Tenart <antoine.tenart@free-electrons.com> Acked-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: David S. Miller <davem@davemloft.net>
Antoine Ténart [Tue, 30 Sep 2014 14:28:15 +0000 (16:28 +0200)]
ARM: dts: berlin: add the Ethernet node
This patch adds the Ethernet node, enabling the network unit on Berlin
BG2Q SoCs.
Signed-off-by: Antoine Tenart <antoine.tenart@free-electrons.com> Acked-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: David S. Miller <davem@davemloft.net>