Sean Wang [Thu, 22 Sep 2016 08:36:15 +0000 (16:36 +0800)]
net: ethernet: mediatek: remove superfluous local variable for phy address
remove the unused variable for parsing PHY address
and the related logic for sanity test which would
be all already handled done when of_mdiobus_register
was called
Reported-by: Nelson Chang <nelson.chang@mediatek.com> Signed-off-by: Sean Wang <sean.wang@mediatek.com> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Thu, 22 Sep 2016 12:21:28 +0000 (08:21 -0400)]
Merge branch 'mediatek-trgmii'
Sean Wang says:
====================
mediatek: add support for RGMII on GMAC0 through TRGMII hardware module
By default, GMAC0 is connected to built-in switch called
MT7530 through the proprietary interface called Turbo RGMII
(TRGMII). TRGMII also supports well for RGMII as generic external
PHY uses but requires some slight changes to the setup of TRGMII
and doesn't have well support on current driver.
So this patchset
1) provides the slight changes of the setup for RGMII can work
through TRGMII
2) adds additional setting "trgmii" as PHY_INTERFACE_MODE_TRGMII
about phy-mode on device tree to make GMAC0 distinguish which
mode it runs
3) changes dynamically source clock, TX/RX delay and interface
mode on TRGMII for adapting various link
Changes since v1:
- fixed the style of comment which doesn't have a space at
the beginning and end of comment lines
- add support for phy-mode "trgmii" as PHY_INTERFACE_MODE_TRGMII
into linux/phy.h
- enhance the Documentation about device tree binding for trgmii
which is applicable only for GMAC0 which uses fixed-link
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Sean Wang [Thu, 22 Sep 2016 02:33:55 +0000 (10:33 +0800)]
net: ethernet: mediatek: add support for GMAC0 connecting with external PHY through TRGMII
Changing dynamically source clock, TX/RX delay and interface mode
used by TRGMII hardware module inside PHY capability polling routine
for adapting to the various speed of RGMII used by external PHY for
GMAC0.
Signed-off-by: Sean Wang <sean.wang@mediatek.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Sean Wang [Thu, 22 Sep 2016 02:33:54 +0000 (10:33 +0800)]
net: ethernet: mediatek: add extension of phy-mode for TRGMII
adds PHY-mode "trgmii" as an extension for the operation
mode of the PHY interface for PHY_INTERFACE_MODE_TRGMII.
and adds a variable trgmii inside mtk_mac as the indication
to make the difference between the MAC connected to internal
switch or connected to external PHY by the given configuration
on the board and then to perform the corresponding setup on
TRGMII hardware module.
Signed-off-by: Sean Wang <sean.wang@mediatek.com> Cc: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Thu, 22 Sep 2016 12:14:59 +0000 (08:14 -0400)]
Merge tag 'rxrpc-rewrite-20160922-v2' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs
David Howells says:
====================
rxrpc: Preparation for slow-start algorithm [ver #2]
Here are some patches that prepare for improvements in ACK generation and
for the implementation of the slow-start part of the protocol:
(1) Stop storing the protocol header in the Tx socket buffers, but rather
generate it on the fly. This potentially saves a little space and
makes it easier to alter the header just before transmission (the
flags may get altered and the serial number has to be changed).
(2) Mask off the Tx buffer annotations and add a flag to record which ones
have already been resent.
(3) Track RTT on a per-peer basis for use in future changes. Tracepoints
are added to log this.
(4) Send PING ACKs in response to incoming calls to elicit a PING-RESPONSE
ACK from which RTT data can be calculated. The response also carries
other useful information.
(5) Expedite PING-RESPONSE ACK generation from sendmsg. If we're actively
using sendmsg, this allows us, under some circumstances, to avoid
having to rely on the background work item to run to generate this
ACK.
This requires ktime_sub_ms() to be added.
(6) Set the REQUEST-ACK flag on some DATA packets to elicit ACK-REQUESTED
ACKs from which RTT data can be calculated.
(7) Limit the use of pings and ACK requests for RTT determination.
Changes:
(V2) Don't use the C division operator for 64-bit division. One instance
should use do_div() and the other should be using nsecs_to_jiffies().
The last two patches got transposed, leading to an undefined symbol
in one of them.
Reported-by: kbuild test robot <lkp@intel.com>
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
David Howells [Wed, 21 Sep 2016 23:29:32 +0000 (00:29 +0100)]
rxrpc: Reduce the number of PING ACKs sent
We don't want to send a PING ACK for every new incoming call as that just
adds to the network traffic. Instead, we send a PING ACK to the first
three that we receive and then once per second thereafter.
This could probably be made adjustable in future.
Signed-off-by: David Howells <dhowells@redhat.com>
David Howells [Wed, 21 Sep 2016 23:29:31 +0000 (00:29 +0100)]
rxrpc: Reduce the number of ACK-Requests sent
Reduce the number of ACK-Requests we set on DATA packets that we're sending
to reduce network traffic. We set the flag on odd-numbered DATA packets to
start off the RTT cache until we have at least three entries in it and then
probe once per second thereafter to keep it topped up.
This could be made tunable in future.
Note that from this point, the RXRPC_REQUEST_ACK flag is set on DATA
packets as we transmit them and not stored statically in the sk_buff.
Signed-off-by: David Howells <dhowells@redhat.com>
David S. Miller [Thu, 22 Sep 2016 07:31:22 +0000 (03:31 -0400)]
Merge branch 'ftgmac100-ast2500-support'
Joel Stanley says:
====================
ftgmac100 support for ast2500
This series adds support to the ftgmac100 driver for the Aspeed ast2400 and
ast2500 SoCs. In particular, they ensure the driver works correctly on the
ast2500 where the MAC block has seen some changes in register layout.
They have been tested on ast2400 and ast2500 systems with the NCSI stack and
with a directly attached PHY.
V2 reworks the two patches relating to PHYSTS_CHG into the one patch that
disables the interrupt instead of playing with interrupt sensitivity. I kept
patch 4 'net/faraday: Clear stale interrupts' which was first introduced to
clear the stale PHYSTS_CHG interrupt, as it helps keep us safe from unhygienic
(vendor) bootloaders.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Joel Stanley [Wed, 21 Sep 2016 23:05:03 +0000 (08:35 +0930)]
net/faraday: Mask out PHYSTS_CHG interrupt
The PHYSTS_CHG (the ftgmac100's PHY IRQ) is telling the system to go
look at the PHY registers for a link status change.
The interrupt was causing issues on Aspeed SoC where some board designs
had an active high configuration, some active low, and in some cases
repurposed for other functions. When misconfigured Linux would chew 100%
of CPU cycles servicing interrupts:
While in the ftgmac100 IP can be configured for high, low and edge
sensitivity the current driver always polls the PHY, so we chose to mask
out the interrupt.
See https://patchwork.ozlabs.org/patch/672099/ for more discussion.
Signed-off-by: Joel Stanley <joel@jms.id.au> Signed-off-by: David S. Miller <davem@davemloft.net>
Joel Stanley [Wed, 21 Sep 2016 23:05:02 +0000 (08:35 +0930)]
net/faraday: Configure old MDIO interface on Aspeed SoCs
The Aspeed SoCs have a new MDIO interface as an option in the G4 and G5
SoCs. The old one is still available, so select it in order to remain
compatible with the ftgmac100 driver.
Signed-off-by: Joel Stanley <joel@jms.id.au> Signed-off-by: David S. Miller <davem@davemloft.net>
There is stale interrupt (PHYSTS_CHG in ISR, bit#6 in 0x0) from
the bootloader (uboot) when enabling the MAC. The stale interrupts
aren't part of kernel and should be cleared.
This clears the stale interrupts in ISR (0x0) when enabling the MAC.
Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com> Signed-off-by: Joel Stanley <joel@jms.id.au> Signed-off-by: David S. Miller <davem@davemloft.net>
Joel Stanley [Wed, 21 Sep 2016 23:05:00 +0000 (08:35 +0930)]
net/faraday: Adapt for Aspeed SoCs
The RXDES and TXDES registers bits in the ftgmac100 indicates EDO{R,T}R
at bit position 15 for the Faraday Tech IP. However, the version of this
IP present in the Aspeed SoCs has these bits at position 30 in the
registers.
It appers that ast2400 SoCs support both positions, with the 15th bit
marked as reserved but still functional. In the ast2500 this bit is
reused for another function, so we need a work around.
This was confirmed with engineers from Aspeed that using bit 30 is
correct for both the ast2400 and ast2500 SoCs.
Signed-off-by: Joel Stanley <joel@jms.id.au> Signed-off-by: David S. Miller <davem@davemloft.net>
Andrew Jeffery [Wed, 21 Sep 2016 23:04:59 +0000 (08:34 +0930)]
net/faraday: Make EDO{R,T}R bits configurable
These bits are #defined at a fixed location. In order to support future
hardware that has chosen to move these bits around move the bits into a
member of the struct ftgmac100.
Signed-off-by: Andrew Jeffery <andrew@aj.id.au> Signed-off-by: Joel Stanley <joel@jms.id.au> Signed-off-by: David S. Miller <davem@davemloft.net>
Andrew Jeffery [Wed, 21 Sep 2016 23:04:58 +0000 (08:34 +0930)]
net/faraday: Separate rx page storage from rxdesc
The ftgmac100 hardware revision in e.g. the Aspeed AST2500 no longer
reserves all bits in RXDES#2 but instead uses the bottom 16 bits to
store MAC frame metadata. Avoid corruption by shifting struct page
pointers out to their own member in struct ftgmac100.
Signed-off-by: Andrew Jeffery <andrew@aj.id.au> Signed-off-by: Joel Stanley <joel@jms.id.au> Signed-off-by: David S. Miller <davem@davemloft.net>
David Howells [Wed, 21 Sep 2016 23:29:31 +0000 (00:29 +0100)]
rxrpc: Obtain RTT data by requesting ACKs on DATA packets
In addition to sending a PING ACK to gain RTT data, we can set the
RXRPC_REQUEST_ACK flag on a DATA packet and get a REQUESTED-ACK ACK. The
ACK packet contains the serial number of the packet it is in response to,
so we can look through the Tx buffer for a matching DATA packet.
This requires that the data packets be stamped with the time of
transmission as a ktime rather than having the resend_at time in jiffies.
This further requires the resend code to do the resend determination in
ktimes and convert to jiffies to set the timer.
Signed-off-by: David Howells <dhowells@redhat.com>
David Howells [Wed, 21 Sep 2016 23:29:31 +0000 (00:29 +0100)]
rxrpc: Expedite ping response transmission
Expedite the transmission of a response to a PING ACK by sending it from
sendmsg if one is pending. We're most likely to see a PING ACK during the
client call Tx phase as the other side may use it to determine a number of
parameters, such as the client's receive window size, the RTT and whether
the client is doing slow start (similar to RFC5681).
If we don't expedite it, it's left to the background processing thread to
transmit.
Signed-off-by: David Howells <dhowells@redhat.com>
David Howells [Wed, 21 Sep 2016 23:29:31 +0000 (00:29 +0100)]
rxrpc: Send pings to get RTT data
Send a PING ACK packet to the peer when we get a new incoming call from a
peer we don't have a record for. The PING RESPONSE ACK packet will tell us
the following about the peer:
(1) its receive window size
(2) its MTU sizes
(3) its support for jumbo DATA packets
(4) if it supports slow start (similar to RFC 5681)
(5) an estimate of the RTT
This is necessary because the peer won't normally send us an ACK until it
gets to the Rx phase and we send it a packet, but we would like to know
some of this information before we start sending packets.
A pair of tracepoints are added so that RTT determination can be observed.
Signed-off-by: David Howells <dhowells@redhat.com>
David S. Miller [Thu, 22 Sep 2016 07:13:39 +0000 (03:13 -0400)]
Merge branch 'sctp-align'
Marcelo Ricardo Leitner says:
====================
Rename WORD_TRUNC/ROUND macros and use them
This patchset aims to rename these macros to a non-confusing name, as
reported by David Laight and David Miller, and to update all remaining
places to make use of it, which was 1 last remaining spot.
v3:
- Name it SCTP_PAD4 instead of SCTP_ALIGN4, as suggested by David Laight
v2:
- fixed 2nd patch summary
Details on the specific changelogs.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
To something more meaningful these days, specially because this is
working on packet headers or lengths and which are not tied to any CPU
arch but to the protocol itself.
So, WORD_TRUNC becomes SCTP_TRUNC4 and WORD_ROUND becomes SCTP_PAD4.
Reported-by: David Laight <David.Laight@ACULAB.COM> Reported-by: David Miller <davem@davemloft.net> Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Thu, 22 Sep 2016 06:51:48 +0000 (02:51 -0400)]
Merge branch 'mlx5e-xdp'
Tariq Toukan says:
====================
mlx5e XDP support
This series adds XDP support in mlx5e driver.
This includes the use cases: XDP_DROP, XDP_PASS, and XDP_TX.
Single stream performance tests show 16.5 Mpps for XDP_DROP,
and 12.4 Mpps for XDP_TX, with nice scalability for multiple streams/rings.
This rate of XDP_DROP is lower than the 32 Mpps we got in previous
implementation, when Striding RQ was used.
We moved to non-Striding RQ, as some XDP_TX requirements (like headroom,
packet-per-page) cannot be satisfied with the current Striding RQ HW,
and we decided to fully support both DROP/TX.
Few directions are considered in order to enable the faster rate for XDP_DROP,
e.g a possibility for users to enable Striding RQ so they choose optimized
XDP_DROP on the price of partial XDP_TX functionality, or some HW changes.
Series generated against net-next commit: cf714ac147e0 'ipvlan: Fix dependency issue'
Thanks,
Tariq
V2:
* patch 8:
- when XDP_TX fails, call mlx5e_page_release and drop the packet.
- update xdp_tx counter within mlx5e_xmit_xdp_frame.
(mlx5e_xmit_xdp_frame return value becomes obsolete, change it to void)
- drop the packet for unknown XDP return code.
* patch 9:
- use a boolean for xdp_doorbell in SQ struct, instead of dragging it
throughout the functions calls.
- handle doorbell and counters within mlx5e_xmit_xdp_frame.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
*My xmitter was limited to 36.3Mpps, so it is the bottleneck.
It seems that receive side can handle more.
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Adding support for XDP_TX forwarding from xdp program.
Using XDP, now user can loop packets out of the same port.
We create a dedicated TX SQ for each channel that will serve
XDP programs that return XDP_TX action to loop packets back to
the wire directly from the channel RQ RX path.
For that RX pages will now need to be mapped bi-directionally,
and on XDP_TX action we will sync the page back to device then
queue it into SQ for transmission. The XDP xmit frame function will
report back to the RX path if the page was consumed (transmitted), if so,
RX path will forget about that page as if it were released to the stack.
Later on, on XDP TX completion, the page will be released back to the
page cache.
For simplicity this patch will hit a doorbell on every XDP TX packet.
Next patch will introduce a xmit more like mechanism that will
queue up more than one packet into SQ w/o notifying the hardware,
once RX napi loop is done we will hit doorbell once for all XDP TX
packets form the previous loop. This should drastically improve
XDP TX performance.
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
net/mlx5e: Have a clear separation between different SQ types
Make a clear separate between Regular SQ (TXQ) and ICO SQ creation,
destruction and union their mutual information structures.
Don't allocate redundant TXQ skb/wqe_info/dma_fifo arrays for ICO SQ.
And have a different SQ edge for ICO SQ than TXQ SQ, to be more
accurate.
In preparation for XDP TX support.
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Add support for the BPF_PROG_TYPE_PHYS_DEV hook in mlx5e driver.
When XDP is on we make sure to change channels RQs type to
MLX5_WQ_TYPE_LINKED_LIST rather than "striding RQ" type to
ensure "page per packet".
On XDP set, we fail if HW LRO is set and request from user to turn it
off. Since on ConnectX4-LX HW LRO is always on by default, this will be
annoying, but we prefer not to enforce LRO off from XDP set function.
Full channels reset (close/open) is required only when setting XDP
on/off.
When XDP set is called just to exchange programs, we will update
each RQ xdp program on the fly and for synchronization with current
data path RX activity of that RQ, we temporally disable that RQ and
ensure RX path is not running, quickly update and re-enable that RQ,
for that we do:
- rq.state = disabled
- napi_synnchronize
- xchg(rq->xdp_prg)
- rq.state = enabled
- napi_schedule // Just in case we've missed an IRQ
Packet rate performance testing was done with pktgen 64B packets and on
TX side and, TC drop action on RX side compared to XDP fast drop.
CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz
Comparison is done between:
1. Baseline, Before this patch with TC drop action
2. This patch with TC drop action
3. This patch with XDP RX fast drop
RX Cores Baseline(TC drop) TC drop XDP fast Drop
--------------------------------------------------------------
1 5.3Mpps 5.3Mpps 16.5Mpps
2 10.2Mpps 10.2Mpps 31.3Mpps
4 20.5Mpps 19.9Mpps 36.3Mpps*
*My xmitter was limited to 36.3Mpps, so it is the bottleneck.
It seems that receive side can handle more.
Signed-off-by: Rana Shahout <ranas@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Add two helper functions to allow dynamic changes of RQ type.
mlx5e_set_rq_priv_params and mlx5e_set_rq_type_params will be
used on netdev creation to determine the default RQ type.
This will be needed later for downstream patches of XDP support.
When enabling XDP we will dynamically move from striding RQ to
linked list RQ type.
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Before this patch LRO size was 64K, now with build_skb requires
extra room, headroom + sizeof(skb_shared_info) added to the data
buffer will make wqe size or page_frag_size slightly larger than
64K which will demand order 5 page instead of order 4 in 4K page systems.
We take those extra bytes from hardware LRO data size in order to not
increase the required page order for when hardware LRO is enabled.
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
We have two types of RX RQs, and they use two separate sets of
info arrays and structures in RX data path function. Today those
structures are mutually exclusive per RQ type, hence one kind is
allocated on RQ creation according to the RQ type.
For better cache locality and to minimalize the
sizeof(struct mlx5e_rq), in this patch we define them as a union.
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
For non-striding RQ configuration before this patch we had a ring
with pre-allocated SKBs and mapped the SKB->data buffers for
device.
For robustness and better RX data buffers management, we allocate a
page per packet and build_skb around it.
This patch (which is a prerequisite for XDP) will actually reduce
performance for normal stack usage, because we are now hitting a bottleneck
in the page allocator. We use the page-cache to restore or even improve
performance in comparison to the old RX scheme.
Packet rate performance testing was done with pktgen 64B packets on xmit
side and TC ingress dropping action on RX side.
CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz
Comparison is done between:
1.Baseline, before 'net/mlx5e: Build RX SKB on demand'
2.Build SKB with RX page cache (This patch)
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Wed, 21 Sep 2016 05:45:58 +0000 (22:45 -0700)]
tcp: implement TSQ for retransmits
We saw sch_fq drops caused by the per flow limit of 100 packets and TCP
when dealing with large cwnd and bursts of retransmits.
Even after increasing the limit to 1000, and even after commit 10d3be569243 ("tcp-tso: do not split TSO packets at retransmit time"),
we can still have these drops.
Under certain conditions, TCP can spend a considerable amount of
time queuing thousands of skbs in a single tcp_xmit_retransmit_queue()
invocation, incurring latency spikes and stalls of other softirq
handlers.
This patch implements TSQ for retransmits, limiting number of packets
and giving more chance for scheduling packets in both ways.
Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Thu, 22 Sep 2016 06:20:16 +0000 (02:20 -0400)]
Merge branch 'mv88e6390-prep'
Andrew Lunn says:
====================
Preparation for mv88e6390
These two patches are a couple of preparation steps for supporting the
the MV88E6390 family of chips. This is a new generation from Marvell,
and will need more feature flags than are currently available in an
unsigned long. Expand to an unsigned long long. The MV88E6390 also
places its port registers somewhere else, so add a wrapper around port
register access.
v2:
Rework wrappers to use mv88e6xxx_{read|write}
Simpliy some (err < ) to (err)
Add Reviewed by tag.
Andrew Lunn [Tue, 20 Sep 2016 23:40:32 +0000 (01:40 +0200)]
net: dsa: mv88e6xxx: Convert flag bits to unsigned long long
We are soon going to run out of flag bits on 32bit systems. Convert to
unsigned long long.
Signed-off-by: Andrew Lunn <andrew@lunn.ch> Reviewed-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Andrew Lunn [Tue, 20 Sep 2016 23:40:31 +0000 (01:40 +0200)]
net: dsa: mv88e6xxx: Add helper for accessing port registers
There is a device coming soon which places its port registers
somewhere different to all other Marvell switches supported so far.
Add helper functions for reading/writing port registers, making it
easier to handle this new device.
Signed-off-by: Andrew Lunn <andrew@lunn.ch> Reviewed-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Nicolas Pitre [Tue, 20 Sep 2016 23:25:58 +0000 (19:25 -0400)]
ptp_clock: future-proofing drivers against PTP subsystem becoming optional
Drivers must be ready to accept NULL from ptp_clock_register() if the
PTP clock subsystem is configured out.
This patch documents that and ensures that all drivers cope well
with a NULL return.
Signed-off-by: Nicolas Pitre <nico@linaro.org> Reviewed-by: Eugenia Emantayev <eugenia@mellanox.com> Acked-by: Richard Cochran <richardcochran@gmail.com> Acked-by: Edward Cree <ecree@solarflare.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Philippe Reynes [Tue, 20 Sep 2016 20:30:11 +0000 (22:30 +0200)]
net: ethernet: hisilicon: hns: use phydev from struct net_device
The private structure contain a pointer to phydev, but the structure
net_device already contain such pointer. So we can remove the pointer
phydev in the private structure, and update the driver to use the
one contained in struct net_device.
Signed-off-by: Philippe Reynes <tremyfr@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
during merge for conflicts overlapping commits by
Commit b20b378d49926b82c0a131492fa8842156e0e8a9
("Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net")
Signed-off-by: Sean Wang <sean.wang@mediatek.com> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Thu, 22 Sep 2016 05:40:07 +0000 (01:40 -0400)]
Merge branch 'cxgb4-tc-offload'
Rahul Lakkireddy says:
====================
cxgb4: add support for offloading TC u32 filters
This series of patches add support to offload TC u32 filters onto
Chelsio NICs.
Patch 1 moves current common filter code to separate files
in order to provide a common api for performing packet classification
and filtering in Chelsio NICs.
Patch 2 enables filters for normal NIC configuration and implements
common api for setting and deleting filters.
Patches 3-5 add support for TC u32 offload via ndo_setup_tc.
---
v3:
Based on review and suggestion from David Miller <davem@davemloft.net>
- Fixed all local variable declarations by placing them in longest line
first and shortest line last order.
v2:
Based on review and suggestions from Jiri Pirko <jiri@resnulli.us>:
- Replaced macros S and U with appropriate static helper functions.
- Moved completion code for set and delete filters to respective
functions cxgb4_set_filter() and cxgb4_del_filter(). Renamed the
original functions to __cxgb4_set_filter() and __cxgb4_del_filter()
in case synchronization is not required.
- Dropped debugfs patch.
- Merged code for inserting and deleting u32 filters into a single
patch.
- Reworked and fixed bugs with traversing the actions list.
- Removed all unnecessary extra ().
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Add support for dropping matched packets in hardware. Also add support
for re-directing matched packets to a specified port in hardware.
Signed-off-by: Rahul Lakkireddy <rahul.lakkireddy@chelsio.com> Signed-off-by: Hariprasad Shenai <hariprasad@chelsio.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Add support for offloading u32 filter onto hardware. Links are stored
in a jump table to perform necessary jumps to match TCP/UDP header.
When inserting rules in the linked bucket, the TCP/UDP match fields
in the corresponding entry of the jump table are appended to the filter
rule before insertion. If a link is deleted, then all corresponding
filters associated with the link are also deleted. Also enable
hardware tc offload as a supported feature.
Signed-off-by: Rahul Lakkireddy <rahul.lakkireddy@chelsio.com> Signed-off-by: Hariprasad Shenai <hariprasad@chelsio.com> Signed-off-by: David S. Miller <davem@davemloft.net>
cxgb4: add parser to translate u32 filters to internal spec
Parse information sent by u32 into internal filter specification.
Add support for parsing several fields in IPv4, IPv6, TCP, and UDP.
Signed-off-by: Rahul Lakkireddy <rahul.lakkireddy@chelsio.com> Signed-off-by: Hariprasad Shenai <hariprasad@chelsio.com> Signed-off-by: David S. Miller <davem@davemloft.net>
cxgb4: add common api support for configuring filters
Enable filters for non-offload configuration and add common api support
for setting and deleting filters in LE-TCAM region of the hardware.
IPv4 filters occupy one slot. IPv6 filters occupy 4 slots and must
be on a 4-slot boundary. IPv4 filters can not occupy a slot belonging
to IPv6 and the vice-versa is also true.
Filters are set and deleted asynchronously. Use completion to wait
for reply from firmware in order to allow for synchronization if needed.
Signed-off-by: Rahul Lakkireddy <rahul.lakkireddy@chelsio.com> Signed-off-by: Hariprasad Shenai <hariprasad@chelsio.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Move common filter code to separate files. Also fix the following
checkpatch checks.
CHECK: Comparison to NULL could be written "!f->l2t"
+ if (f->l2t == NULL) {
CHECK: spaces preferred around that '/' (ctx:VxV)
+ fwr->len16_pkd = htonl(FW_WR_LEN16_V(sizeof(*fwr)/16));
Signed-off-by: Rahul Lakkireddy <rahul.lakkireddy@chelsio.com> Signed-off-by: Hariprasad Shenai <hariprasad@chelsio.com> Signed-off-by: David S. Miller <davem@davemloft.net>
net: skbuff: Coding: Use eth_type_vlan() instead of open coding it
Fix 'skb_vlan_pop' to use eth_type_vlan instead of directly comparing
skb->protocol to ETH_P_8021Q or ETH_P_8021AD.
Signed-off-by: Shmulik Ladkani <shmulik.ladkani@gmail.com> Reviewed-by: Pravin B Shelar <pshelar@ovn.org> Signed-off-by: David S. Miller <davem@davemloft.net>
net: skbuff: Remove errornous length validation in skb_vlan_pop()
In 93515d53b1
"net: move vlan pop/push functions into common code"
skb_vlan_pop was moved from its private location in openvswitch to
skbuff common code.
In case skb has non hw-accel vlan tag, the original 'pop_vlan()' assured
that skb->len is sufficient (if skb->len < VLAN_ETH_HLEN then pop was
considered a no-op).
This validation was moved as is into the new common 'skb_vlan_pop'.
Alas, in its original location (openvswitch), there was a guarantee that
'data' points to the mac_header, therefore the 'skb->len < VLAN_ETH_HLEN'
condition made sense.
However there's no such guarantee in the generic 'skb_vlan_pop'.
For short packets received in rx path going through 'skb_vlan_pop',
this causes 'skb_vlan_pop' to fail pop-ing a valid vlan hdr (in the non
hw-accel case) or to fail moving next tag into hw-accel tag.
Remove the 'skb->len < VLAN_ETH_HLEN' condition entirely:
It is superfluous since inner '__skb_vlan_pop' already verifies there
are VLAN_ETH_HLEN writable bytes at the mac_header.
Note this presents a slight change to skb_vlan_pop() users:
In case total length is smaller than VLAN_ETH_HLEN, skb_vlan_pop() now
returns an error, as opposed to previous "no-op" behavior.
Existing callers (e.g. tc act vlan, ovs) usually drop the packet if
'skb_vlan_pop' fails.
Fixes: 93515d53b1 ("net: move vlan pop/push functions into common code") Signed-off-by: Shmulik Ladkani <shmulik.ladkani@gmail.com> Cc: Pravin Shelar <pshelar@ovn.org> Reviewed-by: Pravin B Shelar <pshelar@ovn.org> Signed-off-by: David S. Miller <davem@davemloft.net>
TCA_VLAN_ACT_MODIFY allows one to change an existing tag.
It accepts same attributes as TCA_VLAN_ACT_PUSH (protocol, id,
priority).
If packet is vlan tagged, then the tag gets overwritten according to
user specified attributes.
For example, this allows user to replace a tag's vid while preserving
its priority bits (as opposed to "action vlan pop pipe action vlan push").
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
TCA_VLAN_ACT_MODIFY allows one to change an existing tag.
It accepts same attributes as TCA_VLAN_ACT_PUSH (protocol, id,
priority).
If packet is vlan tagged, then the tag gets overwritten according to
user specified attributes.
For example, this allows user to replace a tag's vid while preserving
its priority bits (as opposed to "action vlan pop pipe action vlan push").
Signed-off-by: Shmulik Ladkani <shmulik.ladkani@gmail.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Jack Morgenstein [Tue, 20 Sep 2016 11:39:42 +0000 (14:39 +0300)]
net/mlx4_core: Fix deadlock when switching between polling and event fw commands
When switching from polling-based fw commands to event-based fw
commands, there is a race condition which could cause a fw command
in another task to hang: that task will keep waiting for the polling
sempahore, but may never be able to acquire it. This is due to
mlx4_cmd_use_events, which "down"s the sempahore back to 0.
During driver initialization, this is not a problem, since no other
tasks which invoke FW commands are active.
However, there is a problem if the driver switches to polling mode
and then back to event mode during normal operation.
The "test_interrupts" feature does exactly that.
Running "ethtool -t <eth device> offline" causes the PF driver to
temporarily switch to polling mode, and then back to event mode.
(Note that for VF drivers, such switching is not performed).
Fix this by adding a read-write semaphore for protection when
switching between modes.
Fixes: 225c7b1feef1 ("IB/mlx4: Add a driver Mellanox ConnectX InfiniBand adapters") Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il> Signed-off-by: Matan Barak <matanb@mellanox.com> Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Kamal Heib [Tue, 20 Sep 2016 11:39:40 +0000 (14:39 +0300)]
net/mlx4_en: Fix wrong indentation
Use tabs instead of spaces before if statement, no functional change.
Fixes: e7c1c2c46201 ("mlx4_en: Added self diagnostics test implementation") Signed-off-by: Kamal Heib <kamalh@mellanox.com> Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
David Howells [Wed, 21 Sep 2016 23:29:32 +0000 (00:29 +0100)]
rxrpc: Add re-sent Tx annotation
Add a Tx-phase annotation for packet buffers to indicate that a buffer has
already been retransmitted. This will be used by future congestion
management. Re-retransmissions of a packet don't affect the congestion
window managment in the same way as initial retransmissions.
Signed-off-by: David Howells <dhowells@redhat.com>
David Howells [Wed, 21 Sep 2016 23:29:31 +0000 (00:29 +0100)]
rxrpc: Don't store the rxrpc header in the Tx queue sk_buffs
Don't store the rxrpc protocol header in sk_buffs on the transmit queue,
but rather generate it on the fly and pass it to kernel_sendmsg() as a
separate iov. This reduces the amount of storage required.
Note that the security header is still stored in the sk_buff as it may get
encrypted along with the data (and doesn't change with each transmission).
Signed-off-by: David Howells <dhowells@redhat.com>
David S. Miller [Wed, 21 Sep 2016 23:50:15 +0000 (19:50 -0400)]
Merge branch 'bpf-hw-offload'
Jakub Kicinski says:
====================
BPF hardware offload (cls_bpf for now)
Rebased and improved.
v7:
- fix patch 4.
v6 (patch 8 only):
- explicitly check for registers >= MAX_BPF_REG;
- fix leaky error path.
v5:
- fix names of guard defines in bpf_verfier.h.
v4:
- rename parser -> analyzer;
- reorganize the analyzer patches a bit;
- use bitfield.h directly.
--- merge blurb:
In the last year a lot of progress have been made on offloading
simpler TC classifiers. There is also growing interest in using
BPF for generic high-speed packet processing in the kernel.
It seems beneficial to tie those two trends together and think
about hardware offloads of BPF programs. This patch set presents
such offload to Netronome smart NICs. cls_bpf is extended with
hardware offload capabilities and NFP driver gets a JIT translator
which in presence of capable firmware can be used to offload
the BPF program onto the card.
BPF JIT implementation is not 100% complete (e.g. missing instructions)
but it is functional. Encouragingly it should be possible to
offload most (if not all) advanced BPF features onto the NIC -
including packet modification, maps, tunnel encap/decap etc.
Example of basic tests I used:
__section_cls_entry
int cls_entry(struct __sk_buff *skb)
{
if (load_byte(skb, 0) != 0x0)
return 0;
if (load_byte(skb, 4) != 0x1)
return 0;
skb->mark = 0xcafe;
if (load_byte(skb, 50) != 0xff)
return 0;
return ~0U;
}
Above code can be compiled with Clang and loaded like this:
ethtool -K p1p1 hw-tc-offload on
tc qdisc add dev p1p1 ingress
tc filter add dev p1p1 parent ffff: bpf obj prog.o action drop
This set implements the basic transparent offload, the skip_{sw,hw}
flags and reporting statistics for cls_bpf.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Jakub Kicinski [Wed, 21 Sep 2016 10:44:07 +0000 (11:44 +0100)]
nfp: bpf: add offload of TC direct action mode
Add offload of TC in direct action mode. We just need
to provide appropriate checks in the verifier and
a new outro block to translate the exit codes to what
data path expects
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Jakub Kicinski [Wed, 21 Sep 2016 10:44:04 +0000 (11:44 +0100)]
nfp: bpf: add packet marking support
Add missing ABI defines and eBPF instructions to allow
mark to be passed on and extend prepend parsing on the
RX path to pick it up from packet metadata.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Jakub Kicinski [Wed, 21 Sep 2016 10:44:02 +0000 (11:44 +0100)]
net: cls_bpf: allow offloaded filters to update stats
Call into offloaded filters to update stats.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net>
Jakub Kicinski [Wed, 21 Sep 2016 10:44:01 +0000 (11:44 +0100)]
nfp: bpf: add hardware bpf offload
Add hardware bpf offload on our smart NICs. Detect if
capable firmware is loaded and use it to load the code JITed
with just added translator onto programmable engines.
This commit only supports offloading cls_bpf in legacy mode
(non-direct action).
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Jakub Kicinski [Wed, 21 Sep 2016 10:43:59 +0000 (11:43 +0100)]
bpf: recognize 64bit immediate loads as consts
When running as parser interpret BPF_LD | BPF_IMM | BPF_DW
instructions as loading CONST_IMM with the value stored
in imm. The verifier will continue not recognizing those
due to concerns about search space/program complexity
increase.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net>
Jakub Kicinski [Wed, 21 Sep 2016 10:43:58 +0000 (11:43 +0100)]
bpf: enable non-core use of the verfier
Advanced JIT compilers and translators may want to use
eBPF verifier as a base for parsers or to perform custom
checks and validations.
Add ability for external users to invoke the verifier
and provide callbacks to be invoked for every intruction
checked. For now only add most basic callback for
per-instruction pre-interpretation checks is added. More
advanced users may also like to have per-instruction post
callback and state comparison callback.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
Jakub Kicinski [Wed, 21 Sep 2016 10:43:57 +0000 (11:43 +0100)]
bpf: expose internal verfier structures
Move verifier's internal structures to a header file and
prefix their names with bpf_ to avoid potential namespace
conflicts. Those structures will soon be used by external
analyzers.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net>
Jakub Kicinski [Wed, 21 Sep 2016 10:43:56 +0000 (11:43 +0100)]
bpf: don't (ab)use instructions to store state
Storing state in reserved fields of instructions makes
it impossible to run verifier on programs already
marked as read-only. Allocate and use an array of
per-instruction state instead.
While touching the error path rename and move existing
jump target.
Suggested-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net>
Jakub Kicinski [Wed, 21 Sep 2016 10:43:55 +0000 (11:43 +0100)]
net: cls_bpf: add support for marking filters as hardware-only
Add cls_bpf support for the TCA_CLS_FLAGS_SKIP_SW flag.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net>
Jakub Kicinski [Wed, 21 Sep 2016 10:43:54 +0000 (11:43 +0100)]
net: cls_bpf: limit hardware offload by software-only flag
Add cls_bpf support for the TCA_CLS_FLAGS_SKIP_HW flag.
Unlike U32 and flower cls_bpf already has some netlink
flags defined. Create a new attribute to be able to use
the same flag values as the above.
Unlike U32 and flower reject unknown flags.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net>
Jakub Kicinski [Wed, 21 Sep 2016 10:43:53 +0000 (11:43 +0100)]
net: cls_bpf: add hardware offload
This patch adds hardware offload capability to cls_bpf classifier,
similar to what have been done with U32 and flower.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Wed, 21 Sep 2016 05:01:09 +0000 (01:01 -0400)]
Merge branch 'mlxse-resource-query'
Jiri Pirko says:
====================
mlxsw: Replace Hw related const with resource query results
Nogah says:
Many of the ASIC's properties can be read from the HW with resources query.
This patchset adds new resources to the resource query and implement
using them, instead of the constants that we currently use.
Those resources are lag, kvd and router related.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
mlxsw: profile: Add KVD resources to profile config
Use resources from resource query to determine values for
the profile configuration.
Add KVD determined section sizes to the resources struct.
Change the profile struct and value to match this changes.
Signed-off-by: Nogah Frankel <nogahf@mellanox.com> Reviewed-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Wed, 21 Sep 2016 04:23:09 +0000 (00:23 -0400)]
Merge branch 'tcp-bbr'
Neal Cardwell says:
====================
tcp: BBR congestion control algorithm
This patch series implements a new TCP congestion control algorithm:
BBR (Bottleneck Bandwidth and RTT). A paper with a detailed
description of BBR will be published in ACM Queue, September-October
2016, as "BBR: Congestion-Based Congestion Control". BBR is widely
deployed in production at Google.
The patch series starts with a set of supporting infrastructure
changes, including a few that extend the congestion control
framework. The last patch adds BBR as a TCP congestion control
module. Please see individual patches for the details.
- v3 -> v4:
- Updated tcp_bbr.c in "tcp_bbr: add BBR congestion control"
to use const to qualify all the constant parameters.
Thanks to Stephen Hemminger.
- In "tcp_bbr: add BBR congestion control", remove the bbr_rate_kbps()
function, which had a 64-bit divide that would be problematic on some
architectures, and just use bbr_rate_bytes_per_sec() directly.
Thanks to Kenneth Klette Jonassen for suggesting this.
- In "tcp: switch back to proper tcp_skb_cb size check in tcp_init()",
switched from sizeof(skb->cb) to FIELD_SIZEOF.
Thanks to Lance Richardson for suggesting this.
- Updated "tcp_bbr: add BBR congestion control" commit message with
performance data, more details about deployment at Google, and
another reminder to use fq with BBR.
- Updated tcp_bbr.c in "tcp_bbr: add BBR congestion control"
to use MODULE_LICENSE("Dual BSD/GPL").
- v2 -> v3: fix another issue caught by build bots:
- adjust rate_sample struct initialization syntax to allow gcc-4.4 to compile
the "tcp: track data delivery rate for a TCP connection" patch; also
adjusted some similar syntax in "tcp_bbr: add BBR congestion control"
- v1 -> v2: fix issues caught by build bots:
- fix "tcp: export data delivery rate" to use rate64 instead of rate,
so there is a 64-bit numerator for the do_div call
- fix conflicting definitions for minmax caused by
"tcp: use windowed min filter library for TCP min_rtt estimation"
with a new commit:
tcp: cdg: rename struct minmax in tcp_cdg.c to avoid a naming conflict
- fix warning about the use of __packed in
"tcp: track data delivery rate for a TCP connection",
which involves the addition of a new commit:
tcp: switch back to proper tcp_skb_cb size check in tcp_init()
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
This commit implements a new TCP congestion control algorithm: BBR
(Bottleneck Bandwidth and RTT). A detailed description of BBR will be
published in ACM Queue, Vol. 14 No. 5, September-October 2016, as
"BBR: Congestion-Based Congestion Control".
BBR has significantly increased throughput and reduced latency for
connections on Google's internal backbone networks and google.com and
YouTube Web servers.
BBR requires only changes on the sender side, not in the network or
the receiver side. Thus it can be incrementally deployed on today's
Internet, or in datacenters.
The Internet has predominantly used loss-based congestion control
(largely Reno or CUBIC) since the 1980s, relying on packet loss as the
signal to slow down. While this worked well for many years, loss-based
congestion control is unfortunately out-dated in today's networks. On
today's Internet, loss-based congestion control causes the infamous
bufferbloat problem, often causing seconds of needless queuing delay,
since it fills the bloated buffers in many last-mile links. On today's
high-speed long-haul links using commodity switches with shallow
buffers, loss-based congestion control has abysmal throughput because
it over-reacts to losses caused by transient traffic bursts.
In 1981 Kleinrock and Gale showed that the optimal operating point for
a network maximizes delivered bandwidth while minimizing delay and
loss, not only for single connections but for the network as a
whole. Finding that optimal operating point has been elusive, since
any single network measurement is ambiguous: network measurements are
the result of both bandwidth and propagation delay, and those two
cannot be measured simultaneously.
While it is impossible to disambiguate any single bandwidth or RTT
measurement, a connection's behavior over time tells a clearer
story. BBR uses a measurement strategy designed to resolve this
ambiguity. It combines these measurements with a robust servo loop
using recent control systems advances to implement a distributed
congestion control algorithm that reacts to actual congestion, not
packet loss or transient queue delay, and is designed to converge with
high probability to a point near the optimal operating point.
In a nutshell, BBR creates an explicit model of the network pipe by
sequentially probing the bottleneck bandwidth and RTT. On the arrival
of each ACK, BBR derives the current delivery rate of the last round
trip, and feeds it through a windowed max-filter to estimate the
bottleneck bandwidth. Conversely it uses a windowed min-filter to
estimate the round trip propagation delay. The max-filtered bandwidth
and min-filtered RTT estimates form BBR's model of the network pipe.
Using its model, BBR sets control parameters to govern sending
behavior. The primary control is the pacing rate: BBR applies a gain
multiplier to transmit faster or slower than the observed bottleneck
bandwidth. The conventional congestion window (cwnd) is now the
secondary control; the cwnd is set to a small multiple of the
estimated BDP (bandwidth-delay product) in order to allow full
utilization and bandwidth probing while bounding the potential amount
of queue at the bottleneck.
When a BBR connection starts, it enters STARTUP mode and applies a
high gain to perform an exponential search to quickly probe the
bottleneck bandwidth (doubling its sending rate each round trip, like
slow start). However, instead of continuing until it fills up the
buffer (i.e. a loss), or until delay or ACK spacing reaches some
threshold (like Hystart), it uses its model of the pipe to estimate
when that pipe is full: it estimates the pipe is full when it notices
the estimated bandwidth has stopped growing. At that point it exits
STARTUP and enters DRAIN mode, where it reduces its pacing rate to
drain the queue it estimates it has created.
Then BBR enters steady state. In steady state, PROBE_BW mode cycles
between first pacing faster to probe for more bandwidth, then pacing
slower to drain any queue that created if no more bandwidth was
available, and then cruising at the estimated bandwidth to utilize the
pipe without creating excess queue. Occasionally, on an as-needed
basis, it sends significantly slower to probe for RTT (PROBE_RTT
mode).
BBR has been fully deployed on Google's wide-area backbone networks
and we're experimenting with BBR on Google.com and YouTube on a global
scale. Replacing CUBIC with BBR has resulted in significant
improvements in network latency and application (RPC, browser, and
video) metrics. For more details please refer to our upcoming ACM
Queue publication.
Example performance results, to illustrate the difference between BBR
and CUBIC:
Resilience to random loss (e.g. from shallow buffers):
Consider a netperf TCP_STREAM test lasting 30 secs on an emulated
path with a 10Gbps bottleneck, 100ms RTT, and 1% packet loss
rate. CUBIC gets 3.27 Mbps, and BBR gets 9150 Mbps (2798x higher).
Low latency with the bloated buffers common in today's last-mile links:
Consider a netperf TCP_STREAM test lasting 120 secs on an emulated
path with a 10Mbps bottleneck, 40ms RTT, and 1000-packet bottleneck
buffer. Both fully utilize the bottleneck bandwidth, but BBR
achieves this with a median RTT 25x lower (43 ms instead of 1.09
secs).
Our long-term goal is to improve the congestion control algorithms
used on the Internet. We are hopeful that BBR can help advance the
efforts toward this goal, and motivate the community to do further
research.
Test results, performance evaluations, feedback, and BBR-related
discussions are very welcome in the public e-mail list for BBR:
https://groups.google.com/forum/#!forum/bbr-dev
NOTE: BBR *must* be used with the fq qdisc ("man tc-fq") with pacing
enabled, since pacing is integral to the BBR design and
implementation. BBR without pacing would not function properly, and
may incur unnecessary high packet loss rates.
Signed-off-by: Van Jacobson <vanj@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Nandita Dukkipati <nanditad@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
tcp: new CC hook to set sending rate with rate_sample in any CA state
This commit introduces an optional new "omnipotent" hook,
cong_control(), for congestion control modules. The cong_control()
function is called at the end of processing an ACK (i.e., after
updating sequence numbers, the SACK scoreboard, and loss
detection). At that moment we have precise delivery rate information
the congestion control module can use to control the sending behavior
(using cwnd, TSO skb size, and pacing rate) in any CA state.
This function can also be used by a congestion control that prefers
not to use the default cwnd reduction approach (i.e., the PRR
algorithm) during CA_Recovery to control the cwnd and sending rate
during loss recovery.
We take advantage of the fact that recent changes defer the
retransmission or transmission of new data (e.g. by F-RTO) in recovery
until the new tcp_cong_control() function is run.
With this commit, we only run tcp_update_pacing_rate() if the
congestion control is not using this new API. New congestion controls
which use the new API do not want the TCP stack to run the default
pacing rate calculation and overwrite whatever pacing rate they have
chosen at initialization time.
Signed-off-by: Van Jacobson <vanj@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Nandita Dukkipati <nanditad@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
tcp: allow congestion control to expand send buffer differently
Currently the TCP send buffer expands to twice cwnd, in order to allow
limited transmits in the CA_Recovery state. This assumes that cwnd
does not increase in the CA_Recovery.
For some congestion control algorithms, like the upcoming BBR module,
if the losses in recovery do not indicate congestion then we may
continue to raise cwnd multiplicatively in recovery. In such cases the
current multiplier will falsely limit the sending rate, much as if it
were limited by the application.
This commit adds an optional congestion control callback to use a
different multiplier to expand the TCP send buffer. For congestion
control modules that do not specificy this callback, TCP continues to
use the previous default of 2.
Signed-off-by: Van Jacobson <vanj@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Nandita Dukkipati <nanditad@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com> Acked-by: Stephen Hemminger <stephen@networkplumber.org> Signed-off-by: David S. Miller <davem@davemloft.net>
tcp: export tcp_tso_autosize() and parameterize minimum number of TSO segments
To allow congestion control modules to use the default TSO auto-sizing
algorithm as one of the ingredients in their own decision about TSO sizing:
1) Export tcp_tso_autosize() so that CC modules can use it.
2) Change tcp_tso_autosize() to allow callers to specify a minimum
number of segments per TSO skb, in case the congestion control
module has a different notion of the best floor for TSO skbs for
the connection right now. For very low-rate paths or policed
connections it can be appropriate to use smaller TSO skbs.
Signed-off-by: Van Jacobson <vanj@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Nandita Dukkipati <nanditad@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
tcp: allow congestion control module to request TSO skb segment count
Add the tso_segs_goal() function in tcp_congestion_ops to allow the
congestion control module to specify the number of segments that
should be in a TSO skb sent by tcp_write_xmit() and
tcp_xmit_retransmit_queue(). The congestion control module can either
request a particular number of segments in TSO skb that we transmit,
or return 0 if it doesn't care.
This allows the upcoming BBR congestion control module to select small
TSO skb sizes if the module detects that the bottleneck bandwidth is
very low, or that the connection is policed to a low rate.
Signed-off-by: Van Jacobson <vanj@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Nandita Dukkipati <nanditad@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>