Saeed Mahameed [Thu, 17 Nov 2016 11:45:59 +0000 (13:45 +0200)]
net/mlx5: Set driver version infrastructure
Add driver_version capability bit is enabled, and set driver
version command in mlx5_ifc firmware header. The only purpose
of this command is to store a driver version/OS string in FW
to be reported and displayed in various management systems,
such as IPMI/BMC.
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: Huy Nguyen <huyn@mellanox.com> Signed-off-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Huy Nguyen [Thu, 17 Nov 2016 11:45:56 +0000 (13:45 +0200)]
net/mlx5: Port module event hardware structures
Add hardware structures and constants definitions needed for module
events support.
Signed-off-by: Huy Nguyen <huyn@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Fri, 18 Nov 2016 16:55:39 +0000 (11:55 -0500)]
Merge branch 'sfc-tso-v2'
Edward Cree says:
====================
sfc: Firmware-Assisted TSO version 2
The firmware on 8000 series SFC NICs supports a new TSO API ("FATSOv2"), and
7000 series NICs will also support this in an imminent release. This series
adds driver support for this TSO implementation.
The series also removes SWTSO, as it's now equivalent to GSO. This does not
actually remove very much code, because SWTSO was grotesquely intertwingled
with FATSOv1, which will also be removed once 7000 series supports FATSOv2.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Edward Cree [Thu, 17 Nov 2016 10:52:36 +0000 (10:52 +0000)]
sfc: remove Software TSO
It gives no advantage over GSO now that xmit_more exists. If we find
ourselves unable to handle a TSO skb (because our TXQ doesn't have a
TSOv2 context and the NIC doesn't support TSOv1), hand it back to GSO.
Also do that if the TSO handler fails with EINVAL for any other reason.
As Falcon-architecture NICs don't support any firmware-assisted TSO,
they no longer advertise TSO feature flags at all.
Signed-off-by: Edward Cree <ecree@solarflare.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Bert Kenward [Thu, 17 Nov 2016 10:51:54 +0000 (10:51 +0000)]
sfc: Firmware-Assisted TSO version 2
Add support for FATSOv2 to the driver. FATSOv2 offloads far more of the task
of TCP segmentation to the firmware, such that we now just pass a single
super-packet to the NIC. This means TSO has a great deal in common with a
normal DMA transmit, apart from adding a couple of option descriptors.
NIC-specific checks have been moved off the fast path and in to
initialisation where possible.
This also moves FATSOv1/SWTSO to a new file (tx_tso.c). The end of transmit
and some error handling is now outside TSO, since it is common with other
code.
Signed-off-by: Edward Cree <ecree@solarflare.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Alexey Dobriyan [Thu, 17 Nov 2016 01:58:21 +0000 (04:58 +0300)]
netns: make struct pernet_operations::id unsigned int
Make struct pernet_operations::id unsigned.
There are 2 reasons to do so:
1)
This field is really an index into an zero based array and
thus is unsigned entity. Using negative value is out-of-bound
access by definition.
2)
On x86_64 unsigned 32-bit data which are mixed with pointers
via array indexing or offsets added or subtracted to pointers
are preffered to signed 32-bit data.
"int" being used as an array index needs to be sign-extended
to 64-bit before being used.
void f(long *p, int i)
{
g(p[i]);
}
roughly translates to
movsx rsi, esi
mov rdi, [rsi+...]
call g
MOVSX is 3 byte instruction which isn't necessary if the variable is
unsigned because x86_64 is zero extending by default.
Now, there is net_generic() function which, you guessed it right, uses
"int" as an array index:
static inline void *net_generic(const struct net *net, int id)
{
...
ptr = ng->ptr[id - 1];
...
}
And this function is used a lot, so those sign extensions add up.
Patch snipes ~1730 bytes on allyesconfig kernel (without all junk
messing with code generation):
Unfortunately some functions actually grow bigger.
This is a semmingly random artefact of code generation with register
allocator being used differently. gcc decides that some variable
needs to live in new r8+ registers and every access now requires REX
prefix. Or it is shifted into r12, so [r12+0] addressing mode has to be
used which is longer than [r8]
However, overall balance is in negative direction:
David S. Miller [Thu, 17 Nov 2016 18:35:19 +0000 (13:35 -0500)]
Merge branch 'rds-ha-failover-fixes'
Sowmini Varadhan says:
====================
RDS: TCP: HA/Failover fixes
This series contains a set of fixes for bugs exposed when
we ran the following in a loop between a test machine pair:
while (1); do
# modprobe rds-tcp on test nodes
# run rds-stress in bi-dir mode between test machine pair
# modprobe -r rds-tcp on test nodes
done
rds-stress in bi-dir mode will cause both nodes to initiate
RDS-TCP connections at almost the same instant, exposing the
bugs fixed in this series.
Without the fixes, rds-stress reports sporadic packet drops,
and packets arriving out of sequence. After the fixes,we have
been able to run the test overnight, without any issues.
Each patch has a detailed description of the root-cause fixed
by the patch.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Sowmini Varadhan [Wed, 16 Nov 2016 21:29:50 +0000 (13:29 -0800)]
RDS: TCP: Force every connection to be initiated by numerically smaller IP address
When 2 RDS peers initiate an RDS-TCP connection simultaneously,
there is a potential for "duelling syns" on either/both sides.
See commit 241b271952eb ("RDS-TCP: Reset tcp callbacks if re-using an
outgoing socket in rds_tcp_accept_one()") for a description of this
condition, and the arbitration logic which ensures that the
numerically large IP address in the TCP connection is bound to the
RDS_TCP_PORT ("canonical ordering").
The rds_connection should not be marked as RDS_CONN_UP until the
arbitration logic has converged for the following reason. The sender
may start transmitting RDS datagrams as soon as RDS_CONN_UP is set,
and since the sender removes all datagrams from the rds_connection's
cp_retrans queue based on TCP acks. If the TCP ack was sent from
a tcp socket that got reset as part of duel aribitration (but
before data was delivered to the receivers RDS socket layer),
the sender may end up prematurely freeing the datagram, and
the datagram is no longer reliably deliverable.
This patch remedies that condition by making sure that, upon
receipt of 3WH completion state change notification of TCP_ESTABLISHED
in rds_tcp_state_change, we mark the rds_connection as RDS_CONN_UP
if, and only if, the IP addresses and ports for the connection are
canonically ordered. In all other cases, rds_tcp_state_change will
force an rds_conn_path_drop(), and rds_queue_reconnect() on
both peers will restart the connection to ensure canonical ordering.
A side-effect of enforcing this condition in rds_tcp_state_change()
is that rds_tcp_accept_one_path() can now be refactored for simplicity.
It is also no longer possible to encounter an RDS_CONN_UP connection in
the arbitration logic in rds_tcp_accept_one().
Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com> Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Sowmini Varadhan [Wed, 16 Nov 2016 21:29:49 +0000 (13:29 -0800)]
RDS: TCP: Track peer's connection generation number
The RDS transport has to be able to distinguish between
two types of failure events:
(a) when the transport fails (e.g., TCP connection reset)
but the RDS socket/connection layer on both sides stays
the same
(b) when the peer's RDS layer itself resets (e.g., due to module
reload or machine reboot at the peer)
In case (a) both sides must reconnect and continue the RDS messaging
without any message loss or disruption to the message sequence numbers,
and this is achieved by rds_send_path_reset().
In case (b) we should reset all rds_connection state to the
new incarnation of the peer. Examples of state that needs to
be reset are next expected rx sequence number from, or messages to be
retransmitted to, the new incarnation of the peer.
To achieve this, the RDS handshake probe added as part of
commit 5916e2c1554f ("RDS: TCP: Enable multipath RDS for TCP")
is enhanced so that sender and receiver of the RDS ping-probe
will add a generation number as part of the RDS_EXTHDR_GEN_NUM
extension header. Each peer stores local and remote generation
numbers as part of each rds_connection. Changes in generation
number will be detected via incoming handshake probe ping
request or response and will allow the receiver to reset rds_connection
state.
Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com> Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Sowmini Varadhan [Wed, 16 Nov 2016 21:29:48 +0000 (13:29 -0800)]
RDS: TCP: set RDS_FLAG_RETRANSMITTED in cp_retrans list
As noted in rds_recv_incoming() sequence numbers on data packets
can decreas for the failover case, and the Rx path is equipped
to recover from this, if the RDS_FLAG_RETRANSMITTED is set
on the rds header of an incoming message with a suspect sequence
number.
The RDS_FLAG_RETRANSMITTED is predicated on the RDS_FLAG_RETRANSMITTED
flag in the rds_message, so make sure the flag is set on messages
queued for retransmission.
Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com> Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
LABBE Corentin [Wed, 16 Nov 2016 19:09:41 +0000 (20:09 +0100)]
net: stmmac: replace if (netif_msg_type) by their netif_xxx counterpart
As sugested by Joe Perches, we could replace all
if (netif_msg_type(priv)) dev_xxx(priv->devices, ...)
by the simpler macro netif_xxx(priv, hw, priv->dev, ...)
Signed-off-by: Corentin Labbe <clabbe.montjoie@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
LABBE Corentin [Wed, 16 Nov 2016 19:09:39 +0000 (20:09 +0100)]
net: stmmac: replace all pr_xxx by their netdev_xxx counterpart
The stmmac driver use lots of pr_xxx functions to print information.
This is bad since we cannot know which device logs the information.
(moreover if two stmmac device are present)
Furthermore, it seems that it assumes wrongly that all logs will always
be subsequent by using a dev_xxx then some indented pr_xxx like this:
kernel: sun7i-dwmac 1c50000.ethernet: no reset control found
kernel: Ring mode enabled
kernel: No HW DMA feature register supported
kernel: Normal descriptors
kernel: TX Checksum insertion supported
So this patch replace all pr_xxx by their netdev_xxx counterpart.
Excepts for some printing where netdev "cause" unpretty output like:
sun7i-dwmac 1c50000.ethernet (unnamed net_device) (uninitialized): no reset control found
In those case, I keep dev_xxx.
In the same time I remove some "stmmac:" print since
this will be a duplicate with that dev_xxx displays.
Signed-off-by: Corentin Labbe <clabbe.montjoie@gmail.com> Acked-by: Giuseppe Cavallaro <peppe.cavallaro@st.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Thu, 17 Nov 2016 17:48:30 +0000 (09:48 -0800)]
net_sched: sch_fq: use hash_ptr()
When I wrote sch_fq.c, hash_ptr() on 64bit arches was awful,
and I chose hash_32().
Linus Torvalds and George Spelvin fixed this issue, so we can
use hash_ptr() to get more entropy on 64bit arches with Terabytes
of memory, and avoid the cast games.
Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Wed, 16 Nov 2016 14:21:34 +0000 (06:21 -0800)]
net/mlx5e: remove napi_hash_del() calls
Calling napi_hash_del() after netif_napi_del() is pointless.
Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Saeed Mahameed <saeedm@mellanox.com> Acked-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Wed, 16 Nov 2016 13:49:22 +0000 (05:49 -0800)]
net/mlx4_en: remove napi_hash_del() call
There is no need calling napi_hash_del()+synchronize_rcu() before
calling netif_napi_del()
netif_napi_del() does this already.
Using napi_hash_del() in a driver is useful only when dealing with
a batch of NAPI structures, so that a single synchronize_rcu() can
be used. mlx4_en_deactivate_cq() is deactivating a single NAPI.
Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Thu, 17 Nov 2016 04:29:05 +0000 (23:29 -0500)]
Merge branch 'mlxsw-i2c'
Jiri Pirko says:
====================
mlxsw: Introduce support for I2C bus
Vadim says:
This patchset adds I2C access support for SwitchX, SwitchX2, SwitchIB,
SwitchIB2 and Spectrum silicones.
It contains:
- Small changes in mlxsw core code, needed for I2C bus support;
- I2C driver, which obtains I2C input/output mailboxes setting and
provides command interface implementation.
- Minimal driver, which works on top of I2C driver and allows running
of mlxsw command interface over I2C bus;
Use case:
On system, which does not have PCI to ASIC (BMC), hwmon functionality
(sensors, pwm, tacho) will be available through I2C.
Vadim Pasternak [Wed, 16 Nov 2016 14:20:45 +0000 (15:20 +0100)]
mlxsw: Invoke driver's init/fini methods only if defined
We are going to add a minimal driver on top of the mlxsw core
infrastructure, which will be mainly used for hardware monitoring in
Baseboard management controller (BMC) installations.
Unlike the switch drivers (e.g., spectrum, switchx2), this driver does not
initialize the ASIC and therefore doesn't need to implement the init() and
fini() methods in its 'mlxsw_driver' struct.
Signed-off-by: Vadim Pasternak <vadimp@mellanox.com> Reviewed-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Vadim Pasternak [Wed, 16 Nov 2016 14:20:44 +0000 (15:20 +0100)]
mlxsw: Introduce support for I2C bus
Add I2C bus implementation for Mellanox Technologies Switch ASICs.
This includes command interface implementation using input / out mailboxes,
whose location is retrieved from the firmware during probe time.
Signed-off-by: Vadim Pasternak <vadimp@mellanox.com> Reviewed-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Vadim Pasternak [Wed, 16 Nov 2016 14:20:43 +0000 (15:20 +0100)]
mlxsw: Add bus capability flag
The mlxsw core infrastructure currently assumes that communication with
the ASIC is always possible using Ethernet management datagrams (EMADs),
but this is only possible when the PCI bus is used.
The bus capability flag is added to indicate EMAD support and make core
initialize EMAD communication only when it's set. Otherwise, register
access is done using command interface.
Signed-off-by: Vadim Pasternak <vadimp@mellanox.com> Reviewed-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Julia Lawall [Wed, 16 Nov 2016 10:43:33 +0000 (11:43 +0100)]
net: netcp: replace IS_ERR_OR_NULL by IS_ERR
knav_queue_open always returns an ERR_PTR value, never NULL. This can be
confirmed by unfolding the function calls and conforms to the function's
documentation. Thus, replace IS_ERR_OR_NULL by IS_ERR in error checks.
The change is made using the following semantic patch:
(http://coccinelle.lip6.fr/)
// <smpl>
@@
expression x;
statement S;
@@
x = knav_queue_open(...);
if (
- IS_ERR_OR_NULL
+ IS_ERR
(x)) S
// </smpl>
Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr> Signed-off-by: David S. Miller <davem@davemloft.net>
Xin Long [Tue, 15 Nov 2016 15:23:11 +0000 (23:23 +0800)]
sctp: use new rhlist interface on sctp transport rhashtable
Now sctp transport rhashtable uses hash(lport, dport, daddr) as the key
to hash a node to one chain. If in one host thousands of assocs connect
to one server with the same lport and different laddrs (although it's
not a normal case), all the transports would be hashed into the same
chain.
It may cause to keep returning -EBUSY when inserting a new node, as the
chain is too long and sctp inserts a transport node in a loop, which
could even lead to system hangs there.
The new rhlist interface works for this case that there are many nodes
with the same key in one chain. It puts them into a list then makes this
list be as a node of the chain.
This patch is to replace rhashtable_ interface with rhltable_ interface.
Since a chain would not be too long and it would not return -EBUSY with
this fix when inserting a node, the reinsert loop is also removed here.
Signed-off-by: Xin Long <lucien.xin@gmail.com> Acked-by: Neil Horman <nhorman@tuxdriver.com> Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Michael Chan [Thu, 17 Nov 2016 02:13:08 +0000 (21:13 -0500)]
bnxt_en: Enhance autoneg support.
On some dual port NICs, the speed setting on one port can affect the
available speed on the other port. Add logic to detect these changes
and adjust the advertised speed settings when necessary.
Signed-off-by: Michael Chan <michael.chan@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Wed, 16 Nov 2016 22:54:50 +0000 (14:54 -0800)]
netpoll: more efficient locking
Callers of netpoll_poll_lock() own NAPI_STATE_SCHED
Callers of netpoll_poll_unlock() have BH blocked between
the NAPI_STATE_SCHED being cleared and poll_lock is released.
We can avoid the spinlock which has no contention, and use cmpxchg()
on poll_owner which we need to set anyway.
This removes a possible lockdep violation after the cited commit,
since sk_busy_loop() re-enables BH before calling busy_poll_stop()
Fixes: 217f69743681 ("net: busy-poll: allow preemption in sk_busy_loop()") Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Rafal Ozieblo [Wed, 16 Nov 2016 10:02:34 +0000 (10:02 +0000)]
cadence: Add LSO support.
New Cadence GEM hardware support Large Segment Offload (LSO):
TCP segmentation offload (TSO) as well as UDP fragmentation
offload (UFO). Support for those features was added to the driver.
Signed-off-by: David S. Miller <davem@davemloft.net>
The netdev->real_num_rx_queues setting is only available if CONFIG_SYSFS
is enabled, so we now get a build failure when that is turned off:
netronome/nfp/nfp_net_common.c: In function 'nfp_net_ring_swap_enable':
netronome/nfp/nfp_net_common.c:2489:18: error: 'struct net_device' has no member named 'real_num_rx_queues'; did you mean 'real_num_tx_queues'?
As far as I can tell, the check here is only used as an optimization that
we can skip in order to fix the compilation. If sysfs is disabled,
the following netif_set_real_num_rx_queues() has no effect.
Fixes: 164d1e9e5d52 ("nfp: add support for ethtool .set_channels") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Wed, 16 Nov 2016 14:01:47 +0000 (06:01 -0800)]
sfc: remove napi_hash_del() call
Calling napi_hash_del() after netif_napi_del() is pointless.
Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Edward Cree <ecree@solarflare.com> Cc: Bert Kenward <bkenward@solarflare.com> Acked-by: Bert Kenward <bkenward@solarflare.com> Signed-off-by: David S. Miller <davem@davemloft.net>
David Lebrun [Wed, 16 Nov 2016 09:05:46 +0000 (10:05 +0100)]
lwtunnel: subtract tunnel headroom from mtu on output redirect
This patch changes the lwtunnel_headroom() function which is called
in ipv4_mtu() and ip6_mtu(), to also return the correct headroom
value when the lwtunnel state is OUTPUT_REDIRECT.
This patch enables e.g. SR-IPv6 encapsulations to work without
manually setting the route mtu.
Acked-by: Roopa Prabhu <roopa@cumulusnetworks.com> Signed-off-by: David Lebrun <david.lebrun@uclouvain.be> Signed-off-by: David S. Miller <davem@davemloft.net>
Ido Schimmel [Wed, 16 Nov 2016 08:51:58 +0000 (09:51 +0100)]
mlxsw: spectrum_router: Adjust placement of FIB abort warning
The recent merge commit bb598c1b8c9b ("Merge
git://git.kernel.org/pub/scm/linux/kernel/git/davem/net") would cause
the FIB abort warning to fire whenever we flush the FIB tables - either
during module removal or actual abort.
Move it back to its rightful location in the FIB abort function.
Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Andrew Lunn [Wed, 16 Nov 2016 03:26:48 +0000 (04:26 +0100)]
net: dsa: mv88e6xxx: Respect SPEED_UNFORCED, don't set force bit
The SPEED_UNFORCED indicates the MAC & PHY should perform
auto-negotiation to determine a speed which works. If this is called
for, don't set the force bit. If it is set, the MAC actually does
10Gbps, why the internal PHYs don't support.
Signed-off-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
Lendacky, Thomas [Tue, 15 Nov 2016 22:11:15 +0000 (16:11 -0600)]
amd-xgbe: Fix maximum GPIO value check
The GPIO support in the hardware allows for up to 16 GPIO pins, enumerated
from 0 to 15. The driver uses the wrong value (16) to validate the GPIO
pin range in the routines to set and clear the GPIO output pins. Update
the code to use the correct value (15).
Reported-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Lendacky, Thomas [Tue, 15 Nov 2016 22:11:05 +0000 (16:11 -0600)]
amd-xgbe: Fix possible uninitialized variable
The debugfs support in the driver uses a common routine to write the
debugfs values. In this routine, if the input file position is non-zero
then the write routine will not return an error and an output parameter
will not have been set. Because an error isn't returned an uninitialized
value will be written into a register.
Fix the common write routine to return an error if the input file position
is non-zero, which will propagate back to the caller.
Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Wed, 16 Nov 2016 18:44:02 +0000 (13:44 -0500)]
Merge branch 'nway-reset'
Florian Fainelli says:
====================
net: Implenent ethtool::nway_reset for a few drivers
This patch series depends on "net: phy: Centralize auto-negotation restart"
since it provides phy_ethtool_nway_reset as a helper function.
The drivers here already support PHYLIB, so there really is no reason why
restarting auto-negotiation would not be possible with these.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Florian Fainelli [Tue, 15 Nov 2016 19:19:46 +0000 (11:19 -0800)]
net: ethoc: Implement ethtool::nway_reset
Implement ethtool::nway_reset using phy_ethtool_nway_reset. We are
already using dev->phydev all over the place so this comes for free.
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Reviewed-by: Tobias Klauser <tklauser@distanz.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
====================
net: busy-poll: allow preemption and other optimizations
It is time to have preemption points in sk_busy_loop() and improve
its scalability.
Also napi_complete() and friends can tell drivers when it is safe to
not re-enable device interrupts, saving some overhead under
high busy polling.
mlx4 and bnx2x are changed accordingly, to show how this busy polling
status can be exploited by drivers.
Next steps will implement Zach Brown suggestion, where NAPI polling
would be enabled all the time for some chosen queues.
This is needed for efficient epoll() support anyway.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Tue, 15 Nov 2016 18:15:15 +0000 (10:15 -0800)]
bnx2x: switch to napi_complete_done()
Switch from napi_complete() to napi_complete_done()
for better GRO support (gro_flush_timeout) and core NAPI
features.
Do not rearm interrupts if we are busy polling,
to reduce bus and interrupts overhead.
Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Willem de Bruijn <willemb@google.com> Cc: Adam Belay <abelay@google.com> Cc: Tariq Toukan <tariqt@mellanox.com> Cc: Yuval Mintz <Yuval.Mintz@cavium.com> Cc: Ariel Elior <ariel.elior@cavium.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Tue, 15 Nov 2016 18:15:14 +0000 (10:15 -0800)]
net/mlx4_en: use napi_complete_done() return value
Do not rearm interrupts if we are busy polling.
mlx4 uses separate CQ for TX and RX, so number of TX interrupts
does not change, unfortunately.
Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Willem de Bruijn <willemb@google.com> Cc: Adam Belay <abelay@google.com> Cc: Tariq Toukan <tariqt@mellanox.com> Cc: Yuval Mintz <Yuval.Mintz@cavium.com> Cc: Ariel Elior <ariel.elior@cavium.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Tue, 15 Nov 2016 18:15:13 +0000 (10:15 -0800)]
net: busy-poll: return busypolling status to drivers
NAPI drivers use napi_complete_done() or napi_complete() when
they drained RX ring and right before re-enabling device interrupts.
In busy polling, we can avoid interrupts being delivered since
we are polling RX ring in a controlled loop.
Drivers can chose to use napi_complete_done() return value
to reduce interrupts overhead while busy polling is active.
This is optional, legacy drivers should work fine even
if not updated.
Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Willem de Bruijn <willemb@google.com> Cc: Adam Belay <abelay@google.com> Cc: Tariq Toukan <tariqt@mellanox.com> Cc: Yuval Mintz <Yuval.Mintz@cavium.com> Cc: Ariel Elior <ariel.elior@cavium.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Tue, 15 Nov 2016 18:15:12 +0000 (10:15 -0800)]
net: busy-poll: remove need_resched() from sk_can_busy_loop()
Now sk_busy_loop() can schedule by itself, we can remove
need_resched() check from sk_can_busy_loop()
Also add a const to its struct sock parameter.
Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Willem de Bruijn <willemb@google.com> Cc: Adam Belay <abelay@google.com> Cc: Tariq Toukan <tariqt@mellanox.com> Cc: Yuval Mintz <Yuval.Mintz@cavium.com> Cc: Ariel Elior <ariel.elior@cavium.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Tue, 15 Nov 2016 18:15:11 +0000 (10:15 -0800)]
net: busy-poll: allow preemption in sk_busy_loop()
After commit 4cd13c21b207 ("softirq: Let ksoftirqd do its job"),
sk_busy_loop() needs a bit of care :
softirqs might be delayed since we do not allow preemption yet.
This patch adds preemptiom points in sk_busy_loop(),
and makes sure no unnecessary cache line dirtying
or atomic operations are done while looping.
A new flag is added into napi->state : NAPI_STATE_IN_BUSY_POLL
This prevents napi_complete_done() from clearing NAPIF_STATE_SCHED,
so that sk_busy_loop() does not have to grab it again.
Similarly, netpoll_poll_lock() is done one time.
This gives about 10 to 20 % improvement in various busy polling
tests, especially when many threads are busy polling in
configurations with large number of NIC queues.
This should allow experimenting with bigger delays without
hurting overall latencies.
Tested:
On a 40Gb mlx4 NIC, 32 RX/TX queues.
echo 70 >/proc/sys/net/core/busy_read
for i in `seq 1 40`; do echo -n $i: ; ./super_netperf $i -H lpaa24 -t UDP_RR -- -N -n; done
Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Willem de Bruijn <willemb@google.com> Cc: Adam Belay <abelay@google.com> Cc: Tariq Toukan <tariqt@mellanox.com> Cc: Yuval Mintz <Yuval.Mintz@cavium.com> Cc: Ariel Elior <ariel.elior@cavium.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Martin KaFai Lau [Tue, 15 Nov 2016 19:00:04 +0000 (11:00 -0800)]
bpf: Fix compilation warning in __bpf_lru_list_rotate_inactive
gcc-6.2.1 gives the following warning:
kernel/bpf/bpf_lru_list.c: In function ‘__bpf_lru_list_rotate_inactive.isra.3’:
kernel/bpf/bpf_lru_list.c:201:28: warning: ‘next’ may be used uninitialized in this function [-Wmaybe-uninitialized]
The "next" is currently initialized in the while() loop which must have >=1
iterations.
This patch initializes next to get rid of the compiler warning.
Fixes: 3a08c2fd7634 ("bpf: LRU List") Reported-by: David Miller <davem@davemloft.net> Signed-off-by: Martin KaFai Lau <kafai@fb.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
David Lebrun [Tue, 15 Nov 2016 15:14:04 +0000 (16:14 +0100)]
ipv6: sr: add option to control lwtunnel support
This patch adds a new option CONFIG_IPV6_SEG6_LWTUNNEL to enable/disable
support of encapsulation with the lightweight tunnels. When this option
is enabled, CONFIG_LWTUNNEL is automatically selected.
Fix commit 6c8702c60b88 ("ipv6: sr: add support for SRH encapsulation and injection with lwtunnels")
Without a proper option to control lwtunnel support for SR-IPv6, if
CONFIG_LWTUNNEL=n then the IPv6 initialization fails as a consequence
of seg6_iptunnel_init() failure with EOPNOTSUPP:
NET: Registered protocol family 10
IPv6: Attempt to unregister permanent protocol 6
IPv6: Attempt to unregister permanent protocol 136
IPv6: Attempt to unregister permanent protocol 17
NET: Unregistered protocol family 10
Tested (compiling, booting, and loading ipv6 module when relevant)
with possible combinations of CONFIG_IPV6={y,m,n},
CONFIG_IPV6_SEG6_LWTUNNEL={y,n} and CONFIG_LWTUNNEL={y,n}.
Reported-by: Lorenzo Colitti <lorenzo@google.com> Suggested-by: Roopa Prabhu <roopa@cumulusnetworks.com> Signed-off-by: David Lebrun <david.lebrun@uclouvain.be> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Wed, 16 Nov 2016 03:46:31 +0000 (22:46 -0500)]
Merge branch 'alx-multiqueue-support'
Tobias Regnery says:
====================
alx: add multi queue support
This patchset lays the groundwork for multi queue support in the alx driver
and enables multi queue support for the tx path by default. The hardware
supports up to 4 tx queues.
Benefits are better utilization of multi core cpus and the usage of the
msi-x support by default which splits the handling of rx / tx and misc
other interrupts.
The rx path is a little bit harder because apparently (based on the limited
information from the downstream driver) the hardware supports up to 8 rss
queues but only has one hardware descriptor ring on the rx side. So the rx
path will be part of another patchset.
Tested on my AR8161 ethernet adapter with different tests:
- there are no regressions observed during my daily usage
- iperf tcp and udp tests shows no performance regressions
- netperf TCP_RR and UDP_RR shows a slight performance increase of about
1-2% with this patchset applied
This work is based on the downstream driver at github.com/qca/alx
Changes in V2:
- drop unneeded casts in alx_alloc_rx_ring (Patch 1)
- add additional information about testing and benefit to the
changelog
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Tobias Regnery [Tue, 15 Nov 2016 11:43:15 +0000 (12:43 +0100)]
alx: enable msi-x interrupts by default
Remove the module parameter to enable msi-x support and enable msi-x
interrupts unconditionally by default. This is a preparatory step to enable
multi queue support by default, because this is only working with msi-x
interrupts.
Signed-off-by: Tobias Regnery <tobias.regnery@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Tobias Regnery [Tue, 15 Nov 2016 11:43:14 +0000 (12:43 +0100)]
alx: prepare tx path for multi queue support
This patch prepares the tx path to send data on multiple tx queues. It
introduces per queue register adresses and uses them in the alx_tx_queue
structs.
There are new helper functions for the queue mapping in the tx path.
Based on the downstream driver at github.com/qca/alx
Signed-off-by: Tobias Regnery <tobias.regnery@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Tobias Regnery [Tue, 15 Nov 2016 11:43:13 +0000 (12:43 +0100)]
alx: prepare resource allocation for multi queue support
Allocate, initialise and free alx_tx_queue structs based on the number of
alx_napi structures. Also increase the size of the descriptor memory based
on the number of tx queues in use.
Based on the downstream driver at github.com/qca/alx
Signed-off-by: Tobias Regnery <tobias.regnery@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Tobias Regnery [Tue, 15 Nov 2016 11:43:10 +0000 (12:43 +0100)]
alx: add ability to allocate and free alx_napi structures
Add new functions to allocate and free the alx_napi structures and use them
in __alx_open and __alx_stop. We only allocate one of these structures for
now, as the rest of the driver is not yet ready for multiple queues.
We switch over the setup of the interrupt mask and the call to netif_napi_add
to the new function because we must adjust these later on a per queue basis.
Based on the downstream driver at github.com/qca/alx
Signed-off-by: Tobias Regnery <tobias.regnery@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Tobias Regnery [Tue, 15 Nov 2016 11:43:08 +0000 (12:43 +0100)]
alx: refactor descriptor allocation
Split the allocation of descriptor memory and the buffer allocation into a
tx and rx function. This is in preparation for multiple queues where we
need to iterate over the new functions.
While at it drop the unneeded casting on the rx side.
Signed-off-by: Tobias Regnery <tobias.regnery@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Wed, 16 Nov 2016 03:34:27 +0000 (22:34 -0500)]
Merge branch 'dpaa_eth-next'
Madalin Bucur says:
====================
dpaa_eth: Add the QorIQ DPAA Ethernet driver
This patch series adds the Ethernet driver for the Freescale
QorIQ Data Path Acceleration Architecture (DPAA).
This version includes changes following the feedback received
on previous versions from Eric Dumazet, Bob Cochran, Joe Perches,
Paul Bolle, Joakim Tjernlund, Scott Wood, David Miller - thank you.
Together with the driver a managed version of alloc_percpu
is provided that simplifies the release of per-CPU memory.
The Freescale DPAA architecture consists in a series of hardware
blocks that support the Ethernet connectivity. The Ethernet driver
depends upon the following drivers that are currently in the Linux
kernel:
- Peripheral Access Memory Unit (PAMU)
drivers/iommu/fsl_*
- Frame Manager (FMan) added in v4.4
drivers/net/ethernet/freescale/fman
- Queue Manager (QMan), Buffer Manager (BMan) added in v4.9-rc1
drivers/soc/fsl/qbman
where the acronyms used above (and in the code) are:
DPAA = Data Path Acceleration Architecture
FMan = DPAA Frame Manager
QMan = DPAA Queue Manager
BMan = DPAA Buffers Manager
QMI = QMan interface in FMan
BMI = BMan interface in FMan
FMan SP = FMan Storage Profiles
MURAM = Multi-user RAM in FMan
FQ = QMan Frame Queue
Rx Dfl FQ = default reception FQ
Rx Err FQ = Rx error frames FQ
Tx Cnf FQ = Tx confirmation FQ
Tx FQs = transmission frame queues
dtsec = datapath three speed Ethernet controller (10/100/1000 Mbps)
tgec = ten gigabit Ethernet controller (10 Gbps)
memac = multirate Ethernet MAC (10/100/1000/10000)
Changes from v7:
- remove the debug option to use a common buffer pool for all the
interfaces
Changed from v6:
- fixed an issue on an error path in dpaa_set_mac_address()
- removed NDO operation definitions that were not needed
- sorted the local variable declarations
- cleaned up a few checkpatch checks
- removed friendly network interface naming code
Changes from v5:
- adapt to the latest Q/BMan drivers API
- use build_skb() on Rx path instead of buffer pool refill path
- proper support for multiple buffer pools
- align function, variable names, code cleanup
- driver file structure cleanup
Changes from v4:
- addressed feedback from Scott Wood and Joe Perches
- fixed spelling
- fixed leak of uninitialized stack to userspace
- fix prints
- replace raw_cpu_ptr() with this_cpu_ptr()
- remove _s from the end of structure names
- remove underscores at start of functions, goto labels
- remove likely in error paths
- use container_of() instead of open casts
- remove priv from the driver name
- move return type on same line with function name
- drop DPA_READ_SKB_PTR/DPA_WRITE_SKB_PTR
Changes from v3:
- removed bogus delay and comment in .ndo_stop implementation
- addressed minor issues reported by David Miller
Changes from v2:
- removed debugfs, moved exports to ethtool statistics
- removed congestion groups Kconfig params
Changes from v1:
- bpool level Kconfig options removed
- print format using pr_fmt, cleaned up prints
- __hot/__cold removed
- gratuitous unlikely() removed
- code style aligned, consistent spacing for declarations
- comment formatting
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Madalin Bucur [Tue, 15 Nov 2016 08:41:04 +0000 (10:41 +0200)]
dpaa_eth: add ethtool statistics
Add a series of counters to be exported through ethtool:
- add detailed counters for reception errors;
- add detailed counters for QMan enqueue reject events;
- count the number of fragmented skbs received from the stack;
- count all frames received on the Tx confirmation path;
- add congestion group statistics;
- count the number of interrupts for each CPU.
Signed-off-by: Ioana Ciornei <ioana.ciornei@nxp.com> Signed-off-by: Madalin Bucur <madalin.bucur@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Madalin Bucur [Tue, 15 Nov 2016 08:41:02 +0000 (10:41 +0200)]
dpaa_eth: add support for DPAA Ethernet
This introduces the Freescale Data Path Acceleration Architecture
(DPAA) Ethernet driver (dpaa_eth) that builds upon the DPAA QMan,
BMan, PAMU and FMan drivers to deliver Ethernet connectivity on
the Freescale DPAA QorIQ platforms.
Signed-off-by: Madalin Bucur <madalin.bucur@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Andrey Vagin [Tue, 15 Nov 2016 02:15:14 +0000 (18:15 -0800)]
tcp: allow to enable the repair mode for non-listening sockets
The repair mode is used to get and restore sequence numbers and
data from queues. It used to checkpoint/restore connections.
Currently the repair mode can be enabled for sockets in the established
and closed states, but for other states we have to dump the same socket
properties, so lets allow to enable repair mode for these sockets.
The repair mode reveals nothing more for sockets in other states.
Signed-off-by: Andrei Vagin <avagin@openvz.org> Acked-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Wed, 16 Nov 2016 03:24:42 +0000 (22:24 -0500)]
Merge branch 'liquidio-CN23XX-VF-support'
Raghu Vatsavayi says:
====================
liquidio CN23XX VF support
Following is the V6 patch series for adding VF support on
CN23XX devices. This version addressed:
1) Your concern for ordering of local variable declarations
from longest to shortest line.
2) Removed module parameters max_vfs, num_queues_per_{p,v}f.
3) Minor changes for fixing new checkpatch script related
errors on pre-existing driver.
4) Fixed compilation issues when CONFIG_PCI_IOV/CONFIG_PCI_ATS
options are disabled.
5) Modified qualifiers for printing mac addresses with pM format.
I will post remaining VF patches soon after this patchseries is
applied. Please apply patches in the following order as some of
the patches depend on earlier patches.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Lendacky, Thomas [Mon, 14 Nov 2016 22:39:16 +0000 (16:39 -0600)]
amd-xgbe: Fix up some coccinelle identified warnings
Fix up some warnings that were identified by coccinelle:
Clean up an if/else block that can look confusing since the same statement
is executed in an "else if" check and the final "else" statement.
Change a variable from unsigned int to int since it is used in an if
statement checking the value to be less than 0.
Reported-by: Julia Lawall <julia.lawall@lip6.fr> Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Lendacky, Thomas [Mon, 14 Nov 2016 22:39:07 +0000 (16:39 -0600)]
amd-xgbe: Fix mask appliciation for Clause 37 register
The application of a mask to clear an area of a clause 37 register value
was not properly applied. Update the code to do the proper application
of the mask.
Reported-by: Marion & Christophe JAILLET <christophe.jaillet@wanadoo.fr> Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Wed, 16 Nov 2016 03:05:55 +0000 (22:05 -0500)]
Merge branch 'sun4i-emac-big-endian'
Michael Weiser says:
====================
sun4i-emac: Fixes for running a big-endian kernel on Cubieboard2
the following patches are what remains to be fixed in order to allow running a
big-endian kernel on the Cubieboard2.
The first patch fixes up endianness problems with DMA descriptors in
the stmmac driver preventing it from working correctly when runnning a
big-endian kernel.
The second patch adds the ability to enable diagnostic messages in the
sun4i-emac driver which were instrumental in finding the problem fixed
by patch number three: Endianness confusion caused by dual-purpose I/O
register usage in sun4i-emac.
All of these have been tested successfully on a Cubieboard2 DualCard.
Changes since v4:
- Rebased to current master
- Removed already applied patches to sunxi-mmc and sunxi-Kconfig
Changes since v3:
- Rebased sunxi-mmc patch against Ulf's mmc.git/next
- Changed Kconfig change to enable big-endian support only for sun7i
devices
Changes since v2:
- Fixed typo in stmmac patch causing a build failure
- Added sun4i-emac patches
Michael Weiser [Mon, 14 Nov 2016 17:58:07 +0000 (18:58 +0100)]
net: ethernet: sun4i-emac: Read rxhdr in CPU byte-order
The EMAC EMAC_RX_IO_DATA_REG data register is dual-purpose: On one hand
it is used to move actual packet data off the wire. This will be in
wire-format and accepted as such by higher layers such as IP. Therefore
it is correctly read as-is (i.e. raw) using readsl.
On the other hand it provides metadata about incoming transfers to the
driver such as length and checksum validation status. This data is
little-endian, always and it is interpreted by the driver. Therefore it
needs to be swapped to CPU endianness to make sense to the driver. This
is already done for the "receive header" but not rxhdr.
Read rxhdr using readl in order for sun4i-emac to work correctly when
running a big-endian kernel.
Signed-off-by: Michael Weiser <michael.weiser@gmx.de> Cc: Maxime Ripard <maxime.ripard@free-electrons.com> Cc: netdev@vger.kernel.org Signed-off-by: David S. Miller <davem@davemloft.net>
Michael Weiser [Mon, 14 Nov 2016 17:58:06 +0000 (18:58 +0100)]
net: ethernet: sun4i-emac: Allow to enable netif messages
sun4i-emac has the ability to print a number of diagnostic messages using
dev_dbg depending on message level settings implemented using netif_msg_*
macros. But there's no way to actually enable them.
Add the ability to switch diagnostic messages on using either a module
parameter debug or ethtool -s <netif> msglvl <flags>.
Signed-off-by: Michael Weiser <michael.weiser@gmx.de> Cc: Maxime Ripard <maxime.ripard@free-electrons.com> Cc: netdev@vger.kernel.org Signed-off-by: David S. Miller <davem@davemloft.net>
Michael Weiser [Mon, 14 Nov 2016 17:58:05 +0000 (18:58 +0100)]
net: ethernet: stmmac: change dma descriptors to __le32
The stmmac driver does not take into account the processor may be big
endian when writing the DMA descriptors. This causes the ethernet
interface not to be initialised correctly when running a big-endian
kernel. Change the descriptors for DMA to use __le32 and ensure they are
suitably swapped before writing. Tested successfully on the
Cubieboard2.
Signed-off-by: Michael Weiser <michael.weiser@gmx.de> Cc: Giuseppe Cavallaro <peppe.cavallaro@st.com> Cc: Alexandre Torgue <alexandre.torgue@st.com> Cc: netdev@vger.kernel.org Signed-off-by: David S. Miller <davem@davemloft.net>
Florian Westphal [Mon, 14 Nov 2016 15:42:01 +0000 (16:42 +0100)]
dctcp: update cwnd on congestion event
draft-ietf-tcpm-dctcp-02 says:
... when the sender receives an indication of congestion
(ECE), the sender SHOULD update cwnd as follows:
cwnd = cwnd * (1 - DCTCP.Alpha / 2)
So, lets do this and reduce cwnd more smoothly (and faster), as per
current congestion estimate.
Cc: Lawrence Brakmo <brakmo@fb.com> Cc: Andrew Shewmaker <agshew@gmail.com> Cc: Glenn Judd <glenn.judd@morganstanley.com> Cc: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>
Florian Fainelli [Wed, 16 Nov 2016 02:21:09 +0000 (18:21 -0800)]
net: bcm63xx_enet: Fix build failure with phy_ethtool_nway_reset
Introduced a typo making the driver no longer build, *sigh*.
Fixes: 42469bf5d9bb ("net: bcm63xx_enet: Utilize phy_ethtool_nway_reset") Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>