David S. Miller [Tue, 30 Sep 2014 20:37:13 +0000 (16:37 -0400)]
Merge branch 'pxa168_eth'
Antoine Tenart says:
====================
ARM: Berlin: Ethernet support
This series introduce support for the Ethernet controller on Berlin SoCs,
using the existing pxa168 Ethernet driver. In order to do this, DT
support is added to the driver alongside some other modifications and
fixes.
This has been tested on a Berlin BG2Q DMP board.
Changes since v5:
- fixed the build when building the driver as a module
Changes since v4:
- removed the phy-addr property and added a phy subnode
- added COMPILE_TEST for the pxa168_eth driver
Changes since v3:
- moved the addition of pxa168_eth_get_mac_address() to the patch
using it first
Changes since v2:
- reworked how the MAC address is configured
- made the clock anonymous
Changes since v1:
- removed custom Berlin Ethernet driver
- used the pxa168 Ethernet driver instead
- made modifications to the pxa168 driver (DT support, fixes)
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Antoine Ténart [Tue, 30 Sep 2014 14:28:16 +0000 (16:28 +0200)]
ARM: dts: berlin: enable the Ethernet port on the BG2Q DMP
This patch enables the Ethernet port on the Marvell Berlin2Q DMP board.
Signed-off-by: Antoine Tenart <antoine.tenart@free-electrons.com> Acked-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: David S. Miller <davem@davemloft.net>
Antoine Ténart [Tue, 30 Sep 2014 14:28:15 +0000 (16:28 +0200)]
ARM: dts: berlin: add the Ethernet node
This patch adds the Ethernet node, enabling the network unit on Berlin
BG2Q SoCs.
Signed-off-by: Antoine Tenart <antoine.tenart@free-electrons.com> Acked-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: David S. Miller <davem@davemloft.net>
Antoine Ténart [Tue, 30 Sep 2014 14:28:14 +0000 (16:28 +0200)]
net: pxa168_eth: allow to compile the pxa168_eth driver for tests
Add a dependency to COMPILE_TEST so that the driver can be compiled for
test purposes.
Signed-off-by: Antoine Tenart <antoine.tenart@free-electrons.com> Acked-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: David S. Miller <davem@davemloft.net>
Antoine Ténart [Tue, 30 Sep 2014 14:28:13 +0000 (16:28 +0200)]
net: pxa168_eth: allow Berlin SoCs to use the pxa168_eth driver
Berlin SoCs have an Ethernet controller compatible with the pxa168.
Allow these SoCs to use the pxa168_eth driver.
Signed-off-by: Antoine Tenart <antoine.tenart@free-electrons.com> Acked-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: David S. Miller <davem@davemloft.net>
Antoine Ténart [Tue, 30 Sep 2014 14:28:12 +0000 (16:28 +0200)]
net: pxa168_eth: rework the MAC address setup
This patch rework the way the MAC address is retrieved. The MAC address
can now, in addition to being random, be set in the device tree or
retrieved from the Ethernet controller MAC address registers. The
probing function will try to get a MAC address in the following order:
- From the device tree.
- From the Ethernet controller MAC address registers.
- Generate a random one.
This patch also adds a function to read the MAC address from the
Ethernet Controller registers.
Signed-off-by: Antoine Tenart <antoine.tenart@free-electrons.com> Acked-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: David S. Miller <davem@davemloft.net>
Antoine Ténart [Tue, 30 Sep 2014 14:28:11 +0000 (16:28 +0200)]
net: pxa168_eth: set the mac address on the Ethernet controller
When changing the MAC address, in addition to updating the dev_addr in
the net_device structure, this patch also update the MAC address
registers (high and low) of the Ethernet controller with the new MAC.
The address stored in these registers is used for IEEE 802.3x Ethernet
flow control, which is already enabled.
Signed-off-by: Antoine Tenart <antoine.tenart@free-electrons.com> Acked-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: David S. Miller <davem@davemloft.net>
Antoine Ténart [Tue, 30 Sep 2014 14:28:10 +0000 (16:28 +0200)]
net: pxa168_eth: fix Ethernet flow control status
IEEE 802.3x Ethernet flow control is disabled when bit (1 << 2) is set
in the port status register. Fix the flow control detection in the link
event handling function which was relying on the opposite assumption.
Signed-off-by: Antoine Tenart <antoine.tenart@free-electrons.com> Acked-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: David S. Miller <davem@davemloft.net>
Antoine Ténart [Tue, 30 Sep 2014 14:28:09 +0000 (16:28 +0200)]
Documentation: bindings: net: add the Marvell PXA168 Ethernet controller
This adds the binding documentation for the Marvell PXA168 Ethernet
controller, following its DT support.
Signed-off-by: Antoine Tenart <antoine.tenart@free-electrons.com> Acked-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: David S. Miller <davem@davemloft.net>
Antoine Ténart [Tue, 30 Sep 2014 14:28:08 +0000 (16:28 +0200)]
net: pxa168_eth: add device tree support
Add the device tree support to the pxa168_eth driver.
Signed-off-by: Antoine Tenart <antoine.tenart@free-electrons.com> Acked-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: David S. Miller <davem@davemloft.net>
Antoine Ténart [Tue, 30 Sep 2014 14:28:07 +0000 (16:28 +0200)]
net: pxa168_eth: clean up
Clean up a bit the pxa168_eth driver before adding the device tree
support.
Signed-off-by: Antoine Tenart <antoine.tenart@free-electrons.com> Acked-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: David S. Miller <davem@davemloft.net>
Jack Morgenstein [Tue, 30 Sep 2014 09:03:50 +0000 (12:03 +0300)]
net/mlx4_core: Deprecate error message at ConnectX-2 cards startup to debug
ConnectX2 HCAs have max_mtu=4k and max_vl=8 vls. However, if you specify
a 4K mtu, the max_vl supported for 4K is 4 vls. The driver at startup
attempts to set a 4K mtu using the max_vl value obtained from QUERY_PORT.
Since the max_vl value is 8 vls (which is supported up to 2K mtu size),
the first attempt to set the mtl/vl port value will fail, generating
the following error message in the log:
mlx4_core 0000:06:00.0: command 0xc failed: fw status = 0x40
The driver then tries again, using mtu=4k, vls=4, and this succeeds.
Since we do not want to have this error message always displayed at driver
start when there are ConnectX2 HCAs on the host, we deprecate the error
message for this specific command/input_modifier/opcode_modifier/fw-status
to be debug.
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il> Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Jack Morgenstein [Tue, 30 Sep 2014 09:03:49 +0000 (12:03 +0300)]
net/mlx4_core: Protect QUERY_PORT wrapper from untrusted guests
The function mlx4_QUERY_PORT_wrapper implements only the
QUERY_PORT "general" case (opcode modifier = 0).
Verify that the opcode modifier is zero, and also that the
input modifier contains only the port number in bits 0..7
(all other bits should be zero).
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il> Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Majd Dibbiny [Tue, 30 Sep 2014 09:03:48 +0000 (12:03 +0300)]
net/mlx4_core: New init and exit flow for mlx4_core
In the new flow, we separate the pci initialization and teardown
from the initialization and teardown of the other resources.
__mlx4_init_one handles the pci resources initialization. It then
calls mlx4_load_one to initialize the remainder of the resources.
When removing a device, mlx4_remove_one is invoked. However, now
mlx4_remove_one calls mlx4_unload_one to free all the resources except the pci
resources. When mlx4_unload_one returns, mlx4_remove_one then frees the
pci resources.
The above separation will allow us to implement 'reset flow' in the future.
It will also enable more EQs for VFs and is a pre-step to the modern API to
enable/disable SRIOV.
Also added nvfs; an integer array of size MLX4_MAX_PORTS + 1; to the mlx4_dev
struct. This new field is used to avoid parsing the num_vfs module parameter
each time the mlx4_restart_one is called.
Signed-off-by: Majd Dibbiny <majd@mellanox.com> Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Jack Morgenstein [Tue, 30 Sep 2014 09:03:47 +0000 (12:03 +0300)]
net/mlx4_core: Don't disable SRIOV if there are active VFs
When unloading the host driver while there are VFs active on VMs,
the PF driver disabled sriov anyway, causing kernel crashes.
We now leave SRIOV enabled, to avoid that.
When the driver is reloaded, __mlx4_init_one is invoked on the PF.
It now checks to see if SRIOV is already enabled on the PF -- and
if so does not enable sriov again.
Signed-off-by: Tal Alon <talal@mellanox.com> Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il> Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
netfilter: bridge: build br_nf_core only if required
Eric reports build failure with
CONFIG_BRIDGE_NETFILTER=n
We insist to build br_nf_core.o unconditionally, but we must only do so
if br_netfilter was enabled, else it fails to build due to
functions being defined to empty stubs (and some structure members
being defined out).
Also, BRIDGE_NETFILTER=y|m makes no sense when BRIDGE=n.
Fixes: 34666d467 (netfilter: bridge: move br_netfilter out of the core) Reported-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Tue, 30 Sep 2014 05:30:50 +0000 (01:30 -0400)]
Merge branch 'am335x'
Markus Pargmann says:
====================
net: cpsw: Support for am335x chip MACIDs
This series adds support to the cpsw driver to read the MACIDs of the am335x
chip and use them as fallback. These addresses are only used if there are no
mac addresses in the devicetree, for example set by a bootloader.
====================
Acked-by: Mugunthan V N <mugunthanvnm@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Markus Pargmann [Mon, 29 Sep 2014 06:53:19 +0000 (08:53 +0200)]
arm: dts: am33xx, Add syscon phandle to cpsw node
There are 2 MACIDs stored in the control module of the am33xx. These are
read by the cpsw driver if no valid MACID was found in the devicetree.
Signed-off-by: Markus Pargmann <mpa@pengutronix.de> Reviewed-by: Wolfram Sang <wsa@the-dreams.de> Acked-by: Tony Lindgren <tony@atomide.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Markus Pargmann [Mon, 29 Sep 2014 06:53:18 +0000 (08:53 +0200)]
am33xx: define syscon control module device node
Signed-off-by: Markus Pargmann <mpa@pengutronix.de> Reviewed-by: Wolfram Sang <wsa@the-dreams.de> Acked-by: Tony Lindgren <tony@atomide.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Markus Pargmann [Mon, 29 Sep 2014 06:53:17 +0000 (08:53 +0200)]
net: cpsw: Add am33xx MACID readout
This patch adds a function to get the MACIDs from the am33xx SoC
control module registers which hold unique vendor MACIDs. This is only
used if of_get_mac_address() fails to get a valid mac address.
Signed-off-by: Markus Pargmann <mpa@pengutronix.de> Reviewed-by: Wolfram Sang <wsa@the-dreams.de> Tested-by: Steven Rostedt <rostedt@goodmis.org> Acked-by: Tony Lindgren <tony@atomide.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Andy Gospodarek [Mon, 29 Sep 2014 02:34:37 +0000 (22:34 -0400)]
bonding: make global bonding stats more reliable
As the code stands today, bonding stats are based simply on the stats
from the member interfaces. If a member was to be removed from a bond,
the stats would instantly drop. This would be confusing to an admin
would would suddonly see interface stats drop while traffic is still
flowing.
In addition to preventing the stats drops mentioned above, new members
will now be added to the bond and only traffic received after the member
was added to the bond will be counted as part of bonding stats. Bonding
counters will also be updated when any slaves are dropped to make sure
the reported stats are reliable.
v2: Changes suggested by Nik to properly allocate/free stats memory.
v3: Properly destroy workqueue and fix netlink configuration path.
v4: Moved cached stats into bonding and slave structs as there does not
seem to be a complexity/performance benefit to using alloc'd memory vs
in-struct memory.
Signed-off-by: Andy Gospodarek <gospo@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
John Fastabend [Sun, 28 Sep 2014 18:53:57 +0000 (11:53 -0700)]
net: sched: restrict use of qstats qlen
This removes the use of qstats->qlen variable from the classifiers
and makes it an explicit argument to gnet_stats_copy_queue().
The qlen represents the qdisc queue length and is packed into
the qstats at the last moment before passnig to user space. By
handling it explicitely we avoid, in the percpu stats case, having
to figure out which per_cpu variable to put it in.
It would probably be best to remove it from qstats completely
but qstats is a user space ABI and can't be broken. A future
patch could make an internal only qstats structure that would
avoid having to allocate an additional u32 variable on the
Qdisc struct. This would make the qstats struct 128bits instead
of 128+32.
Signed-off-by: John Fastabend <john.r.fastabend@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
John Fastabend [Sun, 28 Sep 2014 18:53:29 +0000 (11:53 -0700)]
net: sched: implement qstat helper routines
This adds helpers to manipulate qstats logic and replaces locations
that touch the counters directly. This simplifies future patches
to push qstats onto per cpu counters.
Signed-off-by: John Fastabend <john.r.fastabend@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
John Fastabend [Sun, 28 Sep 2014 18:52:56 +0000 (11:52 -0700)]
net: sched: make bstats per cpu and estimator RCU safe
In order to run qdisc's without locking statistics and estimators
need to be handled correctly.
To resolve bstats make the statistics per cpu. And because this is
only needed for qdiscs that are running without locks which is not
the case for most qdiscs in the near future only create percpu
stats when qdiscs set the TCQ_F_CPUSTATS flag.
Next because estimators use the bstats to calculate packets per
second and bytes per second the estimator code paths are updated
to use the per cpu statistics.
Signed-off-by: John Fastabend <john.r.fastabend@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Michael Braun [Thu, 25 Sep 2014 14:31:08 +0000 (16:31 +0200)]
macvlan: add source mode
This patch adds a new mode of operation to macvlan, called "source".
It allows one to set a list of allowed mac address, which is used
to match against source mac address from received frames on underlying
interface.
This enables creating mac based VLAN associations, instead of standard
port or tag based. The feature is useful to deploy 802.1x mac based
behavior, where drivers of underlying interfaces doesn't allows that.
Configuration is done through the netlink interface using e.g.:
ip link add link eth0 name macvlan0 type macvlan mode source
ip link add link eth0 name macvlan1 type macvlan mode source
ip link set link dev macvlan0 type macvlan macaddr add 00:11:11:11:11:11
ip link set link dev macvlan0 type macvlan macaddr add 00:22:22:22:22:22
ip link set link dev macvlan0 type macvlan macaddr add 00:33:33:33:33:33
ip link set link dev macvlan1 type macvlan macaddr add 00:33:33:33:33:33
ip link set link dev macvlan1 type macvlan macaddr add 00:44:44:44:44:44
This allows clients with MAC addresses 00:11:11:11:11:11,
00:22:22:22:22:22 to be part of only VLAN associated with macvlan0
interface. Clients with MAC addresses 00:44:44:44:44:44 with only VLAN
associated with macvlan1 interface. And client with MAC address
00:33:33:33:33:33 to be associated with both VLANs.
Based on work of Stefan Gula <steweg@gmail.com>
v8: last version of Stefan Gula for Kernel 3.2.1
v9: rework onto linux-next 2014-03-12 by Michael Braun
add MACADDR_SET command, enable to configure mac for source mode
while creating interface
v10:
- reduce indention level
- rename source_list to source_entry
- use aligned 64bit ether address
- use hash_64 instead of addr[5]
v11:
- rebase for 3.14 / linux-next 20.04.2014
v12
- rebase for linux-next 2014-09-25
Signed-off-by: Michael Braun <michael-dev@fami-braun.de> Signed-off-by: David S. Miller <davem@davemloft.net>
====================
pull request: netfilter/ipvs updates for net-next
The following patchset contains Netfilter/IPVS updates for net-next,
most relevantly they are:
1) Four patches to make the new nf_tables masquerading support
independent of the x_tables infrastructure. This also resolves a
compilation breakage if the masquerade target is disabled but the
nf_tables masq expression is enabled.
2) ipset updates via Jozsef Kadlecsik. This includes the addition of the
skbinfo extension that allows you to store packet metainformation in the
elements. This can be used to fetch and restore this to the packets through
the iptables SET target, patches from Anton Danilov.
3) Add the hash:mac set type to ipset, from Jozsef Kadlecsick.
4) Add simple weighted fail-over scheduler via Simon Horman. This provides
a fail-over IPVS scheduler (unlike existing load balancing schedulers).
Connections are directed to the appropriate server based solely on
highest weight value and server availability, patch from Kenny Mathis.
5) Support IPv6 real servers in IPv4 virtual-services and vice versa.
Simon Horman informs that the motivation for this is to allow more
flexibility in the choice of IP version offered by both virtual-servers
and real-servers as they no longer need to match: An IPv4 connection
from an end-user may be forwarded to a real-server using IPv6 and
vice versa. No ip_vs_sync support yet though. Patches from Alex Gartrell
and Julian Anastasov.
6) Add global generation ID to the nf_tables ruleset. When dumping from
several different object lists, we need a way to identify that an update
has ocurred so userspace knows that it needs to refresh its lists. This
also includes a new command to obtain the 32-bits generation ID. The
less significant 16-bits of this ID is also exposed through res_id field
in the nfnetlink header to quickly detect the interference and retry when
there is no risk of ID wraparound.
7) Move br_netfilter out of the bridge core. The br_netfilter code is
built in the bridge core by default. This causes problems of different
kind to people that don't want this: Jesper reported performance drop due
to the inconditional hook registration and I remember to have read complains
on netdev from people regarding the unexpected behaviour of our bridging
stack when br_netfilter is enabled (fragmentation handling, layer 3 and
upper inspection). People that still need this should easily undo the
damage by modprobing the new br_netfilter module.
8) Dump the set policy nf_tables that allows set parameterization. So
userspace can keep user-defined preferences when saving the ruleset.
From Arturo Borrero.
9) Use __seq_open_private() helper function to reduce boiler plate code
in x_tables, From Rob Jones.
10) Safer default behaviour in case that you forget to load the protocol
tracker. Daniel Borkmann and Florian Westphal detected that if your
ruleset is stateful, you allow traffic to at least one single SCTP port
and the SCTP protocol tracker is not loaded, then any SCTP traffic may
be pass through unfiltered. After this patch, the connection tracking
classifies SCTP/DCCP/UDPlite/GRE packets as invalid if your kernel has
been compiled with support for these modules.
====================
Trivially resolved conflict in include/linux/skbuff.h, Eric moved some
netfilter skbuff members around, and the netfilter tree adjusted the
ifdef guards for the bridging info pointer.
Signed-off-by: David S. Miller <davem@davemloft.net>
Suggested by Stephen. Also drop inline keyword and let compiler decide.
gcc 4.7.3 decides to no longer inline tcp_ecn_check_ce, so split it up.
The actual evaluation is not inlined anymore while the ECN_OK test is.
Suggested-by: Stephen Hemminger <stephen@networkplumber.org> Signed-off-by: Daniel Borkmann <dborkman@redhat.com> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>
After Octavian Purdilas tcp ipv4/ipv6 unification work this helper only
has a single callsite.
While at it, convert name to lowercase, suggested by Stephen.
Suggested-by: Stephen Hemminger <stephen@networkplumber.org> Signed-off-by: Daniel Borkmann <dborkman@redhat.com> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Mon, 29 Sep 2014 18:36:33 +0000 (14:36 -0400)]
Merge branch 'arcnet-EAE'
Michael Grzeschik says:
====================
ARCNET: add support for EAE multi interfac card
this series adds support for the PLX Bridge based multi interface
pci cards and adds support to change device address on com200xx chips
during runtime.
This series is based on v3.17-rc7.
It is fixed for build against com20020_cs.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
ARCNET: add support for multi interfaces on com20020
The com20020-pci driver is currently designed to instance
one netdev with one pci device. This patch adds support to
instance many cards with one pci device, depending on the device
data in the private data.
Signed-off-by: Michael Grzeschik <m.grzeschik@pengutronix.de> Signed-off-by: David S. Miller <davem@davemloft.net>
ARCNET: return IRQ_NONE if the interface isn't running
The interrupt handler needs to return IRQ_NONE in case
two devices are used with the shared interrupt handler.
Otherwise it could steal interrupts from the other
interface.
Signed-off-by: Michael Grzeschik <m.grzeschik@pengutronix.de> Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Mon, 29 Sep 2014 05:18:47 +0000 (22:18 -0700)]
net: reorganize sk_buff for faster __copy_skb_header()
With proliferation of bit fields in sk_buff, __copy_skb_header() became
quite expensive, showing as the most expensive function in a GSO
workload.
__copy_skb_header() performance is also critical for non GSO TCP
operations, as it is used from skb_clone()
This patch carefully moves all the fields that were not copied in a
separate zone : cloned, nohdr, fclone, peeked, head_frag, xmit_more
Then I moved all other fields and all other copied fields in a section
delimited by headers_start[0]/headers_end[0] section so that we
can use a single memcpy() call, inlined by compiler using long
word load/stores.
I also tried to make all copies in the natural orders of sk_buff,
to help hardware prefetching.
I made sure sk_buff size did not change.
Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
netfilter: conntrack: disable generic tracking for known protocols
Given following iptables ruleset:
-P FORWARD DROP
-A FORWARD -m sctp --dport 9 -j ACCEPT
-A FORWARD -p tcp --dport 80 -j ACCEPT
-A FORWARD -p tcp -m conntrack -m state ESTABLISHED,RELATED -j ACCEPT
One would assume that this allows SCTP on port 9 and TCP on port 80.
Unfortunately, if the SCTP conntrack module is not loaded, this allows
*all* SCTP communication, to pass though, i.e. -p sctp -j ACCEPT,
which we think is a security issue.
This is because on the first SCTP packet on port 9, we create a dummy
"generic l4" conntrack entry without any port information (since
conntrack doesn't know how to extract this information).
All subsequent packets that are unknown will then be in established
state since they will fallback to proto_generic and will match the
'generic' entry.
Our originally proposed version [1] completely disabled generic protocol
tracking, but Jozsef suggests to not track protocols for which a more
suitable helper is available, hence we now mitigate the issue for in
tree known ct protocol helpers only, so that at least NAT and direction
information will still be preserved for others.
This patch series adds support for the Qualcomm QCA7000 Homeplug GreenPHY.
The QCA7000 is serial-to-powerline bridge with two interfaces: UART and SPI.
These patches handles only the last one, with an Ethernet over SPI protocol
driver.
This driver based on the Qualcomm code [1], but contains a lot of changes
since last year:
* devicetree support
* DebugFS support
* ethtool support
* better error handling
* performance improvements
* code cleanup
* some bugfixes
The code has been tested only on Freescale i.MX28 boards, but should work
on other platforms.
[1] - https://github.com/IoE/qca7000
Changes in V3:
- Use ether_addr_copy instead of memcpy
- Remove qcaspi_set_mac_address
- Improve DT parsing
- replace OF_GPIO dependancy with OF
- fix compile error caused by SET_ETHTOOL_OPS
- fix possible endless loop when spi read fails
- fix DT documentation
- fix coding style
- fix sparse warnings
Changes in V2:
- replace in DT the SPI intr GPIO with pure interrupt
- make legacy mode a boolean DT property and remove it as module parameter
- make burst length a module parameter instead of DT property
- make pluggable a module parameter instead of DT property
- improve DT documentation
- replace debugFS register dump with ethtool function
- replace debugFS stats with ethtool function
- implement function to get ring parameter via ethtool
- implement function to set TX ring count via ethtool
- fix TX ring state in debugFS
- optimize tx ring flush
- add byte limit for TX ring to avoid bufferbloat
- fix TX queue full and write buffer miss counter
- fix SPI clk speed module parameter
- fix possible packet loss
- fix possible race during transmit
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Mon, 29 Sep 2014 04:13:17 +0000 (00:13 -0400)]
Merge branch 'dctcp'
Daniel Borkmann says:
====================
net: tcp: DCTCP congestion control algorithm
This patch series adds support for the DataCenter TCP (DCTCP) congestion
control algorithm. Please see individual patches for the details.
The last patch adds DCTCP as a congestion control module, and previous
ones add needed infrastructure to extend the congestion control framework.
Joint work between Florian Westphal, Daniel Borkmann and Glenn Judd.
v3 -> v2:
- No changes anywhere, just a resend as requested by Dave
- Added Stephen's ACK
v1 -> v2:
- Rebased to latest net-next
- Addressed Eric's feedback, thanks!
- Update stale comment wrt. DCTCP ECN usage
- Don't call INET_ECN_xmit for every packet
- Add dctcp ss/inetdiag support to expose internal stats to userspace
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Daniel Borkmann [Fri, 26 Sep 2014 20:37:36 +0000 (22:37 +0200)]
net: tcp: add DCTCP congestion control algorithm
This work adds the DataCenter TCP (DCTCP) congestion control
algorithm [1], which has been first published at SIGCOMM 2010 [2],
resp. follow-up analysis at SIGMETRICS 2011 [3] (and also, more
recently as an informational IETF draft available at [4]).
DCTCP is an enhancement to the TCP congestion control algorithm for
data center networks. Typical data center workloads are i.e.
i) partition/aggregate (queries; bursty, delay sensitive), ii) short
messages e.g. 50KB-1MB (for coordination and control state; delay
sensitive), and iii) large flows e.g. 1MB-100MB (data update;
throughput sensitive). DCTCP has therefore been designed for such
environments to provide/achieve the following three requirements:
* High burst tolerance (incast due to partition/aggregate)
* Low latency (short flows, queries)
* High throughput (continuous data updates, large file
transfers) with commodity, shallow buffered switches
The basic idea of its design consists of two fundamentals: i) on the
switch side, packets are being marked when its internal queue
length > threshold K (K is chosen so that a large enough headroom
for marked traffic is still available in the switch queue); ii) the
sender/host side maintains a moving average of the fraction of marked
packets, so each RTT, F is being updated as follows:
F := X / Y, where X is # of marked ACKs, Y is total # of ACKs
alpha := (1 - g) * alpha + g * F, where g is a smoothing constant
The resulting alpha (iow: probability that switch queue is congested)
is then being used in order to adaptively decrease the congestion
window W:
W := (1 - (alpha / 2)) * W
The means for receiving marked packets resp. marking them on switch
side in DCTCP is the use of ECN.
RFC3168 describes a mechanism for using Explicit Congestion Notification
from the switch for early detection of congestion, rather than waiting
for segment loss to occur.
However, this method only detects the presence of congestion, not
the *extent*. In the presence of mild congestion, it reduces the TCP
congestion window too aggressively and unnecessarily affects the
throughput of long flows [4].
DCTCP, as mentioned, enhances Explicit Congestion Notification (ECN)
processing to estimate the fraction of bytes that encounter congestion,
rather than simply detecting that some congestion has occurred. DCTCP
then scales the TCP congestion window based on this estimate [4],
thus it can derive multibit feedback from the information present in
the single-bit sequence of marks in its control law. And thus act in
*proportion* to the extent of congestion, not its *presence*.
Switches therefore set the Congestion Experienced (CE) codepoint in
packets when internal queue lengths exceed threshold K. Resulting,
DCTCP delivers the same or better throughput than normal TCP, while
using 90% less buffer space.
It was found in [2] that DCTCP enables the applications to handle 10x
the current background traffic, without impacting foreground traffic.
Moreover, a 10x increase in foreground traffic did not cause any
timeouts, and thus largely eliminates TCP incast collapse problems.
The algorithm itself has already seen deployments in large production
data centers since then.
We did a long-term stress-test and analysis in a data center, short
summary of our TCP incast tests with iperf compared to cubic:
This test measured DCTCP throughput and latency and compared it with
CUBIC throughput and latency for an incast scenario. In this test, 19
senders sent at maximum rate to a single receiver. The receiver simply
ran iperf -s.
The senders ran iperf -c <receiver> -t 30. All senders started
simultaneously (using local clocks synchronized by ntp).
This test was repeated multiple times. Below shows the results from a
single test. Other tests are similar. (DCTCP results were extremely
consistent, CUBIC results show some variance induced by the TCP timeouts
that CUBIC encountered.)
For this test, we report statistics on the number of TCP timeouts,
flow throughput, and traffic latency.
1) Timeouts (total over all flows, and per flow summaries):
CUBIC DCTCP
Total 3227 25
Mean 169.842 1.316
Median 183 1
Max 207 5
Min 123 0
Stddev 28.991 1.600
Timeout data is taken by measuring the net change in netstat -s
"other TCP timeouts" reported. As a result, the timeout measurements
above are not restricted to the test traffic, and we believe that it
is likely that all of the "DCTCP timeouts" are actually timeouts for
non-test traffic. We report them nevertheless. CUBIC will also include
some non-test timeouts, but they are drawfed by bona fide test traffic
timeouts for CUBIC. Clearly DCTCP does an excellent job of preventing
TCP timeouts. DCTCP reduces timeouts by at least two orders of
magnitude and may well have eliminated them in this scenario.
2) Throughput (per flow in Mbps):
CUBIC DCTCP
Mean 521.684 521.895
Median 464 523
Max 776 527
Min 403 519
Stddev 105.891 2.601
Fairness 0.962 0.999
Throughput data was simply the average throughput for each flow
reported by iperf. By avoiding TCP timeouts, DCTCP is able to
achieve much better per-flow results. In CUBIC, many flows
experience TCP timeouts which makes flow throughput unpredictable and
unfair. DCTCP, on the other hand, provides very clean predictable
throughput without incurring TCP timeouts. Thus, the standard deviation
of CUBIC throughput is dramatically higher than the standard deviation
of DCTCP throughput.
Mean throughput is nearly identical because even though cubic flows
suffer TCP timeouts, other flows will step in and fill the unused
bandwidth. Note that this test is something of a best case scenario
for incast under CUBIC: it allows other flows to fill in for flows
experiencing a timeout. Under situations where the receiver is issuing
requests and then waiting for all flows to complete, flows cannot fill
in for timed out flows and throughput will drop dramatically.
3) Latency (in ms):
CUBIC DCTCP
Mean 4.0088 0.04219
Median 4.055 0.0395
Max 4.2 0.085
Min 3.32 0.028
Stddev 0.1666 0.01064
Latency for each protocol was computed by running "ping -i 0.2
<receiver>" from a single sender to the receiver during the incast
test. For DCTCP, "ping -Q 0x6 -i 0.2 <receiver>" was used to ensure
that traffic traversed the DCTCP queue and was not dropped when the
queue size was greater than the marking threshold. The summary
statistics above are over all ping metrics measured between the single
sender, receiver pair.
The latency results for this test show a dramatic difference between
CUBIC and DCTCP. CUBIC intentionally overflows the switch buffer
which incurs the maximum queue latency (more buffer memory will lead
to high latency.) DCTCP, on the other hand, deliberately attempts to
keep queue occupancy low. The result is a two orders of magnitude
reduction of latency with DCTCP - even with a switch with relatively
little RAM. Switches with larger amounts of RAM will incur increasing
amounts of latency for CUBIC, but not for DCTCP.
4) Convergence and stability test:
This test measured the time that DCTCP took to fairly redistribute
bandwidth when a new flow commences. It also measured DCTCP's ability
to remain stable at a fair bandwidth distribution. DCTCP is compared
with CUBIC for this test.
At the commencement of this test, a single flow is sending at maximum
rate (near 10 Gbps) to a single receiver. One second after that first
flow commences, a new flow from a distinct server begins sending to
the same receiver as the first flow. After the second flow has sent
data for 10 seconds, the second flow is terminated. The first flow
sends for an additional second. Ideally, the bandwidth would be evenly
shared as soon as the second flow starts, and recover as soon as it
stops.
The results of this test are shown below. Note that the flow bandwidth
for the two flows was measured near the same time, but not
simultaneously.
DCTCP performs nearly perfectly within the measurement limitations
of this test: bandwidth is quickly distributed fairly between the two
flows, remains stable throughout the duration of the test, and
recovers quickly. CUBIC, in contrast, is slow to divide the bandwidth
fairly, and has trouble remaining stable.
This test demonstrates the importance of ECT on SYN and SYN-ACK packets
by measuring the connection probability in the presence of competing
flows for a DCTCP connection attempt *without* ECT in the SYN packet.
The test was repeated five times for each number of competing flows.
Competing Flows 1 | 2 | 4 | 8 | 16
------------------------------
Mean Connection Probability 1 | 0.67 | 0.45 | 0.28 | 0
Median Connection Probability 1 | 0.65 | 0.45 | 0.25 | 0
As the number of competing flows moves beyond 1, the connection
probability drops rapidly.
Enabling DCTCP with this patch requires the following steps:
DCTCP must be running both on the sender and receiver side in your
data center, i.e.:
sysctl -w net.ipv4.tcp_congestion_control=dctcp
Also, ECN functionality must be enabled on all switches in your
data center for DCTCP to work. The default ECN marking threshold (K)
heuristic on the switch for DCTCP is e.g., 20 packets (30KB) at
1Gbps, and 65 packets (~100KB) at 10Gbps (K > 1/7 * C * RTT, [4]).
In above tests, for each switch port, traffic was segregated into two
queues. For any packet with a DSCP of 0x01 - or equivalently a TOS of
0x04 - the packet was placed into the DCTCP queue. All other packets
were placed into the default drop-tail queue. For the DCTCP queue,
RED/ECN marking was enabled, here, with a marking threshold of 75 KB.
More details however, we refer you to the paper [2] under section 3).
There are no code changes required to applications running in user
space. DCTCP has been implemented in full *isolation* of the rest of
the TCP code as its own congestion control module, so that it can run
without a need to expose code to the core of the TCP stack, and thus
nothing changes for non-DCTCP users.
Changes in the CA framework code are minimal, and DCTCP algorithm
operates on mechanisms that are already available in most Silicon.
The gain (dctcp_shift_g) is currently a fixed constant (1/16) from
the paper, but we leave the option that it can be chosen carefully
to a different value by the user.
In case DCTCP is being used and ECN support on peer site is off,
DCTCP falls back after 3WHS to operate in normal TCP Reno mode.
Signed-off-by: Daniel Borkmann <dborkman@redhat.com> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Glenn Judd <glenn.judd@morganstanley.com> Acked-by: Stephen Hemminger <stephen@networkplumber.org> Signed-off-by: David S. Miller <davem@davemloft.net>
net: tcp: more detailed ACK events and events for CE marked packets
DataCenter TCP (DCTCP) determines cwnd growth based on ECN information
and ACK properties, e.g. ACK that updates window is treated differently
than DUPACK.
Also DCTCP needs information whether ACK was delayed ACK. Furthermore,
DCTCP also implements a CE state machine that keeps track of CE markings
of incoming packets.
Therefore, extend the congestion control framework to provide these
event types, so that DCTCP can be properly implemented as a normal
congestion algorithm module outside of the core stack.
Joint work with Daniel Borkmann and Glenn Judd.
Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Daniel Borkmann <dborkman@redhat.com> Signed-off-by: Glenn Judd <glenn.judd@morganstanley.com> Acked-by: Stephen Hemminger <stephen@networkplumber.org> Signed-off-by: David S. Miller <davem@davemloft.net>
net: tcp: split ack slow/fast events from cwnd_event
The congestion control ops "cwnd_event" currently supports
CA_EVENT_FAST_ACK and CA_EVENT_SLOW_ACK events (among others).
Both FAST and SLOW_ACK are only used by Westwood congestion
control algorithm.
This removes both flags from cwnd_event and adds a new
in_ack_event callback for this. The goal is to be able to
provide more detailed information about ACKs, such as whether
ECE flag was set, or whether the ACK resulted in a window
update.
It is required for DataCenter TCP (DCTCP) congestion control
algorithm as it makes a different choice depending on ECE being
set or not.
Joint work with Daniel Borkmann and Glenn Judd.
Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Daniel Borkmann <dborkman@redhat.com> Signed-off-by: Glenn Judd <glenn.judd@morganstanley.com> Acked-by: Stephen Hemminger <stephen@networkplumber.org> Signed-off-by: David S. Miller <davem@davemloft.net>
Daniel Borkmann [Fri, 26 Sep 2014 20:37:33 +0000 (22:37 +0200)]
net: tcp: add flag for ca to indicate that ECN is required
This patch adds a flag to TCP congestion algorithms that allows
for requesting to mark IPv4/IPv6 sockets with transport as ECN
capable, that is, ECT(0), when required by a congestion algorithm.
It is currently used and needed in DataCenter TCP (DCTCP), as it
requires both peers to assert ECT on all IP packets sent - it
uses ECN feedback (i.e. CE, Congestion Encountered information)
from switches inside the data center to derive feedback to the
end hosts.
Therefore, simply add a new flag to icsk_ca_ops. Note that DCTCP's
algorithm/behaviour slightly diverges from RFC3168, therefore this
is only (!) enabled iff the assigned congestion control ops module
has requested this. By that, we can tightly couple this logic really
only to the provided congestion control ops.
Joint work with Florian Westphal and Glenn Judd.
Signed-off-by: Daniel Borkmann <dborkman@redhat.com> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Glenn Judd <glenn.judd@morganstanley.com> Acked-by: Stephen Hemminger <stephen@networkplumber.org> Signed-off-by: David S. Miller <davem@davemloft.net>
net: tcp: assign tcp cong_ops when tcp sk is created
Split assignment and initialization from one into two functions.
This is required by followup patches that add Datacenter TCP
(DCTCP) congestion control algorithm - we need to be able to
determine if the connection is moderated by DCTCP before the
3WHS has finished.
As we walk the available congestion control list during the
assignment, we are always guaranteed to have Reno present as
it's fixed compiled-in. Therefore, since we're doing the
early assignment, we don't have a real use for the Reno alias
tcp_init_congestion_ops anymore and can thus remove it.
Actual usage of the congestion control operations are being
made after the 3WHS has finished, in some cases however we
can access get_info() via diag if implemented, therefore we
need to zero out the private area for those modules.
Joint work with Daniel Borkmann and Glenn Judd.
Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Daniel Borkmann <dborkman@redhat.com> Signed-off-by: Glenn Judd <glenn.judd@morganstanley.com> Acked-by: Stephen Hemminger <stephen@networkplumber.org> Signed-off-by: David S. Miller <davem@davemloft.net>
The TX completion was running from another cpu, with high interrupts
rate.
Note that I am using barrier() as a soft hint, as mb() here could be
too heavy cost.
[1] This was a netperf TCP_STREAM with TSO disabled, but GSO enabled.
Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
WANG Cong [Thu, 25 Sep 2014 19:06:05 +0000 (12:06 -0700)]
net_sched: fix another regression in cls_tcindex
Clearly the following change is not expected:
- if (!cp.perfect && !cp.h)
- cp.alloc_hash = cp.hash;
+ if (!cp->perfect && cp->h)
+ cp->alloc_hash = cp->hash;
Fixes: commit 331b72922c5f58d48fd ("net: sched: RCU cls_tcindex") Cc: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Acked-by: John Fastabend <john.r.fastabend@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
WANG Cong [Thu, 25 Sep 2014 19:06:04 +0000 (12:06 -0700)]
net_sched: fix errno in tcindex_set_parms()
When kmemdup() fails, we should return -ENOMEM.
Cc: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Acked-by: John Fastabend <john.r.fastabend@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Sun, 28 Sep 2014 21:32:16 +0000 (17:32 -0400)]
Merge branch 'cxgb4-next'
Hariprasad Shenai says:
====================
cxgb4: Use new BAR2 GTS for T5, adds adaptive rx and few Device ID's
This patch series adds support to use new BAR2 GTS for T5 adapter.
Adds support for adaptive rx. Remove redundant variable from a macro of
cxgb4vf driver. Adds Device ID for new adapters.
The patches series is created against 'net-next' tree.
And includes patches on cxgb4 and cxgb4vf driver.
We have included all the maintainers of respective drivers. Kindly review the
change and let us know in case of any review comments.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
WANG Cong [Thu, 25 Sep 2014 17:26:37 +0000 (10:26 -0700)]
net_sched: remove the first parameter from tcf_exts_destroy()
Cc: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Acked-by: Jamal Hadi Salim <hadi@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Sun, 28 Sep 2014 21:22:21 +0000 (17:22 -0400)]
Merge branch 'defxx-next'
Maciej W. Rozycki says:
====================
defxx: DEFEA fixes and updates
I have finally got my hands on an EISA variation of the board (DEC
FDDIcontroller/EISA aka DEFEA) and was able to do some testing. Here are
initial updates to the driver that address problems I encountered so far.
More to come later on as I get back to the system that I have in a remote
location -- I need to double-check MMIO support and see what might have
been causing spurious interrupts I saw with the 8259A PIC the board's
interrupt line has been routed to.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Use the slot-specific I/O range for decoding accesses to PDQ ASIC
registers (IOCS0) and the discrete Burst Holdoff register (IOCS1) as per
the "HD64981F EISA Slave Interface Controller (ESIC)" datasheet. Use
disjoint decode ranges now that the assignment of chip selects is known.
Update the span of the port I/O resource requested accordingly.
Signed-off-by: Maciej W. Rozycki <macro@linux-mips.org> Signed-off-by: David S. Miller <davem@davemloft.net>
1) Remove useless hash_resize_mutex in xfrm_hash_resize().
This mutex is used only there, but xfrm_hash_resize()
can't be called concurrently at all. From Ying Xue.
2) Extend policy hashing to prefixed policies based on
prefix lenght thresholds. From Christophe Gouault.
3) Make the policy hash table thresholds configurable
via netlink. From Christophe Gouault.
4) Remove the maximum authentication length for AH.
This was needed to limit stack usage. We switched
already to allocate space, so no need to keep the
limit. From Herbert Xu.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Sun, 28 Sep 2014 21:14:15 +0000 (17:14 -0400)]
Merge branch 'dsa_eee'
Florian Fainelli says:
====================
net: dsa: EEE and other PM features
This patch set allows DSA switch drivers to enable/disable/query EEE on a
per-port level, as well as control precisely which switch ports are
enable/disabled.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
net: dsa: bcm_sf2: add support for controlling EEE
When EEE is enabled, negotiate this feature with the PHY and make sure
that the capability checking, local EEE advertisement, link partner EEE
advertisement and auto-negotiation resolution returned by phy_init_eee()
is positive, and enable EEE at the switch level.
While querying the current EEE settings, verify the low-power indication
and indicate its status.
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
net: dsa: allow switches driver to implement get/set EEE
Allow switches driver to query and enable/disable EEE on a per-port
basis by implementing the ethtool_{get,set}_eee settings and delegating
these operations to the switch driver.
set_eee() will need to coordinate with the PHY driver to make sure that
EEE is enabled, the link-partner supports it and the auto-negotiation
result is satisfactory.
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
The SF2 switch driver is already architected around per-port
enable/disable callbacks, so we just need a slight update to our
existing bcm_sf2_port_setup() resp. bcm_sf2_port_disable() functions to
be suitable as callbacks for port_enable/port_disable.
We need to shuffle a little the code that does the per-port VLAN
configuration/isolation since ports can now be brought up/down
separately, so we need to make sure that IMP (CPU, management) port is
always included in that specific port setup.
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
net: dsa: bcm_sf2: disable RGMII interface(s) when link is down
When the link is down, disable the RGMII interface to conserve as much
power as possible. We re-enable the RGMII interface whenever the link is
detected.
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Whenever a per-port network device is used/unused, invoke the switch
driver port_enable/port_disable callbacks to allow saving as much power
as possible by disabling unused parts of the switch (RX/TX logic, memory
arrays, PHYs...). We supply a PHY device argument to make sure the
switch driver can act on the PHY device if needed (like putting/taking
the PHY out of deep low power mode).
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
dsa_slave_open() should start the PHY library state machine for its PHY
interface, and dsa_slave_close() should stop the PHY library state
machine accordingly.
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Peter Pan(潘卫平) [Wed, 24 Sep 2014 14:17:02 +0000 (22:17 +0800)]
tcp: use tcp_flags in tcp_data_queue()
This patch is a cleanup which follows the idea in commit e11ecddf5128 (tcp: use
TCP_SKB_CB(skb)->tcp_flags in input path),
and it may reduce register pressure since skb->cb[] access is fast,
bacause skb is probably in a register.
v2: remove variable th
v3: reword the changelog
Signed-off-by: Weiping Pan <panweiping3@gmail.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Wed, 24 Sep 2014 11:11:22 +0000 (04:11 -0700)]
tcp: change tcp_skb_pcount() location
Our goal is to access no more than one cache line access per skb in
a write or receive queue when doing the various walks.
After recent TCP_SKB_CB() reorganizations, it is almost done.
Last part is tcp_skb_pcount() which currently uses
skb_shinfo(skb)->gso_segs, which is a terrible choice, because it needs
3 cache lines in current kernel (skb->head, skb->end, and
shinfo->gso_segs are all in 3 different cache lines, far from skb->cb)
This very simple patch reuses space currently taken by tcp_tw_isn
only in input path, as tcp_skb_pcount is only needed for skb stored in
write queue.
This considerably speeds up tcp_ack(), granted we avoid shinfo->tx_flags
to get SKBTX_ACK_TSTAMP, which seems possible.
This also speeds up all sack processing in general.
This speeds up tcp_sendmsg() because it no longer has to access/dirty
shinfo.
Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
TCP had the assumption that IPCB and IP6CB are first members of skb->cb[]
This is fine, except that IPCB/IP6CB are used in TCP for a very short time
in input path.
What really matters for TCP stack is to get skb->next,
TCP_SKB_CB(skb)->seq, and TCP_SKB_CB(skb)->end_seq in the same cache line.
skb that are immediately consumed do not care because whole skb->cb[] is
hot in cpu cache, while skb that sit in wocket write queue or receive queues
do not need TCP_SKB_CB(skb)->header at all.
This patch set implements the prereq for IPv4, IPv6, and TCP to make this
possible. This makes TCP more efficient.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Sat, 27 Sep 2014 16:50:57 +0000 (09:50 -0700)]
tcp: better TCP_SKB_CB layout to reduce cache line misses
TCP maintains lists of skb in write queue, and in receive queues
(in order and out of order queues)
Scanning these lists both in input and output path usually requires
access to skb->next, TCP_SKB_CB(skb)->seq, and TCP_SKB_CB(skb)->end_seq
These fields are currently in two different cache lines, meaning we
waste lot of memory bandwidth when these queues are big and flows
have either packet drops or packet reorders.
We can move TCP_SKB_CB(skb)->header at the end of TCP_SKB_CB, because
this header is not used in fast path. This allows TCP to search much faster
in the skb lists.
Even with regular flows, we save one cache line miss in fast path.
Thanks to Christoph Paasch for noticing we need to cleanup
skb->cb[] (IPCB/IP6CB) before entering IP stack in tx path,
and that I forgot IPCB use in tcp_v4_hnd_req() and tcp_v4_save_options().
Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Sat, 27 Sep 2014 16:50:55 +0000 (09:50 -0700)]
ipv4: rename ip_options_echo to __ip_options_echo()
ip_options_echo() assumes struct ip_options is provided in &IPCB(skb)->opt
Lets break this assumption, but provide a helper to not change all call points.
ip_send_unicast_reply() gets a new struct ip_options pointer.
Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
bcmgenet_wol_resume() is only used in bcmgenet_resume(), which is only
defined when CONFIG_PM_SLEEP is enabled. This leads to the following
compile warning when building with !CONFIG_PM_SLEEP:
drivers/net/ethernet/broadcom/genet/bcmgenet.c:1967:12: warning: ‘bcmgenet_wol_resume’ defined but not used [-Wunused-function]
Since bcmgenet_resume() is the only user of bcmgenet_wol_resume(), fix
this by directly inlining the function there.
Reported-by: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: Tobias Klauser <tklauser@distanz.ch> Acked-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Fri, 26 Sep 2014 20:23:12 +0000 (16:23 -0400)]
Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/net-next
Jeff Kirsher says:
====================
Intel Wired LAN Driver Updates 2014-09-23
This patch series adds support for the FM10000 Ethernet switch host
interface. The Intel FM10000 Ethernet Switch is a 48-port Ethernet switch
supporting both Ethernet ports and PCI Express host interfaces. The fm10k
driver provides support for the host interface portion of the switch, both
PF and VF.
As the host interfaces are directly connected to the switch this results in
some significant differences versus a standard network driver. For example
there is no PHY or MII on the device. Since packets are delivered directly
from the switch to the host interface these are unnecessary. Otherwise most
of the functionality is very similar to our other network drivers such as
ixgbe or igb. For example we support all the standard network offloads,
jumbo frames, SR-IOV (64 VFS), PTP, and some VXLAN and NVGRE offloads.
v2: converted dev_consume_skb_any() to dev_kfree_skb_any()
fix up PTP code based on feedback from the community
v3: converted the use of smb_mb__before_clear_bit() to smb_mb__before_atomic()
added vmalloc header to patch 15
added prefetch header to patch 16
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
csum_partial() is a generic function which is not optimised for small fixed
length calculations, and its use requires to store "from" and "to" values in
memory while we already have them available in registers. This also has impact,
especially on RISC processors. In the same spirit as the change done by
Eric Dumazet on csum_replace2(), this patch rewrites inet_proto_csum_replace4()
taking into account RFC1624.
I spotted during a NATted tcp transfert that csum_partial() is one of top 5
consuming functions (around 8%), and the second user of csum_partial() is
inet_proto_csum_replace4().
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
csum_partial() is a generic function which is not optimised for small fixed
length calculations, and its use requires to store "from" and "to" values in
memory while we already have them available in registers. This also has impact,
especially on RISC processors. In the same spirit as the change done by
Eric Dumazet on csum_replace2(), this patch rewrites inet_proto_csum_replace4()
taking into account RFC1624.
I spotted during a NATted tcp transfert that csum_partial() is one of top 5
consuming functions (around 8%), and the second user of csum_partial() is
inet_proto_csum_replace4().
I have proposed the same modification to inet_proto_csum_replace4() in another
patch.
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Fri, 26 Sep 2014 20:05:25 +0000 (16:05 -0400)]
Merge branch 'fec'
Fugang Duan says:
====================
net: fec: Code cleanup
This patches does several things:
- Fixing multiqueue issue.
- Removing the unnecessary errata workaround.
- Aligning the data buffer dma map/unmap size.
- Freeing resource after probe failed.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>