git.karo-electronics.de Git - linux-beck.git/log

switchdev: use gpl variant of symbol export

Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Acked-by: Scott Feldman <sfeldma@gmail.com>
Acked-by: Andy Gospodarek <gospo@cumulusnetworks.com>
Acked-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

tipc: sparse: fix htons conversion warnings

Commit d0f91938bede ("tipc: add ip/udp media type") introduced
some new sparse warnings. Clean them up.

Signed-off-by: Erik Hugne <erik.hugne@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net_sched: destroy proto tp when all filters are gone

Kernel automatically creates a tp for each
(kind, protocol, priority) tuple, which has handle 0,
when we add a new filter, but it still is left there
after we remove our own, unless we don't specify the
handle (literally means all the filters under
the tuple). For example this one is left:

  # tc filter show dev eth0
  filter parent 8001: protocol arp pref 49152 basic

The user-space is hard to clean up these for kernel
because filters like u32 are organized in a complex way.
So kernel is responsible to remove it after all filters
are gone.  Each type of filter has its own way to
store the filters, so each type has to provide its
way to check if all filters are gone.

Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Cong Wang <cwang@twopensource.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Jamal Hadi Salim<jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net/macb: Update DT bindings documentation

Add missing "cdns,at91sam9260-macb", "atmel,sama5d3-gem" and
"atmel,sama5d4-gem" compatible strings.

Signed-off-by: Boris Brezillon <boris.brezillon@free-electrons.com>
Reviewed-by: Alexandre Belloni <alexandre.belloni@free-electrons.com>
Acked-by: Nicolas Ferre <nicolas.ferre@atmel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ethernet: codespell comment spelling fixes

To test a checkpatch spelling patch, I ran codespell against
drivers/net/ethernet/.

$ git ls-files drivers/net/ethernet/ | \
  while read file ; do \
    codespell -w $file; \
  done

I removed a false positive in e1000_hw.h

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: stmmac: make reset control an optional requirement

Not having a reset control line to the ethernet controller should not be a
hard failure. Instead, add support for deferred probing and just print out
a debug statement.

Signed-off-by: Dinh Nguyen <dinguyen@opensource.altera.com>
Cc: Vince Bridgers <vbridger@opensource.altera.com>
CC: David S. Miller <davem@davemloft.net>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'mpls-next'

Eric W. Biederman says:

====================
mpls: Minor fixes and cleanups

This is a bunch of small changes that have come out of the discussions
of the mpls code and the automated tests that people run against things.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

neigh: Use neigh table index for neigh_packet_xmit

Remove a little bit of unnecessary work when transmitting a packet with
neigh_packet_xmit. Use the neighbour table index not the address family
as a parameter.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

mpls: Fix the openvswitch select of NET_MPLS_GSO

Fix the OPENVSWITCH Kconfig option and old Kconfigs by having
OPENVSWITCH select both NET_MPLS_GSO and MPLSO.

A Kbuild test robot reported that when NET_MPLS_GSO is selected by
OPENVSWITCH the generated .config is broken because MPLS is not
selected.

Cc: Simon Horman <horms@verge.net.au>
Fixes: cec9166ca4e mpls: Refactor how the mpls module is built
Reported-by: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Reviewed-by: Simon Horman <horms@verge.net.au>
Signed-off-by: David S. Miller <davem@davemloft.net>

mpls: Correct the ttl decrement.

According to RFC3032 section 2.4.2 packets with an outgoing
ttl of 0 MUST NOT be forwarded. According to section 2.4.1
an outgoing TTL of 0 comes from an incomming TTL <= 1.

Therefore any packets that is received with a ttl <= 1 should
not have it's ttl decremented and forwarded.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

mpls: Better error code for unsupported option.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

mpls: Cleanup the rcu usage in the code.

Sparse was generating a lot of warnings mostly from missing annotations
in the code. Add missing annotations and in a few cases tweak the code
for performance by moving work before loops.

This also fixes a problematic ommision of rcu_assign_pointer and
rcu_dereference.

Hopefully with complete rcu annotations any new rcu errors will stick
out like a sore thumb.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

mpls: Fix the kzalloc argument order in mpls_rt_alloc

*Blink* I got the argument order wrong to kzalloc and the
code was working properly when tested. *Blink*

Fix that.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

neterion: remove reference to ifconfig

Remove reference to obsolete ifconfig command.
MTU can be changed with ip command instead.

Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/net-next

Jeff Kirsher says:

====================
Intel Wired LAN Driver Updates 2015-03-07

This series contains updates to i40e and i40evf only.

Most notably, Greg provides the patch to remove the dreaded configfs
changes in the driver.

Shannon cleans up a sparse warning by simply straighting out the code
so it is less convoluted.  Fixes an issue where the vector allocation
was trying too hard to save vectors for VMDq, to the point of not giving
the PF enough when in a tight situation, such as an NPAR partition.
Changed the driver to make sure that the PF will get all the queues and
vectors it wants to fill out its destiny.  Cleans up reporting to only
print the port and VEB stats if it is the first partition of a
multiplexed port.

Catherine cleans up some duplicated code by simply removing the duplicate
code.

Kamil cleans up the driver by removing an un-needed endian conversion
because it is already done by a register read function.

Jesse fixes a variable width of a datatype, where a u16 should have been
a u32.  Also cleans up debug_read_register() to resolve some sparse
warnings.  Updates the driver to use prefetch() to get the next Tx
descriptor, like in ixgbe, to improve performance.

Akeem moves around code to enable/disable loopback so that other non-SRIOV
supported driver functions can take advantage of the changes.

Anjali cleans up the logging for adding/deleting FD-SB filters, since
ethtool shows all the filters on an interface.  Updates the driver to
use l4_tunnel type generically to keep code flow simple.  Simplifies
the RSS code since the driver initializes the rss_size_max in sw_init.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

net: dsa: mv88e6352: Add support for EEE

Enable EEE support for MV88E6352.

Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: dsa: mv88e6xxx: Add EEE support

EEE configuration is similar for the various MV88E6xxx chips.
Add generic support for it.

Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: bcmgenet: rework Rx queue init

In preparation for supporting multiple Rx queues:
1. Move the initialization of priv->num_rx_bds, priv->rx_bds, and
   priv->rx_cbs from bcmgenet_init_rx_ring() to bcmgenet_init_dma()
   since they are not specific to a single Rx queue. Mimics the Tx
   init model where priv->num_tx_bds, priv->tx_bds, and priv->tx_cbs
   are initialized in bcmgenet_init_dma().
2. Program DMA_MBUF_DONE_THRESH = 1 so that future Rx queues Q0-Q15
   will get per-packet Rx interrupt.
3. Group DMA_START_ADDR, RDMA_READ_PTR, RDMA_WRITE_PTR, and DMA_END_ADDR
   initialization together. Mimics the Tx init model.
4. There is 1-to-1 mapping between RxCBs and RxBDs.
   Precalculate RxCB->bd_addr so that it can be used in the future.

Signed-off-by: Petri Gynther <pgynther@google.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'macb-next'

Merge branch 'macb-next'

Boris Brezillon says:

====================
net/macb: merge at91_ether driver into macb driver

The rm9200 boards use the dedicated at91_ether driver instead of the
regular macb driver.

Both the macb and at91_ether drivers can be compiled as separated
modules.
Since the at91_ether driver uses code from the macb driver, at91_ether.ko
depends on macb.ko.

However the macb.ko module always fails to load on rm9200 boards: the
macb_probe() function expects a hclk clock which doesn't exist on rm9200.
Then the at91_ether.ko can't be loaded in turn due to unresolved
dependencies.

This series of patches fix this issue by merging at91_ether into macb.

Patch 1 is fixing a problem that might happen when enabling ARM
multi-platform suppot.

Changes since v3:
- move "net: macb: remove #if defined(CONFIG_ARCH_AT91) sections" patch
into this series to avoid dependency on other patch series.

Changes since v2:
- rebase after changed brought by commit "net: macb: remove #if
defined(CONFIG_ARCH_AT91) sections"

Changes since v1:
- rework probe functions to share common probing logic
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

net/macb: merge at91_ether driver into macb driver

macb and at91_ether drivers can be compiled as modules, but the at91_ether
driver use some functions and variables defined in the macb one, thus
creating a dependency on the macb driver.

Since these drivers are sharing the same logic we can easily merge
at91_ether into macb.

In order to factorize common probing logic we've added an ->init() function
to struct macb_config (the structure associated with the compatible
string), and moved macb specific init code from macb_probe to macb_init.

Signed-off-by: Cyrille Pitchen <cyrille.pitchen@atmel.com>
Signed-off-by: Boris Brezillon <boris.brezillon@free-electrons.com>
Tested-by: Alexandre Belloni <alexandre.belloni@free-electrons.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net/macb: unify clock management

Most of the functions from the Common Clk Framework handle NULL pointer as
input argument.

Since the TX clock is optional, we now set tx_clk to NULL value
instead of ERR_PTR(-ENOENT) when this clock is not available. This simplifies
the clock management and avoid the need to test tx_clk value.

Signed-off-by: Cyrille Pitchen <cyrille.pitchen@atmel.com>
Acked-by: Boris Brezillon <boris.brezillon@free-electrons.com>
Acked-by: Alexandre Belloni <alexandre.belloni@free-electrons.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: macb: remove #if defined(CONFIG_ARCH_AT91) sections

With multi platform support those sections could lead to unexpected
behavior if both ARCH_AT91 and another ARM SoC using the MACB IP are
selected.
Add two new capabilities to encode the default MII mode and the presence
of a CLKEN bit in USRIO register.
Then define the appropriate config for IPs embedded in at91 SoCs.

Signed-off-by: Boris Brezillon <boris.brezillon@free-electrons.com>
Reviewed-by: Alexandre Belloni <alexandre.belloni@free-electrons.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ARM: at91/dt: fix macb compatible strings

Some at91 SoCs embed a 10/100 Mbit Ethernet IP, that is based on the
at91sam9260 SoC.
Fix at91 DTs accordingly.

Signed-off-by: Boris Brezillon <boris.brezillon@free-electrons.com>
Reviewed-by: Alexandre Belloni <alexandre.belloni@free-electrons.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

i40e: Strip configfs code

The use of configfs is not allowed in network drivers. Strip the code that
uses it.

Signed-off-by: Greg Rose <gregory.v.rose@intel.com>
Tested-by: Jim Young <james.m.young@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

i40e/i40evf: Bump version

Bump i40e to 1.2.12 and i40evf to 1.2.6.

Change-ID: I641871da3a9abd396b28eda5744a4d68493c1400
Signed-off-by: Sravanthi Tangeda <sravanthi.tangeda@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

i40e/i40evf: enable prefetch of Tx descriptors during cleanup

Performance can be improved a bit by imitating ixgbe and using
prefetch to get us the next Tx descriptor.

Change-ID: Ice7ffd4cd0ce87c35295059bdb7972a7f53723aa
Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

i40e: Simplify code for rss_size_max config

We initialize the pf->rss_size_max in sw_init now
and hence this code can be simplified.

Change-ID: I1a7abc837604a40bc65e6c6b21190b909ed6bb21
Signed-off-by: Anjali Singhai Jain <anjali.singhai@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

i40e/i40evf: Simplify tunnel selection logic

Use l4_tunnel type generically to keep code flow simple.

Change-ID: Ic52287e3b1ca4204e6b6e13431890c1a6ae9c422
Signed-off-by: Anjali Singhai Jain <anjali.singhai@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

i40e: FD filters flush policy changes

Since GLQF_FDCNT_0 register now has the right offset, use it to simplify our
FD flush flow.
If the filter add error happens to be for SB we just auto disable SB.

If filter error happens to be for ATR, auto disable ATR and mark
the state to FD_FLUSH_REQUESTED. Which gets cleared when flush completes.

If we are entering flush too quickly (< 30 seconds) and we have quite
a few SB rules, its time to disable ATR for good. Since SB + ATR rules
is most likely making the FD table unstable.

ATR can be re-enabled by turning ntuple off (ethtool -K ntuple off)
and will remain off after turning ntuple on till it gets unstable again.

Change-ID: I2154a2e0a5d44851a2f0eb8731e2f1d4a4d1acbc
Signed-off-by: Anjali Singhai Jain <anjali.singhai@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

i40e: Avoid logs while adding/deleting FD-SB filters

It is not necessary to print FD filter add/delete log with
normal debug settings because ethtool -n ethx shows all the FD-SB
filters on an interface. The log can still be turned on through higher
debug levels and it will continue to print a log if there was an error
in the add/delete process.

Change-ID: I67db2baf49e2075d2f537de40f7895e5b02cd610
Signed-off-by: Anjali Singhai Jain <anjali.singhai@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

i40e: print port stats only on partition 1

Only print the port and veb stats if this is the first partition
of a multiplexed port.

Change-ID: I7ce0c323cdee5cfd2e54d8bea5b0b9102987e671
Signed-off-by: Shannon Nelson <shannon.nelson@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

i40e: Move code to enable/disable Loopback to the main file

Since changes made to enable or disable loopback for all VSIs, not only SR-IOV
or PCIOV, then it became necessary to move the associated functions to main
file - so that other non-SRIOV supported driver can take advantage of the
changes.

Change-ID: I59a49fd23a6136acda5e16f8d1e5ac7fd9c5fc05
Signed-off-by: Akeem G Abodunrin <akeem.g.abodunrin@intel.com>
Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

i40e: rework vector reservation

The initial problem solved here is that the vector allocation was trying
too hard to save vectors for VMDq, to the point of not giving the PF enough
when in a tight situation such as an NPAR partition. This change makes
sure that the PF will get all the queues and vectors it wants to fill
out its destiny. Essentially, nothing is specially reserved for VMDq,
it simply gets whatever is left after the PF, FCoE, and FD sideband get
what they want.

Additionally, the calculations for the reservations were harder to follow
than necessary, so I've made it more straight forward.

Change-ID: I99b384f104535b686c690b8ef0a787559485c8d4
Signed-off-by: Shannon Nelson <shannon.nelson@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

i40e: clean up debug_read_register

There were some additional spaces and strange (double swapping) logic
in this function that I started looking at because sparse was warning.

This fixes the sparse warning and fixes up the other issues.

Change-ID: I72a91a4197cd45921602649040e6bd25e5f17c0a
Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

i40e: store msg_enable in the right size

The kernel returns a u32 for netif_msg_init, and we were storing
it in a u16. Fix the width of the datatype.

Change-ID: I4b23326e5707c91cd59325c5a1ccb2ba7a3974fc
Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

i40e: Remove unneeded conversion

Remove LE16 to CPU endianes conversion from i40e_read_nvm_word_srctl
function, as it's already done by register read function.

Change-ID: I739f0f20a9b8e18223e54c0ca5443e63d75da878
Signed-off-by: Kamil Krawczyk <kamil.krawczyk@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

i40e: Remove duplicate code

This series of code was repeated twice, remove one of them.

Signed-off-by: Catherine Sullivan <catherine.sullivan@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

i40e/i40evf: Refactor i40e_debug_aq and make some functions static

A sparse complaint in i40e_debug_aq in a funky buffer write goes away by
straightening out the code out to something less convoluted.

Also fix some other sparse warnings while we are at it, making some
functions static and using NULL instead of 0.

Change-ID: I93907534fe1f1f675830774b3d14ecf1c6ffc9a0
Signed-off-by: Shannon Nelson <shannon.nelson@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

rocker: sparse: fix dynamic allocation on stack warning

Signed-off-by: Scott Feldman <sfeldma@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

rocker: quiet sparce endianess warnings

Signed-off-by: Scott Feldman <sfeldma@gmail.com>
Reviewed-by: Jonathan Toppins <jtoppins@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ebpf: bpf_map_*: fix linker error on avr32 and openrisc arch

Fengguang reported, that on openrisc and avr32 architectures, we
get the following linker errors on *_defconfig builds that have
no bpf syscall support:

  net/built-in.o:(.rodata+0x1cd0): undefined reference to `bpf_map_lookup_elem_proto'
  net/built-in.o:(.rodata+0x1cd4): undefined reference to `bpf_map_update_elem_proto'
  net/built-in.o:(.rodata+0x1cd8): undefined reference to `bpf_map_delete_elem_proto'

Fix it up by providing built-in weak definitions of the symbols,
so they can be overridden when the syscall is enabled. I think
the issue might be that gcc is not able to optimize all that away.
This patch fixes the linker errors for me, tested with Fengguang's
make.cross [1] script.

  [1] https://git.kernel.org/cgit/linux/kernel/git/wfg/lkp-tests.git/plain/sbin/make.cross

Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Fixes: d4052c4aea0c ("ebpf: remove CONFIG_BPF_SYSCALL ifdefs in socket filter code")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: gro: remove obsolete code from skb_gro_receive()

Some drivers use copybreak to copy tiny frames into smaller skb,
and this smaller skb might not have skb->head_frag set for various
reasons.

skb_gro_receive() currently doesn't allow to aggregate the smaller skb
into the previous GRO packet if this GRO packet has at least 2 MSS in
it.

Following workload easily demonstrates the problem.

netperf -t TCP_RR -H target -- -r 3000,3000

(tcpdump shows one GRO packet with 2 MSS, plus one additional packet of
104 bytes that should have been appended.)

It turns out that we can remove code from skb_gro_receive(), because
commit 8a29111c7ca6 ("net: gro: allow to build full sized skb") and its
followups removed the assumption that a GRO packet with a frag_list had
to have an empty head.

Removing this code allows the aggregation of the last (incomplete) frame
in some RPC workloads. Note that tcp_gro_receive() already takes care of
forcing a flush if necessary, including this case.

If we want to avoid using frag_list in the first place (in forwarding
workloads for example, as the outgoing NIC is generally not able to cope
with skbs having a frag_list), we need to address this separately.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>

ax25: remove unneeded NULL test in ax_xmit()

We get a static checker warning here on devel kernels:

drivers/net/hamradio/mkiss.c:560 ax_xmit()
warn: variable dereferenced before check 'skb' (see line 532)

It turns out that the NULL check can be deleted.

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'mlx4-qcn'

Or Gerlitz says:

====================
Add QCN support to the DCB NL layer

This series from Shani Michaeli adds support for the IEEE QCN attribute
to the kernel DCB NL stack, and implementation in the mlx4 driver which
programs the firmware according to the admin directives.

changes from V0:

- applied feedback from John and added his acked-by to patch #1
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

net/mlx4_en: Add QCN parameters and statistics handling

Implement the IEEE DCB handlers for set/get QCN parameters and
statistics reading per TC.

Signed-off-by: Shani Michaeli <shanim@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net/mlx4_core: Add basic elements for QCN

Add device capability, firmware command opcode and etc prior elements
needed for QCN suppprt. Disable SRIOV VF view/access for QCN is disabled.

While here, remove a redundant offset definition into the
QUERY_DEV_CAP mailbox.

Signed-off-by: Shani Michaeli <shanim@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net/dcb: Add IEEE QCN attribute

As specified in 802.1Qau spec. Add this optional attribute to the
DCB netlink layer. To allow for application to use the new attribute,
NIC drivers should implement and register the callbacks ieee_getqcn,
ieee_setqcn and ieee_getqcnstats.

The QCN attribute holds a set of parameters for management, and
a set of statistics to provide informative data on Congestion-Control
defined by this spec.

Signed-off-by: Shani Michaeli <shanim@mellanox.com>
Signed-off-by: Shachar Raindel <raindel@mellanox.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Acked-by: John Fastabend <john.r.fastabend@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'fib_trie-next'

Alexander Duyck says:

====================
The rest of the FIB patches (add key_vector to fib_table)

This patch series is the rest of what I had originally planned for this kernel
release.  It adds a structure called key_vector which is embedded within
every tnode, leaf, and the trie root itself.  By doing this we can navigate
from any point within the trie to any other point fairly quickly and
avoiding NULL pointer checks in the case of a backtrace.  As a result we
can pipeline things a bit further since we don't have to worry about
dereferencing NULL in a backtrace.  This can amount to significant savings
on a long backtrace.

I decided to drop the up-level code as that conflicts with combining the
main and local tries.  I have one patch as an RFC that currently combines
the tries however it still needs some work as we have to split the local
and main tries in the event of custom rules being defined.  As such we are
probably going to be doing some more hacking on fib_table_flush_external as
that will also need to flush the local entries from the main trie and place
them back in the local trie.

v2: Rebased on the switchdev FIB offload work
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

fib_trie: Add key vector to root, return parent key_vector in resize

This change makes it so that the root of the trie contains a key_vector, by
doing this we make room to essentially collapse the entire trie by at least
one cache line as we can store the information about the tnode or leaf that
is pointed to in the root.

Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

fib_trie: Move parent from key_vector to tnode

This change pulls the parent pointer from the key_vector and places it in
the tnode structure.

Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

fib_trie: Pull empty_children and full_children into tnode

This pulls the information about the child array out of the key_vector and
places it in the tnode since that is where it is needed.

Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

fib_trie: Move rcu from key_vector to tnode, add accessors.

RCU is only needed once for the entire node, not once per key_vector so we
can pull that out and move it to the tnode structure.

In addition add accessors to be used inside the RCU functions so that we
can more easily get from the key vector to either the tnode or the trie
pointers.

Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

fib_trie: Add tnode struct as a container for fields not needed in key_vector

This change pulls the fields not explicitly needed in the key_vector and
placed them in the new tnode structure. By doing this we will eventually
be able to reduce the key_vector down to 16 bytes on 64 bit systems, and
12 bytes on 32 bit systems.

Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

fib_trie: Rename tnode_child_length to child_length

We are now checking the length of a key_vector instead of a tnode so it
makes sense to probably just rename this to child_length since it would
probably even be applicable to a leaf.

Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

fib_trie: replace tnode_get_child functions with get_child macros

I am replacing the tnode_get_child call with get_child since we are
techically pulling the child out of a key_vector now and not a tnode.

Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

fib_trie: Rename tnode to key_vector

Rename the tnode to key_vector. The key_vector will be the eventual
container for all of the information needed by either a leaf or a tnode.
The final result should be much smaller than the 40 bytes currently needed
for either one.

This also updates the trie struct so that it contains an array of size 1 of
tnode pointers. This is to bring the structure more inline with how an
actual tnode itself is configured.

Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

fib_trie: Return pointer to tnode pointer in resize/inflate/halve

Resize related functions now all return a pointer to the pointer that
references the object that was resized.

Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

fib_trie: Minor cleanups to fib_table_flush_external

This change just does a couple of minor cleanups on
fib_table_flush_external. Specifically it addresses the fact that resize
was being called even though nothing was being removed from the table, and
it drops an unecessary indent since we could just call continue on the
inverse of the fi && flag check.

Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: macb: Fix multi queue support for xilinx ZynqMP

ZynqMP soc has single interrupt for all the queue events. So,
passing the IRQF_SHARED flag for interrupt registration call.

Signed-off-by: Punnaiah Choudary Kalluri <punnaia@xilinx.com>
Signed-off-by: Michal Simek <michal.simek@xilinx.com>
Acked-by: Nicolas Ferre <nicolas.ferre@atmel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: macb: Include multi queue support for xilinx ZynqMP ethernet version

Include multi queue support for the ethernet IP version in xilinx ZynqMP
SoC.

Signed-off-by: Punnaiah Choudary Kalluri <punnaia@xilinx.com>
Signed-off-by: Michal Simek <michal.simek@xilinx.com>
Acked-by: Nicolas Ferre <nicolas.ferre@atmel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge tag 'wireless-drivers-next-for-davem-2015-03-06' of git://git.kernel.org/pub/scm/linux/kernel/git/kvalo/wireless-drivers-next

Major changes:

brcmfmac:

* sdio improvements
* add a debugfs file so users can provide us all the revinfo we could
ask for

iwlwifi:

* add triggers for firmware dump collection
* remove support for -9.ucode
* new statitics API
* rate control improvements

ath9k:

* add per-vif TX power capability
* BT coexistance fixes

ath10k:

* qca6174: enable STA transmit beamforming (TxBF) support
* disable multi-vif power save by default

bcma:

* enable support for PCIe Gen 2 host devices

Signed-off-by: David S. Miller <davem@davemloft.net>

fib: make netdev_switch_fib_ipv4_abort in header file static inline

When building without CONFIG_NET_SWITCHDEV,
netdev_switch_fib_ipv4_abort is defined in the header file. It must
be static inline to avoid build failure at link time.

Fixes: 8e05fd7166c6 ("fib: hook IPv4 fib for hardware offload")
Signed-off-by: Willem de Bruijn <willemb@google.com>
Acked-by: Scott Feldman <sfeldma@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

mpls: Properly validate RTA_VIA payload length

If the nla length is less than 2 then the nla data could be accessed
beyond the accessible bounds. So ensure that the nla is big enough to
at least read the via_family before doing so. Replace magic value of
2.

Fixes: 03c0566542f4 ("mpls: Basic support for adding and removing routes")
Cc: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Robert Shearman <rshearma@brocade.com>
Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'bcmgenet-next'

Petri Gynther says:

====================
net: bcmgenet: preparation for multiple Rx queues

Three small patches in preparation for supporting multiple Rx queues:
1. set hw_params->rx_queues = 0
2. adjust the call to alloc_etherdev_mqs()
3. add GENET_Q16_RX_BD_CNT and hw_params->rx_bds_per_q
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

net: bcmgenet: add GENET_Q16_RX_BD_CNT and hw_params->rx_bds_per_q

In preparation for supporting multiple Rx queues, add GENET_Q16_RX_BD_CNT
and hw_params->rx_bds_per_q.

Signed-off-by: Petri Gynther <pgynther@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: bcmgenet: adjust the call to alloc_etherdev_mqs()

In preparation for supporting multiple Rx queues, adjust the call to
alloc_etherdev_mqs() to allow max GENET_MAX_MQ_CNT + 1 Rx queues.

The actual number of Rx queues in use is correctly adjusted with:
netif_set_real_num_rx_queues(priv->dev, priv->hw_params->rx_queues + 1);

Signed-off-by: Petri Gynther <pgynther@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: bcmgenet: set hw_params->rx_queues = 0

bcmgenet driver doesn't yet support multiple Rx queues.
Set hw_params->rx_queues = 0 accordingly.
The default Rx queue (Q16) is still created and operational.

Signed-off-by: Petri Gynther <pgynther@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/net-next

Jeff Kirsher says:

====================
Intel Wired LAN Driver Updates 2015-03-06

This series contains updates to e1000, e1000e and igb.

Yanir provides updates to e1000e based on the patches provided by John
Linville.  First updates the code comment to better describe the changes
and the impact on the driver.  Second removed calls to ioremap/unmap for
i219 since this is only relevant to older hardware only.  Starting with
i219, the NVM will not be mapped to its one BAR but to a address region
in another bar.

Alex Duyck provides two fixes for igb, first fixes a compile warning
where a variable may be used uninitialized, so Alex initializes it.
Second fixes an issue where all of the pin register values were having
to be pushed onto the stack each time the function was called, so to
avoid this, Alex made them static const so that they should only need
to be allocated once and we can avoid all the instructions to get them
onto the stack.

Eliezer found an issue in e1000 where we needed to be calling
netif_carrier_off earlier in the down() to prevent the stack from
queuing more packets to the interface.

Sabrina Dubroca resolved a potential race condition by adding a
dummy allocator.  There was a race condition between e1000_change_mtu()
cleanups and netpoll, when changing the MTU across jumbo sizes.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'pmtu-probe'

Fan Du says:

====================
Improvements for TCP PMTU

This patchset performs some improvements and enhancement
for current TCP PMTU as per RFC4821 with the aim to find
optimal mms size quickly, and also be adaptive to route
changes like enlarged path MTU. Then TCP PMTU could be
used to probe a effective pmtu in absence of ICMP message
for tunnels(e.g. vxlan) across different networking stack.

Patch1/4: Set probe mss base to 1024 Bytes per RFC4821
Patch2/4: Do not double probe_size for each probing,
          use a simple binary search to gain maximum performance.
  mss for next probing.
Patch3/4: Create a probe timer to detect enlarged path MTU.
Patch4/4: Update ip-sysctl.txt for new sysctl knobs.

Changelog:
v5:
  - Zero probe_size before resetting search range.
  - Update ip-sysctl.txt for new sysctl knobs.
v4:
  - Convert probe_size to mss, not directly from search_low/high
  - Clamp probe_threshold
  - Don't adjust search_high in blackhole probe, so drop orignal patch3
v3:
  - Update commit message for patch2
  - Fix pseudo timer delta calculation in patch4
v2:
  - Introduce sysctl_tcp_probe_threshold to control when
    probing will stop, as suggested by John Heffner.
  - Add patch3 to shrink current mss value for search low boundary.
  - Drop cannonical timer usages, implements pseudo timer based on
    32bits jiffies tcp_time_stamp, as suggested by Eric Dumazet.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

ipv4: Documenting two sysctls for tcp PMTU probe

Namely tcp_probe_interval to control how often to restart
a probe. And tcp_probe_threshold to control when stop the
probing in respect to the width of search range in bytes

Signed-off-by: Fan Du <fan.du@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ipv4: Create probe timer for tcp PMTU as per RFC4821

As per RFC4821 7.3. Selecting Probe Size, a probe timer should
be armed once probing has converged. Once this timer expired,
probing again to take advantage of any path PMTU change. The
recommended probing interval is 10 minutes per RFC1981. Probing
interval could be sysctled by sysctl_tcp_probe_interval.

Eric Dumazet suggested to implement pseudo timer based on 32bits
jiffies tcp_time_stamp instead of using classic timer for such
rare event.

Signed-off-by: Fan Du <fan.du@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ipv4: Use binary search to choose tcp PMTU probe_size

Current probe_size is chosen by doubling mss_cache,
the probing process will end shortly with a sub-optimal
mss size, and the link mtu will not be taken full
advantage of, in return, this will make user to tweak
tcp_base_mss with care.

Use binary search to choose probe_size in a fine
granularity manner, an optimal mss will be found
to boost performance as its maxmium.

In addition, introduce a sysctl_tcp_probe_threshold
to control when probing will stop in respect to
the width of search range.

Test env:
Docker instance with vxlan encapuslation(82599EB)
iperf -c 10.0.0.24 -t 60

before this patch:
1.26 Gbits/sec

After this patch: increase 26%
1.59 Gbits/sec

Signed-off-by: Fan Du <fan.du@intel.com>
Acked-by: John Heffner <johnwheffner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ipv4: Raise tcp PMTU probe mss base size

Quotes from RFC4821 7.2.  Selecting Initial Values

   It is RECOMMENDED that search_low be initially set to an MTU size
   that is likely to work over a very wide range of environments.  Given
   today's technologies, a value of 1024 bytes is probably safe enough.
   The initial value for search_low SHOULD be configurable.

Moreover, set a small value will introduce extra time for the search
to converge. So set the initial probe base mss size to 1024 Bytes.

Signed-off-by: Fan Du <fan.du@intel.com>
Acked-by: John Heffner <johnwheffner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

DECnet: Only use neigh_ops for adding the link layer header

Other users users of the neighbour table use neigh->output as the method
to decided when and which link-layer header to place on a packet.
DECnet has been using neigh->output to decide which DECnet headers to
place on a packet depending which neighbour the packet is destined for.

The DECnet usage isn't totally wrong but it can run into problems if the
neighbour output function is run for a second time as the teql driver
and the bridge netfilter code can do.

Therefore to avoid pathologic problems later down the line and make the
neighbour code easier to understand by refactoring the decnet output
code to only use a neighbour method to add a link layer header to a
packet.

This is done by moving the neigbhour operations lookup from
dn_to_neigh_output to dn_neigh_output_packet.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

bonding: implement bond_poll_controller()

This patches implements the poll_controller support for all
bonding driver. If the slaves have poll_controller net_op defined,
this implementation calls them. This is mode agnostic implementation
and iterates through all slaves (based on mode) and calls respective
handler.

Signed-off-by: Mahesh Bandewar <maheshb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

rocker: fix some sparse warnings

Signed-off-by: Scott Feldman <sfeldma@gmail.com>
Acked-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>

switchdev: fix CONFIG_IP_MULTIPLE_TABLES compile issue

Signed-off-by: Scott Feldman <sfeldma@gmail.com>
Acked-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>

e1000: add dummy allocator to fix race condition between mtu change and netpoll

There is a race condition between e1000_change_mtu's cleanups and
netpoll, when we change the MTU across jumbo size:

Changing MTU frees all the rx buffers:
    e1000_change_mtu -> e1000_down -> e1000_clean_all_rx_rings ->
        e1000_clean_rx_ring

Then, close to the end of e1000_change_mtu:
    pr_info -> ... -> netpoll_poll_dev -> e1000_clean ->
        e1000_clean_rx_irq -> e1000_alloc_rx_buffers -> e1000_alloc_frag

And when we come back to do the rest of the MTU change:
    e1000_up -> e1000_configure -> e1000_configure_rx ->
        e1000_alloc_jumbo_rx_buffers

alloc_jumbo finds the buffers already != NULL, since data (shared with
page in e1000_rx_buffer->rxbuf) has been re-alloc'd, but it's garbage,
or at least not what is expected when in jumbo state.

This results in an unusable adapter (packets don't get through), and a
NULL pointer dereference on the next call to e1000_clean_rx_ring
(other mtu change, link down, shutdown):

BUG: unable to handle kernel NULL pointer dereference at           (null)
IP: [<ffffffff81194d6e>] put_compound_page+0x7e/0x330

    [...]

Call Trace:
[<ffffffff81195445>] put_page+0x55/0x60
[<ffffffff815d9f44>] e1000_clean_rx_ring+0x134/0x200
[<ffffffff815da055>] e1000_clean_all_rx_rings+0x45/0x60
[<ffffffff815df5e0>] e1000_down+0x1c0/0x1d0
[<ffffffff811e2260>] ? deactivate_slab+0x7f0/0x840
[<ffffffff815e21bc>] e1000_change_mtu+0xdc/0x170
[<ffffffff81647050>] dev_set_mtu+0xa0/0x140
[<ffffffff81664218>] do_setlink+0x218/0xac0
[<ffffffff814459e9>] ? nla_parse+0xb9/0x120
[<ffffffff816652d0>] rtnl_newlink+0x6d0/0x890
[<ffffffff8104f000>] ? kvm_clock_read+0x20/0x40
[<ffffffff810a2068>] ? sched_clock_cpu+0xa8/0x100
[<ffffffff81663802>] rtnetlink_rcv_msg+0x92/0x260

By setting the allocator to a dummy version, netpoll can't mess up our
rx buffers.  The allocator is set back to a sane value in
e1000_configure_rx.

Fixes: edbbb3ca1077 ("e1000: implement jumbo receive with partial descriptors")
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Tested-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

e1000: call netif_carrier_off early on down

When bringing down an interface netif_carrier_off() should be
one the first things we do, since this will prevent the stack
from queuing more packets to this interface.
This operation is very fast, and should make the device behave
much nicer when trying to bring down an interface under load.

Also, this would Do The Right Thing (TM) if this device has some
sort of fail-over teaming and redirect traffic to the other IF.

Move netif_carrier_off as early as possible.

Signed-off-by: Eliezer Tamir <eliezer.tamir@linux.intel.com>
Tested-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

igb: Make arrays on stack static const to avoid reallocation

While addressing the pin problem I noticed that all of the pin register
values where having to be pushed onto the stack each time the function was
called.  To avoid that I am making them static const so that they should
only need to be allocated once and we can avoid all the instructions to get
them onto the stack..

size before:
   text    data     bss     dec     hex filename
161477   10512       8 171997   29fdd drivers/net/ethernet/intel/igb/igb.ko

size after:
   text    data     bss     dec     hex filename
161205   10512       8 171725   29ecd drivers/net/ethernet/intel/igb/igb.ko

Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com>
Tested-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

igb: Fix warning pin may be used uninitialized

When building the kernel using the gcc 4.8.3 compiler included in Fedora 20
I was repeatedly seeing the warning:

drivers/net/ethernet/intel/igb/igb_ptp.c: In function ‘igb_ptp_feature_enable_i210’:
drivers/net/ethernet/intel/igb/igb_ptp.c:395:21: warning: ‘pin’ may be used uninitialized in this function
[-Wmaybe-uninitialized]
   tssdp &= ~ts_sdp_en[pin];
                     ^
drivers/net/ethernet/intel/igb/igb_ptp.c:471:6: note: ‘pin’ was declared here
   int pin;
       ^

To resolve it I am assigning the pin a value of -1 when it is instantiated.

Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com>
Tested-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

e1000e: remove calls to ioremap/unmap for NVM addr

Starting I219, the NVM will not be mapped to its own BAR, but to an
address region in another bar. The mapping/unmapping is relevant
to older HW only.

CC: John W Linville <linville@tuxdriver.com>
Reported-by: John W Linville <linville@tuxdriver.com>
Signed-off-by: Yanir Lubetkin <yanirx.lubetkin@intel.com>
Tested-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

e1000e: fix obscure comments

The interface to the device flash was modified in i219 and later HW.
This patch better describes the change and the impact on the driver.

CC: John W Linville <linville@tuxdriver.com>
Reported-by: John W Linville <linville@tuxdriver.com>
Signed-off-by: Yanir Lubetkin <yanirx.lubetkin@intel.com>
Tested-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

ipv4: Fix unused variable warnings in fib_table_flush_external.

net/ipv4/fib_trie.c: In function ‘fib_table_flush_external’:
net/ipv4/fib_trie.c:1572:6: warning: unused variable ‘found’ [-Wunused-variable]
  int found = 0;
      ^
net/ipv4/fib_trie.c:1571:16: warning: unused variable ‘slen’ [-Wunused-variable]
  unsigned char slen;
                ^

Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'l3_hw_offload'

Scott Feldman says:

====================
switchdev: add IPv4 routing offload

v4:

  - Add NETIF_F_NETNS_LOCAL to rocker port feature list to keep rocker
    ports in the default netns.  Rocker hardware can't be partitioned
    to support multiple namespaces, currently.  It would be interesting
    to add netns support to rocker device by basically adding another
    match field to each table to match on some unique netns ID, with
    a port knowing it's netns ID.  Future work TDB.
  - Up-level the RTNH_F_EXTERNAL marking of routes installed to offload
    device from driver to switchdev common code.  Now driver can't skip
    routes.  Either it can install the route or it cannot.  Yes or No.
    If no on any route, all offloading is aborted by removing routes
    from offload device and setting ipv4.fib_offload_disabled so no more
    routes can be offloaded.  This is harsh, but it's our starting point.
    We can refine the policies in follow-up work.
  - Add new net.ipv4.fib_offload_disabled bool that is set if anything
    goes wrong with route offloading.  We can refine this later to make
    the setting per-device or per-device-port-netdev, but let's start
    here simple and refine in follow-up work.
  - Rebase against Alex's latest FIB changes.  I think I did everything
    correctly, and didn't run into any issues with testing, but I'd like
    Alex to look over the changes and maybe follow-up with any cleanups.

v3:

Changes based on v2 review comments:

  - Move check for custom rules up earlier in patch set, to keep git bisect
    safe.
  - Simplify the route add/modify failure handling to simple try until
    failure, and then on failure, undo everything.  The switchdev driver
    will return err when route can normally be installed to device, but
    the install fails for one reason or another (no space left on device,
    etc).  If a failure happens, uninstall all routes from the device,
    punting forwarding for all routes back to the kernel.
  - Scan route's full nexthop list, ensuring all nexthop devs belong
    to the same switchdev device, otherwise don't try to install route
    to device.

v2:

Changes based on v1 review comments and discussions at netconf:

  - Allow route modification, but use same ndo op used for adding route.
    Driver/device is expected to modify route in-place, if it can, to avoid
    interruption of service.
  - Add new RTNH_F_EXTERNAL flag to mark FIB entries offloaded externally.
  - Don't offload routes if using custom IP rules.  If routes are already
    offloaded, and custom IP rules are turned on, flush routes from offload
    device.  (Offloaded routes are marked with RTNH_F_EXTERNAL).
  - Use kernel's neigh resolution code to resolve route's nexthops' neigh
    MAC addrs.  (Thanks davem, works great!).
  - Use fib->fib_priority in rocker driver to give priorities to routes in
    OF-DPA unicast route table.

v1:

This patch set adds L3 routing offload support for IPv4 routes.  The idea is to
mirror routes installed in the kernel's FIB down to a hardware switch device to
offload the data forwarding path for L3.  Only the data forwarding path is
intercepted.  Control and management of the kernel's FIB remains with the
kernel.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

rocker: implement IPv4 fib offloading

The driver implements ndo_switch_fib_ipv4_add/del ops to add/del/mod IPv4
routes to/from switchdev device.  Once a route is added to the device, and the
route's nexthops are resolved to neighbor MAC address, the device will forward
matching pkts rather than the kernel.  This offloads the L3 forwarding path
from the kernel to the device.  Note that control and management planes are
still mananged by Linux; only the data plane is offloaded.  Standard routing
control protocols such as OSPF and BGP run on Linux and manage the kernel's FIB
via standard rtm netlink msgs...nothing changes here.

A new hash table is added to rocker to track neighbors.  The driver listens for
neighbor updates events using netevent notifier NETEVENT_NEIGH_UPDATE.  Any ARP
table updates for ports on this device are recorded in this table.  Routes
installed to the device with nexthops that reference neighbors in this table
are "qualified".  In the case of a route with nexthops not resolved in the
table, the kernel is asked to resolve the nexthop.

The driver uses fib_info->fib_priority for the priority field in rocker's
unicast routing table.

The device can only forward to pkts matching route dst to resolved nexthops.
Currently, the device only supports single-path routes (i.e. routes with one
nexthop).  Equal Cost Multipath (ECMP) route support will be added in followup
patches.

This patch is driver support for unicast IPv4 routing only.  Followup patches
will add driver and infrastructure for IPv6 routing and multicast routing.

Signed-off-by: Scott Feldman <sfeldma@gmail.com>
Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>

fib: hook IPv4 fib for hardware offload

Call into the switchdev driver any time an IPv4 fib entry is
added/modified/deleted from the kernel's FIB.  The switchdev driver may or
may not install the route to the offload device.  In the case where the
driver tries to install the route and something goes wrong (device's routing
table is full, etc), then all of the offloaded routes will be flushed from the
device, route forwarding falls back to the kernel, and no more routes are
offloading.

We can refine this logic later.  For now, use the simplist model of offloading
routes up to the point of failure, and then on failure, undo everything and
mark IPv4 offloading disabled.

Signed-off-by: Scott Feldman <sfeldma@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ipv4: add net bool fib_offload_disabled

If something goes wrong with IPv4 FIB offload, mark entire net offload
disabled.  This is brute force policy to basically shut down IPv4 FIB offload
permanently if there is a problem offloading any route to an external device.
We can refine the policy in the future, to handle failures on a per-device or
per-route basis, but for now, this policy is per-net.

What we're trying to avoid is an inconsistent split between the kernel's FIB
and the offload device's FIB.  We don't want the device to fwd a pkt
inconsitent with what the kernel would do.  An example of a split is if device
has 10.0.0.0/16 and kernel has 10.0.0.0/16 and 10.0.0.0/24, the device wouldn't
see the longest prefix 10.0.0.0/24 and potentially forward pkts incorrectly.

Limited capacity or limited capability are two ways a route may fail to install
to the offload device.  We'll not differentiate between failures at this time,
and treat any failure as fatal and mark the net as fib_offload_disabled.

Signed-off-by: Scott Feldman <sfeldma@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

switchdev: implement IPv4 fib ndo wrappers

Flesh out ndo wrappers to call into device driver.  To call into device driver,
the wrapper must interate over route's nexthops to ensure all nexthop devs
belong to the same switch device.  Currently, there is no support for route's
nexthops spanning offloaded and non-offloaded devices, or spanning ports of
multiple offload devices.

Since switch device ports may be stacked under virtual interfaces (bonds and/or
bridges), and the route's nexthop may be on the virtual interface, the wrapper
will traverse the nexthop dev down to the base dev.  It's the base dev that's
passed to the switchdev driver's ndo ops.

Signed-off-by: Scott Feldman <sfeldma@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

switchdev: don't support custom ip rules, for now

Keep switchdev FIB offload model simple for now and don't allow custom ip
rules.

Signed-off-by: Scott Feldman <sfeldma@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

switchdev: add IPv4 fib ndo ops wrappers

Add IPv4 fib ndo wrapper funcs and stub them out for now.

Signed-off-by: Scott Feldman <sfeldma@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

netdevice: add IPv4 fib add/del ops

Add two new ndo ops for IPv4 fib offload support, add and del. Add uses
modifiy semantics if fib entry already offloaded. Drivers implementing the new
ndo ops will return err<0 if programming device fails, for example if device's
tables are full.

Signed-off-by: Scott Feldman <sfeldma@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

rtnetlink: add RTNH_F_EXTERNAL flag for fib offload

Add new RTNH_F_EXTERNAL flag to mark fib entries offloaded externally, for
example to a switchdev switch device.

Signed-off-by: Scott Feldman <sfeldma@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

tg3: use napi_complete_done()

Using napi_complete_done() instead of napi_complete() allows
us to use /sys/class/net/ethX/gro_flush_timeout

GRO layer can aggregate more packets if the flush is delayed a bit,
without having to set too big coalescing parameters that impact
latencies.

Tested:

lpx:~# echo 0 >/sys/class/net/eth1/gro_flush_timeout

lpx:~# sar -n DEV 1 10 | grep eth1
10:36:25 AM      eth1  81290.00  40617.00 120479.67   2777.01      0.00      0.00      0.00
10:36:26 AM      eth1  81283.00  40608.00 120481.81   2778.13      0.00      0.00      1.00
10:36:27 AM      eth1  81304.00  40639.00 120518.42   2778.28      0.00      0.00      0.00
10:36:28 AM      eth1  81255.00  40605.00 120437.34   2775.95      0.00      0.00      1.00
10:36:29 AM      eth1  81306.00  40630.00 120521.44   2777.70      0.00      0.00      0.00
10:36:30 AM      eth1  81286.00  40564.00 120480.20   2773.31      0.00      0.00      0.00
10:36:31 AM      eth1  81256.00  40599.00 120438.81   2776.27      0.00      0.00      0.00
10:36:32 AM      eth1  81287.00  40594.00 120480.69   2776.69      0.00      0.00      0.00
10:36:33 AM      eth1  81279.00  40601.00 120478.53   2775.84      0.00      0.00      0.00
10:36:34 AM      eth1  81277.00  40610.00 120476.94   2776.25      0.00      0.00      0.00
Average:         eth1  81282.30  40606.70 120479.39   2776.54      0.00      0.00      0.20

lpx:~# echo 13000 >/sys/class/net/eth1/gro_flush_timeout

lpx:~# sar -n DEV 1 10 | grep eth1
10:36:43 AM      eth1  81257.00   7747.00 120437.44    530.00      0.00      0.00      0.00
10:36:44 AM      eth1  81278.00   7748.00 120480.00    529.85      0.00      0.00      0.00
10:36:45 AM      eth1  81282.00   7752.00 120479.09    531.09      0.00      0.00      0.00
10:36:46 AM      eth1  81282.00   7751.00 120478.80    530.90      0.00      0.00      0.00
10:36:47 AM      eth1  81276.00   7745.00 120478.31    529.64      0.00      0.00      0.00
10:36:48 AM      eth1  81278.00   7747.00 120478.50    529.81      0.00      0.00      0.00
10:36:49 AM      eth1  81282.00   7749.00 120478.88    530.01      0.00      0.00      0.00
10:36:50 AM      eth1  81284.00   7751.00 120481.52    530.20      0.00      0.00      0.00
10:36:51 AM      eth1  81299.00   7769.00 120481.74    533.81      0.00      0.00      0.00
10:36:52 AM      eth1  81281.00   7748.00 120478.62    529.96      0.00      0.00      0.00
Average:         eth1  81279.90   7750.70 120475.29    530.53      0.00      0.00      0.00

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Michael Chan <mchan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'dsa-next'

Florian Fainelli says:

====================
net: dsa: code re-organization

This pull request contains the first part of the patches required to implement
the grand plan detailed here:

http://www.spinics.net/lists/netdev/msg295942.html

These are mostly code re-organization and function bodies re-arrangement to
allow different callers of lower-level initialization functions for 'struct
dsa_switch' and 'struct dsa_switch_tree' to be later introduced.

There is no functional code change at this point.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

net: dsa: extract dsa switch tree setup and removal

Extract the core logic that setups a 'struct dsa_switch_tree' and
removes it, update dsa_probe() and dsa_remove() to use the two helper
functions. This will be useful to allow for other callers to setup
this structure differently.

Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: dsa: let switches specify their tagging protocol

In order to support the new DSA device driver model, a dsa_switch should
be able to advertise the type of tagging protocol supported by the
underlying switch device. This also removes constraints on how tagging
can be stacked to each other.

Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: dsa: split dsa_switch_setup into two functions

Split the part of dsa_switch_setup() which is responsible for allocating
and initializing a 'struct dsa_switch' and the part which is doing a
given switch device setup and slave network device creation.

This is a preliminary change to allow a separate caller of
dsa_switch_setup_one() which may have externally initialized the
dsa_switch structure, outside of dsa_switch_setup().

Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: dsa: allow deferred probing

In preparation for allowing a different model to register DSA switches,
update dsa_of_probe() and dsa_probe() to return -EPROBE_DEFER where
appropriate.

Failure to find a phandle or Device Tree property is still fatal, but
looking up the internal device structure associated with a Device Tree
node is something that might need to be delayed based on driver probe
ordering.

Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: dsa: update dsa_of_{probe, remove} to use a device pointer

In preparation for allowing a different mechanism to register DSA switch
devices and driver, update dsa_of_probe and dsa_of_remove to take a
struct device pointer since neither of these two functions uses the
struct platform_device pointer.

Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>