Eric Dumazet [Fri, 6 May 2016 16:46:18 +0000 (09:46 -0700)]
ipv4: tcp: ip_send_unicast_reply() is not BH safe
I forgot that ip_send_unicast_reply() is not BH safe (yet).
Disabling preemption before calling it was not a good move.
Fixes: c10d9310edf5 ("tcp: do not assume TCP code is non preemptible") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Andres Lagar-Cavilla <andreslc@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Fri, 6 May 2016 20:01:55 +0000 (16:01 -0400)]
Merge branch 'bpf-direct-pkt-access'
Alexei Starovoitov says:
====================
bpf: introduce direct packet access
This set of patches introduce 'direct packet access' from
cls_bpf and act_bpf programs (which are root only).
Current bpf programs use LD_ABS, LD_INS instructions which have
to do 'if (off < skb_headlen)' for every packet access.
It's ok for socket filters, but too slow for XDP, since single
LD_ABS insn consumes 3% of cpu. Therefore we have to amortize the cost
of length check over multiple packet accesses via direct access
to skb->data, data_end pointers.
The existing packet parser typically look like:
if (load_half(skb, offsetof(struct ethhdr, h_proto)) != ETH_P_IP)
return 0;
if (load_byte(skb, ETH_HLEN + offsetof(struct iphdr, protocol)) != IPPROTO_UDP ||
load_byte(skb, ETH_HLEN) != 0x45)
return 0;
...
with 'direct packet access' the bpf program becomes:
void *data = (void *)(long)skb->data;
void *data_end = (void *)(long)skb->data_end;
struct eth_hdr *eth = data;
struct iphdr *iph = data + sizeof(*eth);
if (data + sizeof(*eth) + sizeof(*iph) + sizeof(*udp) > data_end)
return 0;
if (eth->h_proto != htons(ETH_P_IP))
return 0;
if (iph->protocol != IPPROTO_UDP || iph->ihl != 5)
return 0;
...
which is more natural to write and significantly faster.
See patch 6 for performance tests:
21Mpps(old) vs 24Mpps(new) with just 5 loads.
For more complex parsers the performance gain is higher.
The other approach implemented in [1] was adding two new instructions
to interpreter and JITs and was too hard to use from llvm side.
The approach presented here doesn't need any instruction changes,
but the verifier has to work harder to check safety of the packet access.
Patch 1 prepares the code and Patch 2 adds new checks for direct
packet access and all of them are gated with 'env->allow_ptr_leaks'
which is true for root only.
Patch 3 improves search pruning for large programs.
Patch 4 wires in verifier's changes with net/core/filter side.
Patch 5 updates docs
Patches 6 and 7 add tests.
parse_simple.c - packet parser exapmle with single length check that
filters out udp packets for port 9
parse_varlen.c - variable length parser that understand multiple vlan headers,
ipip, ipip6 and ip options to filter out udp or tcp packets on port 9.
The packet is parsed layer by layer with multitple length checks.
parse_ldabs.c - classic style of packet parsing using LD_ABS instruction.
Same functionality as parse_simple.
explain how verifier checks safety of packet access
and update email addresses.
Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net>
allow cls_bpf and act_bpf programs access skb->data and skb->data_end pointers.
The bpf helpers that change skb->data need to update data_end pointer as well.
The verifier checks that programs always reload data, data_end pointers
after calls to such bpf helpers.
We cannot add 'data_end' pointer to struct qdisc_skb_cb directly,
since it's embedded as-is by infiniband ipoib, so wrapper struct is needed.
Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net>
since UNKNOWN_VALUE type is weaker than CONST_IMM we can un-teach
verifier its recognition of constants in conditional branches
without affecting safety.
Ex:
if (reg == 123) {
.. here verifier was marking reg->type as CONST_IMM
instead keep reg as UNKNOWN_VALUE
}
Two verifier states with UNKNOWN_VALUE are equivalent, whereas
CONST_IMM_X != CONST_IMM_Y, since CONST_IMM is used for stack range
verification and other cases.
So help search pruning by marking registers as UNKNOWN_VALUE
where possible instead of CONST_IMM.
Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net>
Extended BPF carried over two instructions from classic to access
packet data: LD_ABS and LD_IND. They're highly optimized in JITs,
but due to their design they have to do length check for every access.
When BPF is processing 20M packets per second single LD_ABS after JIT
is consuming 3% cpu. Hence the need to optimize it further by amortizing
the cost of 'off < skb_headlen' over multiple packet accesses.
One option is to introduce two new eBPF instructions LD_ABS_DW and LD_IND_DW
with similar usage as skb_header_pointer().
The kernel part for interpreter and x64 JIT was implemented in [1], but such
new insns behave like old ld_abs and abort the program with 'return 0' if
access is beyond linear data. Such hidden control flow is hard to workaround
plus changing JITs and rolling out new llvm is incovenient.
Therefore allow cls_bpf/act_bpf program access skb->data directly:
int bpf_prog(struct __sk_buff *skb)
{
struct iphdr *ip;
if (skb->data + sizeof(struct iphdr) + ETH_HLEN > skb->data_end)
/* packet too small */
return 0;
ip = skb->data + ETH_HLEN;
/* access IP header fields with direct loads */
if (ip->version != 4 || ip->saddr == 0x7f000001)
return 1;
[...]
}
This solution avoids introduction of new instructions. llvm stays
the same and all JITs stay the same, but verifier has to work extra hard
to prove safety of the above program.
For XDP the direct store instructions can be allowed as well.
The skb->data is NET_IP_ALIGNED, so for common cases the verifier can check
the alignment. The complex packet parsers where packet pointer is adjusted
incrementally cannot be tracked for alignment, so allow byte access in such cases
and misaligned access on architectures that define efficient_unaligned_access
Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net>
cleanup verifier code and prepare it for addition of "pointer to packet" logic
Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Fri, 6 May 2016 19:55:30 +0000 (15:55 -0400)]
Merge branch '40GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/next-queue
Jeff Kirsher says:
====================
40GbE Intel Wired LAN Driver Updates 2016-05-05
This series contains updates to i40e and i40evf.
The theme behind this series is code reduction, yeah! Jesse provides
most of the changes starting with a refactor of the interpretation of
a tunnel which lets us start using the hardware's parsing. Removed
the packet split receive routine and ancillary code in preparation
for the Rx-refactor. The refactor of the receive routine,
aligns the receive routine with the one in ixgbe which was highly
optimized. The hardware supports a 16 byte descriptor for receive,
but the driver was never using it in production. There was no performance
benefit to the real driver of 16 byte descriptors, so drop a whole lot
of complexity while getting rid of the code. Fixed a bug where while
changing the number of descriptors using ethtool, the driver did not
test the limits of the system memory before permanently assuming it
would be able to get receive buffer memory.
Mitch fixes a memory leak of one page each time the driver is opened by
allocating the correct number of receive buffers and do not fiddle with
next_to_use in the VF driver.
Arnd Bergmann fixed a indentation issue by adding the appropriate
curly braces in i40e_vc_config_promiscuous_mode_msg().
Julia Lawall fixed an issue found by Coccinelle, where i40e_client_ops
structure can be const since it is never modified.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Jon Maxwell [Wed, 4 May 2016 23:55:51 +0000 (09:55 +1000)]
cnic: call cp->stop_hw() in cnic_start_hw() on allocation failure
We recently had a system crash in the cnic module. Vmcore analysis confirmed
that "ip link up" was executed which failed due to an allocation failure
because of memory fragmentation. Futher analysis revealed that the cnic irq
vector was still allocated after the "ip link up" that failed. When
"ip link down" was executed it called free_msi_irqs() which crashed the system
because the cnic irq was still inuse.
PANIC: "kernel BUG at drivers/pci/msi.c:411!"
The code execution was:
cnic_netdev_event()
if (event == NETDEV_UP) {
.
.
▹ if (!cnic_start_hw(dev))
cnic_start_hw()
calls cnic_cm_open() which failed with -ENOMEM
cnic_start_hw() then took the err1 path:
err1:↩
cp->free_resc(dev);↩ <---- frees resources but not irq vector
pci_dev_put(dev->pcidev);↩
return err;↩
}↩
This returns control back to cnic_netdev_event() but now the cnic irq vector
is still allocated even although cnic_cm_open() failed. The next
"ip link down" while trigger the crash.
The cnic_start_hw() routine is not handling the allocation failure correctly.
Fix this by checking whether CNIC_DRV_STATE_HANDLES_IRQ flag is set indicating
that the hardware has been started in cnic_start_hw(). If it has then call
cp->stop_hw() which frees the cnic irq vector and cnic resources. Otherwise
just maintain the previous behaviour and free cnic resources.
I reproduced this by injecting an ENOMEM error into cnic_cm_alloc_mem()s return
code.
# ip link set dev enpX down
# ip link set dev enpX up <--- hit's allocation failure
# ip link set dev enpX down <--- crashes here
With this patch I confirmed there was no crash in the reproducer.
Signed-off-by: Jon Maxwell <jmaxwell37@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Julia Lawall [Sun, 1 May 2016 12:07:23 +0000 (14:07 +0200)]
i40e: constify i40e_client_ops structure
The i40e_client_ops structure is never modified, so declare it as const.
Done with the help of Coccinelle.
Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr> Reviewed-by: Leon Romanovsky <leonro@mellanox.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Newly added code in i40e_vc_config_promiscuous_mode_msg() is indented
in a way that gcc rightly complains about:
drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c: In function 'i40e_vc_config_promiscuous_mode_msg':
drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c:1543:4: error: this 'if' clause does not guard... [-Werror=misleading-indentation]
if (f->vlan >= 0 && f->vlan <= I40E_MAX_VLANID)
^~
drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c:1550:5: note: ...this statement, but the latter is misleadingly indented as if it is guarded by the 'if'
aq_err = pf->hw.aq.asq_last_status;
From the context, it looks like the aq_err assignment was meant to be
inside of the conditional expression, so I'm adding the appropriate
curly braces now.
When testing on systems with very limited amounts of RAM, a bug was
found where, while changing the number of descriptors using ethtool,
the driver didn't test the limits of system memory before permanently
assuming it would be able to get receive buffer memory.
Work around this issue by pre-allocation of the receive buffer
memory, in the "ghost" ring, which is then used during reinit
using the new ring length.
Change-Id: I92d7a5fb59a6c884b2efdd1ec652845f101c3359 Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Mitch Williams [Mon, 18 Apr 2016 18:33:48 +0000 (11:33 -0700)]
i40evf: Allocate Rx buffers properly
Allocate the correct number of RX buffers, and don't fiddle with
next_to_use. The common RX code handles all of this. This fixes a memory
leak of one page each time the driver is opened.
Change-Id: Id06eca353086e084921f047acad28c14745684ee Signed-off-by: Mitch Williams <mitch.a.williams@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
The hardware supports a 16 byte descriptor for receive, but the
driver was never using it in production. There was no performance
benefit to the real driver of 16 byte descriptors, so drop a whole
lot of complexity while getting rid of the code.
Also since the previous patch made us use no-split mode all the
time, drop any support in the driver for any other value in dtype
and assume it is always zero (aka no-split).
Hooray for code removal!
Change-ID: I2257e902e4dad84a07b94db6d2e6f4ce69b27bc0 Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
This is part 2 of the Rx refactor series, just including
changes to i40evf.
This refactor aligns the receive routine with the one in
ixgbe which was highly optimized. This reduces the code
we have to maintain and allows for (hopefully) more readable
and maintainable RX hot path.
In order to do this:
- consolidate the receive path into a single function that doesn't
use packet split but *does* use pages for Rx buffers.
- remove the old _1buf routine
- consolidate several routines into helper functions
- remove VF ethtool control over packet split
- remove priv_flags interface since it is unused
Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
As part of preparation for the rx-refactor, remove the
packet split receive routine and ancillary code.
Some of the split related context set up code stays in
i40e_virtchnl_pf.c in case an older VF driver tries to load
and still wants to use packet split.
Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
This is part 1 of the Rx refactor series, just including
changes to i40e.
This refactor aligns the receive routine with the one in
ixgbe which was highly optimized. This reduces the code
we have to maintain and allows for (hopefully) more readable
and maintainable RX hot path.
In order to do this:
- consolidate the receive path into a single function that doesn't
use packet split but *does* use pages for Rx buffers.
- remove the old _1buf routine
- consolidate several routines into helper functions
- remove ethtool control over packet split
Change-ID: I5ca100721de65992aa0114f8b4bac844b84758e0 Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
The dma_alloc_coherent() function returns a virtual address which can
be used for coherent access to the underlying memory. On some
architectures, like arm64, undefined behavior results if this memory is
also accessed via virtual mappings that are not coherent. Because of
their undefined nature, operations like virt_to_page() return garbage
when passed virtual addresses obtained from dma_alloc_coherent(). Any
subsequent mappings via vmap() of the garbage page values are unusable
and result in bad things like bus errors (synchronous aborts in ARM64
speak).
The mlx4 driver contains code that does the equivalent of:
vmap(virt_to_page(dma_alloc_coherent)), this results in an OOPs when the
device is opened.
Prevent Ethernet driver to run this problematic code by forcing it to
allocate contiguous memory. As for the Infiniband driver, at first we
are trying to allocate contiguous memory, but in case of failure roll
back to work with fragmented memory.
Signed-off-by: Haggai Abramovsky <hagaya@mellanox.com> Signed-off-by: Yishai Hadas <yishaih@mellanox.com> Reported-by: David Daney <david.daney@cavium.com> Tested-by: Sinan Kaya <okaya@codeaurora.org> Signed-off-by: David S. Miller <davem@davemloft.net>
Jeff Kirsher [Wed, 4 May 2016 22:49:39 +0000 (15:49 -0700)]
MAINTAINERS: Cleanup Intel Wired LAN maintainers list
With the recent "retirements" and other changes, make the maintainers
list a lot less confusing and a bit more straight forward.
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com> Acked-by: Jesse Brandeburg <jesse.brandeburg@intel.com> Acked-by: Shannon Nelson <sln@onemain.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Wed, 4 May 2016 22:27:29 +0000 (15:27 -0700)]
tcp: two more missing bh disable
percpu_counter only have protection against preemption.
TCP stack uses them possibly from BH, so we need BH protection
in contexts that could be run in process context
Fixes: c10d9310edf5 ("tcp: do not assume TCP code is non preemptible") Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Wed, 4 May 2016 21:13:34 +0000 (17:13 -0400)]
Merge branch '10GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/next-queue
Jeff Kirsher says:
====================
10GbE Intel Wired LAN Driver Updates 2016-05-04
This series contains updates to ixgbe, ixgbevf and traffic class helpers.
Sridhar adds helper functions to the tc_mirred header to access tcf_mirred
information and then implements them for ixgbe to enable redirection to
a SRIOV VF or an offloaded MACVLAN device queue via tc 'mirred' action.
Amritha adds support to set filters with multiple header fields (L3,L4)
to match on.
KY Srinivasan from Microsoft add Hyper-V support into ixgbevf.
Emil adds 82599 sub-device IDs that were missing from the list of parts
that support WoL. Then simplified the logic we use to determine WoL
support by reading the EEPROM bits for MACs X540 and newer.
Preethi cleaned up duplicate and unused device IDs. Fixed our ethtool
stat reporting where we were ignoring higher 32 bits of stats registers,
so fill out 64 bit stat values into two 32 bit words.
Babu Moger from Oracle improves VF performance issues on SPARC.
Alex Duyck cleans up some of the Hyper-V implementation from KY so that
we can just use function pointers instead of having to identify if a
given VF is running on a Linux or Windows PF.
Usha makes sure that DCB and FCoE is disabled for X550EM_x/a MACs and
cleans up the DCB initialization in the process.
Tony cleans up the API for ixgbevf_update_xcast_mode() so we do not
have to pass in the netdev parameter, since it was never used in the
function.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
kbuild test robot reported a build failure on s390.
While at it, also fix missing conversion in the tilera driver.
Fixes: 9b36627acecd5792 ("net: remove dev->trans_start") Reported-by: kbuild test robot <fengguang.wu@intel.com> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>
bonding: update documentation section after dev->trans_start removal
Drivers that use LLTX need to update trans_start of the netdev_queue.
(Most drivers don't use LLTX; stack does this update if .ndo_start_xmit
returned TX_OK).
Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Wed, 4 May 2016 00:10:50 +0000 (17:10 -0700)]
tcp: must block bh in __inet_twsk_hashdance()
__inet_twsk_hashdance() might be called from process context,
better block BH before acquiring bind hash and established locks
Fixes: c10d9310edf5 ("tcp: do not assume TCP code is non preemptible") Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Tue, 3 May 2016 23:56:03 +0000 (16:56 -0700)]
tcp: fix lockdep splat in tcp_snd_una_update()
tcp_snd_una_update() and tcp_rcv_nxt_update() call
u64_stats_update_begin() either from process context or BH handler.
This triggers a lockdep splat on 32bit & SMP builds.
We could add u64_stats_update_begin_bh() variant but this would
slow down 32bit builds with useless local_disable_bh() and
local_enable_bh() pairs, since we own the socket lock at this point.
I add sock_owned_by_me() helper to have proper lockdep support
even on 64bit builds, and new u64_stats_update_begin_raw()
and u64_stats_update_end_raw methods.
Fixes: c10d9310edf5 ("tcp: do not assume TCP code is non preemptible") Reported-by: Fabio Estevam <festevam@gmail.com> Diagnosed-by: Francois Romieu <romieu@fr.zoreil.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Tested-by: Fabio Estevam <fabio.estevam@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
In this pull request you have:
- two changes to the MAINTAINERS file where one marks our mailing list
as moderated and the other adds a missing documentation file
- kernel-doc fixes
- code refactoring and various cleanups
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Sergei Shtylyov [Tue, 3 May 2016 20:14:41 +0000 (23:14 +0300)]
mdio_bus: don't return NULL from mdiobus_scan()
I've finally noticed that mdiobus_scan() also returns either NULL or error
value on failure. Return ERR_PTR(-ENODEV) instead of NULL since this is
the error value already filtered out by the callers that want to ignore
the MDIO address scan failure...
Signed-off-by: Sergei Shtylyov <sergei.shtylyov@cogentembedded.com> Signed-off-by: David S. Miller <davem@davemloft.net>
gets replaced with new helper:
netif_trans_update(dev);
3. This helper is then changed to set
netdev_get_tx_queue(dev, 0)->trans_start
instead of dev->trans_start.
After this dev->trans_start can be removed.
It should be noted that after this series several instances
of netif_trans_update() are useless (if they occur in
.ndo_start_xmit and driver doesn't set LLTX flag -- stack already
did an update).
Comments welcome.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
previous patches removed all direct accesses to dev->trans_start,
so change the netif_trans_update helper to update trans_start of
netdev queue 0 instead and then remove trans_start from struct net_device.
AFAICS a lot of the netif_trans_update() invocations are now useless
because they occur in ndo_start_xmit and driver doesn't set LLTX
(i.e. stack already took care of the update).
As I can't test any of them it seems better to just leave them alone.
Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>
drivers: replace dev->trans_start accesses with dev_trans_start
a trans_start struct member exists twice:
- in struct net_device (legacy)
- in struct netdev_queue
Instead of open-coding dev->trans_start usage to obtain the current
trans_start value, use dev_trans_start() instead.
This is not exactly the same, as dev_trans_start also considers
the trans_start values of the netdev queues owned by the device
and provides the most recent one.
For legacy devices this doesn't matter as dev_trans_start can cope
with netdev trans_start values of 0 (they are ignored).
This is a prerequisite to eventual removal of dev->trans_start.
Cc: linux-rdma@vger.kernel.org Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>
Arnd Bergmann [Tue, 3 May 2016 15:19:57 +0000 (17:19 +0200)]
gre6: add Kconfig dependency for NET_IPGRE_DEMUX
The ipv6 gre implementation was cleaned up to share more code
with the ipv4 version, but it can be enabled even when NET_IPGRE_DEMUX
is disabled, resulting in a link error:
net/built-in.o: In function `gre_rcv':
:(.text+0x17f5d0): undefined reference to `gre_parse_header'
ERROR: "gre_parse_header" [net/ipv6/ip6_gre.ko] undefined!
This adds a Kconfig dependency to prevent that now invalid
configuration.
Signed-off-by: Arnd Bergmann <arnd@arndb.de> Fixes: 308edfdf1563 ("gre6: Cleanup GREv6 receive path, call common GRE functions") Acked-by: Tom Herbert <tom@herbertland.com> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Wed, 4 May 2016 18:11:32 +0000 (14:11 -0400)]
Merge branch 'gre-teb'
Jiri Benc says:
====================
gre: receive also TEB packets for lwtunnels
NOTE: this patchset needs net merged to net-next.
This allows lwtunnel users to get also packets with ETH_P_TEB protocol
specified in GRE header through an ipgre interface. There's really nothing
special about these packets in the case of lwtunnels - it's just an inner
protocol like any other. The only complications stem from keeping
compatibility with other uses of GRE.
This will be used by openvswitch to support eth_push and eth_pop actions.
I'd also like to see tc support for lwtunnels (this feature included) in the
future.
The first patch is not directly related and can be submitted standalone if
needed.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Jiri Benc [Tue, 3 May 2016 15:10:08 +0000 (17:10 +0200)]
gre: receive also TEB packets for lwtunnels
For ipgre interfaces in collect metadata mode, receive also traffic with
encapsulated Ethernet headers. The lwtunnel users are supposed to sort this
out correctly. This allows to have mixed Ethernet + L3-only traffic on the
same lwtunnel interface. This is the same way as VXLAN-GPE behaves.
To keep backwards compatibility and prevent any surprises, gretap interfaces
have priority in receiving packets with Ethernet headers.
Signed-off-by: Jiri Benc <jbenc@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Jiri Benc [Tue, 3 May 2016 15:10:06 +0000 (17:10 +0200)]
gre: remove superfluous pskb_may_pull
The call to gre_parse_header is either followed by iptunnel_pull_header, or
in the case of ICMP error path, the actual header is not accessed at all.
In the first case, iptunnel_pull_header will call pskb_may_pull anyway and
it's pointless to do it twice. The only difference is what call will fail
with what error code but the net effect is still the same in all call sites.
In the second case, pskb_may_pull is pointless, as skb->data is at the outer
IP header and not at the GRE header.
Signed-off-by: Jiri Benc <jbenc@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
This series introduces new features and upgrades for mlx5 etherenet SRIOV,
while the first patch provides a bug fixes for a compilation issue introduced
buy the previous aRFS series for when CONFIG_RFS_ACCEL=y and CONFIG_MLX5_CORE_EN=n.
Changes from V0:
- 1st patch: Don't add a new Kconfig flag. Instead, compile out en_arfs.c \
contents when CONFIG_RFS_ACCEL=n
SRIOV upgrades:
- Use synchronize_irq instead of the vport events spin_lock
- Fix memory leak in error flow
- Added full VST support
- Spoofcheck support
- Trusted VF promiscuous and allmulti support
VST and Spoofcheck in details:
- Adding Low level firmware commands support for creating ACLs
(Access Control Lists) Flow tables. ACLs are regular flow tables with
the only exception that they are bound to a specific e-Switch vport (VF)
and they can be one of two types
> egress ACL: filters traffic going from e-Switch to VF.
> ingress ACL: filters traffic going from VF to e-Switch.
- Ingress/Egress ACLs (per vport) for VF VST mode filtering.
- Ingress/Egress ACLs (per vport) for VF spoofcheck filtering.
- Ingress/Egress ACLs (per vport) configuration:
> Created only when at least one of (VST, spoofcheck) is configured.
> if (!spoofchk && !vst) allow all traffic. i.e. no ACLs.
> if (spoofchk && vst) allow only untagged traffic with smac=original mac \
sent from the VF.
> if (spoofchk && !vst) allow only traffic with smac=original mac sent from \
the VF. > if (!spoofchk && vst) allow only untagged traffic.
Trusted VF promiscuous and allmulti support in details:
- Added two flow groups for allmulti and promisc VFs to the e-Switch FDB table
> Allmulti group: One rule that forwards any mcast traffic coming from
either uplink or VFs/PF vports.
> Promisc group: One rule that forwards all unmatched traffic coming from \
uplink.
- Add vport context change event handling for promisc and allmulti
If VF is trusted respect the request and:
> if allmulti request: add the vport to the allmulti group.
and to all other L2 mcast address in the FDB table.
> if promisc request: add the vport to the promisc group.
> Note: A promisc VF can only see traffic that was not explicitly matched to
or requested by any other VF.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
- Add support to configure trusted vf attribute through trust_vf_ndo.
- Upon VF trust setting change we update vport context to refresh
allmulti/promisc or any trusted vf attributes that we didn't trust the
VF for before.
- Lock the eswitch state lock on vport event in order to synchronise the
vport context updates , this will prevent contention with vport trust
setting change which will trigger vport mac list update.
Signed-off-by: Mohamad Haj Yahia <mohamad@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Add promisc_change as a trigger to vport context change event.
Add set vport promisc/allmulti functions to add vport to promiscuous
flowtable rules.
Upon promisc/allmulti rx mode vf request add the vport to
the relevant promiscuous group (Allmulti/Promisc group) so the relevant
traffic will be forwarded to it.
Upon allmulti vf request add the vport to each existing multicast fdb
rule.
Upon adding/removing mcast address from a vport, update all other
allmulti vports.
Signed-off-by: Mohamad Haj Yahia <mohamad@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
net/mlx5: E-Switch, Add promiscuous and allmulti FDB flowtable groups
Add promiscuous and allmulti steering groups in FDB table.
Besides the full match L2 steering rules group, we added
two more groups to catch the "miss" rules traffic:
* Allmulti group: One rule that forwards any mcast traffic coming from
either uplink or VFs/PF vports
* Promisc group: One rule that forwards all unmatched traffic coming
from uplink.
Needed for downstream privileged VF promisc and allmulti support.
Signed-off-by: Mohamad Haj Yahia <mohamad@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
net/mlx5: E-Switch, Use vport event handler for vport cleanup
Remove the usage of explicit cleanup function and use existing vport
change handler. Calling vport change handler while vport
is disabled will cleanup the vport resources.
Signed-off-by: Mohamad Haj Yahia <mohamad@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
net/mlx5: E-Switch, Enable/disable ACL tables on demand
Enable ingress/egress ACL tables only when we need to configure ACL
rules.
Disable ingress/egress ACL tables once all ACL rules are removed.
All VF outgoing/incoming traffic need to go through the ingress/egress ACL
tables.
Adding/Removing these tables on demand will save unnecessary hops in the
flow steering when the ACL tables are empty.
Signed-off-by: Mohamad Haj Yahia <mohamad@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
net/mlx5: E-Switch, Vport ingress/egress ACLs rules for spoofchk
Configure ingress and egress vport ACL rules according to spoofchk
admin parameters.
Ingress ACL flow table rules:
if (!spoofchk && !vst) allow all traffic.
else :
1) one of the following rules :
* if (spoofchk && vst) allow only untagged traffic with smac=original
mac sent from the VF.
* if (spoofchk && !vst) allow only traffic with smac=original mac sent
from the VF.
* if (!spoofchk && vst) allow only untagged traffic.
2) drop all traffic that didn't hit #1.
Add support for set vf spoofchk ndo.
Add non zero mac validation in case of spoofchk to set mac ndo:
when setting new mac we need to validate that the new mac is
not zero while the spoofchk is on because it is illegal
combination.
Signed-off-by: Mohamad Haj Yahia <mohamad@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
net/mlx5: E-Switch, Vport ingress/egress ACLs rules for VST mode
Configure ingress and egress vport ACL rules according to
vlan and qos admin parameters.
Ingress ACL flow table rules:
1) drop any tagged packet sent from the VF
2) allow other traffic (default behavior)
Egress ACL flow table rules:
1) allow only tagged traffic with vlan_tag=vst_vid.
2) drop other traffic.
Signed-off-by: Mohamad Haj Yahia <mohamad@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Create egress/ingress ACLs per VF vport at vport enable.
Ingress ACL:
- one flow group to drop all tagged traffic in VST mode.
Egress ACL:
- one flow group that allows only untagged traffic with
smac that is equals to the original mac (anti-spoofing).
- one flow group that allows only untagged traffic.
- one flow group that allows only smac that is equals
to the original mac (anti-spoofing).
(note: only one of the above group has active rule)
- star rule will be used to drop all other traffic.
By default no rules are generated, unless VST is explicitly requested.
Signed-off-by: Mohamad Haj Yahia <mohamad@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
net/mlx5: E-Switch, Replace vport spin lock with synchronize_irq()
Vport spin lock can be replaced with synchronize_irq() in the right
place, this will remove the need of locking inside irq context.
Locking in esw_enable_vport is not required since vport events are yet
to be enabled, and at esw_disable_vport it is sufficient to
synchronize_irq() to guarantee no further vport events handlers will be
scheduled.
Signed-off-by: Mohamad Haj Yahia <mohamad@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Update the relevant flow steering device structs and commands to
support vport.
Update the flow steering core API to receive vport number.
Add ingress and egress ACL flow table name spaces.
Add ACL flow table support:
* ACL (Access Control List) flow table is a table that contains
only allow/drop steering rules.
* We have two types of ACL flow tables - ingress and egress.
* ACLs handle traffic sent from/to E-Switch FDB table, Ingress refers to
traffic sent from Vport to E-Switch and Egress refers to traffic sent
from E-Switch to vport.
* Ingress ACL flow table allow/drop rules is checked against traffic
sent from VF.
* Egress ACL flow table allow/drop rules is checked against traffic sent
to VF.
Signed-off-by: Mohamad Haj Yahia <mohamad@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Maor Gottlieb [Tue, 3 May 2016 14:13:53 +0000 (17:13 +0300)]
net/mlx5e: Fix aRFS compilation dependency
en_arfs.o should be compiled only if both CONFIG_MLX5_CORE_EN
and CONFIG_RFS_ACCEL are enabled. en_arfs calls to rps_may_expire_flow
which is compiled only if CONFIG_RFS_ACCEL is defined.
Move en_arfs.o compilation dependency to be under CONFIG_MLX5_CORE_EN
and wrap the en_arfs.c content with ifdef of CONFIG_RFS_ACCEL.
Fixes: 1cabe6b0965e ('net/mlx5e: Create aRFS flow tables') Signed-off-by: Maor Gottlieb <maorg@mellanox.com> Reported-by: Alexei Starovoitov <alexei.starovoitov@gmail.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Wed, 4 May 2016 17:59:28 +0000 (13:59 -0400)]
Merge branch 'cxgb4-mbox'
Hariprasad Shenai says:
====================
cxgb4: mbox enhancements for cxgb4
This patch series checks for firmware errors when we are waiting for
mbox response in a loop and breaks out. When negative timeout is passed
to mailbox code, don't sleep. Negative timeout is passed only from
interrupt context.
This patch series has been created against net-next tree and includes
patches on cxgb4 driver.
We have included all the maintainers of respective drivers. Kindly review
the change and let us know in case of any review comments.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
cxgb4: Don't sleep when mbox cmd is issued from interrupt context
When link goes down, from the interrupt handler DCB priority for the
Tx queues needs to be unset. We issue mbox command to unset the Tx queue
priority with negative timeout. In t4_wr_mbox_meat_timeout() do not sleep
when negative timeout is passed, since it is called from interrupt context.
Signed-off-by: Hariprasad Shenai <hariprasad@chelsio.com> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Wed, 4 May 2016 17:32:29 +0000 (13:32 -0400)]
Merge branch 'tunnel-features-and-gso-partial'
Alexander Duyck says:
====================
Fix Tunnel features and enable GSO partial for several drivers
This patch series is meant to allow us to get the best performance possible
for Mellanox ConnectX-3/4 and Broadcom NetXtreme-C/E adapters in terms of
VXLAN and GRE tunnels.
The first 3 patches address issues I found in regards to GSO_PARTIAL and
TSO_MANGLEID.
The next 4 patches go through and enable GSO_PARTIAL for VXLAN tunnels that
have an outer checksum enabled, and then enable IPv6 support where I can.
One outstanding issue is that I wasn't able to get offloads working with
outer IPv6 headers on mlx4. However that wasn't a feature that was enabled
before so it isn't technically a regression, however I believe Engineers
from Mellanox said they would look into it since they thought it should be
supported.
The last patch enables GSO_PARTIAL for VXLAN and GRE tunnels on the bnxt
driver. One piece of feedback I received on the patch was that the
hardware has globally set IPv6 UDP tunnels to always have the checksum
field computed. I plan to work with Broadcom to get that addressed so that
we only populate the checksum field if it was requested by the network
stack.
v2: Rebased patches off of latest changes to the mlx4/mlx5 drivers.
Added bnxt driver patch as I received feedback on the RFC.
v3: Moved 2 patches into series for net as they were generic fixes.
Added patch to disable GSO partial if frame is less than 2x size of MSS
There are outstanding issues as called out above that need to be
addressed, however they were present before these patches so it isn't
as if they introduce a regression. In addition gains can be easily
seen so there should be no issue with applying the driver patches while
the IPv6 mlx4_en and bnxt issues are being researched.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Alexander Duyck [Mon, 2 May 2016 16:38:55 +0000 (09:38 -0700)]
bnxt: Add support for segmentation of tunnels with outer checksums
This patch assumes that the bnxt hardware will ignore existing IPv4/v6
header fields for length and checksum as well as the length and checksum
fields for outer UDP and GRE headers.
I have been told by Michael Chan that this is working. Though this might
be somewhat redundant for IPv6 as they are forcing the checksum to be
computed for all IPv6 frames that are offloaded. A follow-up patch may be
necessary in order to fix this as it is essentially mangling the outer IPv6
headers to add a checksum where none was requested.
Signed-off-by: Alexander Duyck <aduyck@mirantis.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Alexander Duyck [Mon, 2 May 2016 16:38:49 +0000 (09:38 -0700)]
net/mlx5e: Fix IPv6 tunnel checksum offload
The mlx5 driver exposes support for TSO6 but not IPv6 csum for hardware
encapsulated tunnels. This leads to issues as it triggers warnings in
skb_checksum_help as it ends up being called as we report supporting the
segmentation but not the checksumming for IPv6 frames.
This patch corrects that and drops 2 features that don't actually need to
be supported in hw_enc_features since they are Rx features and don't
actually impact anything by being present in hw_enc_features.
Signed-off-by: Alexander Duyck <aduyck@mirantis.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Alexander Duyck [Mon, 2 May 2016 16:38:43 +0000 (09:38 -0700)]
net/mlx5e: Add support for UDP tunnel segmentation with outer checksum offload
This patch assumes that the mlx5 hardware will ignore existing IPv4/v6
header fields for length and checksum as well as the length and checksum
fields for outer UDP headers.
Signed-off-by: Alexander Duyck <aduyck@mirantis.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Alexander Duyck [Mon, 2 May 2016 16:38:37 +0000 (09:38 -0700)]
net/mlx4_en: Add support for inner IPv6 checksum offloads and TSO
>From what I can tell the ConnectX-3 will support an inner IPv6 checksum and
segmentation offload, however it cannot support outer IPv6 headers. This
assumption is based on the fact that I could see the checksum being
offloaded for inner header on IPv4 tunnels, but not on IPv6 tunnels.
For this reason I am adding the feature to the hw_enc_features and adding
an extra check to the features_check call that will disable GSO and
checksum offload in the case that the encapsulated frame has an outer IP
version of that is not 4. The check in mlx4_en_features_check could be
removed if at some point in the future a fix is found that allows the
hardware to offload segmentation/checksum on tunnels with an outer IPv6
header.
Signed-off-by: Alexander Duyck <aduyck@mirantis.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Alexander Duyck [Mon, 2 May 2016 16:38:30 +0000 (09:38 -0700)]
net/mlx4_en: Add support for UDP tunnel segmentation with outer checksum offload
This patch assumes that the mlx4 hardware will ignore existing IPv4/v6
header fields for length and checksum as well as the length and checksum
fields for outer UDP headers.
Signed-off-by: Alexander Duyck <aduyck@mirantis.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Alexander Duyck [Mon, 2 May 2016 16:38:24 +0000 (09:38 -0700)]
net: Fix netdev_fix_features so that TSO_MANGLEID is only available with TSO
This change makes it so that we will strip the TSO_MANGLEID bit if TSO is
not present. This way we will also handle ECN correctly of TSO is not
present.
Signed-off-by: Alexander Duyck <aduyck@mirantis.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Alexander Duyck [Mon, 2 May 2016 16:38:18 +0000 (09:38 -0700)]
gso: Only allow GSO_PARTIAL if we can checksum the inner protocol
This patch addresses a possible issue that can occur if we get into any odd
corner cases where we support TSO for a given protocol but not the checksum
or scatter-gather offload. There are few drivers floating around that
setup their tunnels this way and by enforcing the checksum piece we can
avoid mangling any frames.
Signed-off-by: Alexander Duyck <aduyck@mirantis.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Alexander Duyck [Mon, 2 May 2016 16:38:12 +0000 (09:38 -0700)]
gso: Do not perform partial GSO if number of partial segments is 1 or less
In the event that the number of partial segments is equal to 1 we don't
really need to perform partial segmentation offload. As such we should
skip multiplying the MSS and instead just clear the partial_segs value
since it will not provide any gain to advertise the frame as being GSO when
it is a single frame.
Signed-off-by: Alexander Duyck <aduyck@mirantis.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Jiri Benc [Tue, 3 May 2016 13:00:21 +0000 (15:00 +0200)]
gre: change gre_parse_header to return the header length
It's easier for gre_parse_header to return the header length instead of
filing it into a parameter. That way, the callers that don't care about the
header length can just check whether the returned value is lower than zero.
In gre_err, the tunnel header must not be pulled. See commit b7f8fe251e46
("gre: do not pull header in ICMP error processing") for details.
This patch reduces the conflict between the mentioned commit and commit 95f5c64c3c13 ("gre: Move utility functions to common headers").
Signed-off-by: Jiri Benc <jbenc@redhat.com> Acked-by: Tom Herbert <tom@herbertland.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Tue, 3 May 2016 04:49:25 +0000 (21:49 -0700)]
tcp: guarantee forward progress in tcp_sendmsg()
Under high rx pressure, it is possible tcp_sendmsg() never has a
chance to allocate an skb and loop forever as sk_flush_backlog()
would always return true.
Fix this by calling sk_flush_backlog() only if one skb had been
allocated and filled before last backlog check.
Fixes: d41a69f1d390 ("tcp: make tcp_sendmsg() aware of socket backlog") Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Tony Nguyen [Wed, 27 Apr 2016 21:14:14 +0000 (14:14 -0700)]
ixgbevf: Remove unused parameter
ixgbevf_update_xcast_mode() is not using the netdev parameter;
removing it since it's unnecessary.
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
ixgbe: Disable DCB and FCoE for X550EM_x and x550em_a
This patch adds IXGBE_FLAG_DCB_CAPABLE flag that is set
for all MACs other than X550EM_x and x550em_a. DCB and
FCoE is disabled for these MACS. DCB initialization
code is moved to a separate function.
Signed-off-by: Usha Ketineni <usha.k.ketineni@intel.com> Tested-by: Ronald Bynoe <ronald.j.bynoe@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Alexander Duyck [Fri, 22 Apr 2016 17:18:26 +0000 (13:18 -0400)]
ixgbevf: Use mac_ops instead of trying to identify NIC type
This change makes it so that we can just use function pointers instead of
having to identify if a given VF is running on a Linux or Windows PF. By
doing this we can avoid having to pull too much information out of the
lower layers and can instead just make use of the mac_ops pointers since
they should differ between the two types of VFs anyway.
Signed-off-by: Alexander Duyck <aduyck@mirantis.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
ixgbevf: Change the relaxed order settings in VF driver for sparc
We noticed performance issues with VF interface on sparc compared
to PF. Setting the RX to IXGBE_DCA_RXCTRL_DATA_WRO_EN brings it
on far with PF. Also this matches to the default sparc setting in
PF driver.
Signed-off-by: Babu Moger <babu.moger@oracle.com> Acked-by: Sowmini Varadhan <sowmini.varadhan@oracle.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
ixgbe: Revise populating few registers and macro definitions
Revise populating few registers in ixgbe_get_regs() and macro
definitions.
Before applying patch:
$ du -k objs/drivers/net/ethernet/intel/ixgbe/ixgbe.ko
8572 objs/drivers/net/ethernet/intel/ixgbe/ixgbe.ko
After applying patch:
$ du -k objs/drivers/net/ethernet/intel/ixgbe/ixgbe.ko
8568 objs/drivers/net/ethernet/intel/ixgbe/ixgbe.ko
Signed-off-by: Preethi Banala <preethi.banala@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Emil Tantilov [Thu, 21 Apr 2016 18:37:12 +0000 (11:37 -0700)]
ixgbe: check EEPROM for WOL support for X540 and above
This change aims to simplify the logic we use to determine WOL
support by reading the EEPROM bits for MACs X540 and newer.
Also some cleanups in ixgbe_wol_supported() - changed return type to
bool and removed redundant return variable by simply using return after
the checks.
Signed-off-by: Emil Tantilov <emil.s.tantilov@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Emil Tantilov [Thu, 21 Apr 2016 18:37:06 +0000 (11:37 -0700)]
ixgbe: add WoL support for some 82599 subdevice IDs
We had some 82599 subdevice IDs missing from the list of parts that
support WoL.
Reported-by: Neil Horman <nhorman@redhat.com> Signed-off-by: Emil Tantilov <emil.s.tantilov@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
KY Srinivasan [Wed, 20 Apr 2016 02:17:57 +0000 (19:17 -0700)]
ixgbevf: Support Windows hosts (Hyper-V)
On Hyper-V, the VF/PF communication is a via software mediated path
as opposed to the hardware mailbox. Make the necessary
adjustments to support Hyper-V.
Signed-off-by: K. Y. Srinivasan <kys@microsoft.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
ixgbe: Match on multiple headers for cls_u32 offloads
Adds support to set filters with multiple header fields (L3,L4)to match on.
This is achieved in the following order:
1. Create a leaf hash table for the next header.
2. Create a link to the leaf hash table from the base hash table with
matches on next header type and current header fields.
3. Add filter in leaf hash table with match on next header fields and
action.
Verified with the following filters :
Match TCP and DIP:
handle 1: u32 divisor 1
u32 ht 800: order 1 link 1: \
offset at 0 mask 0f00 shift 6 plus 0 eat \
match ip protocol 6 ff match ip dst 10.0.0.1/32
match tcp src 28 ffff action drop
Delete the filter:
Match on DIP, SIP, UDP (SPort, DPort):
handle 2: u32 divisor 1
u32 ht 800: order 2 link 2: \
offset at 0 mask 0f00 shift 6 plus 0 eat \
match ip dst 15.0.0.2/32 match ip protocol 17 ff \
match ip src 15.0.0.1/32
match udp src 30 ffff match udp dst 32 ffff action drop
Signed-off-by: Amritha Nambiar <amritha.nambiar@intel.com> Acked-by: Sridhar Samudrala <sridhar.samudrala@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
ixgbe: Add support for redirect action to cls_u32 offloads
This patch enables 'redirect' to a SRIOV VF or a offloaded macvlan
device queue via tc 'mirred' action.
Verified with the following script that creates SRIOV VFs, offloaded
macvlan and adds tc u32 filters with redirect action to the associated
netdevs.
# add ingress qdisc.
tc qdisc add dev p4p1 ingress
# enable hw tc offload.
ethtool -K p4p1 hw-tc-offload on
# create 4 sriov VFs and bring up the first one.
echo 4 > /sys/class/net/p4p1/device/sriov_numvfs
sleep 1
ip link set p4p1 up
ip link set p4p1_0 up
# create a offloaded macvlan device and bring it up.
ethtool -K p4p1 l2-fwd-offload on
ip link add link p4p1 name mvlan_1 type macvlan
ip link set mvlan_1 up
# add u32 filter with action to redirect to VF netdev
tc filter add dev p4p1 parent ffff: protocol ip prio 99 \
handle 800:0:1 u32 ht 800: \
match ip src 192.168.1.3/32 \
action mirred egress redirect dev p4p1_0
# add u32 filter with action to redirect to macvlan netdev
tc filter add dev p4p1 parent ffff: protocol ip prio 99 \
handle 800:0:2 u32 ht 800: \
match ip src 192.168.2.3/32 \
action mirred egress redirect dev mvlan_1
Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Pull networking fixes from David Miller:
"Some straggler bug fixes:
1) Batman-adv DAT must consider VLAN IDs when choosing candidate
nodes, from Antonio Quartulli.
2) Fix botched reference counting of vlan objects and neigh nodes in
batman-adv, from Sven Eckelmann.
3) netem can crash when it sees GSO packets, the fix is to segment
then upon ->enqueue. Fix from Neil Horman with help from Eric
Dumazet.
4) Fix VXLAN dependencies in mlx5 driver Kconfig, from Matthew
Finlay.
5) Handle VXLAN ops outside of rcu lock, via a workqueue, in mlx5,
since it can sleep. Fix also from Matthew Finlay.
6) Check mdiobus_scan() return values properly in pxa168_eth and macb
drivers. From Sergei Shtylyov.
7) If the netdevice doesn't support checksumming, disable
segmentation. From Alexandery Duyck.
8) Fix races between RDS tcp accept and sending, from Sowmini
Varadhan.
9) In macb driver, probe MDIO bus before we register the netdev,
otherwise we can try to open the device before it is really ready
for that. Fix from Florian Fainelli.
10) Netlink attribute size for ILA "tunnels" not calculated properly,
fix from Nicolas Dichtel"
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net:
ipv6/ila: fix nlsize calculation for lwtunnel
net: macb: Probe MDIO bus before registering netdev
RDS: TCP: Synchronize accept() and connect() paths on t_conn_lock.
RDS:TCP: Synchronize rds_tcp_accept_one with rds_send_xmit when resetting t_sock
vxlan: Add checksum check to the features check function
net: Disable segmentation if checksumming is not supported
net: mvneta: Remove superfluous SMP function call
macb: fix mdiobus_scan() error check
pxa168_eth: fix mdiobus_scan() error check
net/mlx5e: Use workqueue for vxlan ops
net/mlx5e: Implement a mlx5e workqueue
net/mlx5: Kconfig: Fix MLX5_EN/VXLAN build issue
net/mlx5: Unmap only the relevant IO memory mapping
netem: Segment GSO packets on enqueue
batman-adv: Fix reference counting of hardif_neigh_node object for neigh_node
batman-adv: Fix reference counting of vlan object for tt_local_entry
batman-adv: B.A.T.M.A.N V - make sure iface is reactivated upon NETDEV_UP event
batman-adv: fix DAT candidate selection (must use vid)
Linus Torvalds [Tue, 3 May 2016 21:23:58 +0000 (14:23 -0700)]
Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse
Pull fuse fixes from Miklos Szeredi:
"Fix a regression and update the MAINTAINERS entry for fuse"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse:
fuse: update mailing list in MAINTAINERS
fuse: Fix return value from fuse_get_user_pages()
Nicolas Dichtel [Tue, 3 May 2016 07:58:27 +0000 (09:58 +0200)]
ipv6/ila: fix nlsize calculation for lwtunnel
The handler 'ila_fill_encap_info' adds one attribute: ILA_ATTR_LOCATOR.
Fixes: 65d7ab8de582 ("net: Identifier Locator Addressing module") CC: Tom Herbert <tom@herbertland.com> Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Wei Wang [Tue, 3 May 2016 04:40:07 +0000 (21:40 -0700)]
ipv6: add new struct ipcm6_cookie
In the sendmsg function of UDP, raw, ICMP and l2tp sockets, we use local
variables like hlimits, tclass, opt and dontfrag and pass them to corresponding
functions like ip6_make_skb, ip6_append_data and xxx_push_pending_frames.
This is not a good practice and makes it hard to add new parameters.
This fix introduces a new struct ipcm6_cookie similar to ipcm_cookie in
ipv4 and include the above mentioned variables. And we only pass the
pointer to this structure to corresponding functions. This makes it easier
to add new parameters in the future and makes the function cleaner.
Signed-off-by: Wei Wang <weiwan@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
net: macb: Probe MDIO bus before registering netdev
The current sequence makes us register for a network device prior to
registering and probing the MDIO bus which could lead to some unwanted
consequences, like a thread of execution calling into ndo_open before
register_netdev() returns, while the MDIO bus is not ready yet.
Rework the sequence to register for the MDIO bus, and therefore attach
to a PHY prior to calling register_netdev(), which implies reworking the
error path a bit.
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Acked-by: Nicolas Ferre <nicolas.ferre@atmel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Tue, 3 May 2016 20:03:45 +0000 (16:03 -0400)]
Merge branch 'rds-fixes'
Sowmini Varadhan says:
====================
RDS: TCP: sychronization during connection startup
This patch series ensures that the passive (accept) side of the
TCP connection used for RDS-TCP is correctly synchronized with
any concurrent active (connect) attempts for a given pair of peers.
Patch 1 in the series makes sure that the t_sock in struct
rds_tcp_connection is only reset after any threads in rds_tcp_xmit
have completed (otherwise a null-ptr deref may be encountered).
Patch 2 synchronizes rds_tcp_accept_one() with the rds_tcp*connect()
path.
v2: review comments from Santosh Shilimkar, other spelling corrections
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
RDS: TCP: Synchronize accept() and connect() paths on t_conn_lock.
An arbitration scheme for duelling SYNs is implemented as part of
commit 241b271952eb ("RDS-TCP: Reset tcp callbacks if re-using an
outgoing socket in rds_tcp_accept_one()") which ensures that both nodes
involved will arrive at the same arbitration decision. However, this
needs to be synchronized with an outgoing SYN to be generated by
rds_tcp_conn_connect(). This commit achieves the synchronization
through the t_conn_lock mutex in struct rds_tcp_connection.
The rds_conn_state is checked in rds_tcp_conn_connect() after acquiring
the t_conn_lock mutex. A SYN is sent out only if the RDS connection is
not already UP (an UP would indicate that rds_tcp_accept_one() has
completed 3WH, so no SYN needs to be generated).
Similarly, the rds_conn_state is checked in rds_tcp_accept_one() after
acquiring the t_conn_lock mutex. The only acceptable states (to
allow continuation of the arbitration logic) are UP (i.e., outgoing SYN
was SYN-ACKed by peer after it sent us the SYN) or CONNECTING (we sent
outgoing SYN before we saw incoming SYN).
Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com> Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
RDS:TCP: Synchronize rds_tcp_accept_one with rds_send_xmit when resetting t_sock
There is a race condition between rds_send_xmit -> rds_tcp_xmit
and the code that deals with resolution of duelling syns added
by commit 241b271952eb ("RDS-TCP: Reset tcp callbacks if re-using an
outgoing socket in rds_tcp_accept_one()").
Specifically, we may end up derefencing a null pointer in rds_send_xmit
if we have the interleaving sequence:
rds_tcp_accept_one rds_send_xmit
The race condition can be avoided without adding the overhead of
additional locking in the xmit path: have rds_tcp_accept_one wait
for rds_tcp_xmit threads to complete before resetting callbacks.
The synchronization can be done in the same manner as rds_conn_shutdown().
First set the rds_conn_state to something other than RDS_CONN_UP
(so that new threads cannot get into rds_tcp_xmit()), then wait for
RDS_IN_XMIT to be cleared in the conn->c_flags indicating that any
threads in rds_tcp_xmit are done.
Fixes: 241b271952eb ("RDS-TCP: Reset tcp callbacks if re-using an
outgoing socket in rds_tcp_accept_one()") Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com> Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Mon, 2 May 2016 17:56:27 +0000 (10:56 -0700)]
net: add __sock_wfree() helper
Hosts sending lot of ACK packets exhibit high sock_wfree() cost
because of cache line miss to test SOCK_USE_WRITE_QUEUE
We could move this flag close to sk_wmem_alloc but it is better
to perform the atomic_sub_and_test() on a clean cache line,
as it avoid one extra bus transaction.
skb_orphan_partial() can also have a fast track for packets that either
are TCP acks, or already went through another skb_orphan_partial()
Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Tue, 3 May 2016 20:00:55 +0000 (16:00 -0400)]
Merge branch 'tunnel-csum-and-sg-offloads'
Alexander Duyck says:
====================
Fixes for tunnel checksum and segmentation offloads
This patch series is a subset of patches I had submitted for net-next. I
plan to drop these two patches from the v3 of "Fix Tunnel features and
enable GSO partial for several drivers" and I am instead submitting them
for net since these are truly fixes and likely will need to be backported
to stable branches.
This series addresses 2 specific issues. The first is that we could
request TSO on a v4 inner header while not supporting checksum offload of
the outer IPv6 header. The second is that we could request an IPv6 inner
checksum offload without validating that we could actually support an inner
IPv6 checksum offload.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Alexander Duyck [Mon, 2 May 2016 16:25:16 +0000 (09:25 -0700)]
vxlan: Add checksum check to the features check function
We need to perform an additional check on the inner headers to determine if
we can offload the checksum for them. Previously this check didn't occur
so we would generate an invalid frame in the case of an IPv6 header
encapsulated inside of an IPv4 tunnel. To fix this I added a secondary
check to vxlan_features_check so that we can verify that we can offload the
inner checksum.
Signed-off-by: Alexander Duyck <aduyck@mirantis.com> Signed-off-by: David S. Miller <davem@davemloft.net>