Vivien Didelot [Mon, 1 May 2017 18:05:12 +0000 (14:05 -0400)]
net: dsa: mv88e6xxx: move VTU Operation accessors
Move the helper functions to access the Global 1 VTU Operation register
to a new global1_vtu.c file, and get rid of the old underscore prefix
naming convention. This file will be extended will all VTU/STU related
code.
Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
Vivien Didelot [Mon, 1 May 2017 18:05:11 +0000 (14:05 -0400)]
net: dsa: mv88e6xxx: split VTU entry data member
VLAN aware Marvell chips can program 802.1Q VLAN membership as well as
802.1s per VLAN Spanning Tree state using the same 3 VTU Data registers.
Some chips such as 88E6185 use different Data registers offsets for
ports state and membership, and program them in a single operation.
Other chips such as 88E6352 use the same register layout but program
them in distinct operations (an indirect table is used for 802.1s.)
Newer chips such as 88E6390 use the same offsets for both state and
membership in distinct operations, thus require multiple data accesses.
To correctly abstract this, split the "data" structure member of
mv88e6xxx_vtu_entry in two "state" and "member" members, before adding
VTU support for newer chips.
Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
Vivien Didelot [Mon, 1 May 2017 18:05:10 +0000 (14:05 -0400)]
net: dsa: mv88e6xxx: add max VID to info
Some chips don't have a VLAN Table Unit, most of them do have a 4K
table, some others as the 88E6390 family has a 13th bit for the VID.
Add a new max_vid member to the info structure, used to check the
presence of a VTU as well as the value used to iterate from in VTU
GetNext operations.
This makes the MV88E6XXX_FLAG_VTU obsolete, thus remove it.
Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
Ilan Tayari [Sun, 30 Apr 2017 13:51:19 +0000 (16:51 +0300)]
xfrm: Indicate xfrm_state offload errors
Current code silently ignores driver errors when configuring
IPSec offload xfrm_state, and falls back to host-based crypto.
Fail the xfrm_state creation if the driver has an error, because
the NIC offloading was explicitly requested by the user program.
This will communicate back to the user that there was an error.
Fixes: d77e38e612a0 ("xfrm: Add an IPsec hardware offloading API") Signed-off-by: Ilan Tayari <ilant@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Ilan Tayari [Sun, 30 Apr 2017 13:34:38 +0000 (16:34 +0300)]
net/esp4: Fix invalid esph pointer crash
Both esp_output and esp_xmit take a pointer to the ESP header
and place it in esp_info struct prior to calling esp_output_head.
Inside esp_output_head, the call to esp_output_udp_encap
makes sure to update the pointer if it gets invalid.
However, if esp_output_head itself calls skb_cow_data, the
pointer is not updated and stays invalid, causing a crash
after esp_output_head returns.
Update the pointer if it becomes invalid in esp_output_head
Fixes: fca11ebde3f0 ("esp4: Reorganize esp_output") Signed-off-by: Ilan Tayari <ilant@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
The IPv6 tunneling code tries to insert IPV6_TLV_TNL_ENCAP_LIMIT and
IPV6_TLV_PADN options when an encapsulation limit is defined (the
default is a limit of 4). An MTU adjustment is done to account for
these options as well. However, the options are never present in the
generated packets.
The issue appears to be a subtlety between IPV6_DSTOPTS and
IPV6_RTHDRDSTOPTS defined in RFC 3542. When the IPIP tunnel driver was
written, the encap limit options were included as IPV6_RTHDRDSTOPTS in
dst0opt of struct ipv6_txoptions. Later, ipv6_push_nfrags_opts was
(correctly) updated to require IPV6_RTHDR options when IPV6_RTHDRDSTOPTS
are to be used. This caused the options to no longer be included in v6
encapsulated packets.
The fix is to use IPV6_DSTOPTS (in dst1opt of struct ipv6_txoptions)
instead. IPV6_DSTOPTS do not have the additional IPV6_RTHDR requirement.
Fixes: 1df64a8569c7: ("[IPV6]: Add ip6ip6 tunnel driver.") Fixes: 333fad5364d6: ("[IPV6]: Support several new sockopt / ancillary data in Advanced API (RFC3542)") Signed-off-by: Craig Gallek <kraig@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Here's one last batch of Bluetooth patches in the bluetooth-next tree
targeting the 4.12 kernel.
- Remove custom ECDH implementation and use new KPP API instead
- Add protocol checks to hci_ldisc
- Add module license to HCI UART Nokia H4+ driver
- Minor fix for 32bit user space - 64 bit kernel combination
Please let me know if there are any issues pulling. Thanks.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Liam Beguin [Mon, 1 May 2017 15:02:01 +0000 (11:02 -0400)]
switchdev: documentation: fix whitespace issues
Figure 1 is full of whitespaces; fix it
Signed-off-by: Liam Beguin <lbeguin@tycoint.com> Signed-off-by: Sylvain Lemieux <slemieux@tycoint.com> Acked-by: Ivan Vecera <ivecera@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
When a netdev is enslaved to a VRF master, its router interface (RIF)
needs to be destroyed (if exists) and a new one created using the
corresponding virtual router (VR).
>From the driver's perspective, the above is equivalent to an inetaddr
event sent for this netdev. Therefore, when a port netdev (or its
uppers) are enslaved to a VRF master, call the same function that
would've been called had a NETDEV_UP was sent for this netdev in the
inetaddr notification chain.
This patch also fixes a bug when a LAG netdev with an existing RIF is
enslaved to a VRF. Before this patch, each LAG port would drop the
reference on the RIF, but would re-join the same one (in the wrong VR)
soon after. With this patch, the corresponding RIF is first destroyed
and a new one is created using the correct VR.
Fixes: 7179eb5acd59 ("mlxsw: spectrum_router: Add support for VRFs") Signed-off-by: Ido Schimmel <idosch@mellanox.com> Reviewed-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Mon, 1 May 2017 15:47:10 +0000 (11:47 -0400)]
Merge tag 'mlx5-updates-2017-04-30' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux
mlx5-updates-2017-04-30
Or says:
================
mlx5 neigh update
This series (whose code name is 'neigh update') from Hadar, enhances the
mlx5 TC IP tunnel offloads to deal with changes to tunnel destination
neighbours used in offloaded flows which involved encapsulation.
In order to keep track on the validity state of such neighbours, we register
a netevent notifier callback and act on NEIGH_UPDATE events: if a neighbour
becomes valid, offload the related flows to HW (the other way around when
neigh becomes invalid) and similarly when a neigh mac addresses changes.
Since this traffic is offloaded from the host OS, the neighbour for the IP
tunnel destination can mistakenly become STALE and deleted by the kernel
since its 'used' value wasn't changed. To address that, we proactively
update the neighbour 'used' value every DELAY_PROBE_TIME seconds, using
time stamps generated by the existing driver code for HW flow counters.
We use the DELAY_PROBE_TIME_UPDATE event to adjust the frequency of the updates.
Prior to the core of the series, there's a patch from Saeed that introduces an
extendable vport representor implementation scheme. It provides a separation
between the eswitch to the netdev related aspects of the representors.
We would like to thank Ido Schimmel and Ilya Lesokhin for their coaching && advice
through the long design and review cycles while we struggled to understand and
(hopefully correctly) implement the locking around the different driver flows(..) .
- Or.
=================
Misc Updates:
From Tariq:
Some small performance and trivial code optimization for mlx5 netdev driver
- Optimize poll ICOSQ completion queue
- Use prefetchw when a write is to follow
- Use u8 as ownership type in mlx5e_get_cqe()
From Eran:
- Disable LRO by default on specific setups
From Eli:
- Small cleanup for E-Switch to avoid redundant allocation
Signed-off-by: David S. Miller <davem@davemloft.net>
After removing the PTP related initialization from slowpath start,
the remaining PTT entry is required only in case CONFIG_RFS_ACCEL is set.
Otherwise, it leads to a warning due to it being unused.
Fixes: d179bd1699fc ("qed: Acquire/release ptt_ptp lock when enabling/disabling PTP") Signed-off-by: Yuval Mintz <Yuval.Mintz@cavium.com> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Mon, 1 May 2017 15:42:16 +0000 (11:42 -0400)]
Merge branch 'qed-RoCE-fixes'
Yuval Mintz says:
====================
qed: RoCE related pseudo-fixes
This series contains multiple small corrections to the RoCE logic
in qed plus some debug information and inter-module parameter
meant to prevent issues further along.
- #1, #6 Share information with protocol driver
[either new or filling missing bits in existing API].
- #2, #3 correct error flows in qed.
- #4 add debug related information.
- #5 fixes a minor issue in the HW configuration.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Output to the RDMA driver whether DPM mode is enabled or disabled in
the HW and if so what is the number of WIDs it supports
Signed-off-by: Ram Amrani <Ram.Amrani@cavium.com> Signed-off-by: Yuval Mintz <Yuval.Mintz@cavium.com> Signed-off-by: David S. Miller <davem@davemloft.net>
When calculating doorbell BAR partitioning round up the number of
CPUs to the nearest power of 2 so the size of the DPI (per user
section) configured in the hardware will be stored properly and
not truncated.
Signed-off-by: Ram Amrani <Ram.Amrani@cavium.com> Signed-off-by: Yuval Mintz <Yuval.Mintz@cavium.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Add mechanism to verify RoCE resources are released prior to freeing the
bitmaps. If this is not the case, print what resources were not released.
Signed-off-by: Ram Amrani <Ram.Amrani@cavium.com> Signed-off-by: Yuval Mintz <Yuval.Mintz@cavium.com> Signed-off-by: David S. Miller <davem@davemloft.net>
qed: add error handling flow to TID deregistratin posting failure
If the posting of the ramrod for the purpose of TID deregistration
fails, abort the deregistration operation without using the FW's
return code.
Signed-off-by: Ram Amrani <Ram.Amrani@cavium.com> Signed-off-by: Yuval Mintz <Yuval.Mintz@cavium.com> Signed-off-by: David S. Miller <davem@davemloft.net>
The internal RoCE SQE QP state isn't being used. Instead we mark the
QP as in regular error state.
Signed-off-by: Ram Amrani <Ram.Amrani@cavium.com> Signed-off-by: Yuval Mintz <Yuval.Mintz@cavium.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Ram Amrani <Ram.Amrani@cavium.com> Signed-off-by: Yuval Mintz <Yuval.Mintz@cavium.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Yonghong Song [Sun, 30 Apr 2017 05:52:42 +0000 (22:52 -0700)]
bpf: enhance verifier to understand stack pointer arithmetic
llvm 4.0 and above generates the code like below:
....
440: (b7) r1 = 15
441: (05) goto pc+73
515: (79) r6 = *(u64 *)(r10 -152)
516: (bf) r7 = r10
517: (07) r7 += -112
518: (bf) r2 = r7
519: (0f) r2 += r1
520: (71) r1 = *(u8 *)(r8 +0)
521: (73) *(u8 *)(r2 +45) = r1
....
and the verifier complains "R2 invalid mem access 'inv'" for insn #521.
This is because verifier marks register r2 as unknown value after #519
where r2 is a stack pointer and r1 holds a constant value.
Teach verifier to recognize "stack_ptr + imm" and
"stack_ptr + reg with const val" as valid stack_ptr with new offset.
Signed-off-by: Yonghong Song <yhs@fb.com> Acked-by: Martin KaFai Lau <kafai@fb.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
Since several of the the netlink attributes used to configure the flower
classifier's MPLS TC, BOS and Label fields have additional bits which are
unused, check those bits to ensure that they are actually 0 as suggested
by Jamal.
Signed-off-by: Benjamin LaHaise <benjamin.lahaise@netronome.com> Cc: David Miller <davem@davemloft.net> Cc: Jamal Hadi Salim <jhs@mojatatu.com> Cc: Simon Horman <simon.horman@netronome.com> Cc: Jakub Kicinski <kubakici@wp.pl> Cc: Jiri Pirko <jiri@resnulli.us> Signed-off-by: David S. Miller <davem@davemloft.net>
====================
Netfilter/IPVS updates for net-next
The following patchset contains Netfilter updates for your net-next
tree. A large bunch of code cleanups, simplify the conntrack extension
codebase, get rid of the fake conntrack object, speed up netns by
selective synchronize_net() calls. More specifically, they are:
1) Check for ct->status bit instead of using nfct_nat() from IPVS and
Netfilter codebase, patch from Florian Westphal.
2) Use kcalloc() wherever possible in the IPVS code, from Varsha Rao.
5) Enforce expectation limit from userspace conntrack helper,
from Gao Feng.
6) Add nf_ct_remove_expect() helper function, from Gao Feng.
7) NAT mangle helper function return boolean, from Gao Feng.
8) ctnetlink_alloc_expect() should only work for conntrack with
helpers, from Gao Feng.
9) Add nfnl_msg_type() helper function to nfnetlink to build the
netlink message type.
10) Get rid of unnecessary cast on void, from simran singhal.
11) Use seq_puts()/seq_putc() instead of seq_printf() where possible,
also from simran singhal.
12) Use list_prev_entry() from nf_tables, from simran signhal.
13) Remove unnecessary & on pointer function in the Netfilter and IPVS
code.
14) Remove obsolete comment on set of rules per CPU in ip6_tables,
no longer true. From Arushi Singhal.
15) Remove duplicated nf_conntrack_l4proto_udplite4, from Gao Feng.
16) Remove unnecessary nested rcu_read_lock() in
__nf_nat_decode_session(). Code running from hooks are already
guaranteed to run under RCU read side.
17) Remove deadcode in nf_tables_getobj(), from Aaron Conole.
18) Remove double assignment in nf_ct_l4proto_pernet_unregister_one(),
also from Aaron.
19) Get rid of unsed __ip_set_get_netlink(), from Aaron Conole.
20) Don't propagate NF_DROP error to userspace via ctnetlink in
__nf_nat_alloc_null_binding() function, from Gao Feng.
21) Revisit nf_ct_deliver_cached_events() to remove unnecessary checks,
from Gao Feng.
22) Kill the fake untracked conntrack objects, use ctinfo instead to
annotate a conntrack object is untracked, from Florian Westphal.
23) Remove nf_ct_is_untracked(), now obsolete since we have no
conntrack template anymore, from Florian.
24) Add event mask support to nft_ct, also from Florian.
25) Move nf_conn_help structure to
include/net/netfilter/nf_conntrack_helper.h.
26) Add a fixed 32 bytes scratchpad area for conntrack helpers.
Thus, we don't deal with variable conntrack extensions anymore.
Make sure userspace conntrack helper doesn't go over that size.
Remove variable size ct extension infrastructure now this code
got no more clients. From Florian Westphal.
27) Restore offset and length of nf_ct_ext structure to 8 bytes now
that wraparound is not possible any longer, also from Florian.
28) Allow to get rid of unassured flows under stress in conntrack,
this applies to DCCP, SCTP and TCP protocols, from Florian.
29) Shrink size of nf_conntrack_ecache structure, from Florian.
30) Use TCP_MAX_WSCALE instead of hardcoded 14 in TCP tracker,
from Gao Feng.
31) Register SYNPROXY hooks on demand, from Florian Westphal.
32) Use pernet hook whenever possible, instead of global hook
registration, from Florian Westphal.
33) Pass hook structure to ebt_register_table() to consolidate some
infrastructure code, from Florian Westphal.
34) Use consume_skb() and return NF_STOLEN, instead of NF_DROP in the
SYNPROXY code, to make sure device stats are not fooled, patch
from Gao Feng.
35) Remove NF_CT_EXT_F_PREALLOC this kills quite some code that we
don't need anymore if we just select a fixed size instead of
expensive runtime time calculation of this. From Florian.
36) Constify nf_ct_extend_register() and nf_ct_extend_unregister(),
from Florian.
37) Simplify nf_ct_ext_add(), this kills nf_ct_ext_create(), from
Florian.
38) Attach NAT extension on-demand from masquerade and pptp helper
path, from Florian.
39) Get rid of useless ip_vs_set_state_timeout(), from Aaron Conole.
40) Speed up netns by selective calls of synchronize_net(), from
Florian Westphal.
41) Silence stack size warning gcc in 32-bit arch in snmp helper,
from Florian.
42) Inconditionally call nf_ct_ext_destroy(), even if we have no
extensions, to deal with the NF_NAT_MANIP_SRC case. Patch from
Liping Zhang.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
samples/bpf: fix XDP_FLAGS_SKB_MODE detach for xdp_tx_iptunnel
The xdp_tx_iptunnel program can be terminated in two ways, after
N-seconds or via Ctrl-C SIGINT. The SIGINT code path does not
handle detatching the correct XDP program, in-case the program
was attached with XDP_FLAGS_SKB_MODE.
Fix this by storing the XDP flags as a global variable, which is
available for the SIGINT handler function.
Fixes: 3993f2cb983b ("samples/bpf: Add support for SKB_MODE to xdp1 and xdp_tx_iptunnel") Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Andy Gospodarek <andy@greyhouse.net> Signed-off-by: David S. Miller <davem@davemloft.net>
samples/bpf: fix SKB_MODE flag to be a 32-bit unsigned int
The kernel side of XDP_FLAGS_SKB_MODE is unsigned, and the rtnetlink
IFLA_XDP_FLAGS is defined as NLA_U32. Thus, userspace programs under
samples/bpf/ should use the correct type.
Fixes: 3993f2cb983b ("samples/bpf: Add support for SKB_MODE to xdp1 and xdp_tx_iptunnel") Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Andy Gospodarek <andy@greyhouse.net> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Mon, 1 May 2017 14:35:48 +0000 (10:35 -0400)]
Merge branch 'xdp-netlink-ext-ack'
Jakub Kicinski says:
====================
xdp: use netlink extended ACK reporting
This series is an attempt to make XDP more user friendly by
enabling exploiting the recently added netlink extended ACK
reporting to carry messages to user space.
David Ahern's iproute2 ext ack patches for ip link are sufficient
to show the errors like this:
Error: nfp: MTU too large w/ XDP enabled
Where the message is coming directly from the driver. There could
still be a bit of a leap for a complete novice from the message
above to the right settings, but it's a big improvement over the
standard "Invalid argument" message.
v1/non-rfc:
- add a separate macro in patch 1;
- add KBUILD_MODNAME as part of the message (Daniel);
- don't print the error to logs in patch 1.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Jakub Kicinski [Mon, 1 May 2017 04:46:47 +0000 (21:46 -0700)]
nfp: make use of extended ack message reporting
Try to carry error messages to the user via the netlink extended
ack message attribute.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net>
Jakub Kicinski [Mon, 1 May 2017 04:46:46 +0000 (21:46 -0700)]
xdp: propagate extended ack to XDP setup
Drivers usually have a number of restrictions for running XDP
- most common being buffer sizes, LRO and number of rings.
Even though some drivers try to be helpful and print error
messages experience shows that users don't often consult
kernel logs on netlink errors. Try to use the new extended
ack mechanism to carry the message back to user space.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net>
Jakub Kicinski [Mon, 1 May 2017 04:46:45 +0000 (21:46 -0700)]
netlink: add NULL-friendly helper for setting extended ACK message
As we propagate extended ack reporting throughout various paths in
the kernel it may be that the same function is called with the
extended ack parameter passed as NULL. One place where that happens
is in drivers which have a centralized reconfiguration function
called both from ndos and from ethtool_ops. Add a new helper for
setting the error message in such conditions.
Existing helper is left as is to encourage propagating the ext act
fully wherever possible. It also makes it clear in the code which
messages may be lost due to ext ack being NULL.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net>
netfilter: nf_ct_ext: invoke destroy even when ext is not attached
For NF_NAT_MANIP_SRC, we will insert the ct to the nat_bysource_table,
then remove it from the nat_bysource_table via nat_extend->destroy.
But now, the nat extension is attached on demand, so if the nat extension
is not attached, we will not be notified when the ct is destroyed, i.e.
we may fail to remove ct from the nat_bysource_table.
So just keep it simple, even if the extension is not attached, we will
still invoke the related ext->destroy. And this will also preserve the
flexibility for the future extension.
Fixes: 9a08ecfe74d7 ("netfilter: don't attach a nat extension by default") Signed-off-by: Liping Zhang <zlpnobody@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
netfilter: nf_queue: only call synchronize_net twice if nf_queue is active
nf_unregister_net_hook(s) can avoid a second call to synchronize_net,
provided there is no nfqueue active in that net namespace (which is
the common case).
This also gets rid of the extra arg to nf_queue_nf_hook_drop(), normally
this gets called during netns cleanup so no packets should be queued.
For the rare case of base chain being unregistered or module removal
while nfqueue is in use the extra hiccup due to the packet drops isn't
a big deal.
Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
netfilter: batch synchronize_net calls during hook unregister
synchronize_net is expensive and slows down netns cleanup a lot.
We have two APIs to unregister a hook:
nf_unregister_net_hook (which calls synchronize_net())
and
nf_unregister_net_hooks (calls nf_unregister_net_hook in a loop)
Make nf_unregister_net_hook a wapper around new helper
__nf_unregister_net_hook, which unlinks the hook but does not free it.
Then, we can call that helper in nf_unregister_net_hooks and then
call synchronize_net() only once.
Andrey Konovalov reports this change improves syzkaller fuzzing speed at
least twice.
Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
This patch allows users to enable/disable internal TX and/or RX
clock delay for BCM5481x series PHYs so as to satisfy RGMII timing
specifications.
On a particular platform, whether TX and/or RX clock delay is required
depends on how PHY connected to the MAC IP. This requirement can be
specified through "phy-mode" property in the platform device tree.
Signed-off-by: Abhishek Shah <abhishek.shah@broadcom.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Scott Wood [Sat, 29 Apr 2017 00:17:41 +0000 (19:17 -0500)]
bnx2x: Align RX buffers
The bnx2x driver is not providing proper alignment on the receive buffers it
passes to build_skb(), causing skb_shared_info to be misaligned.
skb_shared_info contains an atomic, and while PPC normally supports
unaligned accesses, it does not support unaligned atomics.
Aligning the size of rx buffers will ensure that page_frag_alloc() returns
aligned addresses.
This can be reproduced on PPC by setting the network MTU to 1450 (or other
non-multiple-of-4) and then generating sufficient inbound network traffic
(one or two large "wget"s usually does it), producing the following oops:
net: bridge: Fix improper taking over HW learned FDB
Commit 7e26bf45e4cb ("net: bridge: allow SW learn to take over HW fdb
entries") added the ability to "take over an entry which was previously
learned via HW when it shows up from a SW port".
However, if an entry was learned via HW and then a control packet
(e.g., ARP request) was trapped to the CPU, the bridge driver will
update the entry and remove the externally learned flag, although the
entry is still present in HW. Instead, only clear the externally learned
flag in case of roaming.
Fixes: 7e26bf45e4cb ("net: bridge: allow SW learn to take over HW fdb entries") Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: Arkadi Sharashevsky <arkadis@mellanox.com> Cc: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Acked-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
WANG Cong [Fri, 28 Apr 2017 17:04:29 +0000 (10:04 -0700)]
ipv4: get rid of ip_ra_lock
After commit 1215e51edad1 ("ipv4: fix a deadlock in ip_ra_control")
we always take RTNL lock for ip_ra_control() which is the only place
we update the list ip_ra_chain, so the ip_ra_lock is no longer needed.
As Eric points out, BH does not need to disable either, RCU readers
don't care.
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
samples/bpf: bpf_load.c detect and abort if ELF maps section size is wrong
The struct bpf_map_def was extended in commit fb30d4b71214 ("bpf: Add tests
for map-in-map") with member unsigned int inner_map_idx. This changed the size
of the maps section in the generated ELF _kern.o files.
Unfortunately the loader in bpf_load.c does not detect or handle this. Thus,
older _kern.o files became incompatible, and caused hard-to-debug errors
where the syscall validation rejected BPF_MAP_CREATE request.
This patch only detect the situation and aborts load_bpf_file(). It also
add code comments warning people that read this loader for inspiration
for these pitfalls.
Fixes: fb30d4b71214 ("bpf: Add tests for map-in-map") Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
Dan Carpenter [Fri, 28 Apr 2017 13:03:48 +0000 (16:03 +0300)]
lwtunnel: fix error path in lwtunnel_fill_encap()
We recently added a check to see if nla_nest_start() fails. There are
two issues with that. First, if it fails then I don't think we should
call nla_nest_cancel(). Second, it's slightly convoluted but the
current code returns success but we should return -EMSGSIZE instead.
Fixes: a50fe0ffd76f ("lwtunnel: check return value of nla_nest_start") Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Acked-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Dan Carpenter [Fri, 28 Apr 2017 12:57:15 +0000 (15:57 +0300)]
liquidio: silence a locking static checker warning
Presumably we never hit this return, but static checkers complain that
we need to unlock so we may as well fix that.
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Acked-by: Felix Manlunas <felix.manlunas@cavium.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Dan Carpenter [Fri, 28 Apr 2017 12:56:09 +0000 (15:56 +0300)]
qed: Unlock on error in qed_vf_pf_acquire()
My static checker complains that we're holding a mutex on this error
path. Let's goto exit instead of returning directly.
Fixes: b0bccb69eba3 ("qed: Change locking scheme for VF channel") Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Acked-by: Yuval Mintz <Yuval.Mintz@cavium.com> Signed-off-by: David S. Miller <davem@davemloft.net>
In the hip06 and hip07 SoCs, phy connect to mdio bus.The mdio
module is probed with module_init, and, as such,
is not guaranteed to probe before the HNS driver. So we need
to support deferred probe.
We check for probe deferral in the mac init, so we not init DSAF
when there is no mdio, and free all resource, to later learn that
we need to defer the probe.
Signed-off-by: lipeng <lipeng321@huawei.com> Reviewed-by: Yisen Zhuang <yisen.zhuang@huawei.com> Reviewed-by: Matthias Brugger <mbrugger@suse.com> Signed-off-by: David S. Miller <davem@davemloft.net>
net: hns: support deferred probe when can not obtain irq
In the hip06 and hip07 SoCs, the interrupt lines from the
DSAF controllers are connected to mbigen hw module.
The mbigen module is probed with module_init, and, as such,
is not guaranteed to probe before the HNS driver. So we need
to support deferred probe.
Signed-off-by: lipeng <lipeng321@huawei.com> Reviewed-by: Yisen Zhuang <yisen.zhuang@huawei.com> Reviewed-by: Matthias Brugger <mbrugger@suse.com> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Mon, 1 May 2017 02:37:01 +0000 (22:37 -0400)]
Merge branch 'nfp-XDP_TX-optimizations'
Jakub Kicinski says:
====================
nfp: optimize XDP TX and small fixes
This series optimizes the nfp XDP TX performance a little bit.
I run quick tests on an Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz.
Single core/queue performance for both touch and drop and touch and
forward is above 20Mpps @64B packets, drop being 2Mpps faster.
I think this is max for a single queue on the low power NFPs.
There are also a few minor fixes included for code in net-next.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Jakub Kicinski [Fri, 28 Apr 2017 04:06:20 +0000 (21:06 -0700)]
nfp: provide 256 bytes of XDP headroom in all configurations
For legacy reasons NFP FW may be compiled to DMA packets to a constant
offset into the buffer and use the space before it for metadata. This
ensures that packets data always start at a certain offset regardless of
the amount of preceding metadata.
If rx offset is set to 0 there may still be up to 64 bytes of metadata
but metadata will start at the beginning of the buffer, instead of:
data_start_offset = rx_offset - meta_len
Even though we make the buffers larger to accommodate up to 64 bytes of
metadata, if there is only N bytes of metadata, we will end up with
N bytes of headroom and 64 - N bytes of tailroom. Therefore we can't
rely on that space for XDP headroom. Make sure we always allocate
full 256 bytes. This, unfortunately, means we can't fit the headroom
on an u8 any more.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Jakub Kicinski [Fri, 28 Apr 2017 04:06:19 +0000 (21:06 -0700)]
nfp: don't completely refuse to work with old flashes
Right now the required Service Process ABI version is still tied
to max ID of known commands. For new NSP commands we are adding
we are checking if NSP version is recent enough on command-by-command
basis. The driver doesn't have to force the device to have the
very latest flash, anything newer than 0.8 should do.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Jakub Kicinski [Fri, 28 Apr 2017 04:06:18 +0000 (21:06 -0700)]
nfp: avoid reading TX queue indexes from the device
Reading TX queue indexes from the device memory on each interrupt
is expensive. It's doubly expensive with XDP running since we have
two TX rings to check there. If the software indexes indicate that
the TX queue is completely empty, however, we don't need to look at
the device completion index at all.
The queuing CPU is doing a wmb() before kicking the device TX so
we should be safe to assume on the CPU handling the completions will
never see old value of the software copy of the index.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Jakub Kicinski [Fri, 28 Apr 2017 04:06:17 +0000 (21:06 -0700)]
nfp: do simple XDP TX buffer recycling
On the RX path we follow the "drop if allocation of replacement
buffer fails" rule. With XDP we extended that to the TX action,
so if XDP prog returned TX but allocation of replacement RX buffer
failed, we will drop the packet.
To improve our XDP TX performance extend the idea of rings being
always full to XDP TX rings. Pre-fill the XDP TX rings with RX
buffers, and when XDP prog returns TX action swap the RX buffer
with the next buffer from the TX ring.
XDP TX complete will no longer free the buffers but let them
sit on the TX ring and wait for swap with RX buffer, instead.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
David Howells [Thu, 27 Apr 2017 21:40:23 +0000 (22:40 +0100)]
net: Initialise init_net.count to 1
Initialise init_net.count to 1 for its pointer from init_nsproxy lest
someone tries to do a get_net() and a put_net() in a process in which
current->ns_proxy->net_ns points to the initial network namespace.
Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
geneve: fix incorrect setting of UDP checksum flag
Creating a geneve link with 'udpcsum' set results in a creation of link
for which UDP checksum will NOT be computed on outbound packets, as can
be seen below.
11: gen0: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN
link/ether c2:85:27:b6:b4:15 brd ff:ff:ff:ff:ff:ff promiscuity 0
geneve id 200 remote 192.168.13.1 dstport 6081 noudpcsum
Similarly, creating a link with 'noudpcsum' set results in a creation
of link for which UDP checksum will be computed on outbound packets.
Fixes: 9b4437a5b870 ("geneve: Unify LWT and netdev handling.") Signed-off-by: Girish Moodalbail <girish.moodalbail@oracle.com> Acked-by: Pravin B Shelar <pshelar@ovn.org> Acked-by: Lance Richardson <lrichard@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
The message "Cannot bind port X, err=Y" creates only confusion. In metadata
based mode, failure of IPv6 socket creation is okay if IPv6 is disabled and
no error message should be printed. But when IPv6 tunnel was requested, such
failure is fatal. The vxlan_socket_create does not know when the error is
harmless and when it's not.
Instead of passing such information down to vxlan_socket_create, remove the
message completely. It's not useful. We propagate the error code up to the
user space and the port number comes from the user space. There's nothing in
the message that the process creating vxlan interface does not know.
Signed-off-by: Jiri Benc <jbenc@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
When IPv6 is compiled but disabled at runtime, __vxlan_sock_add returns
-EAFNOSUPPORT. For metadata based tunnels, this causes failure of the whole
operation of bringing up the tunnel.
Ignore failure of IPv6 socket creation for metadata based tunnels caused by
IPv6 not being available.
Fixes: b1be00a6c39f ("vxlan: support both IPv4 and IPv6 sockets in a single vxlan device") Signed-off-by: Jiri Benc <jbenc@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Andy Shevchenko [Thu, 27 Apr 2017 13:36:59 +0000 (16:36 +0300)]
bnx2x: Replace custom scnprintf()
Use scnprintf() when printing version instead of custom open coded variants.
Signed-off-by: Andy Shevchenko <andy.shevchenko@gmail.com> Acked-by: Yuval Mintz <Yuval.Mintz@cavium.com> Signed-off-by: David S. Miller <davem@davemloft.net>
The description inside uapi/linux/bpf.h about bpf_get_socket_uid
helper function is no longer valid. It returns overflowuid rather
than 0 when failed.
Signed-off-by: Chenbo Feng <fengc@google.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
avoid direct access to sk->sk_state when tcp_poll() is called on a socket
using active TCP fastopen with deferred connect. Use local variable
'state', which stores the result of sk_state_load(), like it was done in
commit 00fd38d938db ("tcp: ensure proper barriers in lockless contexts").
Fixes: 19f6d3f3c842 ("net/tcp-fastopen: Add new API support") Signed-off-by: Davide Caratti <dcaratti@redhat.com> Acked-by: Wei Wang <weiwan@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Wed, 26 Apr 2017 16:09:23 +0000 (09:09 -0700)]
bpf: restore skb->sk before pskb_trim() call
While testing a fix [1] in ___pskb_trim(), addressing the WARN_ON_ONCE()
in skb_try_coalesce() reported by Andrey, I found that we had an skb
with skb->sk set but no skb->destructor.
This invalidated heuristic found in commit 158f323b9868 ("net: adjust
skb->truesize in pskb_expand_head()") and in cited patch.
Considering the BUG_ON(skb->sk) we have in skb_orphan(), we should
restrain the temporary setting to a minimal section.
[1] https://patchwork.ozlabs.org/patch/755570/
net: adjust skb->truesize in ___pskb_trim()
Fixes: 8f917bba0042 ("bpf: pass sk to helper functions") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Willem de Bruijn <willemb@google.com> Cc: Andrey Konovalov <andreyknvl@google.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Since 83a77e9ec415, the phydev irq is explicitly set to PHY_POLL when
there is no pdata. It doesn't work on DT enabled platforms because the
phydev irq is already set by libphy before.
Fixes: 83a77e9ec415 ("net: macb: Added PCI wrapper for Platform Driver.") Signed-off-by: Alexandre Belloni <alexandre.belloni@free-electrons.com> Acked-by: Nicolas Ferre <nicolas.ferre@microchip.com> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Sun, 30 Apr 2017 15:33:37 +0000 (11:33 -0400)]
Merge branch '40GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/next-queue
Jeff Kirsher says:
====================
40GbE Intel Wired LAN Driver Updates 2017-04-30
This series contains updates to i40e and i40evf only.
Jake provides majority of the changes in this series, starting with the
renaming of a flag to avoid confusion. Then renamed a variable to a
more meaningful name to clarify what is actually being done and to
reduce confusion. Amortizes the wait time when initializing or disabling
lots of VFs by using i40e_reset_all_vfs() and
i40e_vsi_stop_rings_no_wait(). Cleaned up a unnecessary delay since
pci_disable_sriov() already has its own delay, so need to add a additional
delay when removing VFs. Avoid using the same name flags for both
vsi->state and pf->state, to make code review easier and assist future
work to use the correct state field when checking bits. Use
DECLARE_BITMAP() to ensure that we always allocate enough space for flags.
Replace hw_disabled_flags with the new _AUTO_DISABLED flags, which are
more readable because we are not setting an *_ENABLED flag to
disable the feature.
Alex corrects a oversight where we were not reprogramming the ports
after a reset, which was causing us to lose all of the receive tunnel
offloads.
Arnd Bergmann moves the declaration of a local variable to avoid a
warning seen on architectures with larger pages about an unused variable.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Sun, 30 Apr 2017 15:30:13 +0000 (11:30 -0400)]
Merge branch '1GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/next-queue
Jeff Kirsher says:
====================
1GbE Intel Wired LAN Driver Updates 2017-04-30
This series contains updates to e1000e only.
Jarod Wilson fixes an issue where the workaround for 82574 & 82583
is needed for i218 as well, so set the appropriate flags.
Sasha adds support for the upcoming new i219 devices for the client
platform (CannonLake), which includes the support for 38.4MHz frequency
to support PTP on CannonLake.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Since the ECDH key generation takes a different path, it needs to be
tested as well. For this generate the public debug key from the private
debug key and compare both.
This also moves the seeding of the private key into the SMP calling code
to allow for easier re-use of the ECDH key generation helper.
Signed-off-by: Marcel Holtmann <marcel@holtmann.org> Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
struct esw_mc_addr is a small struct that can be part of struct
mlx5_eswitch. Define it as a field and not as a pointer and save the
kzalloc call and then error flow handling.
Signed-off-by: Eli Cohen <eli@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Eran Ben Elisha [Wed, 26 Apr 2017 10:42:04 +0000 (13:42 +0300)]
net/mlx5e: Disable HW LRO when PCI is slower than link on striding RQ
We will activate the HW LRO only on servers with PCI BW > MAX LINK BW,
or when PCI BW > 16Gbps. On other cases we do not want LRO by default as
LRO sessions might get timeout and add redundant software overhead.
Tested:
ethtool -k <ifs-name> | grep large-receive-offload
On systems with and without the limitations.
Signed-off-by: Eran Ben Elisha <eranbe@mellanox.com> Cc: kernel-team@fb.com Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Tariq Toukan [Wed, 15 Feb 2017 15:05:39 +0000 (17:05 +0200)]
net/mlx5e: Use prefetchw when a write is to follow
"prefetchw()" prefetches the cacheline for write. Use it for
skb->data, as soon we'll be copying the packet header there.
Performance:
Single-stream packet-rate tested with pktgen.
Packets are dropped in tc level to zoom into driver data-path.
Larger gain is expected for smaller packets, as less time
is spent on handling SKB fragments, making the path shorter
and the improvement more significant.
---------------------------------------------
packet size | before | after | gain |
64B | 4,113,306 | 4,778,720 | 16% |
1024B | 3,633,819 | 3,950,593 | 8.7% |
Tariq Toukan [Wed, 29 Mar 2017 08:46:10 +0000 (11:46 +0300)]
net/mlx5e: Optimize poll ICOSQ completion queue
UMR operations are more frequent and important.
Check them first, and add a compiler branch predictor hint.
According to current design, ICOSQ CQ can contain at most one
pending CQE per napi. Poll function is optimized accordingly.
Performance:
Single-stream packet-rate tested with pktgen.
Packets are dropped in tc level to zoom into driver data-path.
Larger gain is expected for larger packet sizes, as BW is higher
and UMR posts are more frequent.
---------------------------------------------
packet size | before | after | gain |
64B | 4,092,370 | 4,113,306 | 0.5% |
1024B | 3,421,435 | 3,633,819 | 6.2% |
Hadar Hen Zion [Tue, 14 Feb 2017 09:20:24 +0000 (11:20 +0200)]
net/mlx5e: Act on delay probe time updates
The user can change delay_first_probe_time parameter through sysctl.
Listen to NETEVENT_DELAY_PROBE_TIME_UPDATE notifications and update the
intervals for updating the neighbours 'used' value periodic task and
for flow HW counters query periodic task.
Both of the intervals will be update only in case the new delay prob
time value is lower the current interval.
Since the driver saves only one min interval value and not per device,
the users will be able to set lower interval value for updating
neighbour 'used' value periodic task but they won't be able to schedule
a higher interval for this periodic task.
The used interval for scheduling neighbour 'used' value periodic task is
the minimal delay prob time parameter ever seen by the driver.
Signed-off-by: Hadar Hen Zion <hadarh@mellanox.com> Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Hadar Hen Zion [Fri, 24 Feb 2017 10:16:33 +0000 (12:16 +0200)]
net/mlx5e: Update neighbour 'used' state using HW flow rules counters
When IP tunnel encapsulation rules are offloaded, the kernel can't see
the traffic of the offloaded flow. The neighbour for the IP tunnel
destination of the offloaded flow can mistakenly become STALE and
deleted by the kernel since its 'used' value wasn't changed.
To make sure that a neighbour which is used by the HW won't become
STALE, we proactively update the neighbour 'used' value every
DELAY_PROBE_TIME period, when packets were matched and counted by the HW
for one of the tunnel encap flows related to this neighbour.
The periodic task that updates the used neighbours is scheduled when a
tunnel encap rule is successfully offloaded into HW and keeps re-scheduling
itself as long as the representor's neighbours list isn't empty.
Add, remove, lookup and status change operations done over the
representor's neighbours list or the neighbour hash entry encaps list
are all serialized by RTNL lock.
Signed-off-by: Hadar Hen Zion <hadarh@mellanox.com> Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Hadar Hen Zion [Mon, 20 Mar 2017 10:56:47 +0000 (12:56 +0200)]
net/mlx5e: Add support to neighbour update flow
In order to offload TC encap rules, the driver does a lookup for the IP
tunnel neighbour according to the output device and the destination IP
given by the user.
To keep tracking after the validity state of such neighbours, we keep
the neighbours information (pair of device pointer and destination IP)
in a hash table maintained at the relevant egress representor and
register to get NETEVENT_NEIGH_UPDATE events. When getting neighbour update
netevent, we search for a match among the cached neighbours entries used for
encapsulation.
In case the neighbour isn't valid, we can't offload the flow into the
HW. We cache the flow (requested matching and actions) in the driver and
offload the rule later, when the neighbour is resolved and becomes
valid.
When a flow is only cached in the driver and not offloaded into HW
yet, we use EAGAIN return value to mark it internally, the TC ndo still
returns success.
Listen to kernel neighbour update netevents to trace relevant neighbours
validity state:
1. If a neighbour becomes valid, offload the related rules to HW.
2. If the neighbour becomes invalid, remove the related rules from HW.
3. If the neighbour mac address was changed, update the encap header.
Remove all the offloaded rules using the old encap header from the HW
and insert new rules to HW with updated encap header.
Access to the neighbors hash table is protected by RTNL lock of its
caller or by the table's spinlock.
Details of the locking/synchronization among the different actions
applied on the neighbour table:
Add/remove operations - protected by RTNL lock of its caller (all TC
commands are protected by RTNL lock). Add and remove operations are
initiated only when the user inserts/removes a TC rule into/from the driver.
Lookup/remove operations - since the lookup operation is done from
netevent notifier block, RTNL lock can't be used (atomic context).
Use the table's spin lock to protect lookups from TC user removal operation.
bh is used since netevent can be called from a softirq context.
Lookup/add operations - The hash table access functions are taking
care of the protection between lookup and add operations.
When adding/removing encap headers and rules to/from the HW, RTNL lock
is used. It can happen when:
1. The user inserts/removes a TC rule into/from the driver (TC commands
are protected by RTNL lock of it's caller).
2. The driver gets neighbour notification event, which reports about
neighbour validity status change. Before adding/removing encap headers
and rules to/from the HW, RTNL lock is taken.
A neighbour hash table entry should be freed when its encap list is empty.
Since The neighbour update netevent notification schedules a neighbour
update work that uses the neighbour hash entry, it can't be freed
unconditionally when the encap list becomes empty during TC delete rule flow.
Use reference count to protect from freeing neighbour hash table entry
while it's still in use.
When the user asks to unregister a netdvice used by one of the neigbours,
neighbour removal notification is received. Then we take a reference on the
neighbour and don't free it until the relevant encap entries (and flows) are
marked as invalid (not offloaded) and removed from HW.
As long as the encap entry is still valid (checked under RTNL lock) we
can safely access the neighbour device saved on mlx5e_neigh struct.
Signed-off-by: Hadar Hen Zion <hadarh@mellanox.com> Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Hadar Hen Zion [Thu, 2 Feb 2017 14:43:35 +0000 (16:43 +0200)]
net/mlx5e: Add neighbour hash table to the representors
Add hash table to the representors which is to be used by the next patch
to save neighbours information in the driver.
In order to offload IP tunnel encapsulation rules, the driver must find
the tunnel dst neighbour according to the output device and the
destination address given by the user. The next patch will cache the
neighbors information in the driver to allow support in neigh update
flow for tunnel encap rules.
The neighbour entries are also saved in a list so we easily iterate over
them when querying statistics in order to provide 'used' feedback to the
kernel neighbour NUD core.
Signed-off-by: Hadar Hen Zion <hadarh@mellanox.com> Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Hadar Hen Zion [Thu, 2 Feb 2017 14:14:21 +0000 (16:14 +0200)]
net/mlx5e: Read neigh parameters with proper locking
The nud_state and hardware address fields are protected by the neighbour
lock, we should acquire it before accessing those parameters.
Use this lock to avoid inconsistency between the neighbour validity state
and it's hardware address.
Signed-off-by: Hadar Hen Zion <hadarh@mellanox.com> Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Hadar Hen Zion [Tue, 14 Feb 2017 12:03:45 +0000 (14:03 +0200)]
net/mlx5e: Use flag to properly monitor a flow rule offloading state
Instead of relaying on the 'flow->rule' pointer value which can be
valid or invalid (in case the FW returns an error while trying to offload
the rule), monitor the rule state using a flag.
In downstream patch which adds support to IP tunneling neigh update
flow, a TC rule could be cached in the driver and not offloaded into the
HW. In this case, the flow handle pointer stays NULL.
Check the offloaded flag to properly deal with rules which are currently
not offloaded when querying rule statistics.
This patch doesn't add any new functionality.
Signed-off-by: Hadar Hen Zion <hadarh@mellanox.com> Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Passing output device parameter to the helper functions that deal with
creation of encapsulation headers is redundant. Output device parameter
can be defined inside those helpers, no need to pass it. Refactor the code by
removing the parameter from the function signature.
This patch doesn't change any functionality.
Signed-off-by: Hadar Hen Zion <hadarh@mellanox.com> Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Or Gerlitz [Tue, 25 Apr 2017 12:30:08 +0000 (15:30 +0300)]
net/mlx5: Remove encap entry pointer from the eswitch flow attributes
Encap wise, the tc eswitch flow attribute struct needs to have
only the encap ID which is programmed later to the HW and none
of the higher level encap params, fix that.
This patch doesn't change any functionality.
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
net/mlx5e: Extendable vport representor netdev private data
Make representor netdev private data extendable by adding new struct
"mlx5e_rep_priv" and use it as the rep netdev private data struct
instead of directly pointing to mlx5_eswitch_rep.
Added new en_rep.h header file to contain all representor related
definitions and prototypes, and moved all representor specific logic
into en_rep.c.
Needed for downstream patches to extend representor functionality to
support neighbour update.
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com> Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com>
Add support for 38.4MHz frequency is required for PTP
on CannonLake. SYSTIM frequency adjustment attributes for TIMINCA are
get/set dependent on the hardware clock frequency for a different
types of adapters. 38.4MHz frequency supported by CannonLake
and active once time synchronisation mechanism was enabled
Changed abbreviation from Hz to HZ to be compliant checkpatch code style
Signed-off-by: Sasha Neftin <sasha.neftin@intel.com> Reviewed-by: Raanan Avargil <raanan.avargil@intel.com> Reviewed-by: Dima Ruinskiy <dima.ruinskiy@intel.com> Tested-by: Aaron Brown <aaron.f.brown@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
i219 (6) and i219 (7) are the next LOM generations that will be
available on the nextIntel Client platform (CannonLake)
This patch provides the initial support for these devices
Signed-off-by: Sasha Neftin <sasha.neftin@intel.com> Reviewed-by: Raanan Avargil <raanan.avargil@intel.com> Reviewed-by: Dima Ruinskiy <dima.ruinskiy@intel.com> Tested-by: Aaron Brown <aaron.f.brown@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Jarod Wilson [Tue, 26 Jul 2016 18:25:35 +0000 (14:25 -0400)]
e1000e: fix PTP on e1000_pch_lpt variants
I've got reports that the Intel I-218V NIC in Intel NUC5i5RYH systems used
as a PTP slave experiences random ~10 hour clock jumps, which are resolved
if the same workaround for the 82574 and 82583 is employed, so set the
appropriate flag2 in e1000_pch_lpt_info too.
Reported-by: Rupesh Patel <rupatel@redhat.com> Signed-off-by: Jarod Wilson <jarod@redhat.com> Tested-by: Aaron Brown <aaron.f.brown@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
On architectures with larger pages, we get a warning about an unused variable:
drivers/net/ethernet/intel/i40evf/i40evf_main.c: In function 'i40evf_configure_rx':
drivers/net/ethernet/intel/i40evf/i40evf_main.c:690:21: error: unused variable 'netdev' [-Werror=unused-variable]
This moves the declaration into the #ifdef to avoid the warning.
Fixes: dab86afdbbd1 ("i40e/i40evf: Change the way we limit the maximum frame size for Rx") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Jacob Keller <jacob.e.keller@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Jacob Keller [Wed, 19 Apr 2017 13:25:59 +0000 (09:25 -0400)]
i40evf: allocate queues before we setup the interrupts and q_vectors
This matches the ordering of how we free stuff during reset and remove.
It also makes logical sense because we set the interrupts based on the
number of queues. Currently this doesn't really matter in practice.
However a future patch moves the assignment of num_active_queues into
i40evf_alloc_queues, which is required by
i40evf_set_interrupt_capability.
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Jacob Keller [Wed, 19 Apr 2017 13:25:58 +0000 (09:25 -0400)]
i40evf: remove I40E_FLAG_FDIR_ATR_ENABLED
The flag used by the common code and PF code is I40E_FLAG_FD_ATR_ENABLED,
not *FDIR*. It turns out none of the txrx code actually shared with the
VF driver actually checks the ATR flag. This is made even more obvious
by the typo in the VF header file.
Let's just remove the flag from the VF driver since it's not needed.
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Jacob Keller [Wed, 19 Apr 2017 13:25:57 +0000 (09:25 -0400)]
i40e: remove hw_disabled_flags in favor of using separate flag bits
The hw_disabled_flags field was added as a way of signifying that
a feature was automatically or temporarily disabled. However, we
actually only use this for FDir features. Replace its use with new
_AUTO_DISABLED flags instead. This is more readable, because you aren't
setting an *_ENABLED flag to *disable* the feature.
Additionally, clean up a few areas where we used these bits. First, we
don't really need to set the auto-disable flag for ATR if we're fully
disabling the feature via ethtool.
Second, we should always clear the auto-disable bits in case they somehow
got set when the feature was disabled. However, avoid displaying
a message that we've re-enabled the feature.
Third, we shouldn't be re-enabling ATR in the SB ntuple add flow,
because it might have been disabled due to space constraints. Instead,
we should just wait for the fdir_check_and_reenable to be called by the
watchdog.
Overall, this change allows us to simplify some code by removing an
extra field we didn't need, and the result should make it more clear as
to what we're actually doing with these flags.
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Jacob Keller [Wed, 19 Apr 2017 13:25:56 +0000 (09:25 -0400)]
i40evf: remove needless min_t() on num_online_cpus()*2
We already set pairs to the value of adapter->num_active_queues. This
value is limited by vsi_res->num_queue_pairs and num_online_cpus(). This
means that pairs by definition is already smaller than
num_online_cpus()*2, so we don't even need to bother with this check.
Lets just remove it and update the comment.
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Jacob Keller [Wed, 19 Apr 2017 13:25:55 +0000 (09:25 -0400)]
i40e: use DECLARE_BITMAP for state fields
Instead of assuming our flags fit within an unsigned long, use
DECLARE_BITMAP which will ensure that we always allocate enough space.
Additionally, use __I40E_STATE_SIZE__ markers as the last element of the
enumeration so that the size of the BITMAP is compile-time assigned
rather than programmer-time assigned. This ensures that potential future
flag additions do not actually overrun the array. This is especially
important as 32bit systems would only have 32bit longs instead of 64bit
longs as we generally have assumed in the prior code.
This change also removes a dereference of the state fields throughout
the code, so it does have a bit of code churn. The conversions were
automated using sed replacements with an alternation
For debugfs, we modify the printing so that we can display chunks of the
state value on new lines. This ensures that we can print the entire set
of state values. Additionally, we now print them as 08lx to ensure that
they display nicely.
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Jacob Keller [Fri, 21 Apr 2017 20:38:05 +0000 (13:38 -0700)]
i40e: separate PF and VSI state flags
Avoid using the same named flags for both vsi->state and pf->state. This
makes code review easier, as it is more likely that future authors will
use the correct state field when checking bits. Previous commits already
found issues with at least one check, and possibly others may be
incorrect.
This reduces confusion as it is more clear what each flag represents,
and which flags are valid for which state field.
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Jacob Keller [Wed, 19 Apr 2017 13:25:53 +0000 (09:25 -0400)]
i40e: remove unnecessary msleep() delay in i40e_free_vfs
The delay was added because of a desire to ensure that the VF driver can
finish up removing. However, pci_disable_sriov already has its own
ssleep() call that will sleep for an entire second, so there is no
reason to add extra delay on top of this by using msleep here. In
practice, an msleep() won't have a huge impact on timing but there is no
real value in keeping it, so lets just simplify the code and remove it.
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>