Robert Shearman [Fri, 19 Feb 2016 09:43:16 +0000 (09:43 +0000)]
lwtunnel: autoload of lwt modules
The lwt implementations using net devices can autoload using the
existing mechanism using IFLA_INFO_KIND. However, there's no mechanism
that lwt modules not using net devices can use.
Therefore, add the ability to autoload modules registering lwt
operations for lwt implementations not using a net device so that
users don't have to manually load the modules.
Only users with the CAP_NET_ADMIN capability can cause modules to be
loaded, which is ensured by rtnetlink_rcv_msg rejecting non-RTM_GETxxx
messages for users without this capability, and by
lwtunnel_build_state not being called in response to RTM_GETxxx
messages.
Signed-off-by: Robert Shearman <rshearma@brocade.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Zhang Shengju [Thu, 18 Feb 2016 10:00:11 +0000 (10:00 +0000)]
vlan: turn on unicast filtering on vlan device
Currently vlan device inherits unicast filtering flag from underlying
device. If underlying device doesn't support unicast filter, this will
put vlan device into promiscuous mode when it's stacked.
Tun on IFF_UNICAST_FLT on the vlan device in any case so that it does
not go into promiscuous mode needlessly. If underlying device does not
support unicast filtering, that device will enter promiscuous mode.
Signed-off-by: Zhang Shengju <zhangshengju@cmss.chinamobile.com> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Sat, 20 Feb 2016 05:21:44 +0000 (00:21 -0500)]
Merge branch 'bpf-get-stackid'
Alexei Starovoitov says:
====================
bpf_get_stackid() and stack_trace map
This patch set introduces new map type to store stack traces and
corresponding bpf_get_stackid() helper.
BPF programs already can walk the stack via unrolled loop
of bpf_probe_read()s which is ok for simple analysis, but it's
not efficient and limited to <30 frames after that the programs
don't fit into MAX_BPF_STACK. With bpf_get_stackid() helper
the programs can collect up to PERF_MAX_STACK_DEPTH both
user and kernel frames.
Using stack traces as a key in a map turned out to be very useful
for generating flame graphs, off-cpu graphs, waker and chain graphs.
Patch 3 is a simplified version of 'offwaketime' tool which is
described in detail here:
http://brendangregg.com/blog/2016-02-01/linux-wakeup-offwake-profiling.html
Earlier version of this patch were using save_stack_trace() helper,
but 'unreliable' frames add to much noise and two equiavlent
stack traces produce different 'stackid's.
Using lockdep style of storing frames with MAX_STACK_TRACE_ENTRIES is
great for lockdep, but not acceptable for bpf, since the stack_trace
map needs to be freed when user Ctrl-C the tool.
The ftrace style with per_cpu(struct ftrace_stack) is great, but it's
tightly coupled with ftrace ring buffer and has the same 'unreliable'
noise. perf_event's perf_callchain() mechanism is also very efficient
and it only needed minor generalization which is done in patch 1
to be used by bpf stack_trace maps.
Peter, please take a look at patch 1.
If you're ok with it, I'd like to take the whole set via net-next.
Patch 1 - generalization of perf_callchain()
Patch 2 - stack_trace map done as lock-less hashtable without link list
to avoid spinlock on insertion which is critical path when
bpf_get_stackid() helper is called for every task switch event
Patch 3 - offwaketime example
After the patch the 'perf report' for artificial 'sched_bench'
benchmark that doing pthread_cond_wait/signal and 'offwaketime'
example is running in the background:
16.35% swapper [kernel.vmlinux] [k] intel_idle
2.18% sched_bench [kernel.vmlinux] [k] __switch_to
2.18% sched_bench libpthread-2.12.so [.] pthread_cond_signal@@GLIBC_2.3.2
1.72% sched_bench libpthread-2.12.so [.] pthread_mutex_unlock
1.53% sched_bench [kernel.vmlinux] [k] bpf_get_stackid
1.44% sched_bench [kernel.vmlinux] [k] entry_SYSCALL_64
1.39% sched_bench [kernel.vmlinux] [k] __call_rcu.constprop.73
1.13% sched_bench libpthread-2.12.so [.] pthread_mutex_lock
1.07% sched_bench libpthread-2.12.so [.] pthread_cond_wait@@GLIBC_2.3.2
1.07% sched_bench [kernel.vmlinux] [k] hash_futex
1.05% sched_bench [kernel.vmlinux] [k] do_futex
1.05% sched_bench [kernel.vmlinux] [k] get_futex_key_refs.isra.13
The hotest part of bpf_get_stackid() is inlined jhash2, so we may consider
using some faster hash in the future, but it's good enough for now.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
This is simplified version of Brendan Gregg's offwaketime:
This program shows kernel stack traces and task names that were blocked and
"off-CPU", along with the stack traces and task names for the threads that woke
them, and the total elapsed time from when they blocked to when they were woken
up. The combined stacks, task names, and total time is summarized in kernel
context for efficiency.
Example:
$ sudo ./offwaketime | flamegraph.pl > demo.svg
Open demo.svg in the browser as FlameGraph visualization.
Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
add new map type to store stack traces and corresponding helper
bpf_get_stackid(ctx, map, flags) - walk user or kernel stack and return id
@ctx: struct pt_regs*
@map: pointer to stack_trace map
@flags: bits 0-7 - numer of stack frames to skip
bit 8 - collect user stack instead of kernel
bit 9 - compare stacks by hash only
bit 10 - if two different stacks hash into the same stackid
discard old
other bits - reserved
Return: >= 0 stackid on success or negative error
stackid is a 32-bit integer handle that can be further combined with
other data (including other stackid) and used as a key into maps.
Userspace will access stackmap using standard lookup/delete syscall commands to
retrieve full stack trace for given stackid.
Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
Daniel Borkmann [Fri, 19 Feb 2016 23:29:30 +0000 (00:29 +0100)]
net: use skb_postpush_rcsum instead of own implementations
Replace individual implementations with the recently introduced
skb_postpush_rcsum() helper.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Tom Herbert <tom@herbertland.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
Andrew Lunn [Fri, 19 Feb 2016 23:35:29 +0000 (00:35 +0100)]
phy: marvell/micrel: Fix Unpossible condition
commit 2b2427d06426 ("phy: micrel: Add ethtool statistics counters")
from Dec 30, 2015, leads to the following static checker
warning:
drivers/net/phy/micrel.c:609 kszphy_get_stat()
warn: unsigned 'val' is never less than zero.
drivers/net/phy/micrel.c
602 static u64 kszphy_get_stat(struct phy_device *phydev, int i)
603 {
604 struct kszphy_hw_stat stat = kszphy_hw_stats[i];
605 struct kszphy_priv *priv = phydev->priv;
606 u64 val;
607
608 val = phy_read(phydev, stat.reg);
609 if (val < 0) {
^^^^^^^
Unpossible!
610 val = UINT64_MAX;
611 } else {
612 val = val & ((1 << stat.bits) - 1);
613 priv->stats[i] += val;
614 val = priv->stats[i];
615 }
616
617 return val;
618 }
The same problem exists in the Marvell driver. Fix both.
Fixes: 2b2427d06426 ("phy: micrel: Add ethtool statistics counters") Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Reported-by: Julia.Lawall <julia.lawall@lip6.fr> Signed-off-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Sat, 20 Feb 2016 03:54:10 +0000 (22:54 -0500)]
Merge branch 'ethtool-perqueue-params'
Kan Liang says:
====================
ethtool per queue parameters support
Modern network interface controllers usually support multiple receive
and transmit queues. Each queue may have its own parameters. For
example, Intel XL710/X710 hardware supports per queue interrupt
moderation. However, current ethtool does not support per queue
parameters option. User has to set parameters for the whole NIC.
This series extends ethtool to support per queue parameters option.
Since the support of per queue parameters vary with different cards,
it is impossible to address all cards in one patch. This series only
supports per queue coalesce options on i40e driver. The framework used
in the patch can be easily extended to other cards and parameters.
The lib bitmap needs to be extended to facilitate exchanging queue bitmaps
between user space and kernel space. Two patches from David's latest V8
patch series are also cited in this series. You may refer to
https://lkml.org/lkml/2016/2/9/919 for more details.
Changes since V6:
- Rebase on commit 76d13b568776. Did minor change in patch 6.
Changes since V5:
- Add test_bitmap.c and bitmap.sh in the series. They are forgot
to be added previously.
- Update the first two patches to David's latest V8 version. The changes
include
- bitmap u32 API returns number of bits copied, unit tests updated
- module_exit in test_bitmap
- Also change the mode of bitmap.sh to 755 according to Ben's suggestion
Changes since V4:
- Modify set/get_per_queue_coalesce function description
- Change the queue number to be u32
- Correct an error of calculating coalesce backup buffer address
- Rename queue_num to n_queues
- Don't log error message in __i40e_get_coalesce
Changes since V3:
- Based on David's lib bitmap.
- ETHTOOL_PERQUEUE should be handled before the containing switch
- Make the rollback code unconditional
- some minor changes according to Ben's feedback
Changes since V2:
- Add queue-specific settings for interrupt moderation in i40e
Changes since V1:
- Checking the sub-command number to determine whether the command
requires CAP_NET_ADMIN
- Refine the struct ethtool_per_queue_op and improve the comments
- Use bitmap functions to parse queue mask
- Improve comments
- Use bitmap functions to parse queue mask
- Improve comments
- Add rollback support
- Correct the way to find the vector for specific queue.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Kan Liang [Fri, 19 Feb 2016 14:24:06 +0000 (09:24 -0500)]
i40e/ethtool: support coalesce setting by queue
This patch implements set_per_queue_coalesce for i40e driver.
Signed-off-by: Kan Liang <kan.liang@intel.com> Acked-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Kan Liang [Fri, 19 Feb 2016 14:24:05 +0000 (09:24 -0500)]
i40e/ethtool: support coalesce getting by queue
This patch implements get_per_queue_coalesce for i40e driver.
Signed-off-by: Kan Liang <kan.liang@intel.com> Acked-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Kan Liang [Fri, 19 Feb 2016 14:24:04 +0000 (09:24 -0500)]
i40e: queue-specific settings for interrupt moderation
For i40e driver, each vector has its own ITR register. However, there
are no concept of queue-specific settings in the driver proper. Only
global variable is used to store ITR values. That will cause problems
especially when resetting the vector. The specific ITR values could be
lost.
This patch move rx_itr_setting and tx_itr_setting to i40e_ring to store
specific ITR register for each queue.
i40e_get_coalesce and i40e_set_coalesce are also modified accordingly to
support queue-specific settings. To make it compatible with old ethtool,
if user doesn't specify the queue number, i40e_get_coalesce will return
queue 0's value. While i40e_set_coalesce will apply value to all queues.
Signed-off-by: Kan Liang <kan.liang@intel.com> Acked-by: Shannon Nelson <shannon.nelson@intel.com> Acked-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Kan Liang [Fri, 19 Feb 2016 14:24:03 +0000 (09:24 -0500)]
net/ethtool: support set coalesce per queue
This patch implements sub command ETHTOOL_SCOALESCE for ioctl
ETHTOOL_PERQUEUE. It introduces an interface set_per_queue_coalesce to
set coalesce of each masked queue to device driver. The wanted coalesce
information are stored in "data" for each masked queue, which can copy
from userspace.
If it fails to set coalesce to device driver, the value which already
set to specific queue will be tried to rollback.
Signed-off-by: Kan Liang <kan.liang@intel.com> Reviewed-by: Ben Hutchings <ben@decadent.org.uk> Signed-off-by: David S. Miller <davem@davemloft.net>
Kan Liang [Fri, 19 Feb 2016 14:24:02 +0000 (09:24 -0500)]
net/ethtool: support get coalesce per queue
This patch implements sub command ETHTOOL_GCOALESCE for ioctl
ETHTOOL_PERQUEUE. It introduces an interface get_per_queue_coalesce to
get coalesce of each masked queue from device driver. Then the interrupt
coalescing parameters will be copied back to user space one by one.
Signed-off-by: Kan Liang <kan.liang@intel.com> Reviewed-by: Ben Hutchings <ben@decadent.org.uk> Signed-off-by: David S. Miller <davem@davemloft.net>
Aimed at transferring bitmaps to/from user-space in a 32/64-bit agnostic
way.
Tested:
unit tests (next patch) on qemu i386, x86_64, ppc, ppc64 BE and LE,
ARM.
Signed-off-by: David Decotigny <decot@googlers.com> Reviewed-by: Ben Hutchings <ben@decadent.org.uk> Signed-off-by: David S. Miller <davem@davemloft.net>
hv_netvsc: add software transmit timestamp support
Enable skb_tx_timestamp in hyperv netvsc.
Signed-off-by: Simon Xiao <sixiao@microsoft.com> Reviewed-by: K. Y. Srinivasan <kys@microsoft.com> Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Wei Wang [Wed, 17 Feb 2016 21:58:22 +0000 (13:58 -0800)]
ipv6: pass up EMSGSIZE msg for UDP socket in Ipv6
In ipv4, when the machine receives a ICMP_FRAG_NEEDED message, the
connected UDP socket will get EMSGSIZE message on its next read from the
socket.
However, this is not the case for ipv6.
This fix modifies the udp err handler in Ipv6 for ICMP6_PKT_TOOBIG to
make it similar to ipv4 behavior. That is when the machine gets an
ICMP6_PKT_TOOBIG message, the connected UDP socket will get EMSGSIZE
message on its next read from the socket.
Signed-off-by: Wei Wang <weiwan@google.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
be2net: Fix pcie error recovery in case of NIC+RoCE adapters
Interrupts registered by RoCE driver are not unregistered when
msix interrupts are disabled during error recovery causing a
crash. Detach the adapter instance from RoCE driver when error
is detected to complete the cleanup. Attach the driver again after
the adapter is recovered from error.
Signed-off-by: Padmanabh Ratnakar <padmanabh.ratnakar@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Sergio Prado [Tue, 16 Feb 2016 23:10:45 +0000 (21:10 -0200)]
net: macb: make magic-packet property generic
As requested by Rob Herring on patch
https://patchwork.ozlabs.org/patch/580862/.
This is a new property that it's still in net-next and has never been
used in production, so we are not breaking anything with the
incompatible binding change.
Signed-off-by: Sergio Prado <sergio.prado@e-labworks.com> Acked-by: Rob Herring <robh@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Fri, 19 Feb 2016 20:27:37 +0000 (15:27 -0500)]
Merge branch 'bridge-mdb-attrs'
Nikolay Aleksandrov says:
====================
bridge: mdb: add support for extended attributes
This small set allows to extend the per mdb entry exported attributes,
before this set we had only a structure exported which couldn't be changed
because we would've broken user-space, after this we extend the attribute
that was used for the structure and add per-mdb entry attributes after the
struct has been added (see patch 02 for more details). Note that the reason
we can't simply add an attribute after MDBA_MDB_ENTRY_INFO is that current
users (e.g. iproute2) walk over the attribute list directly without
checking for the attribute type.
Patch 01 is a simple change to reduce one indentation level in order to
avoid over 80 char lines.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
bridge: mdb: add support for more attributes and export timer
Currently mdb entries are exported directly as a structure inside
MDBA_MDB_ENTRY_INFO attribute, we can't really extend it without
breaking user-space. In order to export new mdb fields, I've converted
the MDBA_MDB_ENTRY_INFO into a nested attribute which starts like before
with struct br_mdb_entry (without header, as it's casted directly in
iproute2) and continues with MDBA_MDB_EATTR_ attributes. This way we
keep compatibility with older users and can export new data.
I've tested this with iproute2, both with and without support for the
added attribute and it works fine.
So basically we again have MDBA_MDB_ENTRY_INFO with struct br_mdb_entry
inside but it may contain also some additional MDBA_MDB_EATTR_ attributes
such as MDBA_MDB_EATTR_TIMER which can be parsed by user-space.
So the new structure is:
[MDBA_MDB] = {
[MDBA_MDB_ENTRY] = {
[MDBA_MDB_ENTRY_INFO]
[MDBA_MDB_ENTRY_INFO] { <- Nested attribute
struct br_mdb_entry <- nla_put_nohdr()
[MDBA_MDB_ENTRY attributes] <- normal netlink attributes
}
}
}
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Sasha Levin [Fri, 19 Feb 2016 18:53:10 +0000 (13:53 -0500)]
bpf: grab rcu read lock for bpf_percpu_hash_update
bpf_percpu_hash_update() expects rcu lock to be held and warns if it's not,
which pointed out a missing rcu read lock.
Fixes: 15a07b338 ("bpf: add lookup/update support for per-cpu hash and array maps") Signed-off-by: Sasha Levin <sasha.levin@oracle.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Fri, 19 Feb 2016 16:16:11 +0000 (11:16 -0500)]
Merge branch '40GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/next-queue
Jeff Kirsher says:
====================
40GbE Intel Wired LAN Driver Updates 2016-02-19
This series contains updates to i40e/i40evf only.
Alex Duyck splits up the descriptor count function from the function that
stops the ring to have access to the descriptor count used for the data
portion of the frame. The rewrites the logic for how we determine if we
can transmit the frame or if it needs to be linearized. Place the checksum
close to TSO since they have a lot in common and it can help to reduce the
decision tree for how to handle the frame as the first check in TSO is to
see if checksumming is offloaded.
Carolyn adds functions to blink leds on devices using 10GBaseT PHY since
MAC registers used in other designs do not work in this device configuration.
Fixes an issue where a previously removed message has returned.
Kevin increases the timeout when checking GLGEN_RSTAT_DEVSTATE bit since
linking with particular PHY types, the amount of time it takes for the
GLGEN_RSTAT_DEVSTATE to be set increases greatly.
Neerav changes the receive queues to not wait to be disabled before DCB
has been reconfigured, like transmit queues.
Anjali adds new register definitions for programming the parser, flow
director and RSS blocks in the hardware.
Shannon adds the new opcodes and structures used for asking the firmware
to update receive control registers that need extra care when being
accessed while under heavy traffic. Integrates the new AdminQ functions
for safely accessing the receive control registers that may be affected
by heavy small packet traffic.
Mitch provides another colorful patch description on letting go of
the stale local VSI pointer when the VF resets.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Mitch Williams [Thu, 18 Feb 2016 00:12:23 +0000 (16:12 -0800)]
i40e: let go of the past
If we reset a VF, its VSI goes away, and it gets a new one. So don't
hang on to the now-stale local VSI pointer. It just leads to suffering
and kernel panics.
Change-ID: Ia8823b4e85893e95e963acee284968022b29177a Signed-off-by: Mitch Williams <mitch.a.williams@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
We need to suspend scheduling or any pending service task during driver
unload process, so that new task will not be scheduled. This patch sets
the suspend flag bit during reload which avoids service task execution.
Change-ID: I017c57b5d6656564556e3c5387da671369a572ac Signed-off-by: Pandi Kumar Maharajan <pandi.maharajan@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Shannon Nelson [Thu, 18 Feb 2016 00:12:21 +0000 (16:12 -0800)]
i40e: Use the new rx ctl register helpers. Don't use AQ calls from clear_hw.
Use the new AdminQ functions for safely accessing the Rx control
registers that may be affected by heavy small packet traffic.
We can't use AdminQ calls in i40e_clear_hw() because the HW is being
initialized and the AdminQ is not alive. We recently added an AQ
related replacement for reading PFLAN_QALLOC, and this patch puts
back the original register read.
Change-ID: Ib027168c954a5733299aa3a4ce5f8218c6bb5636 Signed-off-by: Shannon Nelson <shannon.nelson@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Shannon Nelson [Thu, 18 Feb 2016 00:12:20 +0000 (16:12 -0800)]
i40e: implement and use Rx CTL helper functions
Use the new AdminQ functions for safely accessing the Rx control
registers that may be affected by heavy small packet traffic.
Change-ID: Ibb00983e8dcba71f4b760222a609a5fcaa726f18 Signed-off-by: Shannon Nelson <shannon.nelson@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Shannon Nelson [Thu, 18 Feb 2016 00:12:19 +0000 (16:12 -0800)]
i40e: add adminq commands for Rx CTL registers
Add the new opcodes and struct used for asking the firmware to update Rx
control registers that need extra care when being accessed while under
heavy traffic - e.g. sustained 64byte packets at line rate on all ports.
The firmware will take extra steps to be sure the register accesses
are successful.
Change-ID: I56c8144000da66ad99f68948d8a184b2ec2aeb3e Signed-off-by: Shannon Nelson <shannon.nelson@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
John Underwood [Thu, 18 Feb 2016 17:19:24 +0000 (09:19 -0800)]
i40e: add check for null VSI
Return from i40e_vsi_reinit_setup() if vsi param is NULL.
This makes this code consistent with all the other code that
checks for NULL before using one of the VSI pointers accessed
with an indexed variable. (Indexed VSI pointers are
intentionally set to NULL in i40e_vsi_clear() and
i40e_remove().
Change-ID: I3bc8b909c70fd2439334eeae994d151f61480985 Signed-off-by: John Underwood <johnx.underwood@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Carolyn Wyborny [Thu, 18 Feb 2016 00:12:16 +0000 (16:12 -0800)]
i40e: Fix for unexpected messaging
This fixes an issue where a previously removed message
has returned. Changing the message type to dev_dbg
leaves the info, if desired, but takes it out of normal
everyday usage. Also changed call to only provide port
data when its valid and not when its not (delete case).
Change-ID: Ief6f33b915f6364c24fa8e5789c2fc3168b5e2ed Signed-off-by: Carolyn Wyborny <carolyn.wyborny@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Neerav Parikh [Thu, 18 Feb 2016 00:12:15 +0000 (16:12 -0800)]
i40e: Do not wait for Rx queue disable in DCB reconfig
Just like Tx queues don't wait for Rx queues to be disabled before
DCB has been reconfigured.
Check the queues are disabled only after the DCB configuration has
been applied to the VSI(s) managed by the PF driver.
In case of any timeout issue a PF reset to recover.
Change-ID: Ic51e94c25baf9a5480cee983f35d15575a88642c Signed-off-by: Neerav Parikh <neerav.parikh@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Kevin Scott [Thu, 18 Feb 2016 00:12:13 +0000 (16:12 -0800)]
i40e: Increase timeout when checking GLGEN_RSTAT_DEVSTATE bit
When linking with particular PHY types (ex: copper PHY), the amount of
time it takes for the GLGEN_RSTAT_DEVSTATE to be set increases greatly,
which can lead to a timeout and failure to load the driver.
Change-ID: If02be0dfcd7c57fdde2d5c81cd63651260cd2029 Signed-off-by: Kevin Scott <kevin.c.scott@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Carolyn Wyborny [Thu, 18 Feb 2016 00:12:12 +0000 (16:12 -0800)]
i40e: Fix led blink capability for 10GBaseT PHY
This patch fixes a problem where the ethtool identify adapter
functionality did not work for some copper PHY's. Without this
patch, the blink led functionality fails on some parts. This
patch adds PHY write code to blink led's on parts where this
functionality is contained in the PHY rather than the MAC.
Change-ID: Iee7b3453f61d5ffd0b3d03f720ee4f17f919fcc2 Signed-off-by: Carolyn Wyborny <carolyn.wyborny@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Carolyn Wyborny [Thu, 18 Feb 2016 00:12:11 +0000 (16:12 -0800)]
i40e: Add functions to blink led on 10GBaseT PHY
This patch adds functions to blink led on devices using
10GBaseT PHY since MAC registers used in other designs
do not work in this device configuration.
Change-ID: Id4b88c93c649fd2b88073a00b42867a77c761ca3 Signed-off-by: Carolyn Wyborny <carolyn.wyborny@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Alexander Duyck [Wed, 17 Feb 2016 19:02:56 +0000 (11:02 -0800)]
i40e/i40evf: Move Tx checksum closer to TSO
On all of the other Intel drivers we place checksum close to TSO as they
have a significant amount in common and it can help to reduce the decision
tree for how to handle the frame as the first check in TSO is to see if
checksumming is offloaded, and if it is not we can skip _BOTH_ TSO and Tx
checksum offload based on a single check.
Signed-off-by: Alexander Duyck <aduyck@mirantis.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Alexander Duyck [Wed, 17 Feb 2016 19:02:50 +0000 (11:02 -0800)]
i40e/i40evf: Rewrite logic for 8 descriptor per packet check
This patch is meant to rewrite the logic for how we determine if we can
transmit the frame or if it needs to be linearized.
The previous code for this function was using a mix of division and modulus
division as a part of computing if we need to take the slow path. Instead
I have replaced this by simply working with a sliding window which will
tell us if the frame would be capable of causing a single packet to span
several descriptors.
The logic for the scan is fairly simple. If any given group of 6 fragments
is less than gso_size - 1 then it is possible for us to have one byte
coming out of the first fragment, 6 fragments, and one or more bytes coming
out of the last fragment. This gives us a total of 8 fragments
which exceeds what we can allow so we send such frames to be linearized.
Arguably the use of modulus might be more exact as the approach I propose
may generate some false positives. However the likelihood of us taking much
of a hit for those false positives is fairly low, and I would rather not
add more overhead in the case where we are receiving a frame composed of 4K
pages.
Signed-off-by: Alexander Duyck <aduyck@mirantis.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Alexander Duyck [Wed, 17 Feb 2016 19:02:43 +0000 (11:02 -0800)]
i40e/i40evf: Break up xmit_descriptor_count from maybe_stop_tx
In an upcoming patch I would like to have access to the descriptor count
used for the data portion of the frame. For this reason I am splitting up
the descriptor count function from the function that stops the ring.
Also in order to try and reduce unnecessary duplication of code I am moving
the slow-path portions of the code out of being inline calls so that we can
just jump to them and process them instead of having to build them into
each function that calls them.
Signed-off-by: Alexander Duyck <aduyck@mirantis.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
David S. Miller [Fri, 19 Feb 2016 04:47:04 +0000 (23:47 -0500)]
Merge branch '40GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/next-queue
Jeff Kirsher says:
====================
40GbE Intel Wired LAN Driver Updates 2016-02-18
This series contains updates to i40e and i40evf only.
Alex Duyck provides all the patches in the series to update and fix the
drivers. Fixed the driver to drop the outer checksum offload on UDP
tunnels, since the issue is that the upper levels of the stack never
requested such an offload and it results in possible errors. Updates the
TSO function to just use u64 values, so we do not have to end up casting
u32 values. In the TSO path, factored out the L4 header offsets allowing
us to ignore the L4 header offsets when dealing with the L3 checksum and
length update. Consolidates all of the spots where we were updating
either the TCP or IP checksums in the TSO and checksum path into the TSO
function. Fixed two issues by adding support for IPv4 encapsulated in
IPv6, first issue was the fact that iphdr(skb)->protocol was being used to
test for the outer transport protocol which breaks IPv6 support. The second
was that we cleared the flag for v4 going to v6, but we did not take care
of txflags going the other way. Added support for IPv6 extension headers
in setting up the Tx checksum. Added exception handling to the Tx
checksum path so that we can handle cases of TSO where the frame is bad,
or Tx checksum where we did not recognize a protocol. Fixed a number of
issues to make certain that we are using the correct protocols when
parsing both the inner and outer headers of a frame that is mixed between
IPv4 and IPv6 for inner and outer. Updated the feature flags to reflect
the newly enabled/added features.
Sorry, no witty patch descriptions this time around, probably should
let Mitch help in writing patch descriptions for Alex. :-)
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Yuval Mintz [Wed, 17 Feb 2016 11:15:14 +0000 (13:15 +0200)]
bnx2x: Add missing HSI for big-endian machines
Commit e5d3a51cefbb ("bnx2x: extend DCBx support") was missing HSI
changes for big-endian machine, breaking compilation on such
platforms.
Reported-by: kbuild test robot <fengguang.wu@intel.com> Signed-off-by: Yuval Mintz <Yuval.Mintz@qlogic.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Alexander Duyck [Tue, 26 Jan 2016 03:32:54 +0000 (19:32 -0800)]
i40e: Add support for ATR w/ IPv6 extension headers
This patch updates the code for determining the L4 protocol and L3 header
length so that when IPv6 extension headers are being used we can determine
the offset and type of the L4 protocol.
Signed-off-by: Alexander Duyck <aduyck@mirantis.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Alexander Duyck [Mon, 25 Jan 2016 05:17:57 +0000 (21:17 -0800)]
i40evf: Update feature flags to reflect newly enabled features
Recent changes should have enabled support for IPv6 based tunnels and
support for TSO with outer UDP checksums. As such we can update the
feature flags to reflect that.
In addition we can clean-up the flags that aren't needed such as SCTP and
RXCSUM since having the bits there doesn't add any value.
I also found one spot where we were setting the same flag twice. It looks
like it was probably a git merge error that resulted in the line being
duplicated. As such I have dropped it in this patch.
Signed-off-by: Alexander Duyck <aduyck@mirantis.com> Acked-by: Anjali Singhai Jain <anjali.singhai@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Alexander Duyck [Mon, 25 Jan 2016 05:17:50 +0000 (21:17 -0800)]
i40e: Update feature flags to reflect newly enabled features
Recent changes should have enabled support for IPv6 based tunnels and
support for TSO with outer UDP checksums. As such we can update the
feature flags to reflect that.
In addition we can clean-up the flags that aren't needed such as SCTP and
RXCSUM since having the bits there doesn't add any value.
Signed-off-by: Alexander Duyck <aduyck@mirantis.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Alexander Duyck [Mon, 25 Jan 2016 05:17:43 +0000 (21:17 -0800)]
i40e: Do not drop support for IPv6 VXLAN or GENEVE tunnels
All of the documentation in the datasheets for the XL710 do not call out
any reason to exclude support for IPv6 based tunnels. As such I am
dropping the code that was excluding these tunnel types from having their
port numbers recognized. This way we can take advantage of things such as
checksum offload for inner headers over IPv6 based VXLAN or GENEVE
tunnels.
Signed-off-by: Alexander Duyck <aduyck@mirantis.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Alexander Duyck [Mon, 25 Jan 2016 05:17:36 +0000 (21:17 -0800)]
i40e: Fix ATR in relation to tunnels
This patch contains a number of fixes to make certain that we are using
the correct protocols when parsing both the inner and outer headers of a
frame that is mixed between IPv4 and IPv6 for inner and outer.
Signed-off-by: Alexander Duyck <aduyck@mirantis.com> Acked-by: Kiran Patil <kiran.patil@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Alexander Duyck [Mon, 25 Jan 2016 05:17:29 +0000 (21:17 -0800)]
i40e/i40evf: Enable support for SKB_GSO_UDP_TUNNEL_CSUM
The XL722 has support for providing the outer UDP tunnel checksum on
transmits. Make use of this feature to support segmenting UDP tunnels with
outer checksums enabled.
Signed-off-by: Alexander Duyck <aduyck@mirantis.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Alexander Duyck [Mon, 25 Jan 2016 05:17:22 +0000 (21:17 -0800)]
i40e/i40evf: Clean-up Rx packet checksum handling
This is mostly a minor clean-up for the Rx checksum path in order to avoid
some of the unnecessary conditional checks that were being applied.
Signed-off-by: Alexander Duyck <aduyck@mirantis.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
This series adds vlan filtering offload to qede.
First patch introduces small additional infrastructure needed in
qed to support it, while second contains the main bulk of driver changes.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Device would start receiving only vlan-tagged traffic with tags matching
that of one of the configured vlan IDs, unless:
- Device is expliicly placed in PROMISC mode.
- Device exhausts its vlan filter credits.
Signed-off-by: Sudarsana Reddy Kalluru <sudarsana.kalluru@qlogic.com> Signed-off-by: Yuval Mintz <Yuval.Mintz@qlogic.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Yuval Mintz [Thu, 18 Feb 2016 15:00:39 +0000 (17:00 +0200)]
qed: Lay infrastructure for vlan filtering offload
Today, interfaces are working in vlan-promisc mode; But once
vlan filtering offloaded would be supported, we'll need a method to
control it directly [e.g., when setting device to PROMISC, or when
running out of vlan credits].
This adds the necessary API for L2 client to manually choose whether to
accept all vlans or only those for which filters were configured.
Signed-off-by: Yuval Mintz <Yuval.Mintz@qlogic.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Files in sysfs are created using the name from the phy_driver struct,
when two names are the same we may get a duplicate filename warning,
fix this.
Reported-by: kernel test robot <ying.huang@linux.intel.com> Signed-off-by: Andrew F. Davis <afd@ti.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Alexander Duyck [Wed, 17 Feb 2016 19:23:55 +0000 (11:23 -0800)]
net: Optimize local checksum offload
This patch takes advantage of several assumptions we can make about the
headers of the frame in order to reduce overall processing overhead for
computing the outer header checksum.
First we can assume the entire header is in the region pointed to by
skb->head as this is what csum_start is based on.
Second, as a result of our first assumption, we can just call csum_partial
instead of making a call to skb_checksum which would end up having to
configure things so that we could walk through the frags list.
Signed-off-by: Alexander Duyck <aduyck@mirantis.com> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Thu, 18 Feb 2016 19:35:02 +0000 (14:35 -0500)]
Merge branch 'iptunnel-pkt-scrub-consolidate'
Jiri Benc says:
====================
iptunnel: scrub packet in iptunnel_pull_header
As every IP tunnel has to scrub skb on decapsulation, iptunnel_pull_header
tried to do that and open coded part of skb_scrub_packet. Various tunneling
protocols (VXLAN, Geneve) then called full skb_scrub_packet on their own,
duplicating part of the scrubbing already done.
Consolidate the code, calling skb_scrub_packet from iptunnel_pull_header.
This will allow additional cleanups in VXLAN code, as the packet is scrubbed
early during rx processing after this patchset and VXLAN can start filling
out skb fields earlier.
The full picture of vxlan cleanup patches can be seen at:
https://github.com/jbenc/linux-vxlan/commits/master
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Vivien Didelot [Tue, 16 Feb 2016 15:09:51 +0000 (10:09 -0500)]
net: bridge: log port STP state on change
Remove the shared br_log_state function and print the info directly in
br_set_state, where the net_bridge_port state is actually changed.
Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com> Acked-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Thu, 18 Feb 2016 19:16:13 +0000 (14:16 -0500)]
Merge branch 'cxgb4-addr-sync'
Hariprasad Shenai says:
====================
cxgb4: Use __dev_[um]c_[un]sync for MAC address syncing
This patch series adds support to use __dev_uc_sync/__dev_mc_sync to add
MAC address and __dev_uc_unsync/__dev_mc_unsync to delete MAC address.
This patch series has been created against net-next tree and includes
patches on cxgb4 and cxgb4vf driver.
We have included all the maintainers of respective drivers. Kindly review
the change and let us know in case of any review comments.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Alexander Duyck [Mon, 25 Jan 2016 05:17:10 +0000 (21:17 -0800)]
i40e/i40evf: Add exception handling for Tx checksum
Add exception handling to the Tx checksum path so that we can handle cases
of TSO where the frame is bad, or Tx checksum where we didn't recognize a
protocol
Drop I40E_TX_FLAGS_CSUM as it is unused, move the CHECKSUM_PARTIAL check
into the function itself so that we can decrease indent.
Signed-off-by: Alexander Duyck <aduyck@mirantis.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Alexander Duyck [Mon, 25 Jan 2016 05:17:01 +0000 (21:17 -0800)]
i40e/i40evf: Do not write to descriptor unless we complete
This patch defers writing to the Tx descriptor bits until we know we have
successfully completed a given operation. So for example we defer updating
the tunnelling portion of the context descriptor until we have fully
identified the type.
The advantage to this approach is that we can assemble values as we go
instead of having to try and kludge everything together all at once. As a
result we can significantly clean up the tunneling configuration for
instance as we can just do a pointer walk and do the math for the distance
between each set of points.
Signed-off-by: Alexander Duyck <aduyck@mirantis.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Jiri Benc [Thu, 18 Feb 2016 18:19:29 +0000 (19:19 +0100)]
vxlan: tun_id is 64bit, not 32bit
The tun_id field in struct ip_tunnel_key is __be64, not __be32. We need to
convert the vni to tun_id correctly.
Fixes: 54bfd872bf16 ("vxlan: keep flags and vni in network byte order") Reported-by: Paolo Abeni <pabeni@redhat.com> Tested-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Jiri Benc <jbenc@redhat.com> Acked-by: Thadeu Lima de Souza Cascardo <cascardo@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Alexander Duyck [Mon, 25 Jan 2016 05:16:54 +0000 (21:16 -0800)]
i40e/i40evf: Handle IPv6 extension headers in checksum offload
This patch adds support for IPv6 extension headers in setting up the Tx
checksum. Without this patch extension headers would cause IPv6 traffic to
fail as the transport protocol could not be identified.
Signed-off-by: Alexander Duyck <aduyck@mirantis.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Alexander Duyck [Mon, 25 Jan 2016 05:16:48 +0000 (21:16 -0800)]
i40e/i40evf: Add support for IPv4 encapsulated in IPv6
This patch fixes two issues. First was the fact that iphdr(skb)->protocl
was being used to test for the outer transport protocol. This completely
breaks IPv6 support. Second was the fact that we cleared the flag for v4
going to v6, but we didn't take care of txflags going the other way. As
such we would have the v6 flag still set even if the inner header was v4.
Signed-off-by: Alexander Duyck <aduyck@mirantis.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Alexander Duyck [Mon, 25 Jan 2016 05:16:42 +0000 (21:16 -0800)]
i40e/i40evf: Replace header pointers with unions of pointers in Tx checksum path
The Tx checksum path was maintaining a set of 3 pointers and two lengths in
order to prepare the packet for being checksummed. The thing is we only
really needed 2 pointers, and the lengths that were being maintained can
easily be computed.
As such we can replace the IPv4 and IPv6 header pointers with one single
union that represents both, or a generic pointer to the start of the
network header. For the L4 headers we can do the same with TCP and a
generic pointer to the start of the transport header. The length of the
TCP header is obtained by simply multiplying doff by 4, and the network
header length can be obtained by subtracting the network header pointer
from the transport header pointer.
While I was at it I renamed l4_hdr to l4_proto to make it a bit more clear
and less likely to be confused with l4.hdr which is the transport header
pointer.
Signed-off-by: Alexander Duyck <aduyck@mirantis.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Alexander Duyck [Mon, 25 Jan 2016 05:16:35 +0000 (21:16 -0800)]
i40e/i40evf: Consolidate all header changes into TSO function
This patch goes through and pulls all of the spots where we were updating
either the TCP or IP checksums in the TSO and checksum path into the TSO
function. The general idea here is that we should only be updating the
header after we verify we have completed a skb_cow_head check to verify the
head is writable.
One other advantage to doing this is that it makes things much more
obvious. For example, in the case of IPv6 there was one spot where the
offset of the IPv4 header checksum was being updated which is obviously
incorrect.
Signed-off-by: Alexander Duyck <aduyck@mirantis.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Alexander Duyck [Mon, 25 Jan 2016 05:16:28 +0000 (21:16 -0800)]
i40e/i40evf: Factor out L4 header and checksum from L3 bits in TSO path
This patch makes it so that the L4 header offsets and such can be ignored
when dealing with the L3 checksum and length update. This is done making
use of two things.
First we can just use the offset from the L4 header to the start of the
packet to determine the L4 offset, and from that we can then make use of
the data offset to determine the full length of the headers.
As far as adjusting the checksum to remove the length we can simply add the
inverse of the length instead of having to recompute the entire
pseudo-header without the length. In the case of an IPv6 header this
should be significantly cheaper since we can make use of a value we already
needed instead of having to read the source and destination address out of
the packet.
Signed-off-by: Alexander Duyck <aduyck@mirantis.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Alexander Duyck [Mon, 25 Jan 2016 05:16:20 +0000 (21:16 -0800)]
i40e/i40evf: Use u64 values instead of casting them in TSO function
Instead of casing u32 values to u64 it makes more sense to just start out
with u64 values in the first place. This way we don't need to create a
mess with all of the casts needed to populate a 64b value.
Signed-off-by: Alexander Duyck <aduyck@mirantis.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Alexander Duyck [Mon, 25 Jan 2016 05:16:13 +0000 (21:16 -0800)]
i40e/i40evf: Drop outer checksum offload that was not requested
The i40e and i40evf drivers contained code for inserting an outer checksum
on UDP tunnels. The issue however is that the upper levels of the stack
never requested such an offload and it results in possible errors.
In addition the same logic was being applied to the Rx side where it was
attempting to validate the outer checksum, but the logic there was
incorrect in that it was testing for the resultant sum to be equal to the
header checksum instead of being equal to 0.
Since this code is so massively flawed, and doing things that we didn't ask
for it to do I am just dropping it, and will bring it back later to use as
an offload for SKB_GSO_UDP_TUNNEL_CSUM which can make use of such a
feature.
As far as the Rx feature I am dropping it completely since it would need to
be massively expanded and applied to IPv4 and IPv6 checksums for all parts,
not just the one that supports Tx checksum offload for the outer.
Signed-off-by: Alexander Duyck <aduyck@mirantis.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
David S. Miller [Thu, 18 Feb 2016 16:42:41 +0000 (11:42 -0500)]
Merge branch 'netlink-mmap-remove'
Florian Westphal says:
====================
netlink: remove mmapped netlink support
As discussed during netconf 2016 in Seville, this series removes
CONFIG_NETLINK_MMAP.
Close to three years after it was merged it has retained several problems
that do not appear to be fixable.
No official netfilter libmnl release contains support for mmap backed netlink
sockets. No openvswitch release makes use of it either.
To use the mmap interface, userspace not only has to probe for mmap netlink
support, it also has to implement a recv/socket receive path in order to
handle messages that exceed the size of an rx ring element (NL_MMAP_STATUS_COPY).
So if there are odd programs out there that attempt to use MMAP netlink
they should continue to work as they already need a socket based code path
to work properly.
The actual revert (first patch) has a list of problems.
The followup patches remove a couple of helpers that are no longer needed
after the revert.
I did a few tests with mmap vs. socket based interface on a 4.4 based
kernel on an i7-4790 box and there are no performance advantages:
loopback, single nfqueue, queueing in -t filter INPUT:
traffic generated by 8 * ping -q -f localhost:
socket backend:
real 0m27.325s
user 0m3.993s
sys 0m23.292s
with mmap ring backend:
real 0m29.054s
user 0m4.924s
sys 0m24.127s
with single tcp stream, unidirectional, loopback mtu set at 1500
(nc localhost discard < /dev/zero > /dev/null):
socket interface:
time nfqdump -b $((8 * 1024 * 1024 * 1024)) -w /dev/null
real 0m15.960s
user 0m1.756s
sys 0m11.143s
mmap ring:
real 0m16.441s
user 0m3.040s
sys 0m13.687s
socket interface nfqdump[1] with --gso option (i.e. MTU is exceeded,
no kernel-side segmentation and checksum fixups) completes in about 5s.
I also tested dumping a conntrack table with 1m entries.
On my box this takes about 2.4 seconds for both mmap and socket backend:
time LD_PRELOAD=../../src/.libs/libmnl.so ./nfct-dump-sk > /dev/null
mnl_cb_run: Success
messages: 1000000
real 0m2.485s
user 0m1.085s
sys 0m1.400s
time LD_PRELOAD=../../src/.libs/libmnl.so ./nfct-dump-mmap > /dev/null
messages: 1000000
real 0m2.451s
user 0m1.124s
sys 0m1.328s
Florian Westphal [Thu, 18 Feb 2016 14:03:24 +0000 (15:03 +0100)]
netlink: remove mmapped netlink support
mmapped netlink has a number of unresolved issues:
- TX zerocopy support had to be disabled more than a year ago via
commit 4682a0358639b29cf ("netlink: Always copy on mmap TX.")
because the content of the mmapped area can change after netlink
attribute validation but before message processing.
- RX support was implemented mainly to speed up nfqueue dumping packet
payload to userspace. However, since commit ae08ce0021087a5d812d2
("netfilter: nfnetlink_queue: zero copy support") we avoid one copy
with the socket-based interface too (via the skb_zerocopy helper).
The other problem is that skbs attached to mmaped netlink socket
behave different from normal skbs:
- they don't have a shinfo area, so all functions that use skb_shinfo()
(e.g. skb_clone) cannot be used.
- reserving headroom prevents userspace from seeing the content as
it expects message to start at skb->head.
See for instance
commit aa3a022094fa ("netlink: not trim skb for mmaped socket when dump").
- skbs handed e.g. to netlink_ack must have non-NULL skb->sk, else we
crash because it needs the sk to check if a tx ring is attached.
Also not obvious, leads to non-intuitive bug fixes such as 7c7bdf359
("netfilter: nfnetlink: use original skbuff when acking batches").
mmaped netlink also didn't play nicely with the skb_zerocopy helper
used by nfqueue and openvswitch. Daniel Borkmann fixed this via
commit 6bb0fef489f6 ("netlink, mmap: fix edge-case leakages in nf queue
zero-copy")' but at the cost of also needing to provide remaining
length to the allocation function.
nfqueue also has problems when used with mmaped rx netlink:
- mmaped netlink doesn't allow use of nfqueue batch verdict messages.
Problem is that in the mmap case, the allocation time also determines
the ordering in which the frame will be seen by userspace (A
allocating before B means that A is located in earlier ring slot,
but this also means that B might get a lower sequence number then A
since seqno is decided later. To fix this we would need to extend the
spinlocked region to also cover the allocation and message setup which
isn't desirable.
- nfqueue can now be configured to queue large (GSO) skbs to userspace.
Queing GSO packets is faster than having to force a software segmentation
in the kernel, so this is a desirable option. However, with a mmap based
ring one has to use 64kb per ring slot element, else mmap has to fall back
to the socket path (NL_MMAP_STATUS_COPY) for all large packets.
To use the mmap interface, userspace not only has to probe for mmap netlink
support, it also has to implement a recv/socket receive path in order to
handle messages that exceed the size of an rx ring element.
Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: Ken-ichirou MATSUZAWA <chamaken@gmail.com> Cc: Pablo Neira Ayuso <pablo@netfilter.org> Cc: Patrick McHardy <kaber@trash.net> Cc: Thomas Graf <tgraf@suug.ch> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>
Jamal Hadi Salim [Thu, 18 Feb 2016 13:04:43 +0000 (08:04 -0500)]
net_sched: Improve readability of filter processing
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net>
Ido Schimmel [Thu, 18 Feb 2016 13:01:46 +0000 (14:01 +0100)]
bridge: switchdev: Offload VLAN flags to hardware bridge
When VLANs are created / destroyed on a VLAN filtering bridge (MASTER
flag set), the configuration is passed down to the hardware. However,
when only the flags (e.g. PVID) are toggled, the configuration is done
in the software bridge alone.
While it is possible to pass these flags to hardware when invoked with
the SELF flag set, this creates inconsistency with regards to the way
the VLANs are initially configured.
Pass the flags down to the hardware even when the VLAN already exists
and only the flags are toggled.
Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Stefan Roese [Thu, 18 Feb 2016 09:59:07 +0000 (10:59 +0100)]
net: phy: Add SGMII support for Marvell 88E1510/1512/1514/1518
Add code to select SGMII-to-copper mode upon SGMII interface selection.
Signed-off-by: Stefan Roese <sr@denx.de> Cc: Andrew Lunn <andrew@lunn.ch> Cc: Florian Fainelli <f.fainelli@gmail.com> Cc: David S. Miller <davem@davemloft.net> Signed-off-by: David S. Miller <davem@davemloft.net>
Alison Schofield [Thu, 18 Feb 2016 06:35:11 +0000 (22:35 -0800)]
isdn: divamnt: use y2038-safe ktime_get_ts64() for trace data timestamps
divamnt stores a start_time at module init and uses it to calculate
elapsed time. The elapsed time, stored in secs and usecs, is part of
the trace data the driver maintains for the DIVA Server ISDN cards.
No change to the format of that time data is required.
To avoid overflow on 32-bit systems use ktime_get_ts64() to return
the elapsed monotonic time since system boot.
This is a change from real to monotonic time. Since the driver only
stores elapsed time, monotonic time is sufficient and more robust
against real time clock changes. These new monotonic values can be
more useful for debugging because they can be easily compared to
other monotonic timestamps.
Note elaspsed time values will now start at system boot time rather
than module load time, so they will differ slightly from previously
reported values.
Remove declaration and init of previously unused time constants:
start_sec, start_usec.
Signed-off-by: Alison Schofield <amsfield22@gmail.com> Reviewed-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Thu, 18 Feb 2016 15:32:18 +0000 (10:32 -0500)]
Merge branch '40GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/next-queue
Jeff Kirsher says:
====================
40GbE Intel Wired LAN Driver Updates 2016-02-17
This series contains updates to i40e/i40evf once again.
Mitch updates the use of a define instead of a magic number. Adds support
for packet split receive on VFs, which is disabled by default. Expands on
a code comment which was not verbose or really helpful. Fixes an issue
where if a reset fails to complete and was not properly setting the
adapter state, which would cause a panic on rmmod, so set the adpater
state to DOWN to avoid a panic.
Jesse cleans up a "dump" in debugfs that never panned out to be useful.
Anjali adds a workaround for cases where we might have interrupts that get
lost but wright-back (WB) happened. Fixes an issue by falling back to
enabling unicast, multicast and broadcast promiscuous mode when the driver
must disable it's use of "default port" (defport mode) due to internal
incompatibility with Multiple Function per Port (MFP). Fixes an issue
where queues should never be enabled/disabled in the interrupt handler.
Kiran cleans up th code which used hard coded base VEB SEID since it was
removed from the specification.
Shannon adds a few bits for better debug messages. Fixes an obscure corner
case, where it was possible to clear the NVM update wait flag when no
update_done message was actually received.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Colin Ian King [Mon, 15 Feb 2016 22:54:47 +0000 (22:54 +0000)]
net-sysfs: remove unused fmt_long_hex
Ever since commit 04ed3e741d0f133e02bed7fa5c98edba128f90e7
("net: change netdev->features to u32") the format string
fmt_long_hex has not been used, so we may as well remove it.
Signed-off-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: David S. Miller <davem@davemloft.net>
i40e: When in promisc mode apply promisc mode to Tx Traffic as well
In MFP mode particularly when we were setting the PF VSI in limited
promiscuous, the HW switch was still mirroring the outgoing packets
from other VSIs (VF/VMdq) onto the PF VSI.
With this new bit set, the mirroring doesn't happen any more and so
we are in limited promiscuous on the PF VSI in MFP which is similar
to defport.
An API check is not required, since this bit is reserved for FW API
version < 1.5
Also update copyright year in file headers.
Change-ID: I9840cb95f11dde733d943cb03ce84f68b9611bc8 Signed-off-by: Anjali Singhai Jain <anjali.singhai@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Shannon Nelson [Fri, 15 Jan 2016 22:33:20 +0000 (14:33 -0800)]
i40e: clean event descriptor before use
In one obscure corner case, it was possible to clear the NVM update wait
flag when no update_done message was actually received. This patch
cleans the event descriptor before use, and moves the opcode check to
where it won't get done if there was no event to clean.
Also update copyright year in file headers.
Change-ID: I68bbc41965e93f4adf07cbe98b9dfd63d41509a4 Signed-off-by: Shannon Nelson <shannon.nelson@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Mitch Williams [Fri, 15 Jan 2016 22:33:19 +0000 (14:33 -0800)]
i40evf: set adapter state on reset failure
If a reset fails to complete, the driver gets its affairs in order and
awaits the cold solace of rmmod. Unfortunately, it was not properly
setting the adapter state, which would cause a panic on rmmod, instead
of the desired surcease.
Set the adapter state to DOWN in this case, and avoid a panic.
Change-ID: I6fdd9906da52e023f8dc744f7da44b5d95278ca9 Signed-off-by: Mitch Williams <mitch.a.williams@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Shannon Nelson [Fri, 15 Jan 2016 22:33:18 +0000 (14:33 -0800)]
i40e: better error reporting for nvmupdate
Make sure we return EBUSY while finishing up a reset, and add a few bits
for better debug messages.
Change-ID: I23f6c28a8d96d7aa171abcc265737cec7826c292 Signed-off-by: Shannon Nelson <shannon.nelson@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Mitch Williams [Fri, 15 Jan 2016 22:33:17 +0000 (14:33 -0800)]
i40e: expand comment
Explain why we cannot remove this code, even though it works differently
than any of our other interrupt cause handling code.
Change-ID: Ie66203bd037a466066036611c31d44f759ec5176 Signed-off-by: Mitch Williams <mitch.a.williams@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
i40e: Do not disable queues in the Legacy/MSI Interrupt handler
The queues should never be enabled/disabled in the interrupt handler,
ICR0 interrupt enable should be the only thing that needs to be
dynamically changed in the handler.
This patch fixes that. Without this patch X722 platforms were
seeing weird ping timings when in Legacy mode since it takes
a whole lot of time for the HW/FW to re-enable queues.
Change-ID: If065afc45d81c5a19d4a94a00cd5b8f61cefc40c Signed-off-by: Anjali Singhai Jain <anjali.singhai@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Mitch Williams [Fri, 15 Jan 2016 22:33:15 +0000 (14:33 -0800)]
i40e/i40evf: avoid atomics
In the case where we have a page fully used by receive data, we need to
release the page fully to the stack. Instead of calling get_page (which
increments the page count) followed by free_page (which decrements the
page count), just donate our reference to the stack. Although this
donation is not tax deductible, it does allow us to avoid two very
expensive atomic operations that reverse each other.
Change-ID: If70739792d5748995fc175ec92ac2171ed4ad8fc Signed-off-by: Mitch Williams <mitch.a.williams@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
i40e: Fix PROMISC mode for Multi-function per port (MFP) devices
This patch falls back to enabling unicast, multicast and
broadcast promiscuous mode when the driver must disable it's use
of "default port" aka defport mode (which is normally used to
provide a promiscuous mode), due to internal incompatibility
with Multiple Function per Port (aka MFP).
The situation that requires this patch is when Physical
Function 0 is the device being used, and it can support SR-IOV
when MFP is enabled, via the driver creating a VEB on an MFP
enabled adapter.
Change-ID: Ie90b00d0d58782a5dfcf2c3c9725a2eb90bd63d8 Signed-off-by: Anjali Singhai Jain <anjali.singhai@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>