Keller, Jacob E [Tue, 9 Feb 2016 00:05:03 +0000 (16:05 -0800)]
ethtool: correctly ensure {GS}CHANNELS doesn't conflict with GS{RXFH}
Ethernet drivers implementing both {GS}RXFH and {GS}CHANNELS ethtool ops
incorrectly allow SCHANNELS when it would conflict with the settings
from SRXFH. This occurs because it is not possible for drivers to
understand whether their Rx flow indirection table has been configured
or is in the default state. In addition, drivers currently behave in
various ways when increasing the number of Rx channels.
Some drivers will always destroy the Rx flow indirection table when this
occurs, whether it has been set by the user or not. Other drivers will
attempt to preserve the table even if the user has never modified it
from the default driver settings. Neither of these situation is
desirable because it leads to unexpected behavior or loss of user
configuration.
The correct behavior is to simply return -EINVAL when SCHANNELS would
conflict with the current Rx flow table settings. However, it should
only do so if the current settings were modified by the user. If we
required that the new settings never conflict with the current (default)
Rx flow settings, we would force users to first reduce their Rx flow
settings and then reduce the number of Rx channels.
This patch proposes a solution implemented in net/core/ethtool.c which
ensures that all drivers behave correctly. It checks whether the RXFH
table has been configured to non-default settings, and stores this
information in a private netdev flag. When the number of channels is
requested to change, it first ensures that the current Rx flow table is
not going to assign flows to now disabled channels.
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Fri, 12 Feb 2016 10:52:41 +0000 (05:52 -0500)]
Merge branch 'local-checksum-offload'
Edward Cree says:
====================
Local Checksum Offload
Re-tested VxLAN; everything else is unchanged from v4.
Changes from v4:
* Rebased series to fix conflicts with vxlan/vxlan6 merge.
Changes from v3:
* Fixed inverted checksum values introduced in v3.
* Don't mangle zero checksums in GRE.
* Clear skb->encapsulation in iptunnel_handle_offloads when not using
CHECKSUM_PARTIAL, lest drivers incorrectly interpret that as a request
for inner checksum offload.
Changes from v2:
* Added support for IPv4 GRE.
* Split out 'always set up for checksum offload' into its own patch.
* Removed csum_help from iptunnel_handle_offloads.
* Rewrote LCO callers to only fold once.
* Simplified nocheck handling.
Changes from v1:
* Enabled support in more encapsulation protocols.
I think it now covers everything except GRE.
* Wrote up some documentation covering TX checksum offload, LCO and RCO.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Edward Cree [Thu, 11 Feb 2016 20:49:40 +0000 (20:49 +0000)]
net: udp: always set up for CHECKSUM_PARTIAL offload
If the dst device doesn't support it, it'll get fixed up later anyway
by validate_xmit_skb(). Also, this allows us to take advantage of LCO
to avoid summing the payload multiple times.
Signed-off-by: Edward Cree <ecree@solarflare.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Edward Cree [Thu, 11 Feb 2016 20:48:04 +0000 (20:48 +0000)]
net: local checksum offload for encapsulation
The arithmetic properties of the ones-complement checksum mean that a
correctly checksummed inner packet, including its checksum, has a ones
complement sum depending only on whatever value was used to initialise
the checksum field before checksumming (in the case of TCP and UDP,
this is the ones complement sum of the pseudo header, complemented).
Consequently, if we are going to offload the inner checksum with
CHECKSUM_PARTIAL, we can compute the outer checksum based only on the
packed data not covered by the inner checksum, and the initial value of
the inner checksum field.
Signed-off-by: Edward Cree <ecree@solarflare.com> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Fri, 12 Feb 2016 10:28:38 +0000 (05:28 -0500)]
Merge branch 'tcp_dccp_ports'
Eric Dumazet says:
====================
tcp/dccp: better use of ephemeral ports
Big servers have bloated bind table, making very hard to succeed
ephemeral port allocations, without special containers/namespace tricks.
This patch series extends the strategy added in commit 07f4c90062f8
("tcp/dccp: try to not exhaust ip_local_port_range in connect()").
Since ports used by connect() are much likely to be shared among them,
we give a hint to both bind() and connect() to keep the crowds separated
if possible.
Of course, if on a specific host an application needs to allocate ~30000
ports using bind(), it will still be able to do so. Same for ~30000 connect()
to a unique 2-tuple (dst addr, dst port)
New implemetation is also more friendly to softirqs and reschedules.
v2: rebase after TCP SO_REUSEPORT changes
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Fri, 12 Feb 2016 00:28:49 +0000 (16:28 -0800)]
tcp/dccp: better use of ephemeral ports in connect()
In commit 07f4c90062f8 ("tcp/dccp: try to not exhaust ip_local_port_range
in connect()"), I added a very simple heuristic, so that we got better
chances to use even ports, and allow bind() users to have more available
slots.
It gave nice results, but with more than 200,000 TCP sessions on a typical
server, the ~30,000 ephemeral ports are still a rare resource.
I chose to go a step further, by looking at all even ports, and if none
was available, fallback to odd ports.
The companion patch does the same in bind(), but in opposite way.
I've seen exec times of up to 30ms on busy servers, so I no longer
disable BH for the whole traversal, but only for each hash bucket.
I also call cond_resched() to be gentle to other tasks.
Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
This patchset is the first real use-case for kmem_cache bulk _free_.
The use of bulk _alloc_ is NOT included in this patchset. The full use
have previously been posted here [1].
The bulk free side have the largest benefit for the network stack
use-case, because network stack is hitting the kmem_cache/SLUB
slowpath when freeing SKBs, due to the amount of outstanding SKBs.
This is solved by using the new API kmem_cache_free_bulk().
Introduce new API napi_consume_skb(), that hides/handles bulk freeing
for the caller. The drivers simply need to use this call when freeing
SKBs in NAPI context, e.g. replacing their calles to dev_kfree_skb() /
dev_consume_skb_any().
ixgbe: bulk free SKBs during TX completion cleanup cycle
There is an opportunity to bulk free SKBs during reclaiming of
resources after DMA transmit completes in ixgbe_clean_tx_irq. Thus,
bulk freeing at this point does not introduce any added latency.
Simply use napi_consume_skb() which were recently introduced. The
napi_budget parameter is needed by napi_consume_skb() to detect if it
is called from netpoll.
Benchmarking IPv4-forwarding, on CPU i7-4790K @4.2GHz (no turbo boost)
Single CPU/flow numbers: before: 1982144 pps -> after : 2064446 pps
Improvement: +82302 pps, -20 nanosec, +4.1%
(SLUB and GCC version 5.1.1 20150618 (Red Hat 5.1.1-4))
Joint work with Alexander Duyck.
Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com> Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
net: bulk free SKBs that were delay free'ed due to IRQ context
The network stack defers SKBs free, in-case free happens in IRQ or
when IRQs are disabled. This happens in __dev_kfree_skb_irq() that
writes SKBs that were free'ed during IRQ to the softirq completion
queue (softnet_data.completion_queue).
These SKBs are naturally delayed, and cleaned up during NET_TX_SOFTIRQ
in function net_tx_action(). Take advantage of this a use the skb
defer and flush API, as we are already in softirq context.
For modern drivers this rarely happens. Although most drivers do call
dev_kfree_skb_any(), which detects the situation and calls
__dev_kfree_skb_irq() when needed. This due to netpoll can call from
IRQ context.
Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com> Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
net: bulk free infrastructure for NAPI context, use napi_consume_skb
Discovered that network stack were hitting the kmem_cache/SLUB
slowpath when freeing SKBs. Doing bulk free with kmem_cache_free_bulk
can speedup this slowpath.
NAPI context is a bit special, lets take advantage of that for bulk
free'ing SKBs.
In NAPI context we are running in softirq, which gives us certain
protection. A softirq can run on several CPUs at once. BUT the
important part is a softirq will never preempt another softirq running
on the same CPU. This gives us the opportunity to access per-cpu
variables in softirq context.
Extend napi_alloc_cache (before only contained page_frag_cache) to be
a struct with a small array based stack for holding SKBs. Introduce a
SKB defer and flush API for accessing this.
Introduce napi_consume_skb() as replacement for e.g. dev_consume_skb_any()
when running in NAPI context. A small trick to handle/detect if we
are called from netpoll is to see if budget is 0. In that case, we
need to invoke dev_consume_skb_irq().
Joint work with Alexander Duyck.
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
This small set is a follow-up for the recent patches that added ethtool
get/set settings. Patch 1 changes the speed validation routine to check
if the speed is between 0 and INT_MAX (or SPEED_UNKNOWN) and patch 2 adds
port validation to virtio_net and better validation comment.
This set is on top of Michael's patch which explains that speeds from 0
to INT_MAX are valid:
http://patchwork.ozlabs.org/patch/578911/
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
virtio_net: validate ethtool port setting and explain the user validation
We should validate the port setting that we got from the user and check
if it's what we've set it to (PORT_OTHER), also add explanation that
ignoring advertising is good as long as we don't have autonegotiation.
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
ethtool: make validate_speed accept all speeds between 0 and INT_MAX
Devices these days can have any speed and as was recently pointed out
any speed from 0 to INT_MAX is valid so adjust speed validation to
accept such values.
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Andrew F. Davis [Sun, 7 Feb 2016 17:47:19 +0000 (11:47 -0600)]
net: phy: dp83848: Reorganize code for readability and safety
Reorganize code by moving the desired interrupt mask definition
out of function. Also rearrange the enable/disable interrupt function
to prevent accidental over-writing of values in registers.
Signed-off-by: Andrew F. Davis <afd@ti.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Ivan Vecera [Thu, 11 Feb 2016 11:42:26 +0000 (12:42 +0100)]
be2net: don't report EVB for older chipsets when SR-IOV is disabled
The EVB (virtual bridge) functionality should be disabled on older BE3
and Lancer chips if SR-IOV is disabled in the NIC's BIOS. This setting
is identified by the zero value of total VFs reported by the card.
The GET_HSW_CONFIG command cannot be used as it is not supported by
these older chipset's FW.
v2: added the comment
Cc: Sathya Perla <sathya.perla@broadcom.com> Cc: Ajit Khaparde <ajit.khaparde@broadcom.com> Cc: Padmanabh Ratnakar <padmanabh.ratnakar@broadcom.com> Cc: Sriharsha Basavapatna <sriharsha.basavapatna@broadcom.com> Cc: Somnath Kotur <somnath.kotur@broadcom.com> Signed-off-by: Ivan Vecera <ivecera@redhat.com> Acked-by: Sathya Perla <sathya.perla@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Thu, 11 Feb 2016 16:34:31 +0000 (11:34 -0500)]
Merge branch 'spi_ks8995'
Helmut Buchsbaum says:
====================
Add support for MICREL KSZ8795CLX 5-port switch
This patch series refactors the spi-ks8995 driver to finally add support
for the MICREL KSZ8795CLX. Additionally support for controlling a GPIO
line for resetting the switch is added.
Helmut
Changes since v2:
- use GPIO_ACTIVE_LOW according to Andrew's remark.
- use ePAPR compliant node name in example, thanks to Sergei for
pointing out
Changes since v1:
- removed initializing registers from Device Tree following Florian's
advice
- fixed GPIO handling for reset according to Andrew's remark.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
net: phy: spi_ks8995: generalize creation of SPI commands
Prepare creating SPI reads and writes for other switch families.
The KS8995 family uses the straight forward
<8bit CMD><8bit ADDR>
sequence.
To be able to support KSZ8795 family, which uses
<3bit CMD><12bit ADDR><1 bit TR>
make the SPI command creation chip variant dependent.
Signed-off-by: Helmut Buchsbaum <helmut.buchsbaum@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
net: phy: spi_ks8995: add support for resetting switch using GPIO
When using device tree it is no more possible to reset the PHY at board
level. Furthermore, doing in the driver allows to power down the switch
when it is not used any more.
The patch introduces a new optional property "reset-gpios" denoting an
appropriate GPIO handle, e.g.:
reset-gpios = <&gpio0 46 GPIO_ACTIVE_LOW>
Signed-off-by: Helmut Buchsbaum <helmut.buchsbaum@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Thu, 11 Feb 2016 16:30:32 +0000 (11:30 -0500)]
Merge branch 'thunderx-irq-hints'
Sunil Goutham says:
====================
net: thunderx: Setting IRQ affinity hints and other optimizations
This patch series contains changes
- To add support for virtual function's irq affinity hint
- Replace napi_schedule() with napi_schedule_irqoff()
- Reduce page allocation overhead by allocating pages
of higher order when pagesize is 4KB.
- Add couple of stats which helps in debugging
- Some miscellaneous changes to BGX driver.
Changes from v1:
- As suggested changed MAC address invalid log message
to dev_err() instead of dev_warn().
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Robert Richter [Thu, 11 Feb 2016 16:20:25 +0000 (21:50 +0530)]
net: thunderx: bgx: Add log message when setting mac address
Signed-off-by: Robert Richter <rrichter@cavium.com> Signed-off-by: Sunil Goutham <sgoutham@cavium.com> Signed-off-by: David S. Miller <davem@davemloft.net>
David Daney [Thu, 11 Feb 2016 16:20:24 +0000 (21:50 +0530)]
net: thunderx: bgx: Use standard firmware node infrastructure.
In the case of OF device tree, the firmware information is attached to
the BGX device structure in the standard manner, so use the firmware
iterators and accessors where possible.
Signed-off-by: David Daney <david.daney@cavium.com> Signed-off-by: Sunil Goutham <sgoutham@cavium.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Sunil Goutham [Thu, 11 Feb 2016 16:20:23 +0000 (21:50 +0530)]
net: thunderx: Assign affinity hints to vf's interrupts
This affinity hint can be used by user space irqbalance tool to set
preferred CPU mask for irqs registered by this VF. Irqbalance needs
to be in 'exact' mode to set irq affinity same as indicated by
affinity hint.
Signed-off-by: Sunil Goutham <sgoutham@cavium.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Sunil Goutham [Thu, 11 Feb 2016 16:20:22 +0000 (21:50 +0530)]
net: thunderx: Use napi_schedule_irqoff()
napi_schedule is being called from hard irq context, hence
switch to napi_schedule_irqoff which avoids unneeded call
to local_irq_save and local_irq_restore.
Signed-off-by: Sunil Goutham <sgoutham@cavium.com> Signed-off-by: David S. Miller <davem@davemloft.net>
When system is low on atomic memory, too many error messages are logged.
Since this is not a total failure but a simple switch to non-atomic allocation
better to have a stat.
Also add a stat for reset, kicked due to transmit watchdog timeout.
Signed-off-by: Thanneeru Srinivasulu <tsrinivasulu@caviumnetworks.com> Signed-off-by: Sunil Goutham <sgoutham@cavium.com> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Thu, 11 Feb 2016 14:59:28 +0000 (09:59 -0500)]
Merge branch 'igmp-ns'
Nikolay Borisov says:
====================
Make igmp sysctl knobs namespace aware
This series continue making more of the net related sysctls
namespace aware. The first 2 and last patches are straight
forward and convert sysctls which weren't defined to be
namespace aware. The only thing in them is that each removes
a define which is used in only one place (to initialise
the respective sysctl) so I don't think this is a huge loss.
The third patch however, converts igmp_llm_reports which was
already defined in the ipv4_net_table but wasn't using any of
the net namespace infrastructure.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Nikolay Borisov [Mon, 8 Feb 2016 22:13:50 +0000 (00:13 +0200)]
igmp: Namespaceify igmp_llm_reports sysctl knob
This was initially introduced in df2cf4a78e488d26 ("IGMP: Inhibit
reports for local multicast groups") by defining the sysctl in the
ipv4_net_table array, however it was never implemented to be
namespace aware. Fix this by changing the code accordingly.
Signed-off-by: David S. Miller <davem@davemloft.net>
Zhang Shengju [Tue, 9 Feb 2016 10:37:46 +0000 (10:37 +0000)]
bonding: use return instead of goto
Replace 'goto' with 'return' to remove unnecessary check at label:
err_undo_flags.
The reason is that 'err_undo_flags' do two things for the first slave device:
1.revert bond mac address if it is set by the slave device.
2.revert bond device type if it's not ARPHRD_ETHER.
It's not necessary for the following three places, they changed neither bond
mac address nor type. It's straightforward to return directly.
Signed-off-by: Zhang Shengju <zhangshengju@cmss.chinamobile.com> Signed-off-by: David S. Miller <davem@davemloft.net>
net: hamradio: baycom_ser_fdx: Replace timeval with timespec64
32 bit systems using 'struct timeval' will break in the year 2038, so
we replace the code appropriately. However, this driver is not broken
in 2038 since we are only using microseconds portion of the time.
This patch replaces 'struct timeval' with 'struct timespec64'. We only
need to find elapsed microseconds rather than absolute time, so it's
better to use monotonic time, so using ktime_get_ts64() makes the code
more efficient and more robust against concurrent settimeofday()
calls.
Signed-off-by: Amitoj Kaur Chawla <amitoj1606@gmail.com> Reviewed-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: Thomas Sailer <t.sailer@alumni.ethz.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
Tycho Andersen [Fri, 5 Feb 2016 16:20:52 +0000 (09:20 -0700)]
openvswitch: allow management from inside user namespaces
Operations with the GENL_ADMIN_PERM flag fail permissions checks because
this flag means we call netlink_capable, which uses the init user ns.
Instead, let's introduce a new flag, GENL_UNS_ADMIN_PERM for operations
which should be allowed inside a user namespace.
The motivation for this is to be able to run openvswitch in unprivileged
containers. I've tested this and it seems to work, but I really have no
idea about the security consequences of this patch, so thoughts would be
much appreciated.
v2: use the GENL_UNS_ADMIN_PERM flag instead of a check in each function
v3: use separate ifs for UNS_ADMIN_PERM and ADMIN_PERM, instead of one
massive one
Reported-by: James Page <james.page@canonical.com> Signed-off-by: Tycho Andersen <tycho.andersen@canonical.com> CC: Eric Biederman <ebiederm@xmission.com> CC: Pravin Shelar <pshelar@ovn.org> CC: Justin Pettit <jpettit@nicira.com> CC: "David S. Miller" <davem@davemloft.net> Acked-by: Pravin B Shelar <pshelar@ovn.org> Signed-off-by: David S. Miller <davem@davemloft.net>
ethtool: future-proof interface for speed extensions
Many virtual and not quite virtual devices allow any speed to be set
through ethtool. In particular, this applies to the virtio-net devices.
Document this fact to make sure people don't assume the enum lists all
possible values. Reserve values greater than INT_MAX for future
extension and to avoid conflict with SPEED_UNKNOWN.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Thu, 11 Feb 2016 13:55:42 +0000 (08:55 -0500)]
Merge branch 'gso-checksums'
Alexander Duyck says:
====================
Add GSO support for outer checksum w/ inner checksum offloads
This patch series updates the existing segmentation offload code for
tunnels to make better use of existing and updated GSO checksum
computation. This is done primarily through two mechanisms. First we
maintain a separate checksum in the GSO context block of the sk_buff. This
allows us to maintain two checksum values, one offloaded with values stored
in csum_start and csum_offset, and one computed and tracked in
SKB_GSO_CB(skb)->csum. By maintaining these two values we are able to take
advantage of the same sort of math used in local checksum offload so that
we can provide both inner and outer checksums with minimal overhead.
Below is the performance for a netperf session between an ixgbe PF and VF
on the same host but in different namespaces. As can be seen a significant
gain in performance can be had from allowing the use of Tx checksum offload
on the inner headers while performing a software offload on the outer
header computation:
Recv Send Send Utilization Service Demand
Socket Socket Message Elapsed Send Recv Send Recv
Size Size Size Time Throughput local remote local remote
bytes bytes bytes secs. 10^6bits/s % S % U us/KB us/KB
Changes from v1:
* Dropped use of CHECKSUM_UNNECESSARY for remote checksum offload
* Left encap_hdr_csum as it will likely be needed in future for SCTP GSO
* Broke the changes out over many more patches
* Updated GRE segmentation to more closely match UDP tunnel segmentation
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Alexander Duyck [Fri, 5 Feb 2016 23:28:26 +0000 (15:28 -0800)]
net: Allow tunnels to use inner checksum offloads with outer checksums needed
This patch enables us to use inner checksum offloads if provided by
hardware with outer checksums computed by software.
It basically reduces encap_hdr_csum to an advisory flag for now, but based
on the fact that SCTP may be getting segmentation support before long I
thought we may want to keep it as it is possible we may need to support
CRC32c and 1's compliment checksum in the same packet at some point in the
future.
Signed-off-by: Alexander Duyck <aduyck@mirantis.com> Acked-by: Tom Herbert <tom@herbertland.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Alexander Duyck [Fri, 5 Feb 2016 23:28:20 +0000 (15:28 -0800)]
udp: Use uh->len instead of skb->len to compute checksum in segmentation
The segmentation code was having to do a bunch of work to pull the
skb->len and strip the udp header offset before the value could be used to
adjust the checksum. Instead of doing all this work we can just use the
value that goes into uh->len since that is the correct value with the
correct byte order that we need anyway. By using this value we can save
ourselves a bunch of pain as there is no need to do multiple byte swaps.
Signed-off-by: Alexander Duyck <aduyck@mirantis.com> Acked-by: Tom Herbert <tom@herbertland.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Alexander Duyck [Fri, 5 Feb 2016 23:28:14 +0000 (15:28 -0800)]
udp: Clean up the use of flags in UDP segmentation offload
This patch goes though and cleans up the logic related to several of the
control flags used in UDP segmentation. Specifically the use of dont_encap
isn't really needed as we can just check the skb for CHECKSUM_PARTIAL and
if it isn't set then we don't need to update the internal headers. As such
we can just drop that value.
Signed-off-by: Alexander Duyck <aduyck@mirantis.com> Acked-by: Tom Herbert <tom@herbertland.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Alexander Duyck [Fri, 5 Feb 2016 23:28:01 +0000 (15:28 -0800)]
gre: Use GSO flags to determine csum need instead of GRE flags
This patch updates the gre checksum path to follow something much closer to
the UDP checksum path. By doing this we can avoid needing to do as much
header inspection and can just make use of the fields we were already
reading in the sk_buff structure.
Signed-off-by: Alexander Duyck <aduyck@mirantis.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Alexander Duyck [Fri, 5 Feb 2016 23:27:55 +0000 (15:27 -0800)]
net: Move skb_has_shared_frag check out of GRE code and into segmentation
The call skb_has_shared_frag is used in the GRE path and skb_checksum_help
to verify that no frags can be modified by an external entity. This check
really doesn't belong in the GRE path but in the skb_segment function
itself. This way any protocol that might be segmented will be performing
this check before attempting to offload a checksum to software.
Signed-off-by: Alexander Duyck <aduyck@mirantis.com> Acked-by: Tom Herbert <tom@herbertland.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Alexander Duyck [Fri, 5 Feb 2016 23:27:49 +0000 (15:27 -0800)]
net: Store checksum result for offloaded GSO checksums
This patch makes it so that we can offload the checksums for a packet up
to a certain point and then begin computing the checksums via software.
Setting this up is fairly straight forward as all we need to do is reset
the values stored in csum and csum_start for the GSO context block.
One complication for this is remote checksum offload. In order to allow
the inner checksums to be offloaded while computing the outer checksum
manually we needed to have some way of indicating that the offload wasn't
real. In order to do that I replaced CHECKSUM_PARTIAL with
CHECKSUM_UNNECESSARY in the case of us computing checksums for the outer
header while skipping computing checksums for the inner headers. We clean
up the ip_summed flag and set it to either CHECKSUM_PARTIAL or
CHECKSUM_NONE once we hand the packet off to the next lower level.
Signed-off-by: Alexander Duyck <aduyck@mirantis.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Alexander Duyck [Fri, 5 Feb 2016 23:27:43 +0000 (15:27 -0800)]
net: Update remote checksum segmentation to support use of GSO checksum
This patch addresses two main issues.
First in the case of remote checksum offload we were avoiding dealing with
scatter-gather issues. As a result it would be possible to assemble a
series of frames that used frags instead of being linearized as they should
have if remote checksum offload was enabled.
Second I have updated the code so that we now let GSO take care of doing
the checksum on the data itself and drop the special case that was added
for remote checksum offload.
Signed-off-by: Alexander Duyck <aduyck@mirantis.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Alexander Duyck [Fri, 5 Feb 2016 23:27:37 +0000 (15:27 -0800)]
net: Move GSO csum into SKB_GSO_CB
This patch moves the checksum maintained by GSO out of skb->csum and into
the GSO context block in order to allow for us to work on outer checksums
while maintaining the inner checksum offsets in the case of the inner
checksum being offloaded, while the outer checksums will be computed.
While updating the code I also did a minor cleanu-up on gso_make_checksum.
The change is mostly to make it so that we store the values and compute the
checksum instead of computing the checksum and then storing the values we
needed to update.
Signed-off-by: Alexander Duyck <aduyck@mirantis.com> Acked-by: Tom Herbert <tom@herbertland.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Alexander Duyck [Fri, 5 Feb 2016 23:27:31 +0000 (15:27 -0800)]
net: Drop unecessary enc_features variable from tunnel segmentation functions
The enc_features variable isn't necessary since features isn't used
anywhere after we create enc_features so instead just use a destructive AND
on features itself and save ourselves the variable declaration.
Signed-off-by: Alexander Duyck <aduyck@mirantis.com> Acked-by: Tom Herbert <tom@herbertland.com> Signed-off-by: David S. Miller <davem@davemloft.net>
hv_netvsc: cleanup netdev feature flags for netvsc
1. Adding NETIF_F_TSO6 feature flag;
2. Adding NETIF_F_HW_CSUM. NETIF_F_IPV6_CSUM and NETIF_F_IP_CSUM are
being deprecated;
3. Cleanup the coding style of flag assignment by using macro.
Signed-off-by: Simon Xiao <sixiao@microsoft.com> Reviewed-by: K. Y. Srinivasan <kys@microsoft.com> Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Thu, 11 Feb 2016 12:16:29 +0000 (07:16 -0500)]
Merge branch 'ethtool-nfc-ipv6'
Edward Cree says:
====================
IPv6 NFC
This series adds support for steering IPv6 flows using the ethtool NFC
interface, and implements it for sfc devices.
Tested using an in-development patch to the ethtool utility.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Edward Cree [Fri, 5 Feb 2016 11:16:50 +0000 (11:16 +0000)]
sfc: implement IPv6 NFC (and IPV4_USER_FLOW)
Signed-off-by: Edward Cree <ecree@solarflare.com> Reviewed-by: Ben Hutchings <ben@decadent.org.uk> Signed-off-by: David S. Miller <davem@davemloft.net>
Edward Cree [Fri, 5 Feb 2016 11:16:21 +0000 (11:16 +0000)]
ethtool: add IPv6 to the NFC API
Signed-off-by: Edward Cree <ecree@solarflare.com> Reviewed-by: Ben Hutchings <ben@decadent.org.uk> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Thu, 11 Feb 2016 12:13:29 +0000 (07:13 -0500)]
Merge branch 'cxgb4-tos'
Hariprasad Shenai says:
====================
Add TOS support and some cleanup
This series adds TOS support for iWARP and also does some cleanup to make
code more readable. Patch series is created against infiniband tree and
includes patches on iw_cxgb4 and cxgb4 driver.
We have included all the maintainers of respective drivers. Kindly review
the change and let us know in case of any review comments.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
This series provides support for iWARP applications to specify a TOS
value and have that map to a VLAN Priority for iw_cxgb4 iWARP connections.
In iw_cxgb4, when allocating an L2T entry, pass the skb_priority based
on the tos value in the cm_id. Also pass the correct tos value during
connection setup so the passive side gets the client's desired tos.
When sending the FLOWC work request to FW, if the egress device is
in a vlan, then use the vlan priority bits as the scheduling class.
This allows associating RDMA connections with scheduling classes to
provide traffic shaping per flow.
Signed-off-by: Steve Wise <swise@opengridcomputing.com> Signed-off-by: Hariprasad Shenai <hariprasad@chelsio.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Don't log errors if a listening endpoint is going away when procesing a
PASS_ACCEPT_REQ message. This can happen. Change the error printk to
a PDBG() debug log entry
Signed-off-by: Steve Wise <swise@opengridcomputing.com> Signed-off-by: Hariprasad Shenai <hariprasad@chelsio.com> Signed-off-by: David S. Miller <davem@davemloft.net>
iw_cxgb4: make queue allocation code more readable
Rename local mm* variables to more meaningful names
Signed-off-by: Steve Wise <swise@opengridcomputing.com> Signed-off-by: Hariprasad Shenai <hariprasad@chelsio.com> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Thu, 11 Feb 2016 11:15:00 +0000 (06:15 -0500)]
Merge branch 'fec-next'
Troy Kisky says:
====================
net: fec: cleanup/fixes
V2 is a rebase on top of johannes endian-safe patch and
is only the 1st eight patches.
The testing for this series was done on a nitrogen6x.
The base commit was
commit b45efa30a626e915192a6c548cd8642379cd47cc
Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Testing showed no change in performance.
Testing used imx_v6_v7_defconfig + CONFIG_MICREL_PHY.
The processor was running at 996Mhz.
The following commands were used to get the transfer rates.
On an x86 ubunto system,
iperf -s -i.5 -u
On a nitrogen6x board, running via SD Card.
I first stopped some background processes
There is a branch available on github with this series, and the rest of
my fec patches, for those who would like to test it.
https://github.com:boundarydevices/linux-imx6.git branch net-next_master
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Troy Kisky [Fri, 5 Feb 2016 21:52:47 +0000 (14:52 -0700)]
net: fec: add variable reg_desc_active to speed things up
There is no need for complex macros every time we need to activate
a queue. Also, no need to call skb_get_queue_mapping when we already
know which queue it is using.
Signed-off-by: Troy Kisky <troy.kisky@boundarydevices.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Troy Kisky [Fri, 5 Feb 2016 21:52:45 +0000 (14:52 -0700)]
net: fec: fix fec_enet_get_free_txdesc_num
When first initialized, cur_tx points to the 1st
entry in the queue, and dirty_tx points to the last.
At this point, fec_enet_get_free_txdesc_num will
return tx_ring_size -2. If tx_ring_size -2 entries
are now queued, then fec_enet_get_free_txdesc_num
should return 0, but it returns tx_ring_size instead.
Signed-off-by: Troy Kisky <troy.kisky@boundarydevices.com> Signed-off-by: David S. Miller <davem@davemloft.net>
bonding: 3ad: allow to set ad_actor settings while the bond is up
No need to require the bond down while changing these settings, the change
will be reflected immediately and the 3ad mode will sort itself out.
For faster convergence set port->ntt to true in order to generate new
LACPDUs immediately.
CC: Jay Vosburgh <j.vosburgh@gmail.com> CC: Veaceslav Falico <vfalico@gmail.com> CC: Andy Gospodarek <gospo@cumulusnetworks.com> Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Johannes Berg [Thu, 4 Feb 2016 12:31:20 +0000 (13:31 +0100)]
ipv6: add option to drop unsolicited neighbor advertisements
In certain 802.11 wireless deployments, there will be NA proxies
that use knowledge of the network to correctly answer requests.
To prevent unsolicitd advertisements on the shared medium from
being a problem, on such deployments wireless needs to drop them.
Enable this by providing an option called "drop_unsolicited_na".
Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Johannes Berg [Thu, 4 Feb 2016 12:31:19 +0000 (13:31 +0100)]
ipv6: add option to drop unicast encapsulated in L2 multicast
In order to solve a problem with 802.11, the so-called hole-196 attack,
add an option (sysctl) called "drop_unicast_in_l2_multicast" which, if
enabled, causes the stack to drop IPv6 unicast packets encapsulated in
link-layer multi- or broadcast frames. Such frames can (as an attack)
be created by any member of the same wireless network and transmitted
as valid encrypted frames since the symmetric key for broadcast frames
is shared between all stations.
Reviewed-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Johannes Berg [Thu, 4 Feb 2016 12:31:18 +0000 (13:31 +0100)]
ipv4: add option to drop gratuitous ARP packets
In certain 802.11 wireless deployments, there will be ARP proxies
that use knowledge of the network to correctly answer requests.
To prevent gratuitous ARP frames on the shared medium from being
a problem, on such deployments wireless needs to drop them.
Enable this by providing an option called "drop_gratuitous_arp".
Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Johannes Berg [Thu, 4 Feb 2016 12:31:17 +0000 (13:31 +0100)]
ipv4: add option to drop unicast encapsulated in L2 multicast
In order to solve a problem with 802.11, the so-called hole-196 attack,
add an option (sysctl) called "drop_unicast_in_l2_multicast" which, if
enabled, causes the stack to drop IPv4 unicast packets encapsulated in
link-layer multi- or broadcast frames. Such frames can (as an attack)
be created by any member of the same wireless network and transmitted
as valid encrypted frames since the symmetric key for broadcast frames
is shared between all stations.
Additionally, enabling this option provides compliance with a SHOULD
clause of RFC 1122.
Reviewed-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Siva Reddy Kallam <siva.kallam@broadcom.com> Signed-off-by: Michael Chan <mchan@broadcom.com> Acked-by: Prashant Sreedharan <prashant@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
David Ahern [Tue, 2 Feb 2016 16:17:07 +0000 (08:17 -0800)]
net: Add support for filtering link dump by master device and kind
Add support for filtering link dumps by master device and kind, similar
to the filtering implemented for neighbor dumps.
Each net_device that exists adds between 1196 bytes (eth) and 1556 bytes
(bridge) to the link dump. As the number of interfaces increases so does
the amount of data pushed to user space for a link list. If the user
only wants to see a list of specific devices (e.g., interfaces enslaved
to a specific bridge or a list of VRFs) most of that data is thrown away.
Passing the filters to the kernel to have only relevant data returned
makes the dump more efficient.
Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Acked-by: Roopa Prabhu <roopa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Thu, 11 Feb 2016 08:54:23 +0000 (03:54 -0500)]
Merge branch 'tcp-fast-so_reuseport'
Craig Gallek says:
====================
Faster SO_REUSEPORT for TCP
This patch series complements an earlier series (6a5ef90c58da)
which added faster SO_REUSEPORT lookup for UDP sockets by
extending the feature to TCP sockets. It uses the same
array-based data structure which allows for socket selection
after finding the first listening socket that matches an incoming
packet. Prior to this feature, every socket in the reuseport
group needed to be found and examined before a selection could be
made.
With this series the SO_ATTACH_REUSEPORT_CBPF and
SO_ATTACH_REUSEPORT_EBPF socket options now work for TCP sockets
as well. The test at the end of the series includes an example of
how to use these options to select a reuseport socket based on the
cpu core id handling the incoming packet.
There are several refactoring patches that precede the feature
implementation. Only the last two patches in this series
should result in any behavioral changes.
v4
- Fix build issue when compiling IPv6 as a module. This required
moving the ipv6_rcv_saddr_equal into an object that is included as a
built-in object. I included this change in the second patch which
adds inet6_hash since that is where ipv6_rcv_saddr_equal will
later be called from non-module code.
v3:
- Another warning in the first patch caught by a build bot. Return 0 in
the no-op UDP hash function.
v2:
- In the first patched I missed a couple of hash functions that should now be
returning int instead of void. I missed these the first time through as it
only generated a warning and not an error :\
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Craig Gallek [Wed, 10 Feb 2016 16:50:41 +0000 (11:50 -0500)]
soreuseport: BPF selection functional test for TCP
Unfortunately the existing test relied on packet payload in order to
map incoming packets to sockets. In order to get this to work with TCP,
TCP_FASTOPEN needed to be used.
Since the fast open path is slightly different than the standard TCP path,
I created a second test which sends to reuseport group members based
on receiving cpu core id. This will probably serve as a better
real-world example use as well.
Signed-off-by: Craig Gallek <kraig@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Craig Gallek [Wed, 10 Feb 2016 16:50:40 +0000 (11:50 -0500)]
soreuseport: fast reuseport TCP socket selection
This change extends the fast SO_REUSEPORT socket lookup implemented
for UDP to TCP. Listener sockets with SO_REUSEPORT and the same
receive address are additionally added to an array for faster
random access. This means that only a single socket from the group
must be found in the listener list before any socket in the group can
be used to receive a packet. Previously, every socket in the group
needed to be considered before handing off the incoming packet.
This feature also exposes the ability to use a BPF program when
selecting a socket from a reuseport group.
Signed-off-by: Craig Gallek <kraig@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Craig Gallek [Wed, 10 Feb 2016 16:50:39 +0000 (11:50 -0500)]
soreuseport: Prep for fast reuseport TCP socket selection
Both of the lines in this patch probably should have been included
in the initial implementation of this code for generic socket
support, but weren't technically necessary since only UDP sockets
were supported.
First, the sk_reuseport_cb points to a structure which assumes
each socket in the group has this pointer assigned at the same
time it's added to the array in the structure. The sk_clone_lock
function breaks this assumption. Since a child socket shouldn't
implicitly be in a reuseport group, the simple fix is to clear
the field in the clone.
Second, the SO_ATTACH_REUSEPORT_xBPF socket options require that
SO_REUSEPORT also be set first. For UDP sockets, this is easily
enforced at bind-time since that process both puts the socket in
the appropriate receive hlist and updates the reuseport structures.
Since these operations can happen at two different times for TCP
sockets (bind and listen) it must be explicitly checked to enforce
the use of SO_REUSEPORT with SO_ATTACH_REUSEPORT_xBPF in the
setsockopt call.
Signed-off-by: Craig Gallek <kraig@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Craig Gallek [Wed, 10 Feb 2016 16:50:38 +0000 (11:50 -0500)]
inet: refactor inet[6]_lookup functions to take skb
This is a preliminary step to allow fast socket lookup of SO_REUSEPORT
groups. Doing so with a BPF filter will require access to the
skb in question. This change plumbs the skb (and offset to payload
data) through the call stack to the listening socket lookup
implementations where it will be used in a following patch.
Signed-off-by: Craig Gallek <kraig@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>