mlx4: use napi_consume_skb API to get bulk free operations
Bulk free of SKBs happen transparently by the API call napi_consume_skb().
The napi budget parameter is usually needed by napi_consume_skb()
to detect if called from netpoll. In this patch it has an extra meaning.
For mlx4 driver, the mlx4_en_stop_port() call is done outside
NAPI/softirq context, and cleanup the entire TX ring via
mlx4_en_free_tx_buf(). The code mlx4_en_free_tx_desc() for
freeing SKBs are shared with NAPI calls.
To handle this shared use the zero budget indication is reused,
and handled appropriately in napi_consume_skb(). To reflect this,
variable is called napi_mode for the function call that needed
this distinction.
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
net: adjust napi_consume_skb to handle non-NAPI callers
Some drivers reuse/share code paths that free SKBs between NAPI
and non-NAPI calls. Adjust napi_consume_skb to handle this
use-case.
Before, calls from netpoll (w/ IRQs disabled) was handled and
indicated with a budget zero indication. Use the same zero
indication to handle calls not originating from NAPI/softirq.
Simply handled by using dev_consume_skb_any().
This adds an extra branch+call for the netpoll case (checking
in_irq() + irqs_disabled()), but that is okay as this is a slowpath.
Suggested-by: Alexander Duyck <aduyck@mirantis.com> Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Jiri Pirko [Thu, 10 Mar 2016 22:10:21 +0000 (23:10 +0100)]
mlxsw: pci: Implement reset done check
Firmware now tells us that the reset is done by passing a magic value
via register. Use it to shorten the wait in case this is supported.
With old firmware, we still wait until the timeout is reached.
Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
sctp: allow sctp_transmit_packet and others to use gfp
Currently sctp_sendmsg() triggers some calls that will allocate memory
with GFP_ATOMIC even when not necessary. In the case of
sctp_packet_transmit it will allocate a linear skb that will be used to
construct the packet and this may cause sends to fail due to ENOMEM more
often than anticipated specially with big MTUs.
This patch thus allows it to inherit gfp flags from upper calls so that
it can use GFP_KERNEL if it was triggered by a sctp_sendmsg call or
similar. All others, like retransmits or flushes started from BH, are
still allocated using GFP_ATOMIC.
In netperf tests this didn't result in any performance drawbacks when
memory is not too fragmented and made it trigger ENOMEM way less often.
Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Samuel Gauthier [Thu, 10 Mar 2016 16:14:59 +0000 (17:14 +0100)]
ovs: allow nl 'flow set' to use ufid without flow key
When we want to change a flow using netlink, we have to identify it to
be able to perform a lookup. Both the flow key and unique flow ID
(ufid) are valid identifiers, but we always have to specify the flow
key in the netlink message. When both attributes are there, the ufid
is used. The flow key is used to validate the actions provided by
the userland.
This commit allows to use the ufid without having to provide the flow
key, as it is already done in the netlink 'flow get' and 'flow del'
path. The flow key remains mandatory when an action is provided.
Signed-off-by: Samuel Gauthier <samuel.gauthier@6wind.com> Reviewed-by: Simon Horman <simon.horman@netronome.com> Acked-by: Pravin B Shelar <pshelar@ovn.org> Signed-off-by: David S. Miller <davem@davemloft.net>
Nicolas Ferre [Thu, 10 Mar 2016 15:44:32 +0000 (16:44 +0100)]
net: macb: fix default configuration for GMAC on AT91
On AT91 SoCs, the User Register (USRIO) exposes a switch to configure the
"Reduced" or "Traditional" version of the Media Independent Interface
(RMII vs. MII or RGMII vs. GMII).
As on the older EMAC version, on GMAC, this switch is set by default to the
non-reduced type of interface, so use the existing capability and extend it to
GMII as well. We then keep the current logic in the macb_init() function.
The capabilities of sama5d2, sama5d4 and sama5d3 GEM interface are updated in
the macb_config structure to be able to properly enable them with a traditional
interface (GMII or MII).
Reported-by: Romain HENRIET <romain.henriet@l-acoustics.com> Signed-off-by: Nicolas Ferre <nicolas.ferre@atmel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
LABBE Corentin [Thu, 10 Mar 2016 12:58:58 +0000 (13:58 +0100)]
phy: remove documentation of removed members of phy_device structure
Commit e5a03bfd873c ("phy: Add an mdio_device structure") removed addr,
bus and dev member of the phy_device structure.
This patch remove the documentation about those members.
Signed-off-by: LABBE Corentin <clabbe.montjoie@gmail.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Acked-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
====================
xen-netback: fix multiple extra info handling
If a frontend passes multiple extra info fragments to netback on the guest
transmit side, because xen-netback does not account for this properly, only
a single ack response will be sent. This will eventually cause processing
of the shared ring to wedge.
This series re-imports the canonical netif.h from Xen, where the ring
protocol documentation has been updated, fixes this issue in xen-netback
and also adds a patch to reduce log spam.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Paul Durrant [Thu, 10 Mar 2016 12:30:28 +0000 (12:30 +0000)]
xen-netback: reduce log spam
Remove the "prepare for reconnect" pr_info in xenbus.c. It's largely
uninteresting and the states of the frontend and backend can easily be
observed by watching the (o)xenstored log.
Signed-off-by: Paul Durrant <paul.durrant@citrix.com> Cc: Wei Liu <wei.liu2@citrix.com> Acked-by: Wei Liu <wei.liu2@citrix.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Paul Durrant [Thu, 10 Mar 2016 12:30:27 +0000 (12:30 +0000)]
xen-netback: support multiple extra info fragments passed from frontend
The code does not currently support a frontend passing multiple extra info
fragments to the backend in a tx request. The xenvif_get_extras() function
handles multiple extra_info fragments but make_tx_response() assumes there
is only ever a single extra info fragment.
This patch modifies xenvif_get_extras() to pass back a count of extra
info fragments, which is then passed to make_tx_response() (after
possibly being stashed in pending_tx_info for deferred responses).
Signed-off-by: Paul Durrant <paul.durrant@citrix.com> Cc: Wei Liu <wei.liu2@citrix.com> Acked-by: Wei Liu <wei.liu2@citrix.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Paul Durrant [Thu, 10 Mar 2016 12:30:26 +0000 (12:30 +0000)]
xen-netback: re-import canonical netif header
The canonical netif header (in the Xen source repo) and the Linux variant
have diverged significantly. Recently much documentation has been added to
the canonical header which is highly useful for developers making
modifications to either xen-netfront or xen-netback. This patch therefore
re-imports the canonical header in its entirity.
To maintain compatibility and some style consistency with the old Linux
variant, the header was stripped of its emacs boilerplate, and
post-processed and copied into place with the following commands:
ed -s netif.h << EOF
H
,s/NETTXF_/XEN_NETTXF_/g
,s/NETRXF_/XEN_NETRXF_/g
,s/NETIF_/XEN_NETIF_/g
,s/XEN_XEN_/XEN_/g
,s/netif/xen_netif/g
,s/xen_xen_/xen_/g
,s/^typedef.*$//g
,s/^ /${TAB}/g
w
$
w
EOF
Signed-off-by: Paul Durrant <paul.durrant@citrix.com> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: David Vrabel <david.vrabel@citrix.com> Cc: Wei Liu <wei.liu2@citrix.com> Acked-by: Wei Liu <wei.liu2@citrix.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Xin Long [Thu, 10 Mar 2016 07:31:57 +0000 (15:31 +0800)]
sctp: fix the transports round robin issue when init is retransmitted
prior to this patch, at the beginning if we have two paths in one assoc,
they may have the same params other than the last_time_heard, it will try
the paths like this:
1st cycle
try trans1 fail.
then trans2 is selected.(cause it's last_time_heard is after trans1).
2nd cycle:
try trans2 fail
then trans2 is selected.(cause it's last_time_heard is after trans1).
3rd cycle:
try trans2 fail
then trans2 is selected.(cause it's last_time_heard is after trans1).
....
trans1 will never have change to be selected, which is not what we expect.
we should keeping round robin all the paths if they are just added at the
beginning.
So at first every tranport's last_time_heard should be initialized 0, so
that we ensure they have the same value at the beginning, only by this,
all the transports could get equal chance to be selected.
Then for sctp_trans_elect_best, it should return the trans_next one when
*trans == *trans_next, so that we can try next if it fails, but now it
always return trans. so we can fix it by exchanging these two params when
we calls sctp_trans_elect_tie().
Fixes: 4c47af4d5eb2 ('net: sctp: rework multihoming retransmission path selection to rfc4960') Signed-off-by: Xin Long <lucien.xin@gmail.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
David Howells [Wed, 9 Mar 2016 23:22:56 +0000 (23:22 +0000)]
rxrpc: Replace all unsigned with unsigned int
Replace all "unsigned" types with "unsigned int" types.
Reported-by: David Miller <davem@davemloft.net> Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Sun, 13 Mar 2016 19:03:34 +0000 (15:03 -0400)]
Merge tag 'wireless-drivers-next-for-davem-2016-03-09' of git://git.kernel.org/pub/scm/linux/kernel/git/kvalo/wireless-drivers-next
Kalle Valo says:
====================
wireless-drivers patches for 4.6
Major changes:
ath10k
* dt: add bindings for ipq4019 wifi block
* start adding support for qca4019 chip
ath9k
* add device ID for Toshiba WLM-20U2/GN-1080
* allow more than one interface on DFS channels
bcma
* move flash detection code to ChipCommon core driver
brcmfmac
* IPv6 Neighbor discovery offload
* driver settings that can be populated from different sources
* country code setting in firmware
* length checks to validate firmware events
* new way to determine device memory size needed for BCM4366
* various offloads during Wake on Wireless LAN (WoWLAN)
* full Management Frame Protection (MFP) support
iwlwifi
* add support for thermal device / cooling device
* improvements in scheduled scan without profiles
* new firmware support (-21.ucode)
* add MSIX support for 9000 devices
* enable MU-MIMO and take care of firmware restart
* add support for large SKBs in mvm to reach A-MSDU
* add support for filtering frames from a BA session
* start implementing the new Rx path for 9000 devices
* enable the new Radio Resource Management (RRM) nl80211 feature flag
* add a new module paramater to disable VHT
* build infrastructure for Dynamic Queue Allocation
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
====================
A couple of minor clean-ups and optimizations
This patch series is basically just a v2 of a couple patches I recently
submitted.
The two patches aren't technically related but there are just items I found
while cleaning up and prepping some further work to enable Tx checksums for
tunnels.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Alexander Duyck [Wed, 9 Mar 2016 17:25:26 +0000 (09:25 -0800)]
csum: Update csum_block_add to use rotate instead of byteswap
The code for csum_block_add was doing a funky byteswap to swap the even and
odd bytes of the checksum if the offset was odd. Instead of doing this we
can save ourselves some trouble and just shift by 8 as this should have the
same effect in terms of the final checksum value and only requires one
instruction.
In addition we can update csum_block_sub to just use csum_block_add with a
inverse value for csum2. This way we follow the same code path as
csum_block_add without having to duplicate it.
Signed-off-by: Alexander Duyck <aduyck@mirantis.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Alexander Duyck [Wed, 9 Mar 2016 17:24:23 +0000 (09:24 -0800)]
gro: Defer clearing of flush bit in tunnel paths
This patch updates the GRO handlers for GRE, VXLAN, GENEVE, and FOU so that
we do not clear the flush bit until after we have called the next level GRO
handler. Previously this was being cleared before parsing through the list
of frames, however this resulted in several paths where either the bit
needed to be reset but wasn't as in the case of FOU, or cases where it was
being set as in GENEVE. By just deferring the clearing of the bit until
after the next level protocol has been parsed we can avoid any unnecessary
bit twiddling and avoid bugs.
Signed-off-by: Alexander Duyck <aduyck@mirantis.com> Signed-off-by: David S. Miller <davem@davemloft.net>
This series contains several changes to driver interaction with the
management fw.
The biggest [& most significant] change here is a change in the locking
scheme and re-definition of the 'critical section' when accessing shared
resources toward the goal of interacting with the management firmware.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Yuval Mintz [Wed, 9 Mar 2016 07:16:26 +0000 (09:16 +0200)]
qed: Enlrage the drain timeout
In the scenario where slowpath configuration isn't passing due to
various pause configurations affecting the chip, the theoretical time
required in worst-case-scenario to empty hw fifos sufficiently to
guarantee that slowpath configuration would flow is currently
insufficient.
This increases such a drain request to the theoretical maximum.
Signed-off-by: Yuval Mintz <Yuval.Mintz@qlogic.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Zvi Nachmani [Wed, 9 Mar 2016 07:16:25 +0000 (09:16 +0200)]
qed: Notify of transciever changes
Handle a new message from the MFW, one that indicate that the transciever
state has changed, and log that into the system logs.
Signed-off-by: Zvi Nachmani <Zvi.Nachmani@qlogic.com> Signed-off-by: Yuval Mintz <Yuval.Mintz@qlogic.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Tomer Tayar [Wed, 9 Mar 2016 07:16:24 +0000 (09:16 +0200)]
qed: Major changes to MB locking
Driver interaction with the managemnt firmware is done via mailbox
commands which the management firmware periodically sample, as well
as placing of additional data in set places in the shared memory.
Each PF has a single designated mailbox address, and all flows that
require messaging to the management should use it.
This patch does 2 things:
1. It re-defines the critical section surrounding the mailbox sending -
that section should include the setting of the shared memory as well as
the sending of the command [otherwise a race might send a command with
the data of a different command].
2. It moves the locking scheme from using mutices into using spinlocks.
This lays the groundwork for sending MFW commands from non-sleepable
contexts.
Signed-off-by: Tomer Tayar <Tomer.Tayar@qlogic.com> Signed-off-by: Yuval Mintz <Yuval.Mintz@qlogic.com> Signed-off-by: David S. Miller <davem@davemloft.net>
When device is configured for Multi-function mode, some older management
firmware might incorrectly notify interfaces of link changes while they
haven't requested the physical link configuration to be set.
This can create bizzare race conditions where unloading interfaces are
getting notified that the link is up.
Let the driver compensate - store the logical requested state of the link
and don't propagate notifications after protocol driver explicitly
requires the link to be unset.
Signed-off-by: Sudarsana Reddy Kalluru <sudarsana.kalluru@qlogic.com> Signed-off-by: Yuval Mintz <Yuval.Mintz@qlogic.com> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Fri, 11 Mar 2016 20:14:27 +0000 (15:14 -0500)]
Merge branch 'bpf-flow-labels'
Daniel Borkmann says:
====================
BPF support for flow labels
This set adds support for tunnel key flow labels for vxlan
and geneve devices in collect meta data mode and eBPF support
for managing these. For details please see individual patches.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Daniel Borkmann [Wed, 9 Mar 2016 02:00:05 +0000 (03:00 +0100)]
bpf: support flow label for bpf_skb_{set, get}_tunnel_key
This patch extends bpf_tunnel_key with a tunnel_label member, that maps
to ip_tunnel_key's label so underlying backends like vxlan and geneve
can propagate the label to udp_tunnel6_xmit_skb(), where it's being set
in the IPv6 header. It allows for having 20 more bits to encode/decode
flow related meta information programmatically. Tested with vxlan and
geneve.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
Daniel Borkmann [Wed, 9 Mar 2016 02:00:04 +0000 (03:00 +0100)]
geneve: support setting IPv6 flow label
This work adds support for setting the IPv6 flow label for geneve per
device and through collect metadata (ip_tunnel_key) frontends. Also here,
the geneve dst cache does not need any special considerations, for the
cases where caches can be used, the label is static per cache.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net>
Daniel Borkmann [Wed, 9 Mar 2016 02:00:03 +0000 (03:00 +0100)]
vxlan: support setting IPv6 flow label
This work adds support for setting the IPv6 flow label for vxlan per
device and through collect metadata (ip_tunnel_key) frontends. The
vxlan dst cache does not need any special considerations here, for
the cases where caches can be used, the label is static per cache.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net>
Daniel Borkmann [Wed, 9 Mar 2016 02:00:02 +0000 (03:00 +0100)]
ip_tunnel: add support for setting flow label via collect metadata
This patch extends udp_tunnel6_xmit_skb() to pass in the IPv6 flow label
from call sites. Currently, there's no such option and it's always set to
zero when writing ip6_flow_hdr(). Add a label member to ip_tunnel_key, so
that flow-based tunnels via collect metadata frontends can make use of it.
vxlan and geneve will be converted to add flow label support separately.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net>
This fixes a regression in the bridge ageing time caused by:
commit c62987bbd8a1 ("bridge: push bridge setting ageing_time down to switchdev")
There are users of Linux bridge which use the feature that if ageing time
is set to 0 it causes entries to never expire. See:
https://www.linuxfoundation.org/collaborate/workgroups/networking/bridge
For a pure software bridge, it is unnecessary for the code to have
arbitrary restrictions on what values are allowable.
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Ido Schimmel [Tue, 8 Mar 2016 20:59:34 +0000 (12:59 -0800)]
rocker: set FDB cleanup timer according to lowest ageing time
In rocker, ageing time is a per-port attribute, so the next time the FDB
cleanup timer fires should be set according to the lowest ageing time.
This will later allow us to delete the BR_MIN_AGEING_TIME macro, which was
added to guarantee minimum ageing time in the bridge layer, thereby breaking
existing behavior.
Signed-off-by: Ido Schimmel <idosch@mellanox.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Ido Schimmel [Tue, 8 Mar 2016 20:59:33 +0000 (12:59 -0800)]
mlxsw: spectrum: Check requested ageing time is valid
Commit c62987bbd8a1 ("bridge: push bridge setting ageing_time down to
switchdev") added a check for minimum and maximum ageing time, but this
breaks existing behaviour where one can set ageing time to 0 for a
non-learning bridge.
Push this check down to the driver and allow the check in the bridge
layer to be removed. Currently ageing time 0 is refused by the driver,
but we can later add support for this functionality.
Signed-off-by: Ido Schimmel <idosch@mellanox.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
The stack expects link layer headers in the skb linear section.
Macvtap can create skbs with llheader in frags in edge cases:
when (IFF_VNET_HDR is off or vnet_hdr.hdr_len < ETH_HLEN) and
prepad + len > PAGE_SIZE and vnet_hdr.flags has no or bad csum.
Add checks to ensure linear is always at least ETH_HLEN.
At this point, len is already ensured to be >= ETH_HLEN.
For backwards compatiblity, rounds up short vnet_hdr.hdr_len.
This differs from tap and packet, which return an error.
Fixes b9fb9ee07e67 ("macvtap: add GSO/csum offload support") Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Amir Vadai [Fri, 11 Mar 2016 09:08:45 +0000 (11:08 +0200)]
net/flower: Fix pointer cast
Cast pointer to unsigned long instead of u64, to fix compilation warning
on 32 bit arch, spotted by 0day build.
Fixes: 5b33f48 ("net/flower: Introduce hardware offload support") Signed-off-by: Amir Vadai <amir@vadai.me> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Thu, 10 Mar 2016 21:24:03 +0000 (16:24 -0500)]
Merge branch 'flower-offload'
Amir Vadai says:
====================
cls_flower hardware offload support
Please see changes from V2 at the bottom.
This patchset introduces cls_flower hardware offload support over ConnectX-4
driver, more hardware vendors are welcome to use it too.
This patchset is based on John's infrastructure for tc offloading [2] to add
hardware offload support to the flower filter. It also extends the support to
an additional tc action - skbedit mark operation.
NIC driver that was used is ConnectX-4. Feature is off by default and could be
turned on using ethtool.
$TC filter add dev $ETH protocol ip prio 30 parent ffff: \
flower ip_proto 6 \
indev $ETH \
action skbedit mark 0x1234
$TC filter add dev $ETH protocol ip prio 10 parent ffff: \
handle 0x1234 fw action pass
The code was tested and applied on top of commit 3ebeac1 ("Merge branch
'cxgb4-next'")
Changes from V2:
- patch 1/10 ("net/flower: Introduce hardware offload support")
- Remove unused variable [Dave]
- Don't fail command when HW can't offload filter [John]
- patch 3/10 ("net/sched: Macro instead of CONFIG_NET_CLS_ACT ifdef")
- Mention in changelog that struct tc_action is now exposed out of the ifdef.
- patch 4/10 ("net/act_skbedit: Utility functions for mark action")
- Document clearly that is_tcf_skbedit_mark() is returning true if and only
if the only action is mark [Dave]
- patch 8/10 ("net/mlx5e: Introduce tc offload support")
- make mlx5e_tc_add_flow() static
Changes from V1:
- patch 3/10 ("net/sched: Macro instead of CONFIG_NET_CLS_ACT ifdef")
- fixed return value of tc_no_actions
Changes from V0:
- Use tc_no_actions and tc_for_each_action instead of ifdef CONFIG_NET_CLS_ACT
- Replace ENOTSUPP (and some EINVAL) with EOPNOTSUPP
- Name the flower command enum
- fl_hw_destroy_filter() to return void - nobody uses the return value
- mlx5e_tc_init() and mlx5e_tc_cleanup() to be called from the right places.
- When adding HW rule fails - fail the command
- Rules are added to be processed both by HW and SW unless SKIP_HW is given
- Adding patch 6/10 ("net/mlx5e: Relax ndo_setup_tc handle restriction")
Main changes from the RFC [1]:
- API
- Using ndo_setup_tc() instead of switchdev
- act_skbedit, act_gact
- Actions are not serialized to NIC driver, instead using access functions.
- cls_flower
- prevent double classification by software by not adding
successfuly offloaded filters to the hashtable
- Fixed some bugs in original RFC with rule delete
- mlx5
- Adding flow table to kernel namespace instead of a new namespace
- s/offload/tc/ in many places
- no need for a special kconfig since switchdev is not used
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Amir Vadai [Tue, 8 Mar 2016 10:42:36 +0000 (12:42 +0200)]
net/mlx5e: Introduce tc offload support
Extend ndo_setup_tc() to support ingress tc offloading. Will be used by
later patches to offload tc flower filter.
Feature is off by default and could be enabled by issuing:
# ethtool -K eth0 hw-tc-offload on
Offloads flow table is dynamically created when first filter is
added.
Rules are saved in a hash table that is maintained by the consumer (for
example - the flower offload in the next patch).
When last filter is removed and no filters exist in the hash table, the
offload flow table is destroyed.
Signed-off-by: Amir Vadai <amir@vadai.me> Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Amir Vadai [Tue, 8 Mar 2016 10:42:35 +0000 (12:42 +0200)]
net/mlx5e: Add a new priority for kernel flow tables
Move the vlan and main flow tables to use priority 1. This will allow
the upcoming TC offload logic to use a higher priority (0) for the
offload steering table.
Signed-off-by: Amir Vadai <amir@vadai.me> Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Amir Vadai [Tue, 8 Mar 2016 10:42:33 +0000 (12:42 +0200)]
net/mlx5_core: Set flow steering dest only for forward rules
We need to handle flow table entry destinations only if the action
associated with the rule is forwarding (MLX5_FLOW_CONTEXT_ACTION_FWD_DEST).
Fixes: 26a8145390b3 ('net/mlx5_core: Introduce flow steering firmware commands') Signed-off-by: Amir Vadai <amir@vadai.me> Signed-off-by: Maor Gottlieb <maorg@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Amir Vadai [Tue, 8 Mar 2016 10:42:31 +0000 (12:42 +0200)]
net/sched: Macro instead of CONFIG_NET_CLS_ACT ifdef
Introduce the macros tc_no_actions and tc_for_each_action to make code
clearer.
Extracted struct tc_action out of the ifdef to make calls to
is_tcf_gact_shot() and similar functions valid, even when it is a nop.
Acked-by: Jiri Pirko <jiri@mellanox.com> Acked-by: John Fastabend <john.r.fastabend@intel.com> Suggested-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: Amir Vadai <amir@vadai.me> Signed-off-by: David S. Miller <davem@davemloft.net>
Amir Vadai [Tue, 8 Mar 2016 10:42:30 +0000 (12:42 +0200)]
net/flow_dissector: Make dissector_uses_key() and skb_flow_dissector_target() public
Will be used in a following patch to query if a key is being used, and
what it's value in the target object.
Acked-by: John Fastabend <john.r.fastabend@intel.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: Amir Vadai <amir@vadai.me> Signed-off-by: David S. Miller <davem@davemloft.net>
Amir Vadai [Tue, 8 Mar 2016 10:42:29 +0000 (12:42 +0200)]
net/flower: Introduce hardware offload support
This patch is based on a patch made by John Fastabend.
It adds support for offloading cls_flower.
when NETIF_F_HW_TC is on:
flags = 0 => Rule will be processed twice - by hardware, and if
still relevant, by software.
flags = SKIP_HW => Rull will be processed by software only
If hardware fail/not capabale to apply the rule, operation will NOT
fail. Filter will be processed by SW only.
Acked-by: Jiri Pirko <jiri@mellanox.com> Suggested-by: John Fastabend <john.r.fastabend@intel.com> Signed-off-by: Amir Vadai <amir@vadai.me> Signed-off-by: David S. Miller <davem@davemloft.net>
This series adds support for the Mediatek ethernet core found on current ARM
based SoCs. The driver works on MT2701 and MT7623 SoCs
Instead of trying to upstream everything at once I decided to concentrate on
the important parts required to make current generation silicon work. The V3
series only includes the code required to make dual MAC setups work and only
supports the newer QDMA engine.
Changes in V5
* reduce the mdio timeut to HZ
* add a call to usleep_range() which schedules in the background.
Changes in V4
* remove ugly _FE macro, use offsetof() instead
Changes in V3
* only include code for MT2701/7623 support
* drop support for PDMA and older MIPS based SoCs
* drop switch support
Changes in V2
* change the namespace of the functions from fe_* to mtk_*
* add support for the latest generation of ARM SoCs
* add dual MAC support
* remove the swconfig specific bits
* remove most of the magic values and replace them with defines
* add verbose descriptions to the patches
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
John Crispin [Tue, 8 Mar 2016 10:29:55 +0000 (11:29 +0100)]
net-next: mediatek: add support for MT7623 ethernet
Add ethernet support for MediaTek SoCs from the MT7623 family. These have
dual GMAC. Depending on the exact version, there might be a built-in
Gigabit switch (MT7530). The core does not have the typical DMA ring setup.
Instead there is a linked list that we add descriptors to. There is only
one linked list that both MACs use together. There is a special field
inside the TX descriptors called the VQID. This allows us to assign packets
to different internal queues. By using a separate id for each MAC we are
able to get deterministic results for BQL. Additionally we need to
provide the core with a block of scratch memory that is the same size as
the RX ring and data buffer. This is really needed to make the HW datapath
work. Although the driver does not support this yet, we still need to
assign the memory and tell the core about it for RX to work.
Signed-off-by: Felix Fietkau <nbd@openwrt.org> Signed-off-by: Michael Lee <igvtee@gmail.com> Signed-off-by: John Crispin <blogic@openwrt.org> Signed-off-by: David S. Miller <davem@davemloft.net>
This adds the binding documentation for the MediaTek Ethernet
controller.
Signed-off-by: John Crispin <blogic@openwrt.org> Acked-by: Rob Herring <robh@kernel.org> Cc: devicetree@vger.kernel.org Signed-off-by: David S. Miller <davem@davemloft.net>
Neil Armstrong [Tue, 8 Mar 2016 09:36:20 +0000 (10:36 +0100)]
net: dsa: Fix cleanup resources upon module removal
The initial commit badly merged into the dsa_resume method instead
of the dsa_remove_dst method.
As consequence, the dst->master_netdev->dsa_ptr is not set to NULL on
removal and re-bind of the dsa device fails with error -17.
Fixes: b0dc635d923c ("net: dsa: cleanup resources upon module removal ") Signed-off-by: Neil Armstrong <narmstrong@baylibre.com> Acked-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
Manish Chopra [Tue, 8 Mar 2016 09:09:44 +0000 (04:09 -0500)]
qede: Fix net-next "make ARCH=x86_64"
'commit 55482edc25f0606851de42e73618f813f310d009
("qede: Add slowpath/fastpath support and enable hardware GRO")'
introduces below error when compiling net-next with "make ARCH=x86_64"
drivers/built-in.o: In function `qede_rx_int':
qede_main.c:(.text+0x6101a0): undefined reference to `tcp_gro_complete'
Signed-off-by: Manish Chopra <manish.chopra@qlogic.com> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Thu, 10 Mar 2016 21:15:54 +0000 (16:15 -0500)]
Merge branch 'qlcnic-next'
Rajesh Borundia says:
====================
qlcnic fixes
This series adds following fixes.
o While processing mailbox if driver gets a spurious mailbox
interrupt it leads into premature completion of a next
mailbox request. Added a guard against this by checking current
state of mailbox and ignored spurious interrupt.
Added a stats counter to record this condition.
v2:
o Added patch that removes usage of atomic_t as we are not implemeting
atomicity by using atomic_t value.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Rajesh Borundia [Tue, 8 Mar 2016 07:39:58 +0000 (02:39 -0500)]
qlcnic: Fix mailbox completion handling during spurious interrupt
o While the driver is in the middle of a MB completion processing
and it receives a spurious MB interrupt, it is mistaken as a good MB
completion interrupt leading to premature completion of the next MB
request. Fix the driver to guard against this by checking the current
state of MB processing and ignore the spurious interrupt.
Also added a stats counter to record this condition.
Signed-off-by: Rajesh Borundia <rajesh.borundia@qlogic.com> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Thu, 10 Mar 2016 21:12:25 +0000 (16:12 -0500)]
Merge branch 'cxgb4-next'
Hariprasad Shenai says:
====================
cxgb4vf: Interrupt and queue configuration changes
This series fixes some issues and some changes in the queue and interrupt
configuration for cxgb4vf driver. We need to enable interrupts before we
register our network device, so that we don't loose link up interrupts.
Allocate rx queues based on interrupt type. Set number of tx/rx queues in
probe function only. Also adds check for some invalid configurations.
This patch series has been created against net-next tree and includes
patches on cxgb4vf driver.
We have included all the maintainers of respective drivers. Kindly review
the change and let us know in case of any review comments.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
cxgb4vf: Configure queue based on resource and interrupt type
The Queue Set Configuration code was always reserving room for a
Forwarded interrupt Queue even in the cases where we weren't using it.
Figure out how many Ports and Queue Sets we can support. This depends on
knowing our Virtual Function Resources and may be called a second time
if we fall back from MSI-X to MSI Interrupt Mode. This change fixes that
problem.
Signed-off-by: Hariprasad Shenai <hariprasad@chelsio.com> Signed-off-by: David S. Miller <davem@davemloft.net>
cxgb4vf: Enable interrupts before we register our network devices
This avoids a race condition where a system that has network devices set up
to be automatically configured and we get the first Port Link Status
message from the firmware on the Asynchronous Firmware Event Queue before
we've enabled interrupts. If that happens, we end up losing the interrupt
and never realizing that the links has actually come up.
Signed-off-by: Hariprasad Shenai <hariprasad@chelsio.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com> Tested-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
Vivien Didelot [Mon, 7 Mar 2016 23:24:39 +0000 (18:24 -0500)]
net: dsa: mv88e6xxx: read then write PVID
The port register 0x07 contains more options than just the default VID,
even though they are not used yet. So prefer a read then write operation
over a direct write.
This also allows to keep track of the change through dynamic debug.
Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com> Tested-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
Vivien Didelot [Mon, 7 Mar 2016 23:24:17 +0000 (18:24 -0500)]
net: dsa: mv88e6xxx: rework port state setter
Apply a few non-functional changes on the port state setter:
* add a dynamic debug message with state names to track changes
* explicit states checking instead of assuming their numeric values
* lock mutex only once when changing several port states
* use bitmap macros to declare and access port_state_update_mask
Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com> Tested-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
Sergei Shtylyov [Mon, 7 Mar 2016 22:37:09 +0000 (01:37 +0300)]
sh_eth: advance 'rxdesc' later in sh_eth_ring_format()
Iff dma_map_single() fails, 'rxdesc' should point to the last filled RX
descriptor, so that it can be marked as the last one, however the driver
would have already advanced it by that time. In order to fix that, only
fill an RX descriptor once all the data for it is ready.
Signed-off-by: Sergei Shtylyov <sergei.shtylyov@cogentembedded.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Sergei Shtylyov [Mon, 7 Mar 2016 22:36:28 +0000 (01:36 +0300)]
sh_eth: fix NULL pointer dereference in sh_eth_ring_format()
In a low memory situation, if netdev_alloc_skb() fails on a first RX ring
loop iteration in sh_eth_ring_format(), 'rxdesc' is still NULL. Avoid
kernel oops by adding the 'rxdesc' check after the loop.
Reported-by: Wolfram Sang <wsa+renesas@sang-engineering.com> Signed-off-by: Sergei Shtylyov <sergei.shtylyov@cogentembedded.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Arnd Bergmann [Thu, 10 Mar 2016 18:31:12 +0000 (19:31 +0100)]
kcm: mark helper functions inline
The stub helper functions for the newly added kcm_proc_init/exit interfaces
are defined as 'static' in a header file, which leads to build warnings for
each file that includes them without calling them:
include/net/kcm.h:183:12: error: 'kcm_proc_init' defined but not used [-Werror=unused-function]
include/net/kcm.h:184:13: error: 'kcm_proc_exit' defined but not used [-Werror=unused-function]
This marks the two functions as 'static inline' instead, which avoids the
warnings and is obviously what was meant here.
Signed-off-by: Arnd Bergmann <arnd@arndb.de> Fixes: cd6e111bf5be ("kcm: Add statistics and proc interfaces") Signed-off-by: David S. Miller <davem@davemloft.net>
this is a pull request of 5 patch for net-next/master.
Marek Vasut contributes 4 patches for the ifi CAN driver, which makes
it work on real hardware. There is one patch by Ramesh Shanmugasundaram
for the rcar_can driver that adds support for the 3rd generation IP
core.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Marek Vasut [Thu, 3 Mar 2016 19:45:58 +0000 (20:45 +0100)]
can: ifi: Add obscure bit swap for EFF frame IDs
In case of CAN2.0 EFF frame, the controller handles frame IDs in a
rather bizzare way. The ID is split into an extended part, IDX[28:11]
and standard part, ID[10:0]. In the TX path, the core first sends the
top 11 bits of the IDX, followed by ID and finally the rest of IDX.
In the RX path, the core stores the ID the LSbit part of IDX field,
followed by the LSbit parts of real IDX. The MSbit parts of IDX are
stored in ID field of the register.
This patch implements the necessary bit shuffling to mitigate this
obscure behavior. In case two of these controllers are connected
together, the RX and TX bit swapping nullifies itself and the issue
does not manifest. The issue only manifests when talking to another
different CAN controller.
Signed-off-by: Marek Vasut <marex@denx.de> Cc: Marc Kleine-Budde <mkl@pengutronix.de> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Oliver Hartkopp <socketcan@hartkopp.net> Cc: Wolfgang Grandegger <wg@grandegger.com> Reviewed-by: Oliver Hartkopp <socketcan@hartkopp.net> Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
Marek Vasut [Thu, 3 Mar 2016 19:45:57 +0000 (20:45 +0100)]
can: ifi: Fix RX and TX ID mask
The RX and TX ID mask for CAN2.0 is 11 bits wide. This patch fixes
the incorrect mask, which caused the CAN IDs to miss the MSBit both
on receive and transmit.
Signed-off-by: Marek Vasut <marex@denx.de> Cc: Marc Kleine-Budde <mkl@pengutronix.de> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Oliver Hartkopp <socketcan@hartkopp.net> Cc: Wolfgang Grandegger <wg@grandegger.com> Reviewed-by: Oliver Hartkopp <socketcan@hartkopp.net> Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
Marek Vasut [Thu, 3 Mar 2016 19:45:56 +0000 (20:45 +0100)]
can: ifi: Fix TX DLC configuration
The TX DLC, the transmission length information, was not written
into the transmit configuration register. When using the CAN core
with different CAN controller, the receiving CAN controller will
receive only the ID part of the CAN frame, but no data at all.
This patch adds the TX DLC into the register to fix this issue.
Signed-off-by: Marek Vasut <marex@denx.de> Cc: Marc Kleine-Budde <mkl@pengutronix.de> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Oliver Hartkopp <socketcan@hartkopp.net> Cc: Wolfgang Grandegger <wg@grandegger.com> Reviewed-by: Oliver Hartkopp <socketcan@hartkopp.net> Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
Marek Vasut [Thu, 3 Mar 2016 19:45:55 +0000 (20:45 +0100)]
can: ifi: Fix clock generator configuration
The clock generation does not match reality when using the CAN IP
core outside of the FPGA design. This patch fixes the computation
of values which are programmed into the clock generator registers.
First, there are some off-by-one errors which manifest themselves
only when communicating with different controller, so those are
fixed.
Second, the bits in the clock generator registers have different
meaning depending on whether the core is in ISO CANFD mode or any
of the other modes (BOSCH CANFD or CAN2.0). Detect the ISO CANFD
mode and fix handling of this special case of clock configuration.
Finally, the CAN clock speed is in CANCLOCK register, not SYSCLOCK
register, so fix this as well.
Signed-off-by: Marek Vasut <marex@denx.de> Cc: Marc Kleine-Budde <mkl@pengutronix.de> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Oliver Hartkopp <socketcan@hartkopp.net> Cc: Wolfgang Grandegger <wg@grandegger.com> Reviewed-by: Oliver Hartkopp <socketcan@hartkopp.net> Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
bpf: avoid copying junk bytes in bpf_get_current_comm()
Lots of places in the kernel use memcpy(buf, comm, TASK_COMM_LEN); but
the result is typically passed to print("%s", buf) and extra bytes
after zero don't cause any harm.
In bpf the result of bpf_get_current_comm() is used as the part of
map key and was causing spurious hash map mismatches.
Use strlcpy() to guarantee zero-terminated string.
bpf verifier checks that output buffer is zero-initialized,
so even for short task names the output buffer don't have junk bytes.
Note it's not a security concern, since kprobe+bpf is root only.
Fixes: ffeedafbf023 ("bpf: introduce current->pid, tgid, uid, gid, comm accessors") Reported-by: Tobias Waldekranz <tobias@waldekranz.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
bpf: bpf_stackmap_copy depends on CONFIG_PERF_EVENTS
0-day bot reported build error:
kernel/built-in.o: In function `map_lookup_elem':
>> kernel/bpf/.tmp_syscall.o:(.text+0x329b3c): undefined reference to `bpf_stackmap_copy'
when CONFIG_BPF_SYSCALL is set and CONFIG_PERF_EVENTS is not.
Add weak definition to resolve it.
This code path in map_lookup_elem() is never taken
when CONFIG_PERF_EVENTS is not set.
Fixes: 557c0c6e7df8 ("bpf: convert stackmap to pre-allocation") Reported-by: Fengguang Wu <fengguang.wu@intel.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
Allow device-specific validation of link layer headers. Existing
checks drop all packets shorter than hard_header_len. For variable
length protocols, such packets can be valid.
patch 1 adds header_ops.validate and dev_validate_header
patch 2 implements the protocol specific callback for AX25
patch 3 replaces ll_header_truncated with dev_validate_header
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Willem de Bruijn [Thu, 10 Mar 2016 02:58:34 +0000 (21:58 -0500)]
packet: validate variable length ll headers
Replace link layer header validation check ll_header_truncate with
more generic dev_validate_header.
Validation based on hard_header_len incorrectly drops valid packets
in variable length protocols, such as AX25. dev_validate_header
calls header_ops.validate for such protocols to ensure correctness
below hard_header_len.
See also http://comments.gmane.org/gmane.linux.network/401064
Fixes 9c7077622dd9 ("packet: make packet_snd fail on len smaller than l2 header") Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Willem de Bruijn [Thu, 10 Mar 2016 02:58:33 +0000 (21:58 -0500)]
ax25: add link layer header validation function
As variable length protocol, AX25 fails link layer header validation
tests based on a minimum length. header_ops.validate allows protocols
to validate headers that are shorter than hard_header_len. Implement
this callback for AX25.
See also http://comments.gmane.org/gmane.linux.network/401064
Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Willem de Bruijn [Thu, 10 Mar 2016 02:58:32 +0000 (21:58 -0500)]
net: validate variable length ll headers
Netdevice parameter hard_header_len is variously interpreted both as
an upper and lower bound on link layer header length. The field is
used as upper bound when reserving room at allocation, as lower bound
when validating user input in PF_PACKET.
Clarify the definition to be maximum header length. For validation
of untrusted headers, add an optional validate member to header_ops.
Allow bypassing of validation by passing CAP_SYS_RAWIO, for instance
for deliberate testing of corrupt input. In this case, pad trailing
bytes, as some device drivers expect completely initialized headers.
See also http://comments.gmane.org/gmane.linux.network/401064
Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Kernel Connection Multiplexor (KCM) is a facility that provides a
message based interface over TCP for generic application protocols.
The motivation for this is based on the observation that although
TCP is byte stream transport protocol with no concept of message
boundaries, a common use case is to implement a framed application
layer protocol running over TCP. To date, most TCP stacks offer
byte stream API for applications, which places the burden of message
delineation, message I/O operation atomicity, and load balancing
in the application. With KCM an application can efficiently send
and receive application protocol messages over TCP using a
datagram interface.
In order to delineate message in a TCP stream for receive in KCM, the
kernel implements a message parser. For this we chose to employ BPF
which is applied to the TCP stream. BPF code parses application layer
messages and returns a message length. Nearly all binary application
protocols are parsable in this manner, so KCM should be applicable
across a wide range of applications. Other than message length
determination in receive, KCM does not require any other application
specific awareness. KCM does not implement any other application
protocol semantics-- these are are provided in userspace or could be
implemented in a kernel module layered above KCM.
KCM implements an NxM multiplexor in the kernel as diagrammed below:
The KCM sockets provide the datagram interface to applications,
Psocks are the state for each attached TCP connection (i.e. where
message delineation is performed on receive).
A description of the APIs and design can be found in the included
Documentation/networking/kcm.txt.
In this patch set:
- Add MSG_BATCH flag. This is used in sendmsg msg_hdr flags to
indicate that more messages will be sent on the socket. The stack
may batch messages up if it is beneficial for transmission.
- In sendmmsg, set MSG_BATCH in all sub messages except for the last
one.
- In order to allow sendmmsg to contain multiple messages with
SOCK_SEQPAKET we allow each msg_hdr in the sendmmsg to set MSG_EOR.
- Add KCM module
- This supports SOCK_DGRAM and SOCK_SEQPACKET.
- KCM documentation
v2:
- Added splice and page operations.
- Assemble receive messages in place on TCP socket (don't have a
separate assembly queue.
- Based on above, enforce maxmimum receive message to be the size
of the recceive socket buffer.
- Support message assembly timeout. Use the timeout value in
sk_rcvtimeo on the TCP socket.
- Tested some with a couple of other production applications,
see ~5% improvement in application latency.
Testing:
Dave Watson has integrated KCM into Thrift and we intend to put these
changes into open source. Example of this is in:
Some initial KCM Thrift benchmark numbers (comment from Dave)
Thrift by default ties a single connection to a single thread. KCM is
instead able to load balance multiple connections across multiple epoll
loops easily.
A test sending ~5k bytes of data to a kcm thrift server, dropping the
bytes on recv:
QPS Latency / std dev Latency
without KCM
70336 209/123
with KCM
70353 191/124
A test sending a small request, then doing work in the epoll thread,
before serving more requests:
QPS Latency / std dev Latency
without KCM
14282 559/602
with KCM
23192 344/234
At the high end, there's definitely some additional kernel overhead:
Cranking the pipelining way up, with lots of small requests
QPS Latency / std dev Latency
without KCM 1863429 127/119
with KCM 1337713 192/241
---
So for a "realistic" workload, KCM performs pretty well (second case).
Under extreme conditions of highest tps we still have some work to do.
In its nature a multiplexor will spread work between CPUs which is
logically good for load balancing but coan conflict with the goal
promoting affinity. Batching messages on both send and receive are
the means to recoup performance.
Future support:
- Integration with TLS (TLS-in-kernel is a separate initiative).
- Page operations/splice support
- Unconnected KCM sockets. Will be able to attach sockets to different
destinations, AF_KCM addresses with be used in sendmsg and recvmsg
to indicate destination
- Explore more utility in performing BPF inline with a TCP data stream
(setting SO_MARK, rxhash for messages being sent received on
KCM sockets).
- Performance work
- Diagnose performance issues under high message load
FAQ (Questions posted on LWN)
Q: Why do this in the kernel?
A: Because the kernel is good at scheduling threads and steering packets
to threads. KCM fits well into this model since it allows the unit
of work for scheduling and steering to be the application layer
messages themselves. KCM should be thought of as generic application
protocol acceleration. It to the philosophy that the kernel provides
generic and extensible interfaces.
Q: How can adding code in the path yield better performance?
A: It is true that for just sending receiving a single message there
would be some performance loss since the code path is longer (for
instance comparing netperf to KCM). But for real production
applications performance takes on many dynamics. Parallelism, context
switching, affinity, granularity of locking, and load balancing are
all relevant. The theory of KCM is that by an application-centric
interface, the kernel can provide better support for these
performance characteristics.
Q: Why not use an existing message-oriented protocol such as RUDP,
DCCP, SCTP, RDS, and others?
A: Because that would entail using a completely new transport protocol.
Deploying a new protocol at scale is either a huge undertaking or
fundamentally infeasible. This is true in either the Internet and in
the data center due in a large part to protocol ossification.
Besides, KCM we want KCM to work existing, well deployed application
protocols that we couldn't change even if we wanted to (e.g. http/2).
KCM simply defines a new interface method, it does not redefine any
aspect of the transport protocol nor application protocol, nor set
any new requirements on these. Neither does KCM attempt to implement
any application protocol logic other than message deliniation in the
stream. These are fundamental requirement of KCM.
Q: How does this affect TCP?
A: It doesn't, not in the slightest. The use of KCM can be one-sided,
KCM has no effect on the wire.
Q: Why force TCP into doing something it's not designed for?
A: TCP is defined as transport protocol and there is no standard that
says the API into TCP must be stream based sockets, or for that
matter sockets at all (or even that TCP needs to be implemented in a
kernel). KCM is not inconsistent with the design of TCP just because
to makes an message based interface over TCP, if it were then every
application protocol sending messages over TCP would also be! :-)
Q: What about the problem of a connections with very slow rate of
incoming data? As a result your application can get storms of very
short reads. And it actually happens a lot with connection from
mobile devices and it is a problem for servers handling a lot of
connections.
A: The storm of short reads will occur regardless of whether KCM is used
or not. KCM does have one advantage in this scenario though, it will
only wake up the application when a full message has been received,
not for each packet that makes up part of a bigger messages. If a
bunch of small messages are received, the application can receive
messages in batches using recvmmsg.
Q: Why not just use DPDK, or at least provide KCM like functionality in
DPDK?
A: DPDK, or more generally OS bypass presumably with a TCP stack in
userland, presents a different model of load balancing than that of
KCM (and the kernel). KCM implements load balancing of messages
across the threads of an application, whereas DPDK load balances
based on queues which are more static and coarse-grained since
multiple connections are bound to queues. DPDK works best when
processing of packets is silo'ed in a thread on the CPU processing
a queue, and packet processing (for both the stack and application)
is fairly uniform. KCM works well for applications where the amount
of work to process messages varies an application work is commonly
delegated to worker threads often on different CPUs.
The message based interface over TCP is something that could be
provide by a DPDK or OS bypass library.
Q: I'm not quite seeing this for HTTP. Maybe for HTTP/2, I guess, or web
sockets?
A: Yes. KCM is most appropriate for message based protocols over TCP
where is easy to deduce the message length (e.g. a length field)
and the protocol implements its own message ordering semantics.
Fortunately this encompasses many modern protocols.
Q: How is memory limited and controlled?
A: In v2 all data for messages is now kept in socket buffers, either
those for TCP or KCM, so socket buffer limits are applicable.
This includes receive messages assembly which is now done ont teh
TCP socket buffer instead of a separate queue-- this has the
consequence that the TCP socket buffer limit provides an
enforceable maxmimum message size.
Additionally, a timeout may be set for messages assembly. The
value used for this is taken from sk_rcvtimeo of the TCP socket.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Tom Herbert [Mon, 7 Mar 2016 22:11:11 +0000 (14:11 -0800)]
kcm: Add receive message timeout
This patch adds receive timeout for message assembly on the attached TCP
sockets. The timeout is set when a new messages is started and the whole
message has not been received by TCP (not in the receive queue). If the
completely message is subsequently received the timer is cancelled, if the
timer expires the RX side is aborted.
The timeout value is taken from the socket timeout (SO_RCVTIMEO) that is
set on a TCP socket (i.e. set by get sockopt before attaching a TCP socket
to KCM.
Signed-off-by: Tom Herbert <tom@herbertland.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Tom Herbert [Mon, 7 Mar 2016 22:11:10 +0000 (14:11 -0800)]
kcm: Add memory limit for receive message construction
Message assembly is performed on the TCP socket. This is logically
equivalent of an application that performs a peek on the socket to find
out how much memory is needed for a receive buffer. The receive socket
buffer also provides the maximum message size which is checked.
The receive algorithm is something like:
1) Receive the first skbuf for a message (or skbufs if multiple are
needed to determine message length).
2) Check the message length against the number of bytes in the TCP
receive queue (tcp_inq()).
- If all the bytes of the message are in the queue (incluing the
skbuf received), then proceed with message assembly (it should
complete with the tcp_read_sock)
- Else, mark the psock with the number of bytes needed to
complete the message.
3) In TCP data ready function, if the psock indicates that we are
waiting for the rest of the bytes of a messages, check the number
of queued bytes against that.
- If there are still not enough bytes for the message, just
return
- Else, clear the waiting bytes and proceed to receive the
skbufs. The message should now be received in one
tcp_read_sock
Signed-off-by: Tom Herbert <tom@herbertland.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Tom Herbert [Mon, 7 Mar 2016 22:11:07 +0000 (14:11 -0800)]
kcm: Add statistics and proc interfaces
This patch adds various counters for KCM. These include counters for
messages and bytes received or sent, as well as counters for number of
attached/unattached TCP sockets and other error or edge events.
The statistics are exposed via a proc interface. /proc/net/kcm provides
statistics per KCM socket and per psock (attached TCP sockets).
/proc/net/kcm_stats provides aggregate statistics.
Signed-off-by: Tom Herbert <tom@herbertland.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Tom Herbert [Mon, 7 Mar 2016 22:11:06 +0000 (14:11 -0800)]
kcm: Kernel Connection Multiplexor module
This module implements the Kernel Connection Multiplexor.
Kernel Connection Multiplexor (KCM) is a facility that provides a
message based interface over TCP for generic application protocols.
With KCM an application can efficiently send and receive application
protocol messages over TCP using datagram sockets.
For more information see the included Documentation/networking/kcm.txt
Signed-off-by: Tom Herbert <tom@herbertland.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Tom Herbert [Mon, 7 Mar 2016 22:11:05 +0000 (14:11 -0800)]
tcp: Add tcp_inq to get available receive bytes on socket
Create a common kernel function to get the number of bytes available
on a TCP socket. This is based on code in INQ getsockopt and we now call
the function for that getsockopt.
Signed-off-by: Tom Herbert <tom@herbertland.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Tom Herbert [Mon, 7 Mar 2016 22:11:03 +0000 (14:11 -0800)]
net: Add MSG_BATCH flag
Add a new msg flag called MSG_BATCH. This flag is used in sendmsg to
indicate that more messages will follow (i.e. a batch of messages is
being sent). This is similar to MSG_MORE except that the following
messages are not merged into one packet, they are sent individually.
sendmmsg is updated so that each contained message except for the
last one is marked as MSG_BATCH.
MSG_BATCH is a performance optimization in cases where a socket
implementation can benefit by transmitting packets in a batch.
Signed-off-by: Tom Herbert <tom@herbertland.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Tom Herbert [Mon, 7 Mar 2016 22:11:02 +0000 (14:11 -0800)]
net: Allow MSG_EOR in each msghdr of sendmmsg
This patch allows setting MSG_EOR in each individual msghdr passed
in sendmmsg. This allows a sendmmsg to send multiple messages when
using SOCK_SEQPACKET.
Signed-off-by: Tom Herbert <tom@herbertland.com> Signed-off-by: David S. Miller <davem@davemloft.net>
this test calls bpf programs from different contexts:
from inside of slub, from rcu, from pretty much everywhere,
since it kprobes all spin_lock functions.
It stresses the bpf hash and percpu map pre-allocation,
deallocation logic and call_rcu mechanisms.
User space part adding more stress by walking and deleting map elements.
Note that due to nature bpf_load.c the earlier kprobe+bpf programs are
already active while loader loads new programs, creates new kprobes and
attaches them.
Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
Daniel Borkmann [Tue, 8 Mar 2016 22:36:03 +0000 (23:36 +0100)]
ip_tunnel, bpf: ip_tunnel_info_opts_{get, set} depends on CONFIG_INET
Helpers like ip_tunnel_info_opts_{get,set}() are only available if
CONFIG_INET is set, thus add an empty definition into the header for
the !CONFIG_INET case, where already other empty inline helpers are
defined.
This avoids ifdef kludge inside filter.c, but also vxlan and geneve
themself where this facility can only be used with, depend on INET
being set. For the !INET case TUNNEL_OPTIONS_PRESENT would never be
set in flags.
Fixes: 14ca0751c96f ("bpf: support for access to tunnel options") Reported-by: Fengguang Wu <fengguang.wu@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>