consume_skb() isn't for error cases that kfree_skb() is more proper
one. At this patch, it fixed tpacket_rcv() and packet_rcv() to be
consistent for error or non-error cases letting perf trace its event
properly.
Signed-off-by: Weongyo Jeong <weongyo.linux@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
tipc: fix a race condition leading to subscriber refcnt bug
Until now, the requests sent to topology server are queued
to a workqueue by the generic server framework.
These messages are processed by worker threads and trigger the
registered callbacks.
To reduce latency on uniprocessor systems, explicit rescheduling
is performed using cond_resched() after MAX_RECV_MSG_COUNT(25)
messages.
This implementation on SMP systems leads to an subscriber refcnt
error as described below:
When a worker thread yields by calling cond_resched() in a SMP
system, a new worker is created on another CPU to process the
pending workitem. Sometimes the sleeping thread wakes up before
the new thread finishes execution.
This breaks the assumption on ordering and being single threaded.
The fault is more frequent when MAX_RECV_MSG_COUNT is lowered.
If the first thread was processing subscription create and the
second thread processing close(), the close request will free
the subscriber and the create request oops as follows:
In this commit, we
- rename tipc_conn_shutdown() to tipc_conn_release().
- move connection release callback execution from tipc_close_conn()
to a new function tipc_sock_release(), which is executed before
we free the connection.
Thus we release the subscriber during connection release procedure
rather than connection shutdown procedure.
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com> Acked-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Thu, 14 Apr 2016 20:23:42 +0000 (16:23 -0400)]
Merge branch 'gro-fixed-id-gso-partial'
Alexander Duyck says:
====================
GRO Fixed IPv4 ID support and GSO partial support
This patch series sets up a few different things.
First it adds support for GRO of frames with a fixed IP ID value. This
will allow us to perform GRO for frames that go through things like an IPv6
to IPv4 header translation.
The second item we add is support for segmenting frames that are generated
this way. Most devices only support an incrementing IP ID value, and in
the case of TCP the IP ID can be ignored in many cases since the DF bit
should be set. So we can technically segment these frames using existing
TSO if we are willing to allow the IP ID to be mangled. As such I have
added a matching feature for the new form of GRO/GSO called TCP IPv4 ID
mangling. With this enabled we can assemble and disassemble a frame with
the sequence number fixed and the only ill effect will be that the IPv4 ID
will be altered which may or may not have any noticeable effect. As such I
have defaulted the feature to disabled.
The third item this patch series adds is support for partial GSO
segmentation. Partial GSO segmentation allows us to split a large frame
into two pieces. The first piece will have an even multiple of MSS worth
of data and the headers before the one pointed to by csum_start will have
been updated so that they are correct for if the data payload had already
been segmented. By doing this we can do things such as precompute the
outer header checksums for a frame to be segmented allowing us to perform
TSO on devices that don't support tunneling, or tunneling with outer header
checksums.
This patch set is based on the net-next tree, but I included "net: remove
netdevice gso_min_segs" in my tree as I assume it is likely to be applied
before this patch set will and I wanted to avoid a merge conflict.
v2: Fixed items reported by Jesse Gross
fixed missing GSO flag in MPLS check
adding DF check for MANGLEID
Moved extra GSO feature checks into gso_features_check
Rebased batches to account for "net: remove netdevice gso_min_segs"
Driver patches from the first patch set should still be compatible. However
I do have a few changes in them so I will submit a v2 of those to Jeff
Kirsher once these patches are accepted into net-next.
Example driver patches for i40e, ixgbe, and igb:
https://patchwork.ozlabs.org/patch/608221/
https://patchwork.ozlabs.org/patch/608224/
https://patchwork.ozlabs.org/patch/608225/
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Alexander Duyck [Mon, 11 Apr 2016 01:45:09 +0000 (21:45 -0400)]
Documentation: Add documentation for TSO and GSO features
This document is a starting point for defining the TSO and GSO features.
The whole thing is starting to get a bit messy so I wanted to make sure we
have notes somwhere to start describing what does and doesn't work.
Signed-off-by: Alexander Duyck <aduyck@mirantis.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Alexander Duyck [Mon, 11 Apr 2016 01:45:03 +0000 (21:45 -0400)]
GSO: Support partial segmentation offload
This patch adds support for something I am referring to as GSO partial.
The basic idea is that we can support a broader range of devices for
segmentation if we use fixed outer headers and have the hardware only
really deal with segmenting the inner header. The idea behind the naming
is due to the fact that everything before csum_start will be fixed headers,
and everything after will be the region that is handled by hardware.
With the current implementation it allows us to add support for the
following GSO types with an inner TSO_MANGLEID or TSO6 offload:
NETIF_F_GSO_GRE
NETIF_F_GSO_GRE_CSUM
NETIF_F_GSO_IPIP
NETIF_F_GSO_SIT
NETIF_F_UDP_TUNNEL
NETIF_F_UDP_TUNNEL_CSUM
In the case of hardware that already supports tunneling we may be able to
extend this further to support TSO_TCPV4 without TSO_MANGLEID if the
hardware can support updating inner IPv4 headers.
Signed-off-by: Alexander Duyck <aduyck@mirantis.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Alexander Duyck [Mon, 11 Apr 2016 01:44:57 +0000 (21:44 -0400)]
GRO: Add support for TCP with fixed IPv4 ID field, limit tunnel IP ID values
This patch does two things.
First it allows TCP to aggregate TCP frames with a fixed IPv4 ID field. As
a result we should now be able to aggregate flows that were converted from
IPv6 to IPv4. In addition this allows us more flexibility for future
implementations of segmentation as we may be able to use a fixed IP ID when
segmenting the flow.
The second thing this does is that it places limitations on the outer IPv4
ID header in the case of tunneled frames. Specifically it forces the IP ID
to be incrementing by 1 unless the DF bit is set in the outer IPv4 header.
This way we can avoid creating overlapping series of IP IDs that could
possibly be fragmented if the frame goes through GRO and is then
resegmented via GSO.
Signed-off-by: Alexander Duyck <aduyck@mirantis.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Alexander Duyck [Mon, 11 Apr 2016 01:44:51 +0000 (21:44 -0400)]
GSO: Add GSO type for fixed IPv4 ID
This patch adds support for TSO using IPv4 headers with a fixed IP ID
field. This is meant to allow us to do a lossless GRO in the case of TCP
flows that use a fixed IP ID such as those that convert IPv6 header to IPv4
headers.
In addition I am adding a feature that for now I am referring to TSO with
IP ID mangling. Basically when this flag is enabled the device has the
option to either output the flow with incrementing IP IDs or with a fixed
IP ID regardless of what the original IP ID ordering was. This is useful
in cases where the DF bit is set and we do not care if the original IP ID
value is maintained.
Signed-off-by: Alexander Duyck <aduyck@mirantis.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Alexander Duyck [Mon, 11 Apr 2016 01:44:44 +0000 (21:44 -0400)]
ethtool: Add support for toggling any of the GSO offloads
The strings were missing for several of the GSO offloads that are
available. This patch provides the missing strings so that we can toggle
or query any of them via the ethtool command.
Signed-off-by: Alexander Duyck <aduyck@mirantis.com> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Thu, 14 Apr 2016 20:22:12 +0000 (16:22 -0400)]
Merge branch 'mlxsw-devlink-shared-buffers'
Jiri Pirko says:
====================
devlink + mlxsw: add support for config and control of shared buffers
ASICs implement shared buffer for packet forwarding purposes and enable
flexible partitioning of the shared buffer for different flows and ports,
enabling non-blocking progress of different flows as well as separation
of lossy traffic from loss-less traffic when using Per-Priority Flow
Control (PFC). The shared buffer optimizes the buffer utilization for better
absorption of packet bursts.
This patchset implements API which is based on the model SAI uses. That is
aligned with multiple ASIC vendors so this API should be vendor neutral.
Userspace counterpart patchset for devlink iproute2 tool can be found here:
https://github.com/jpirko/iproute2_mlxsw/tree/devlink_sb
Couple of examples of usage:
switch$ devlink sb help
Usage: devlink sb show [ DEV [ sb SB_INDEX ] ]
devlink sb pool show [ DEV [ sb SB_INDEX ] pool POOL_INDEX ]
devlink sb pool set DEV [ sb SB_INDEX ] pool POOL_INDEX
size POOL_SIZE thtype { static | dynamic }
devlink sb port pool show [ DEV/PORT_INDEX [ sb SB_INDEX ]
pool POOL_INDEX ]
devlink sb port pool set DEV/PORT_INDEX [ sb SB_INDEX ]
pool POOL_INDEX th THRESHOLD
devlink sb tc bind show [ DEV/PORT_INDEX [ sb SB_INDEX ] tc TC_INDEX ]
devlink sb tc bind set DEV/PORT_INDEX [ sb SB_INDEX ] tc TC_INDEX
type { ingress | egress } pool POOL_INDEX
th THRESHOLD
devlink sb occupancy show { DEV | DEV/PORT_INDEX } [ sb SB_INDEX ]
devlink sb occupancy snapshot DEV [ sb SB_INDEX ]
devlink sb occupancy clearmax DEV [ sb SB_INDEX ]
Implement occupancy API introduced in devlink and mlxsw core. This is
done by accessing SBPM register for Port-Pool and SBSR for Port-TC
current and max occupancy values. Max clear is implemented using the
same registers.
Signed-off-by: Jiri Pirko <jiri@mellanox.com> Reviewed-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
mlxsw: core: Introduce support for asynchronous EMAD register access
So far it was possible to have one EMAD register access at a time,
locked by mutex. This patch extends this interface to allow multiple
EMAD register accesses to be in fly at once. That allows faster
processing on firmware side avoiding unused time in between EMADs.
Measured speedup is ~30% for shared occupancy snapshot operation.
Signed-off-by: Jiri Pirko <jiri@mellanox.com> Reviewed-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
mlxsw: core: Add mlxsw specific workqueue and use it for FDB notif. processing
Follow-up patch is going to need to use delayed work as well and
frequently. The FDB notification processing is already using that and
also quite frequently. It makes sense to create separate workqueue just
for mlxsw driver in this case and do not pollute system_wq.
Signed-off-by: Jiri Pirko <jiri@mellanox.com> Reviewed-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
mlxsw: reg: Extend SBPM register for occupancy control
Since it is not possible to get and clear Port-Pool occupancy data using
SBSR register, there's a need to implement that using SBPM.
Extend pack helper and add unpack helper to get occupancy values.
Signed-off-by: Jiri Pirko <jiri@mellanox.com> Reviewed-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
mlxsw: spectrum_buffers: Get max_buff defaults into limits exposed to user
Although the device supports max_buff magic values 0 and 0xff, these are
not exposed to the user via devlink.
Therefore, adjust the default values to be within configurable range.
Signed-off-by: Jiri Pirko <jiri@mellanox.com> Reviewed-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
mlxsw: spectrum_buffers: Change initialization of PG 9
As explained in commit ff6551ec0c27 ("mlxsw: spectrum: Correctly
configure headroom size") control packets are directed to priority group
buffer 9 (PG9) in the ports' headroom buffers.
Since we don't want to drop control packets in case they can't be
admitted to the switch's shared buffer we bind PG9 to a different
ingress pool from the one used by all other PGs.
Unlike other PGs, we currently don't expose the binding between PG9 to a
pool and leave it fixed.
Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
mlxsw: spectrum_buffers: Remove eg pool 3 default init and CPU port TC binding to it
Since there is no congestion control for CPU port traffic, we can change
the CPU port TC binding to pool 0 with min_buff and max_buff zeroed.
Remove initialization for pool egress pool 3 since it is no longer used
by dafault.
Signed-off-by: Jiri Pirko <jiri@mellanox.com> Reviewed-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
In order to achieve faster dumping of current setting and also in order
to provide possibility to get pool mode without a need to query hardware,
do cache the configuration in driver.
Signed-off-by: Jiri Pirko <jiri@mellanox.com> Reviewed-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
User needs to monitor shared buffer occupancy. For that, he issues a
snapshot command in order to instruct hardware to catch current and
maximal occupancy values, and clear command in order to clear the
historical maximal values.
Also port-pool and tc-pool-bind command response messages are extended to
carry occupancy values.
Signed-off-by: Jiri Pirko <jiri@mellanox.com> Reviewed-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Define userspace API and drivers API for configuration of shared
buffers. Four basic objects are defined:
shared buffer - attributes are size, number of pools and TCs
pool - chunk of sharedbuffer definition, it has some size and either
static or dynamic threshold
port pool threshold - to set per-port threshold for each pool
port tc threshold bind - to bind port and TC to specified pool
with threshold.
Signed-off-by: Jiri Pirko <jiri@mellanox.com> Reviewed-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Sergei Shtylyov [Sun, 10 Apr 2016 20:55:15 +0000 (23:55 +0300)]
ravb: make ravb_ptp_interrupt() *void*
When we have the ISS.CGIS bit set, we already know that gPTP interrupt has
happened, so an extra GIS register check at the end of ravb_ptp_interrupt()
seems superfluous. We can model the gPTP interrupt handler like all other
dedicated interrupt handlers in the driver and make it *void*.
Signed-off-by: Sergei Shtylyov <sergei.shtylyov@cogentembedded.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Adds support for the following via ethtool:
- UDP configuration of RSS based on 2-tuple/4-tuple.
- RSS hash key.
- RSS indirection table.
Signed-off-by: Sudarsana Reddy Kalluru <sudarsana.kalluru@qlogic.com> Signed-off-by: Yuval Mintz <Yuval.Mintz@qlogic.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Adds the required API for passing RSS-related configuration from qede.
Signed-off-by: Sudarsana Reddy Kalluru <sudarsana.kalluru@qlogic.com> Signed-off-by: Yuval Mintz <Yuval.Mintz@qlogic.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Inbox drivers don't need versioning scheme in order to guarantee
compatibility, as both qed and qede are compiled from same codebase.
Signed-off-by: Rahul Verma <rahul.verma@qlogic.com> Signed-off-by: Yuval Mintz <Yuval.Mintz@qlogic.com> Signed-off-by: David S. Miller <davem@davemloft.net>
bcmgenet_isr1() and bcmgenet_isr0() run in hard irq context,
we do not need to block irq again.
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Petri Gynther <pgynther@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Sat, 9 Apr 2016 05:06:40 +0000 (22:06 -0700)]
net: bcmgenet: use napi_complete_done()
By using napi_complete_done(), we allow fine tuning
of /sys/class/net/ethX/gro_flush_timeout for higher GRO aggregation
efficiency for a Gbit NIC.
Check commit 24d2e4a50737 ("tg3: use napi_complete_done()") for details.
Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Petri Gynther <pgynther@google.com> Cc: Florian Fainelli <f.fainelli@gmail.com> Acked-by: Florian Fainelli <f.fainelli@gmail.com> Acked-by: Petri Gynther <pgynther@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Thu, 14 Apr 2016 03:04:44 +0000 (23:04 -0400)]
Merge branch 'sctp-delayed-wakeups'
Marcelo Ricardo Leitner says:
====================
sctp: delay calls to sk_data_ready() as much as possible
1st patch is a preparation for the 2nd. The idea is to not call
->sk_data_ready() for every data chunk processed while processing
packets but only once before releasing the socket.
v2: patchset re-checked, small changelog fixes
v3: on patch 2, make use of local vars to make it more readable
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
sctp: delay calls to sk_data_ready() as much as possible
Currently processing of multiple chunks in a single SCTP packet leads to
multiple calls to sk_data_ready, causing multiple wake up signals which
are costy and doesn't make it wake up any faster.
With this patch it will note that the wake up is pending and will do it
before leaving the state machine interpreter, latest place possible to
do it realiably and cleanly.
Note that sk_data_ready events are not dependent on asocs, unlike waking
up writers.
v2: series re-checked
v3: use local vars to cleanup the code, suggested by Jakub Sitnicki Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
sctp: compress bit-wide flags to a bitfield on sctp_sock
It wastes space and gets worse as we add new flags, so convert bit-wide
flags to a bitfield.
Currently it already saves 4 bytes in sctp_sock, which are left as holes
in it for now. The whole struct needs packing, which should be done in
another patch.
Note that do_auto_asconf cannot be merged, as explained in the comment
before it.
Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Acked-by: Neil Horman <nhorman@tuxdriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
drivers/net/ethernet/jme.c: Deinline jme_reset_mac_processor, save 2816 bytes
This function compiles to 895 bytes of machine code.
Clearly, this isn't a time-critical function.
For one, it has a number of udelay(1) calls.
Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com> CC: David S. Miller <davem@davemloft.net> CC: linux-kernel@vger.kernel.org CC: netdev@vger.kernel.org Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Thu, 14 Apr 2016 02:42:33 +0000 (22:42 -0400)]
Merge branch 'bridge-sysfs-rtnl-notifications'
Xin Long says:
====================
bridge: support sending rntl info when we set attributes through sysfs/ioctl
This patchset is used to support sending rntl info to user in some places,
and ensure that whenever those attributes change internally or from sysfs,
that a netlink notification is sent out to listeners.
It also make some adjustment in bridge sysfs so that we can implement this
easily.
I've done some tests on this patchset, like:
[br_sysfs]
1. change all the attribute values of br or brif:
$ echo $value > /sys/class/net/br0/bridge/{*}
$ echo $value > /sys/class/net/br0/brif/eth1/{*}
2. meanwhile, on another terminal to observe the msg:
$ bridge monitor
[br_ioctl]
1. in bridge-utils package, do some changes in br_set, let brctl command
use ioctl to set attribute:
if ((ret = set_sysfs(path, value)) < 0) { -->
if (1) {
$ brctl set*
2. meanwhile, on another terminal to observe the msg:
$ bridge monitor
This test covers all the attributes that brctl and sysfs support to set.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Xin Long [Fri, 8 Apr 2016 16:03:33 +0000 (00:03 +0800)]
bridge: a netlink notification should be sent when those attributes are changed by ioctl
Now when we change the attributes of bridge or br_port by netlink,
a relevant netlink notification will be sent, but if we change them
by ioctl or sysfs, no notification will be sent.
We should ensure that whenever those attributes change internally or from
sysfs/ioctl, that a netlink notification is sent out to listeners.
Also, NetworkManager will use this in the future to listen for out-of-band
bridge master attribute updates and incorporate them into the runtime
configuration.
This patch is used for ioctl.
Signed-off-by: Xin Long <lucien.xin@gmail.com> Reviewed-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Xin Long [Fri, 8 Apr 2016 16:03:32 +0000 (00:03 +0800)]
bridge: a netlink notification should be sent when those attributes are changed by br_sysfs_if
Now when we change the attributes of bridge or br_port by netlink,
a relevant netlink notification will be sent, but if we change them
by ioctl or sysfs, no notification will be sent.
We should ensure that whenever those attributes change internally or from
sysfs/ioctl, that a netlink notification is sent out to listeners.
Also, NetworkManager will use this in the future to listen for out-of-band
bridge master attribute updates and incorporate them into the runtime
configuration.
This patch is used for br_sysfs_if, and we also move br_ifinfo_notify out
of store_flag.
Signed-off-by: Xin Long <lucien.xin@gmail.com> Reviewed-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Xin Long [Fri, 8 Apr 2016 16:03:31 +0000 (00:03 +0800)]
bridge: a netlink notification should be sent when those attributes are changed by br_sysfs_br
Now when we change the attributes of bridge or br_port by netlink,
a relevant netlink notification will be sent, but if we change them
by ioctl or sysfs, no notification will be sent.
We should ensure that whenever those attributes change internally or from
sysfs/ioctl, that a netlink notification is sent out to listeners.
Also, NetworkManager will use this in the future to listen for out-of-band
bridge master attribute updates and incorporate them into the runtime
configuration.
This patch is used for br_sysfs_br. and we also need to remove some
rtnl_trylock in old functions so that we can call it in a common one.
For group_addr_store, we cannot make it use store_bridge_parm, because
it's not a string-to-long convert, we will add notification on it
individually.
Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Xin Long [Fri, 8 Apr 2016 16:03:30 +0000 (00:03 +0800)]
bridge: simplify the stp_state_store by calling store_bridge_parm
There are some repetitive codes in stp_state_store, we can remove
them by calling store_bridge_parm.
Signed-off-by: Xin Long <lucien.xin@gmail.com> Reviewed-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Xin Long [Fri, 8 Apr 2016 16:03:29 +0000 (00:03 +0800)]
bridge: simplify the forward_delay_store by calling store_bridge_parm
There are some repetitive codes in forward_delay_store, we can remove
them by calling store_bridge_parm.
Signed-off-by: Xin Long <lucien.xin@gmail.com> Reviewed-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Xin Long [Fri, 8 Apr 2016 16:03:28 +0000 (00:03 +0800)]
bridge: simplify the flush_store by calling store_bridge_parm
There are some repetitive codes in flush_store, we can remove
them by calling store_bridge_parm, also, it would send rtnl notification
after we add it in store_bridge_parm in the following patches.
Signed-off-by: Xin Long <lucien.xin@gmail.com> Reviewed-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
net: force inlining of netif_tx_start/stop_queue, sock_hold, __sock_put
Sometimes gcc mysteriously doesn't inline
very small functions we expect to be inlined. See
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66122
Arguably, gcc should do better, but gcc people aren't willing
to invest time into it, asking to use __always_inline instead.
With this .config:
http://busybox.net/~vda/kernel_config_OPTIMIZE_INLINING_and_Os,
the following functions get deinlined many times.
Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com> CC: David S. Miller <davem@davemloft.net> CC: linux-kernel@vger.kernel.org CC: netdev@vger.kernel.org CC: netfilter-devel@vger.kernel.org Signed-off-by: David S. Miller <davem@davemloft.net>
Daniel Borkmann [Fri, 8 Apr 2016 13:55:00 +0000 (15:55 +0200)]
ipv6, token: allow for clearing the current device token
The original tokenized iid support implemented via f53adae4eae5 ("net: ipv6:
add tokenized interface identifier support") didn't allow for clearing a
device token as it was intended that this addressing mode was the only one
active for globally scoped IPv6 addresses. Later we relaxed that restriction
via 617fe29d45bd ("net: ipv6: only invalidate previously tokenized addresses"),
and we should also allow for clearing tokens as there's no good reason why
it shouldn't be allowed.
Fixes: 617fe29d45bd ("net: ipv6: only invalidate previously tokenized addresses") Reported-by: Robin H. Johnson <robbat2@gentoo.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Cc: Hannes Frederic Sowa <hannes@stressinduktion.org> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
sock: tigthen lockdep checks for sock_owned_by_user
sock_owned_by_user should not be used without socket lock held. It seems
to be a common practice to check .owned before lock reclassification, so
provide a little help to abstract this check away.
Cc: linux-cifs@vger.kernel.org Cc: linux-bluetooth@vger.kernel.org Cc: linux-nfs@vger.kernel.org Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
On GMAC4.xx each descriptor contains 2 buffers of 16KB (each).
Initially, those 2 buffers was filled in dwmac4_rd_prepare_tx_desc but
it is actually not needed. Indeed, stmmac driver supports frame up to
9000 bytes (jumbo). So only one buffer is needed.
Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Alexandre TORGUE <alexandre.torgue@st.com> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Thu, 14 Apr 2016 02:24:52 +0000 (22:24 -0400)]
Merge branch 'udp-hdrs-fixes'
Willem de Bruijn says:
====================
fix two more udp pull header issues
Follow up patches to the fixes to RxRPC and SunRPC. A scan of the
code showed two other interfaces that expect UDP packets to have
a udphdr when queued: read packet length with ioctl SIOCINQ and
receive payload checksum with socket option IP_CHECKSUM.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
udp: do not expect udp headers in recv cmsg IP_CMSG_CHECKSUM
On udp sockets, recv cmsg IP_CMSG_CHECKSUM returns a checksum over
the packet payload. Since commit e6afc8ace6dd pulled the headers,
taking skb->data as the start of transport header is incorrect. Use
the transport header pointer.
Also, when peeking at an offset from the start of the packet, only
return a checksum from the start of the peeked data. Note that the
cmsg does not subtract a tail checkum when reading truncated data.
Fixes: e6afc8ace6dd ("udp: remove headers from UDP packets before queueing") Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
On udp sockets, ioctl SIOCINQ returns the payload size of the first
packet. Since commit e6afc8ace6dd pulled the headers, the result is
incorrect when subtracting header length. Remove that operation.
Fixes: e6afc8ace6dd ("udp: remove headers from UDP packets before queueing") Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Wed, 13 Apr 2016 22:15:24 +0000 (18:15 -0400)]
Merge branch 'dsa-refactoring-set-1'
Andrew Lunn says:
====================
DSA refactoring: set 1
There has been a long running effort to refractor DSA probing to make
the switches true linux devices. Here are a small collection of
patches moving in this direction. Most have been seen before.
We take a little step forward by passing the dsa device point to the
driver, thus allowing it to perform resource allocations using the
normal mechanisms. This device structure will later be replaced by the
devices own device structure.
Future patches will add a true driver probe function, so we rename the
current probe function, cleaning up the namespace.
phys_port_mask continually confuses me, thinking it is about PHYs. But
it is actually about ports enabled to the outside world. So rename it to
enabled_port_mask.
Lots more patches yet to follow, this is just doing some ground work.
v2:
enabled_port_mask instread of user_port_masks
Added Tested-by's and Reviewed-by.
====================
Tested-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Andrew Lunn [Wed, 13 Apr 2016 00:40:45 +0000 (02:40 +0200)]
dsa: mv88e6xxx: Use bus in mv88e6xxx_lookup_name()
mv88e6xxx_lookup_name() returns the model name of a switch at a given
address on an MII bus. Using mii_bus to identify the bus rather than
the host device is more logical, so change the parameter.
Signed-off-by: Andrew Lunn <andrew@lunn.ch> Tested-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com> Reviewed-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Andrew Lunn [Wed, 13 Apr 2016 00:40:44 +0000 (02:40 +0200)]
dsa: Rename phys_port_mask to enabled_port_mask
The phys in phys_port_mask suggests this mask is about PHYs. In fact,
it means physical ports. Rename to enabled_port_mask, indicating
external enabled ports of the switch, which is hopefully less
confusing.
Signed-off-by: Andrew Lunn <andrew@lunn.ch> Tested-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com> Acked-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Andrew Lunn [Wed, 13 Apr 2016 00:40:43 +0000 (02:40 +0200)]
net: dsa: Rename DSA probe function.
Rename the function called from the DSA to perform a probe for the
switch. This makes the normal _probe() name available for a standard
Linux device driver probe function.
Signed-off-by: Andrew Lunn <andrew@lunn.ch> Tested-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com> Acked-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Andrew Lunn [Wed, 13 Apr 2016 00:40:42 +0000 (02:40 +0200)]
net: dsa: Keep the mii bus and address in the private structure
Rather than looking up the mii bus and address every time, do it once
at probe, and keep it in the private structure. Centralise this probe
code in mv88e6xxx.
Signed-off-by: Andrew Lunn <andrew@lunn.ch> Acked-by: Florian Fainelli <f.fainelli@gmail.com> Tested-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Andrew Lunn [Wed, 13 Apr 2016 00:40:41 +0000 (02:40 +0200)]
net: dsa: Remove allocation of driver private memory
The drivers now allocate their own memory for private usage. Remove
the allocation from the core code.
Signed-off-by: Andrew Lunn <andrew@lunn.ch> Acked-by: Florian Fainelli <f.fainelli@gmail.com> Tested-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Andrew Lunn [Wed, 13 Apr 2016 00:40:40 +0000 (02:40 +0200)]
net: dsa: Have the switch driver allocate there own private memory
Now the switch devices have a dev pointer, make use of it for allocating
the drivers private data structures using a devm_kzalloc().
Signed-off-by: Andrew Lunn <andrew@lunn.ch> Acked-by: Florian Fainelli <f.fainelli@gmail.com> Tested-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Andrew Lunn [Wed, 13 Apr 2016 00:40:39 +0000 (02:40 +0200)]
net: dsa: Pass the dsa device to the switch drivers
By passing a device structure to the switch devices, it allows them
to use devm_* methods for resource management.
Signed-off-by: Andrew Lunn <andrew@lunn.ch> Acked-by: Florian Fainelli <f.fainelli@gmail.com> Tested-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Wed, 13 Apr 2016 21:58:51 +0000 (17:58 -0400)]
Merge tag 'mac80211-next-for-davem-2016-04-13' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211-next
Johannes Berg says:
====================
To synchronize with Kalle, here's just a big change that affects
all drivers - removing the duplicated enum ieee80211_band and
replacing it by enum nl80211_band. On top of that, just a small
documentation update.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Jon Paul Maloy [Wed, 13 Apr 2016 15:45:47 +0000 (11:45 -0400)]
tipc: remove remnants of old broadcast code
We remove a couple of leftover fields in struct tipc_bearer. Those
were used by the old broadcast implementation, and are not needed
any longer. There is no functional changes in this commit.
Acked-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Wed, 13 Apr 2016 02:41:33 +0000 (22:41 -0400)]
Merge branch 'mediatek-stress-test-fixes'
John Crispin says:
====================
net: mediatek: make the driver pass stress tests
While testing the driver we managed to get the TX path to stall and fail
to recover. When dual MAC support was added to the driver, the whole queue
stop/wake code was not properly adapted. There was also a regression in the
locking of the xmit function. The fact that watchdog_timeo was not set and
that the tx_timeout code failed to properly reset the dma, irq and queue
just made the mess complete.
This series make the driver pass stress testing. With this series applied
the testbed has been running for several days and still has not locked up.
We have a second setup that has a small hack patch applied to randomly stop
irqs and/or one of the queues and successfully manages to recover from these
simulated tx stalls.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
John Crispin [Thu, 7 Apr 2016 22:54:11 +0000 (00:54 +0200)]
net: mediatek: do not set the QID field in the TX DMA descriptors
The QID field gets set to the mac id. This made the DMA linked list queue
the traffic of each MAC on a different internal queue. However during long
term testing we found that this will cause traffic stalls as the multi
queue setup requires a more complete initialisation which is not part of
the upstream driver yet.
This patch removes the code setting the QID field, resulting in all
traffic ending up in queue 0 which works without any special setup.
Signed-off-by: John Crispin <blogic@openwrt.org> Signed-off-by: David S. Miller <davem@davemloft.net>
John Crispin [Thu, 7 Apr 2016 22:54:10 +0000 (00:54 +0200)]
net: mediatek: move the pending_work struct to the device generic struct
The worker always touches both netdevs. It is ethernet core and not MAC
specific. We only need one worker, which belongs into the ethernets core
struct.
Signed-off-by: John Crispin <blogic@openwrt.org> Signed-off-by: David S. Miller <davem@davemloft.net>
John Crispin [Thu, 7 Apr 2016 22:54:09 +0000 (00:54 +0200)]
net: mediatek: fix mtk_pending_work
The driver supports 2 MACs. Both run on the same DMA ring. If we hit a TX
timeout we need to stop both netdevs before restarting them again. If we
don't do this, mtk_stop() wont shutdown DMA and the consecutive call to
mtk_open() wont restart DMA and enable IRQs.
Signed-off-by: John Crispin <blogic@openwrt.org> Signed-off-by: David S. Miller <davem@davemloft.net>
John Crispin [Thu, 7 Apr 2016 22:54:08 +0000 (00:54 +0200)]
net: mediatek: fix TX locking
Inside the TX path there is a lock inside the tx_map function. This is
however too late. The patch moves the lock to the start of the xmit
function right before the free count check of the DMA ring happens.
If we do not do this, the code becomes racy leading to TX stalls and
dropped packets. This happens as there are 2 netdevs running on the
same physical DMA ring.
Signed-off-by: John Crispin <blogic@openwrt.org> Signed-off-by: David S. Miller <davem@davemloft.net>
John Crispin [Thu, 7 Apr 2016 22:54:07 +0000 (00:54 +0200)]
net: mediatek: fix stop and wakeup of queue
The driver supports 2 MACs. Both run on the same DMA ring. If we go
above/below the TX rings threshold value, we always need to wake/stop
the queue of both devices. Not doing to can cause TX stalls and packet
drops on one of the devices.
Signed-off-by: John Crispin <blogic@openwrt.org> Signed-off-by: David S. Miller <davem@davemloft.net>
John Crispin [Thu, 7 Apr 2016 22:54:05 +0000 (00:54 +0200)]
net: mediatek: mtk_cal_txd_req() returns bad value
The code used to also support the PDMA engine, which had 2 packet pointers
per descriptor. Because of this we had to divide the result by 2 and round
it up. This is no longer needed as the code only supports QDMA.
Signed-off-by: John Crispin <blogic@openwrt.org> Signed-off-by: David S. Miller <davem@davemloft.net>
Here's a set of Bluetooth & 802.15.4 patches intended for the 4.7 kernel:
- Fix for race condition in vhci driver
- Memory leak fix for ieee802154/adf7242 driver
- Improvements to deal with single-mode (LE-only) Bluetooth controllers
- Fix for allowing the BT_SECURITY_FIPS security level
- New BCM2E71 ACPI ID
- NULL pointer dereference fix fox hci_ldisc driver
Let me know if there are any issues pulling. Thanks.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Johannes Berg [Tue, 12 Apr 2016 13:56:15 +0000 (15:56 +0200)]
cfg80211: remove enum ieee80211_band
This enum is already perfectly aliased to enum nl80211_band, and
the only reason for it is that we get IEEE80211_NUM_BANDS out of
it. There's no really good reason to not declare the number of
bands in nl80211 though, so do that and remove the cfg80211 one.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
The roaming cases for the Connect command were not fully covered and
neither Connect nor Associate command uses of the prev_bssid parameter
were very clear. Add details to describe how the prev_bssid argument is
supposed to be used and when the driver should use association or
reassociation.
Signed-off-by: Jouni Malinen <jouni@qca.qualcomm.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Andrew Lunn [Mon, 11 Apr 2016 19:40:05 +0000 (21:40 +0200)]
net: mdio: Fix lockdep falls positive splat
MDIO devices can be stacked upon each other. The current code supports
two levels, which until recently has been enough for a DSA mdio bus on
top of another bus. Now we have hardware which has an MDIO mux in the
middle.
Define an MDIO MUTEX class with three levels.
Signed-off-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Mon, 11 Apr 2016 19:34:42 +0000 (15:34 -0400)]
Merge branch 'rprpc-2nd-rewrite-part-1'
David Howells says:
====================
RxRPC: 2nd rewrite part 1
Okay, I'm in the process of rewriting the RxRPC rewrite. The primary aim of
this second rewrite is to strictly control the number of active connections we
know about and to get rid of connections we don't need much more quickly.
On top of this, there are fixes to the protocol handling which will all occur
in later parts.
Here's the first set of patches from the second go, aimed at net-next. These
are all fixes and cleanups preparatory to the main event.
Notable parts of this set include:
(1) A fix for the AFS filesystem to wait for outstanding calls to complete
before closing the RxRPC socket.
(2) Differentiation of local and remote abort codes. At a future point
userspace will get to see this via control message data on recvmsg().
(3) Absorb the rxkad module into the af_rxrpc module to prevent a dependency
loop.
(4) Create a null security module and unconditionalise calls into the
security module that's in force (there will always be a security module
applied to a connection, even if it's just the null one).
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
David Howells [Thu, 7 Apr 2016 16:23:58 +0000 (17:23 +0100)]
rxrpc: Create a null security type and get rid of conditional calls
Create a null security type for security index 0 and get rid of all
conditional calls to the security operations. We expect normally to be
using security, so this should be of little negative impact.
Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
David Howells [Thu, 7 Apr 2016 16:23:51 +0000 (17:23 +0100)]
rxrpc: Absorb the rxkad security module
Absorb the rxkad security module into the af_rxrpc module so that there's
only one module file. This avoids a circular dependency whereby rxkad pins
af_rxrpc and cached connections pin rxkad but can't be manually evicted
(they will expire eventually and cease pinning).
With this change, af_rxrpc can just be unloaded, despite having cached
connections.
Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
David Howells [Thu, 7 Apr 2016 16:23:44 +0000 (17:23 +0100)]
rxrpc: Don't assume transport address family and size when using it
Don't assume transport address family and size when using the peer address
to send a packet. Instead, use the start of the transport address rather
than any particular element of the union and use the transport address
length noted inside the sockaddr_rxrpc struct.
This will be necessary when IPv6 support is introduced.
Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
David Howells [Thu, 7 Apr 2016 16:23:37 +0000 (17:23 +0100)]
rxrpc: Don't pass gfp around in incoming call handling functions
Don't pass gfp around in incoming call handling functions, but rather hard
code it at the points where we actually need it since the value comes from
within the rxrpc driver and is always the same.
Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
David Howells [Thu, 7 Apr 2016 16:23:30 +0000 (17:23 +0100)]
rxrpc: Differentiate local and remote abort codes in structs
In the rxrpc_connection and rxrpc_call structs, there's one field to hold
the abort code, no matter whether that value was generated locally to be
sent or was received from the peer via an abort packet.
Split the abort code fields in two for cleanliness sake and add an error
field to hold the Linux error number to the rxrpc_call struct too
(sometimes this is generated in a context where we can't return it to
userspace directly).
Furthermore, add a skb mark to indicate a packet that caused a local abort
to be generated so that recvmsg() can pick up the correct abort code. A
future addition will need to be to indicate to userspace the difference
between aborts via a control message.
Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
David Howells [Thu, 7 Apr 2016 16:23:03 +0000 (17:23 +0100)]
afs: Wait for outstanding async calls before closing rxrpc socket
The afs filesystem needs to wait for any outstanding asynchronous calls
(such as FS.GiveUpCallBacks cleaning up the callbacks lodged with a server)
to complete before closing the AF_RXRPC socket when unloading the module.
This may occur if the module is removed too quickly after unmounting all
filesystems. This will produce an error report that looks like:
AFS: Assertion failed
1 == 0 is false
0x1 == 0x0 is false
------------[ cut here ]------------
kernel BUG at ../fs/afs/rxrpc.c:135!
...
RIP: 0010:[<ffffffffa004111c>] afs_close_socket+0xec/0x107 [kafs]
...
Call Trace:
[<ffffffffa004a160>] afs_exit+0x1f/0x57 [kafs]
[<ffffffff810c30a0>] SyS_delete_module+0xec/0x17d
[<ffffffff81610417>] entry_SYSCALL_64_fastpath+0x12/0x6b
Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Commit e6afc8ace6dd ("udp: remove headers from UDP packets before
queueing") modified udp receive processing to pull headers before
enqueue and to not expect them on dequeue.
The patch missed protocols on top of udp with in-kernel
implementations that have their own skb_recv_datagram calls and
dequeue logic. Modify these datapaths to also no longer expect
a udp header at skb->data.
Sunrpc and rxrpc are the only two protocols that call this
function and contain references to udphr (some others, like tipc,
are based on encap_rcv, which acts before enqueue, before the
the header pull).
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit e6afc8ace6dd modified the udp receive path by pulling the udp
header before queuing an skbuff onto the receive queue.
Rxrpc also calls skb_recv_datagram to dequeue an skb from a udp
socket. Modify this receive path to also no longer expect udp
headers.
Fixes: e6afc8ace6dd ("udp: remove headers from UDP packets before queueing") Signed-off-by: Willem de Bruijn <willemb@google.com> Tested-by: Thierry Reding <treding@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Commit e6afc8ace6dd modified the udp receive path by pulling the udp
header before queuing an skbuff onto the receive queue.
Sunrpc also calls skb_recv_datagram to dequeue an skb from a udp
socket. Modify this receive path to also no longer expect udp
headers.
Fixes: e6afc8ace6dd ("udp: remove headers from UDP packets before queueing") Reported-by: Franklin S Cooper Jr. <fcooper@ti.com> Signed-off-by: Willem de Bruijn <willemb@google.com> Tested-by: Thierry Reding <treding@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
host h1 with IP 12.0.0.1 has 2 paths to host h3 at 15.0.0.5:
root@h1:~# ip ro ls
...
12.0.0.0/24 dev swp1 proto kernel scope link src 12.0.0.1
15.0.0.0/16
nexthop via 12.0.0.2 dev swp1 weight 1
nexthop via 12.0.0.3 dev swp1 weight 1
...
If the link between tor3 and tor1 is down and the link between tor1
and tor2 then tor1 is effectively cut-off from h1. Yet the route lookups
in h1 are alternating between the 2 routes: ping 15.0.0.5 gets one and
ssh 15.0.0.5 gets the other. Connections that attempt to use the
12.0.0.2 nexthop fail since that neighbor is not reachable:
root@h1:~# ip neigh show
...
12.0.0.3 dev swp1 lladdr 00:02:00:00:00:1b REACHABLE
12.0.0.2 dev swp1 FAILED
...
The failed path can be avoided by considering known neighbor information
when selecting next hops. If the neighbor lookup fails we have no
knowledge about the nexthop, so give it a shot. If there is an entry
then only select the nexthop if the state is sane. This is similar to
what fib_detect_death does.
To maintain backward compatibility use of the neighbor information is
based on a new sysctl, fib_multipath_use_neigh.
Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Reviewed-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: David S. Miller <davem@davemloft.net>