Shradha Shah [Wed, 6 May 2015 00:00:07 +0000 (01:00 +0100)]
sfc: Bind the sfc driver to any available VF's
Add the device ID of the VF to the PCI device ID table.
Added a boolean flag is_vf in efx_nic_type to differentiate
between a VF and PF at probe time. This flag is useful in later
patches while setting MAC address specially in the
PCI-passthrough case.
Signed-off-by: Shradha Shah <sshah@solarflare.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Jon Cooper [Tue, 5 May 2015 23:59:38 +0000 (00:59 +0100)]
sfc: Add use of shared RSS contexts.
Allow PFs to allocate shared RSS contexts if we exhaust our
exclusive RSS contexts. Make VFs use shared RSS contexts in
all cases.
Spruce up error handling so that the shadow copy of the RSS
table is updated after successful update, rather than in all
cases, so that we report the actual contents of the RSS table
after a failure to set it, rather than what we'd like it to be.
Populate context_size parameter when vacuously allocating RSS
context of size 1.
Signed-off-by: Shradha Shah <sshah@solarflare.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Edward Cree [Tue, 5 May 2015 23:59:18 +0000 (00:59 +0100)]
sfc: Cope with permissions enforcement added to firmware for SR-IOV
* Accept EPERM in some simple cases, the following cases are handled:
1) efx_mcdi_read_assertion()
Unprivileged PCI functions aren't allowed to GET_ASSERTS.
We return success as it's up to the primary PF to deal with asserts.
2) efx_mcdi_mon_probe() in efx_ef10_probe()
Unprivileged PCI functions aren't allowed to read sensor info, and
worrying about sensor data is the primary PF's job.
3) phy_op->reconfigure() in efx_init_port() and efx_reset_up()
Unprivileged functions aren't allowed to MC_CMD_SET_LINK, they just have
to accept the settings (including flow-control, which is what
efx_init_port() is worried about) they've been given.
4) Fallback to GET_WORKAROUNDS in efx_ef10_probe()
Unprivileged PCI functions aren't allowed to set workarounds. So if
efx_mcdi_set_workaround() fails EPERM, use efx_mcdi_get_workarounds()
to find out if workaround_35388 is enabled.
5) If DRV_ATTACH gets EPERM, try without specifying fw-variant
Unprivileged PCI functions have to use a FIRMWARE_ID of 0xffffffff
(MC_CMD_FW_DONT_CARE).
6) Don't try to exit_assertion unless one had fired
Previously we called efx_mcdi_exit_assertion even if
efx_mcdi_read_assertion had received MC_CMD_GET_ASSERTS_FLAGS_NO_FAILS.
This is unnecessary, and the resulting MC_CMD_REBOOT, even if the
AFTER_ASSERTION flag made it a no-op, would fail EPERM for unprivileged
PCI functions.
So make efx_mcdi_read_assertion return whether an assert happened, and only
call efx_mcdi_exit_assertion if it has.
Signed-off-by: Shradha Shah <sshah@solarflare.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Shradha Shah [Tue, 5 May 2015 23:58:54 +0000 (00:58 +0100)]
sfc: manually allocate and free vadaptors
To be able to use MC_CMD_VADAPTOR_SET_MAC, vadaptors must be
manually allocated and freed as automatic vadaptors will disappear
when their reference_count reaches zero, which must happen before
the MAC address is changed.
Vadaptors are allocated and freed in the vswitching_probe/remove
functions for PFs and VFs, and this means that vadaptors are restored
correctly following an MC reboot or other reset when required.
Signed-off-by: Shradha Shah <sshah@solarflare.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Shradha Shah [Tue, 5 May 2015 23:58:31 +0000 (00:58 +0100)]
sfc: create vports for VFs and assign random MAC addresses
The parent PF creates vports for all its child VFs and adds MAC
addresses to these. When the VF driver loads, it can make an MCDI
call to get the MAC address that the parent PF assigned it.
The parent PF also assigns a mac address to its own vport because
implicit creation of a vAdaptor will only work on evb ports with
MAC addresses assigned.
The vport MAC address needs to be stored in the PF's nic_data
struct as it can later be changed on the vadaptor (and its net_dev
struct). When removing a vport the original MAC address must be
deleted.
A new flag is needed in the VF data structure to identify whether
a vport has been assigned to the VF. This is to determine whether
it needs to be un-assigned before freeing the vport. Also,
attempting to un-assign a vport which is not assigned will result
in an EALREADY error.
Signed-off-by: Shradha Shah <sshah@solarflare.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Daniel Pieczko [Tue, 5 May 2015 23:57:34 +0000 (00:57 +0100)]
sfc: create VEB vswitch and vport above default firmware setup
Adds functions to allocate and free vswitches and vports; vadaptors
are automatically allocated and freed when TX/RX queues are
initialised and finalised. This vswitching structure is only created
if the firmware supports it, so a check that full-featured firmware
is running is performed first.
If the MC resets, the vswitching infrastructure will need to be
recreated, so mark the "must_probe_vswitching" flag when an MC reboot
is detected.
Don't try to create a vswitch if vf-count=0
This allocation of vswitches and vports does not currently support
configuring VLAN tags, but that can be added in a future change.
Signed-off-by: Shradha Shah <sshah@solarflare.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Daniel Pieczko [Tue, 5 May 2015 23:57:14 +0000 (00:57 +0100)]
sfc: record the PF's vport ID in nic_data
The default port ID of EVB_PORT_ID_ASSIGNED is a "magic" number
for the MCFW to select the physical port of the PF. If other
vswitches and vports are created on top of the default firmware
configuration, the ID of the newly created vport is then required
when passed to MCDI commands. Currently, this doesn't happen so
the vport_id is never changed, but a subsequent patch will change
this behaviour so that other vswitches and vports are created.
The vport_id recorded in nic_data is only relevant for PFs.
VFs will have their vports created by their parent PF, and in
that case the parent PF will record the vport ID of each VF.
For a VF, nic_data->vport_id is expected to remain at the default
value.
Signed-off-by: Shradha Shah <sshah@solarflare.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Daniel Pieczko [Tue, 5 May 2015 23:56:55 +0000 (00:56 +0100)]
sfc: Record [rt]x_dpcpu_fw_id in EF10 nic_data
The (future) code to add/remove vswitches and vports will be
dependent on the firmware variant.
To simplify the checking of the firmware variant, record
values for rx_dpcpu_fw_id and tx_dpcpu_fw_id in EF10 nic_data.
There was only one place where this was previously used:
efx_mcdi_print_fwver() in ethtool.c.
The MC_CMD_GET_CAPABILITIES can be replaced and the values from
nic_data used instead.
Note that the printing of "?" if the MC command fails or if the
outlength is incorrect no longer apply, because errors are returned
in efx_ef10_init_datapath_caps() in both of these cases.
Signed-off-by: Shradha Shah <sshah@solarflare.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Daniel Pieczko [Tue, 5 May 2015 23:55:36 +0000 (00:55 +0100)]
sfc: Move and rename efx_vf struct to siena_vf
The efx_vf struct contains Siena-specific fields for VFs,
so rename to siena_vf.
Also move it into the siena_nic_data struct, as EF10 will
track its VFs in its own ef10_nic_data, storing much less
information about them since VFDI is no longer used.
Signed-off-by: Shradha Shah <sshah@solarflare.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Shradha Shah [Tue, 5 May 2015 23:55:13 +0000 (00:55 +0100)]
sfc: Own header for nic-specific sriov functions, single instance of netdev_ops and sriov removed from Falcon code
By putting all the efx_{siena,ef10}_sriov_* declarations in
{siena,ef10}_sriov.h, ensure they cannot be called from nic-generic code.
Also fixes up an instance of this, where mcdi.c was calling
efx_siena_sriov_flr.
The single instance of netdev_ops should call general high level
functions that can then call something adapter specific in efx_nic_type.
We should only do adapter specialisation via efx_nic_type.
Removal of sriov functionality from the Falcon code means that tests
are needed for the presence of some callbacks.
Signed-off-by: Shradha Shah <sshah@solarflare.com> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Sat, 9 May 2015 20:05:54 +0000 (16:05 -0400)]
Merge branch 'dsa-next'
Andrew Lunn says:
====================
More Marvell DSA refactring and fixup
This patch setup continues the refactoring and cleanup of the Marvell
DSA drivers.
Patch #1 Centralizes the duplicated parts of port setup and global
setup into the shared mv88e6xxx.
Patch #2 Centralizes looping over the ports setting them up
Patch #3 Uses mnemonics for the remaining register access in the
drivers.
Patch #4 The 6172 is actually a member of the 6352 family. This moves
the probe code into the correct driver.
Patch #5 Adds more members of the 6171 family to the 6171 driver. The
new devices are untested.
Patch #6 The 6185 is a member of the 6131 family. Add it to the probe
code of the 6131 driver.
Patch #7 and Patch #8 Simply the mutex's in mv88e6xxx.c. The SMI bus
is the bottleneck, not the granularity of the mutex's so simply the
code down to a single mutex.
Patch #8 Fixes a false positive lockdep splat, due to nested uses of
MDIO busses.
Patch #9 Fixes another false positive lockdep splat with the transmit
queue because of stacked Ethernet devices.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Andrew Lunn [Tue, 5 May 2015 23:09:56 +0000 (01:09 +0200)]
net: dsa: Add lockdep class to tx queues to avoid lockdep splat
DSA stacks an Ethernet device on top of an Ethernet device. This can
cause false positive lockdep splats for the transmit queue: Acked-by: Florian Fainelli <f.fainelli@gmail.com>
=============================================
[ INFO: possible recursive locking detected ] 4.0.0-rc7-01838-g70621a215fc7 #386 Not tainted
---------------------------------------------
kworker/0:0/4 is trying to acquire lock:
(_xmit_ETHER#2){+.-...}, at: [<c040e95c>] sch_direct_xmit+0xa8/0x1fc
but task is already holding lock:
(_xmit_ETHER#2){+.-...}, at: [<c03f4208>] __dev_queue_xmit+0x4d4/0x56c
other info that might help us debug this:
Possible unsafe locking scenario:
DSA can have nested MDIO busses, where the Ethernet MDIO bus is used
to access an MDIO bus within the switch which has the PHYs connected
to it. This nesting causes lockdep to give false positives. Use
mutex_lock_nested() to avoid this.
Signed-off-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
Andrew Lunn [Tue, 5 May 2015 23:09:54 +0000 (01:09 +0200)]
net: dsa: mv88e6xxx: Replace stats mutex with SMI mutex
The SMI bus is the bottleneck in all switch operations, not the
granularity of locks. Replace the stats mutex by the SMI mutex to make
the locking concept simpler.
The REG_READ/REG_WRITE macros cannot be used while holding the SMI
mutex, since they try to acquire it. Replace with calls to the
appropriate function which does not try to get the mutex.
Signed-off-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
Andrew Lunn [Tue, 5 May 2015 23:09:53 +0000 (01:09 +0200)]
net: dsa: mv88e6xxx: Replace PHY mutex by SMI mutex
The SMI bus is the bottleneck in all switch operations, not the
granularity of locks. Replace the PHY mutex by the SMI mutex to make
the locking concept simpler.
The REG_READ/REG_WRITE macros cannot be used while holding the SMI
mutex, since they try to acquire it. Replace with calls to the
appropriate function which does not try to get the mutex.
Signed-off-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
Andrew Lunn [Tue, 5 May 2015 23:09:50 +0000 (01:09 +0200)]
net: dsa: Move mv88e6172 support into mv88e6352 family driver
The mv88e6172 is part of the mv88e6352 family of devices. Move support
for it out of the mv88e6171 driver into the mv88e6352, which results
in some simplifications to the code.
Signed-off-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
Andrew Lunn [Tue, 5 May 2015 23:09:47 +0000 (01:09 +0200)]
net: dsa: Centralise global and port setup code into mv88e6xxx.
The port setup code in the individual drivers is identical for 6123,
6171, and 6352, and very similar in 6131. Move it all into mv88e6xxx,
using the chip families to differentiate on features.
Similarly, the global setup is also very similar. Move the majority
into mv8e6xxx.
The chips themselves fall into families. Add helpers which uses the
device IDs to determine if a device is a member of a family or not.
Add some additional device IDs to the existing list, to make these
helper functions more complete. However these IDs are not yet added to
the probe functions.
Signed-off-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
net: axienet: Removed _of_ prefix in probe and remove functions
Synchronize names with other drivers.
Signed-off-by: Srikanth Thokala <sthokal@xilinx.com> Signed-off-by: Michal Simek <michal.simek@xilinx.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Srikanth Thokala <sthokal@xilinx.com> Signed-off-by: Michal Simek <michal.simek@xilinx.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Srikanth Thokala <sthokal@xilinx.com> Signed-off-by: Michal Simek <michal.simek@xilinx.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Srikanth Thokala <sthokal@xilinx.com> Signed-off-by: Michal Simek <michal.simek@xilinx.com> Signed-off-by: David S. Miller <davem@davemloft.net>
net: axienet: Removed coding style errors and warnings
Removed checkpatch.pl errors and warnings.
Signed-off-by: Srikanth Thokala <sthokal@xilinx.com> Signed-off-by: Michal Simek <michal.simek@xilinx.com> Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds proper checks to handle the PHY-less case.
Signed-off-by: Srikanth Thokala <sthokal@xilinx.com> Signed-off-by: Michal Simek <michal.simek@xilinx.com> Signed-off-by: David S. Miller <davem@davemloft.net>
net: axienet: Handle jumbo frames for lesser frame sizes
In the current implementation, jumbo frames are supported only
for the frame sizes > 16K. This patch corrects this logic to
handle jumbo frames for lesser frame sizes (< 16K) ensuring jumbo frame
MTU is within the limit of max frame size configured in the h/w
design.
Signed-off-by: Srikanth Thokala <sthokal@xilinx.com> Signed-off-by: Michal Simek <michal.simek@xilinx.com> Signed-off-by: David S. Miller <davem@davemloft.net>
The packet completion interrupts for TX and RX should be serviced before
the packets are consumed. This ensures against the degenerate case when a
new completion interrupt is raised after the handler has exited but before
the interrupts are cleared. In this case its possible for the ISR to clear
an unhandled interrupt (leading to potential deadlock).
Signed-off-by: Peter Crosthwaite <peter.crosthwaite@xilinx.com> Tested-by: Jason Wu <huanyu@xilinx.com> Acked-by: Michal Simek <michal.simek@xilinx.com> Signed-off-by: Michal Simek <michal.simek@xilinx.com> Signed-off-by: David S. Miller <davem@davemloft.net>
The AXI-DMA rx-delay interrupt can sometimes be triggered
when there are 0 outstanding packets received. This is due
to the fact that the receive function will greedily consume
as many packets as possible on interrupt. So if two packets
(with a very particular timing) arrive in succession they
will each cause the rx-delay interrupt, but the first interrupt
will consume both packets.
This means the second interrupt is a 0 packet receive.
This is mostly OK, except that the tail pointer register is
updated unconditionally on receive. Currently the tail pointer
is always set to the current bd-ring descriptor under
the assumption that the hardware has moved onto the next
descriptor. What this means for length 0 recv is the current
descriptor that the hardware is potentially yet to use will
be marked as the tail. This causes the hardware to think
its run out of descriptors deadlocking the whole rx path.
Fixed by updating the tail pointer to the most recent
successfully consumed descriptor.
Reported-by: Wendy Liang <wendy.liang@xilinx.com> Signed-off-by: Peter Crosthwaite <peter.crosthwaite@xilinx.com> Tested-by: Jason Wu <huanyu@xilinx.com> Acked-by: Michal Simek <michal.simek@xilinx.com> Signed-off-by: Michal Simek <michal.simek@xilinx.com> Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds support for the RGMII. The h/w configuration
parameter C_PHY_TYPE, which represents the interface configured in
the design, is used to differentiate various interfaces supported
by AXI Ethernet.
Signed-off-by: Srikanth Thokala <sthokal@xilinx.com> Signed-off-by: Michal Simek <michal.simek@xilinx.com> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Tue, 5 May 2015 23:31:50 +0000 (19:31 -0400)]
Merge branch 'cxgb4-next'
Hariprasad Shenai says:
====================
Trivial fixes and changes for SGE
This patch series adds the following.
Discard packet if length is greater than MTU, move sge monitor code to a
new routine, add device node to ULD info, add congestion notification from
SGE for ingress queue and freelists and for T5, setting up the Congestion
Manager values of the new RX Ethernet Queue is done by firmware now.
This patch series has been created against net-next tree and includes
patches on cxgb4 driver.
We have included all the maintainers of respective drivers. Kindly review
the change and let us know in case of any review comments.
Thanks
V2: Align parenthesis for PATCH 2/6 and PATCH 5/6
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
cxgb4: Discard the packet if the length is greater than mtu
pktgen sends raw udp packets and bypasses most of the
linux networking stack. User can specify different packet sizes.
Hence we need to discard the packet if the length is greater than mtu
Signed-off-by: Hariprasad Shenai <hariprasad@chelsio.com> Signed-off-by: David S. Miller <davem@davemloft.net>
cxgb4: Pass in a Congestion Channel Map to t4_sge_alloc_rxq()
Passes a Congestion Channel Map to t4_sge_alloc_rxq()
for the Ethernet RX Queues based on the MPS Buffer Group Map
of the TX Channel rather than just the TX Channel Map.
Also, in t4_sge_alloc_rxq() for T5, setting up the
Congestion Manager values of the new RX Ethernet Queue is
done by firmware now.
Signed-off-by: Hariprasad Shenai <hariprasad@chelsio.com> Signed-off-by: David S. Miller <davem@davemloft.net>
cxgb4: Make sure that Freelist size is larger than Egress Congestion Threshold
We need to make sure that the Free List Size, in pointers, is at
least 2 Egress Queue Units (8 pointers/each) larger than the SGE's Egress
Congestion Threshold (in pointers).
Signed-off-by: Hariprasad Shenai <hariprasad@chelsio.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Thomas Graf [Tue, 5 May 2015 00:27:02 +0000 (02:27 +0200)]
rhashtable-test: Fix 64bit division
A 64bit division went in unnoticed. Use do_div() to accomodate
non 64bit architectures.
Reported-by: kbuild test robot Fixes: 1aa661f5c3df ("rhashtable-test: Measure time to insert, remove & traverse entries") Signed-off-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Tue, 5 May 2015 23:29:50 +0000 (19:29 -0400)]
Merge branch 'ipvlan-mcast'
Mahesh Bandewar says:
====================
Multicast processing in IPvlan
Dan Willems pointed out that autoconf in IPvlan is broken because of the
way broadcast bit gets set. Since broadcast processing is a real performance
drain, the broadcast bit in multicast filter was only set when the interface
was configured with IPv4 address. In autoconf scenario, when there are
no addresses configured; this logic did not work and it wouldn't allow
DHCPv4 to work. The only way was to add protocol specific hacks to avoid
processing unnecessary broadcast burdon.
This jugglery could be avoided if these multicast / broadcast packets are taken
out of fast-path and are processed in a work-queue. This will enable us to add
broadcast bit in all multicast filters without any impact on performance of
the virtual device. This patch series just does that.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Mahesh Bandewar [Tue, 5 May 2015 00:06:11 +0000 (17:06 -0700)]
ipvlan: Always set broadcast bit in multicast filter
Earlier tricks of setting broadcast bit only when IPv4 address is added
onto interface are not good enough especially when autoconf comes in play.
Setting them on always is performance drag but now that multicast /
broadcast is not processed in fast-path; enabling broadcast will let
autoconf work correctly without affecting performance characteristics of
the device.
Signed-off-by: Mahesh Bandewar <maheshb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Mahesh Bandewar [Tue, 5 May 2015 00:06:03 +0000 (17:06 -0700)]
ipvlan: Defer multicast / broadcast processing to a work-queue
Processing multicast / broadcast in fast path is performance draining
and having more links means more cloning and bringing performance
down further.
Broadcast; in particular, need to be given to all the virtual links.
Earlier tricks of enabling broadcast bit for IPv4 only interfaces are not
really working since it fails autoconf. Which means enabling broadcast
for all the links if protocol specific hacks do not have to be added into
the driver.
This patch defers all (incoming as well as outgoing) multicast traffic to
a work-queue leaving only the unicast traffic in the fast-path. Now if we
need to apply any additional tricks to further reduce the impact of this
(multicast / broadcast) type of traffic, it can be implemented while
processing this work without affecting the fast-path.
Signed-off-by: Mahesh Bandewar <maheshb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Tue, 5 May 2015 23:24:43 +0000 (19:24 -0400)]
Merge branch 'eth_proto_is_802_3'
Alexander Duyck says:
====================
Add eth_proto_is_802_3 to provide improved means of checking Ethertype
This patch series implements and makes use of eth_proto_is_802_3(). The
idea behind the function is to provide an optimized means of testing to
determine if a given Ethertype value is a length or 802.3 protocol number.
The standard path for this was to use ntohs(proto) and then perform a
comparison. This adds a slight cost as it usually requires either a 16b
rotate or byte swap which can cost 1 cycle or more depending on the
processor.
I had previously addressed this for eth_type_trans, however in doing so I had
overlooked checking with sparse and had introduced a couple sparse warnings.
The first patch in this series fixes those sparse warnings as well as does
some additional optimization for big endian systems. In addition it pushes
the code out into a separate function which can then be used in the other
patches to reduce the instruction count/processing time in those functions
as well.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Alexander Duyck [Mon, 4 May 2015 21:33:48 +0000 (14:33 -0700)]
etherdev: Fix sparse error, make test usable by other functions
This change does two things. First it fixes a sparse error for the fact
that the __be16 degrades to an integer. Since that is actually what I am
kind of doing I am simply working around that by forcing both sides of the
comparison to u16.
Also I realized on some compilers I was generating another instruction for
big endian systems such as PowerPC since it was masking the value before
doing the comparison. So to resolve that I have simply pulled the mask out
and wrapped it in an #ifndef __BIG_ENDIAN.
Lastly I pulled this all out into its own function. I notices there are
similar checks in a number of other places so this function can be reused
there to help reduce overhead in these paths as well.
Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Bernhard Thaler [Mon, 4 May 2015 20:47:13 +0000 (22:47 +0200)]
bridge: change BR_GROUPFWD_RESTRICTED to allow forwarding of LLDP frames
BR_GROUPFWD_RESTRICTED bitmask restricts users from setting values to
/sys/class/net/brX/bridge/group_fwd_mask that allow forwarding of
some IEEE 802.1D Table 7-10 Reserved addresses:
Change BR_GROUPFWD_RESTRICTED to allow to forward LLDP frames and document
group_fwd_mask.
e.g.
echo 16384 > /sys/class/net/brX/bridge/group_fwd_mask
allows to forward LLDP frames.
This may be needed for bridge setups used for network troubleshooting or
any other scenario where forwarding of LLDP frames is desired (e.g. bridge
connecting a virtual machine to real switch transmitting LLDP frames that
virtual machine needs to receive).
Tested on a simple bridge setup with two interfaces and host transmitting
LLDP frames on one side of this bridge (used lldpd). Setting group_fwd_mask
as described above lets LLDP frames traverse bridge.
Signed-off-by: Bernhard Thaler <bernhard.thaler@wvnet.at> Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Mon, 4 May 2015 04:34:46 +0000 (21:34 -0700)]
tcp: provide SYN headers for passive connections
This patch allows a server application to get the TCP SYN headers for
its passive connections. This is useful if the server is doing
fingerprinting of clients based on SYN packet contents.
Two socket options are added: TCP_SAVE_SYN and TCP_SAVED_SYN.
The first is used on a socket to enable saving the SYN headers
for child connections. This can be set before or after the listen()
call.
The latter is used to retrieve the SYN headers for passive connections,
if the parent listener has enabled TCP_SAVE_SYN.
TCP_SAVED_SYN is read once, it frees the saved SYN headers.
The data returned in TCP_SAVED_SYN are network (IPv4/IPv6) and TCP
headers.
Original patch was written by Tom Herbert, I changed it to not hold
a full skb (and associated dst and conntracking reference).
We have used such patch for about 3 years at Google.
Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Tested-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Linus Lüssing [Mon, 4 May 2015 22:19:35 +0000 (00:19 +0200)]
net: fix two sparse warnings introduced by IGMP/MLD parsing exports
> net/core/skbuff.c:4108:13: sparse: incorrect type in assignment (different base types)
> net/ipv6/mcast_snoop.c:63 ipv6_mc_check_exthdrs() warn: unsigned 'offset' is never less than zero.
Reported-by: kbuild test robot <fengguang.wu@intel.com> Signed-off-by: Linus Lüssing <linus.luessing@c0d3.blue> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Mon, 4 May 2015 19:37:08 +0000 (15:37 -0400)]
Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/next-queue
Jeff Kirsher says:
====================
Intel Wired LAN Driver Updates 2015-05-04
This series contains updates to igb, e100, e1000e and ixgbe.
Todd cleans up igb_enable_mas() since it should only be called for the
82575 silicon and has no clear return, so modify the function to void.
Jean Sacren found upon inspection that 'err' did not need to be
initialized, since it is immediately overwritten.
Alex Duyck provides two patches for e1000e, the first cleans up the
handling VLAN_HLEN as a part of max frame size. Fixes the issue: c751a3d58cf2d ("e1000e: Correctly include VLAN_HLEN when changing
interface MTU"). The second fixes an issue where the driver was not
allowing jumbo frames to be enabled when CRC stripping was disabled,
however it was allowing CRC stripping to be disabled while jumbo frames
were enabled.
Jeff (me) fixes a warning found on PPC where the use of do_div() needed
to use u64 arg and not s64.
Mark provides three ixgbe patches, first to fix the Intel On-chip System
Fabric (IOSF) Sideband message interfaces, to serialize access using both
PHY bits in the SWFW_SEMAPHORE register. Then fixes how semaphore bits
were released, since they should be released in reverse of the order that
they were taken. Lastly updates ixgbe to use a signed type to hold
error codes, since error codes are negative, so consistently use signed
types when handling them.
v2: dropped the previous #6-#8 patches by Hiroshi Shimanoto based on
feedback from Or Gerlitz (and David Miller) that it appears there
needs to be further discussion on how this gets implemented.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Mon, 4 May 2015 19:04:02 +0000 (15:04 -0400)]
Merge branch 'tipc-topology-cleanup'
Ying Xue says:
====================
tipc: cleanup topology server
Not only function names declared in subscr.c are very confused, but
also topology server's locking policy is not designed very well, for
instance, usually leading to panic in some special corner cases.
In this series, we attempt to eliminate the confusion of function names
and simplify topology server's locking policy to solve above mentioned
issues. More importantly, the change will make relevant code easily
understandable and maintainable.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Ying Xue [Mon, 4 May 2015 02:36:47 +0000 (10:36 +0800)]
tipc: adjust locking policy of subscription
Currently subscriber's lock protects not only subscriber's subscription
list but also all subscriptions linked into the list. However, as all
members of subscription are never changed after they are initialized,
it's unnecessary for subscription to be protected under subscriber's
lock. If the lock is used to only protect subscriber's subscription
list, the adjustment not only makes the locking policy simpler, but
also helps to avoid a deadlock which may happen once creating a
subscription is failed.
Signed-off-by: Ying Xue <ying.xue@windriver.com> Reviewed-by: Jon Maloy <jon.maloy@ericson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Ying Xue [Mon, 4 May 2015 02:36:46 +0000 (10:36 +0800)]
tipc: involve reference counter for subscriber
At present subscriber's lock is used to protect the subscription list
of subscriber as well as subscriptions linked into the list. While one
or all subscriptions are deleted through iterating the list, the
subscriber's lock must be held. Meanwhile, as deletion of subscription
may happen in subscription timer's handler, the lock must be grabbed
in the function as well. When subscription's timer is terminated with
del_timer_sync() during above iteration, subscriber's lock has to be
temporarily released, otherwise, deadlock may occur. However, the
temporary release may cause the double free of a subscription as the
subscription is not disconnected from the subscription list.
Now if a reference counter is introduced to subscriber, subscription's
timer can be asynchronously stopped with del_timer(). As a result, the
issue is not only able to be fixed, but also relevant code is pretty
readable and understandable.
Signed-off-by: Ying Xue <ying.xue@windriver.com> Reviewed-by: Jon Maloy <jon.maloy@ericson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Ying Xue [Mon, 4 May 2015 02:36:44 +0000 (10:36 +0800)]
tipc: rename functions defined in subscr.c
When a topology server accepts a connection request from its client,
it allocates a connection instance and a tipc_subscriber structure
object. The former is used to communicate with client, and the latter
is often treated as a subscriber which manages all subscription events
requested from a same client. When a topology server receives a request
of subscribing name services from a client through the connection, it
creates a tipc_subscription structure instance which is seen as a
subscription recording what name services are subscribed. In order to
manage all subscriptions from a same client, topology server links
them into the subscrp_list of the subscriber. So subscriber and
subscription completely represents different meanings respectively,
but function names associated with them make us so confused that we
are unable to easily tell which function is against subscriber and
which is to subscription. So we want to eliminate the confusion by
renaming them.
Signed-off-by: Ying Xue <ying.xue@windriver.com> Reviewed-by: Jon Maloy <jon.maloy@ericson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Mon, 4 May 2015 18:49:23 +0000 (14:49 -0400)]
Merge branch 'igmp_mld_export'
Linus Lüssing says:
====================
Exporting IGMP/MLD checking from bridge code
The multicast optimizations in batman-adv are yet only usable and
enabled in non-bridged scenarios. To be able to support bridged setups
batman-adv needs to be able to detect IGMP/MLD queriers and reports on
mesh nodes without bridges, too. See the following link for details:
To avoid duplicate code between the bridge and batman-adv, the IGMP/MLD
message validation code is moved from the bridge to the IPv4/IPv6 stack.
On the way, some refactoring to increase readability and to iron out
some subtle differences between the IGMP and MLD parsing code is done.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Linus Lüssing [Sat, 2 May 2015 12:01:07 +0000 (14:01 +0200)]
net: Export IGMP/MLD message validation code
With this patch, the IGMP and MLD message validation functions are moved
from the bridge code to IPv4/IPv6 multicast files. Some small
refactoring was done to enhance readibility and to iron out some
differences in behaviour between the IGMP and MLD parsing code (e.g. the
skb-cloning of MLD messages is now only done if necessary, just like the
IGMP part always did).
Finally, these IGMP and MLD message validation functions are exported so
that not only the bridge can use it but batman-adv later, too.
Signed-off-by: Linus Lüssing <linus.luessing@c0d3.blue> Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Linus Lüssing <linus.luessing@c0d3.blue> Acked-by: Stephen Hemminger <stephen@networkplumber.org> Signed-off-by: David S. Miller <davem@davemloft.net>
Mark Rustad [Fri, 10 Apr 2015 17:36:36 +0000 (10:36 -0700)]
ixgbe: Use a signed type to hold error codes
Because error codes are negative, it only makes sense to
consistently use signed types when handling them. Also remove
some explicit comparisons with 0 on these variables.
Signed-off-by: Mark Rustad <mark.d.rustad@intel.com> Tested-by: Phil Schmitt <phillip.j.schmitt@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Mark Rustad [Fri, 10 Apr 2015 17:36:31 +0000 (10:36 -0700)]
ixgbe: Release semaphore bits in the right order
The global semaphore bits should be released in the reverse of the
order that they were taken, so correct that.
Signed-off-by: Mark Rustad <mark.d.rustad@intel.com> Tested-by: Phil Schmitt <phillip.j.schmitt@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Mark Rustad [Fri, 10 Apr 2015 17:36:26 +0000 (10:36 -0700)]
ixgbe: Fix IOSF SB access issues
IOSF is the Intel On-chip System Fabric used in SOCs. IOSF SB is
the IOSF SideBand message interface. This patch serializes IOSF SB
access using both phy bits in the SWFW_SEMAPHORE register. It also
adds a helper function to wait for IOSF SB accesses to complete.
Use the new function to perform this wait before each access, as
specified in the datasheet, in addition to using it to wait for
IOSF SB read/write completion.
Signed-off-by: Mark Rustad <mark.d.rustad@intel.com> Tested-by: Phil Schmitt <phillip.j.schmitt@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Jeff Kirsher [Sat, 2 May 2015 08:20:04 +0000 (01:20 -0700)]
e1000e: fix call to do_div() to use u64 arg
We were using s64 for lat_ns (latency nano-second value) since in
our calculations a negative value could be a resultant. For negative
values, we then assign lat_ns to be zero, so the value passed to
do_div() was never negative, but do_div() expects the argument type
to be u64, so do a cast to resolve a compile warning seen on
PowerPC.
CC: Yanjiang Jin <yanjiang.jin@windriver.com> CC: Yanir Lubetkin <yanirx.lubetkin@intel.com> Reported-by: Yanjiang Jin <yanjiang.jin@windriver.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com> Tested-by: Aaron Brown <aaron.f.brown@intel.com>
Alexander Duyck [Sat, 2 May 2015 08:09:59 +0000 (01:09 -0700)]
e1000e: Do not allow CRC stripping to be disabled on 82579 w/ jumbo frames
The driver wasn't allowing jumbo frames to be
enabled when CRC stripping was disabled, however it was allowing CRC
stripping to be disabled while jumbo frames were enabled. This fixes that by
making it so that the NETIF_F_RXFCS flag cannot be set when jumbo frames are
enabled on 82579 and newer parts.
Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com> Tested-by: Aaron Brown <aaron.f.brown@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Alexander Duyck [Sat, 2 May 2015 07:52:00 +0000 (00:52 -0700)]
e1000e: Cleanup handling of VLAN_HLEN as a part of max frame size
When the VLAN_HLEN was added to the calculation for the maximum frame size
there seems to have been a number of issues added to the driver.
The first issue is that in some cases the maximum frame size for a device
never really reached the actual maximum frame size as the VLAN header
length was not included the calculation for that value. As a result some
parts only supported a maximum frame size of either 1496 in the case of
parts that didn't support jumbo frames, and 8996 in the case of the parts
that do.
The second issue is the fact that there were several checks that weren't
updated so as a result setting an MTU of 1500 was treated as enabling jumbo
frames as the calculated value was 1522 instead of 1518. I have addressed
those by replacing ETH_FRAME_LEN with VLAN_ETH_FRAME_LEN where appropriate.
The final issue was the fact that lowering the MTU below 1500 would cause
the driver to allocate 2K buffers for the rings. This is an old issue that
was fixed several years ago in igb/ixgbe and I am addressing now by just
replacing == with a <= so that we always just round up to 1522 for anything
that isn't a jumbo frame.
Fixes: c751a3d58cf2d ("e1000e: Correctly include VLAN_HLEN when changing interface MTU") Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com> Tested-by: Aaron Brown <aaron.f.brown@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Jean Sacren [Sat, 2 May 2015 07:49:26 +0000 (00:49 -0700)]
e100: don't initialize int object to zero
'err' will be overwritten so no need to initialize it to zero.
Signed-off-by: Jean Sacren <sakiwit@gmail.com> Tested-by: Aaron Brown <aaron.f.brown@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Patches #6 and #7 are fairly simple barrier stuff.
Patch #8 closes some SMP transmit races - not that anyone really
complained about these but it's a bit hard to handwave that they
can be safely ignored. Some testing, especially SMP testing of
course, would be welcome.
. Changes since #2:
- added dma_rmb barrier in vlan related patch 6.
- s/wmb/dma_wmb/ in (*new*) patch 7 of 8.
- added explicit SMP barriers in (*new*) patch 8 of 8.
. Changes since #1:
- turned wmb() into dma_wmb() as suggested by davem and Alexander Duyck
in patch 1 of 6.
- forgot to reset rx_head_desc in rhine_reset_rbufs in patch 4 of 6.
- removed rx_head_desc altogether in (*new*) patch 5 of 6
- remoed some vlan receive uglyness in (*new*) patch 6 of 6.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
françois romieu [Fri, 1 May 2015 20:14:45 +0000 (22:14 +0200)]
via-rhine: close SMP transmit races.
7ab87ff4c770eed71e3777936299292739fcd0fe ("via-rhine: move work from
irq handler to softirq and beyond") forgot to explicitely control the
lifespan of the tx_dirty and tx_cur pointers.
Signed-off-by: Francois Romieu <romieu@fr.zoreil.com> Signed-off-by: David S. Miller <davem@davemloft.net>
françois romieu [Fri, 1 May 2015 20:14:41 +0000 (22:14 +0200)]
via-rhine: forbid holes in the receive descriptor ring.
Rationales:
- throttle work under memory pressure
- lower receive descriptor recycling latency for the network adapter
- lower the maintenance burden of uncommon paths
The patch is twofold:
- it fails early if the receive ring can't be completely initialized
at dev->open() time
- it drops packets on the floor in the napi receive handler so as to
keep the received ring full
Signed-off-by: Francois Romieu <romieu@fr.zoreil.com> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Mon, 4 May 2015 04:09:09 +0000 (00:09 -0400)]
Merge branch 'flow_keys_digest'
Tom Herbert says:
====================
net: Eliminate calls to flow_dissector and introduce flow_keys_digest
In this patch set we add skb_get_hash_perturb which gets the skbuff
hash for a packet and perturbs it using a provided key and jhash1.
This function is used in serveral qdiscs and eliminates many calls
to flow_dissector and jhash3 to get a perturbed hash for a packet.
To handle the sch_choke issue (passes flow_keys in skbuff cb) we
add flow_keys_digest which is a digest of a flow constructed
from a flow_keys structure.
This is the second version of these patches I posted a while ago,
and is prerequisite work to increasing the size of the flow_keys
structure and hashing over it (full IPv6 address, flow label, VLAN ID,
etc.).
Version 2:
- Add keyval parameter to __flow_hash_from_keys which allows caller to
set the initval for jhash
- Perturb always does flow dissection and creates hash based on
input perturb value which acts as the keyval to __flow_hash_from_keys
- Added a _flow_keys_digest_data which is used in make_flow_keys_digest.
This fills out the digest by populating individual fields instead
of copying the whole structure.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Tom Herbert [Fri, 1 May 2015 18:30:17 +0000 (11:30 -0700)]
net: Add flow_keys digest
Some users of flow keys (well just sch_choke now) need to pass
flow_keys in skbuff cb, and use them for exact comparisons of flows
so that skb->hash is not sufficient. In order to increase size of
the flow_keys structure, we introduce another structure for
the purpose of passing flow keys in skbuff cb. We limit this structure
to sixteen bytes, and we will technically treat this as a digest of
flow_keys struct hence its name flow_keys_digest. In the first
incaranation we just copy the flow_keys structure up to 16 bytes--
this is the same information previously passed in the cb. In the
future, we'll adapt this for larger flow_keys and could use something
like SHA-1 over the whole flow_keys to improve the quality of the
digest.
Signed-off-by: Tom Herbert <tom@herbertland.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Tom Herbert [Fri, 1 May 2015 18:30:12 +0000 (11:30 -0700)]
net: Add skb_get_hash_perturb
This calls flow_disect and __skb_get_hash to procure a hash for a
packet. Input includes a key to initialize jhash. This function
does not set skb->hash.
Signed-off-by: Tom Herbert <tom@herbertland.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Andrew Lunn [Fri, 1 May 2015 14:39:54 +0000 (16:39 +0200)]
net: ipv4: route: Fix sending IGMP messages with link address
In setups with a global scope address on an interface, and a lesser
scope address on an interface sending IGMP reports, the reports can be
sent using the other interfaces global scope address rather than the
local interface address. RFC 2236 suggests:
Ignore the Report if you cannot identify the source address of
the packet as belonging to a subnet assigned to the interface on
which the packet was received.
since such reports could be forged.
Look at the protocol when deciding if a RT_SCOPE_LINK address should
be used for the packet.
Signed-off-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
TC classifiers/actions were converted to RCU by John in the series:
http://thread.gmane.org/gmane.linux.network/329739/focus=329739
and many follow on patches.
This is the last patch from that series that finally drops
ingress spin_lock.
Single cpu ingress+u32 performance goes from 22.9 Mpps to 24.5 Mpps.
In two cpu case when both cores are receiving traffic on the same
device and go into the same ingress+u32 the performance jumps
from 4.5 + 4.5 Mpps to 23.5 + 23.5 Mpps
Signed-off-by: John Fastabend <john.r.fastabend@intel.com> Signed-off-by: Alexei Starovoitov <ast@plumgrid.com> Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Mon, 4 May 2015 03:18:02 +0000 (23:18 -0400)]
Merge branch 'tcp_sack_rttm'
Kenneth Klette Jonassen says:
====================
tcp: SACK RTTM changes for congestion control
This patch series improves SACK RTT measurements for congestion control:
o Picks the latest sequence SACKed for RTT, i.e. most accurate delay
signal.
o Calls the congestion control's pkts_acked hook with SACK RTTMs
even when not sequentially ACKing new data.
V2: amend misleading comment
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
tcp_sacktag_one() always picks the earliest sequence SACKed for RTT.
This might not make sense for congestion control in cases where:
1. ACKs are lost, i.e. a SACK following a lost SACK covers both
new and old segments at the receiver.
2. The receiver disregards the RFC 5681 recommendation to immediately
ACK out-of-order segments.
Give congestion control a RTT for the latest segment SACKed, which is the
most accurate RTT estimate, but preserve the conservative RTT for RTO.
Removes the call to skb_mstamp_get() in tcp_sacktag_one().
Cc: Yuchung Cheng <ycheng@google.com> Cc: Eric Dumazet <edumazet@google.com> Signed-off-by: Kenneth Klette Jonassen <kennetkl@ifi.uio.no> Acked-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>