Roopa Prabhu [Thu, 15 Jan 2015 04:02:25 +0000 (20:02 -0800)]
bridge: fix setlink/dellink notifications
problems with bridge getlink/setlink notifications today:
- bridge setlink generates two notifications to userspace
- one from the bridge driver
- one from rtnetlink.c (rtnl_bridge_notify)
- dellink generates one notification from rtnetlink.c. Which
means bridge setlink and dellink notifications are not
consistent
- Looking at the code it appears,
If both BRIDGE_FLAGS_MASTER and BRIDGE_FLAGS_SELF were set,
the size calculation in rtnl_bridge_notify can be wrong.
Example: if you set both BRIDGE_FLAGS_MASTER and BRIDGE_FLAGS_SELF
in a setlink request to rocker dev, rtnl_bridge_notify will
allocate skb for one set of bridge attributes, but,
both the bridge driver and rocker dev will try to add
attributes resulting in twice the number of attributes
being added to the skb. (rocker dev calls ndo_dflt_bridge_getlink)
There are multiple options:
1) Generate one notification including all attributes from master and self:
But, I don't think it will work, because both master and self may use
the same attributes/policy. Cannot pack the same set of attributes in a
single notification from both master and slave (duplicate attributes).
2) Generate one notification from master and the other notification from
self (This seems to be ideal):
For master: the master driver will send notification (bridge in this
example)
For self: the self driver will send notification (rocker in the above
example. It can use helpers from rtnetlink.c to do so. Like the
ndo_dflt_bridge_getlink api).
This patch implements 2) (leaving the 'rtnl_bridge_notify' around to be used
with 'self').
v1->v2 :
- rtnl_bridge_notify is now called only for self,
so, remove 'BRIDGE_FLAGS_SELF' check and cleanup a few things
- rtnl_bridge_dellink used to always send a RTM_NEWLINK msg
earlier. So, I have changed the notification from br_dellink to
go as RTM_NEWLINK
Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Sun, 18 Jan 2015 01:34:14 +0000 (20:34 -0500)]
Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/net-next
Jeff Kirsher says:
====================
Intel Wired LAN Driver Updates 2015-01-16
This series contains updates to i40e and i40evf.
This series is a little bit larger than normal because two of the patches are
version bumps.
Shannon provides tweaks to i40e and i40evf to keep the firmware, software
and silicon validation in line together by removing unused and
deprecated code, adding define for iSCSI and fix queue mask size. Fix
i40e so we do not give up in the reset/rebuild process if DCB setup
fails, just handle it the same as in the probe setup. Cleans up PTP
log messages by removing the use of __func__ as we are not using that
any longer and removes the netdev name, since that can change and can
be misleading. Adds struct size checks to indirect and command
structs that were left out previously. Added admin queue API updates
(LLDP control, OEM OCSD and OCBB commands).
Kevin increases ASQ timeout for scenarios with multi-function devices.
Carolyn fixes a problem where the interrupts descriptions from the MSIx
configuration were truncating the needed bus info, which makes it hard
to distinguish configurations from port to port. Increased the string
buffer size in order to allow the full data to be displayed.
Sravanthi cleans up the dump stats string from debugfs.
Jacob updates i40e to only enable the PTP interrupt in PFs which have PTP
enabled, instead of blindly enabling the PTP interrupt flags for all PFs.
Also updated i40e so that we do not do Tx or Rx timestamps if we do not
have PTP enabled. Added the same check against pf->ptp_rx as we have
in Rx timestamp code path because it is possible that the user can
configure only Tx hardware timestamping so we do not want to check for
Rx timestamp hang since the software won't be handling them.
Neerav updates the driver to disable firmware LLDP agent for NICs with
a firmware version lower than v4.3 and added a message when this happens.
Adds parsing and reporting of iSCSI capability for a given device or
function, as well as adding support for iSCSI partition type with DCB
in NPAR mode.
v2:
- Dropped patch 10 "i40e: clean up PTP log messages" based on feedback
from David Laight and David Miller
- Split up the original patch 13 "i40e: AQ API updates for new commands"
into 2 patches (now #12 & #13) based on feedback from Or Gerlitz
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
The following series of patches includes functional updates to the
driver as well as some trivial changes.
- Fix checks/warnings from checkpatch in the amd-xgbe driver
- Fix checks/warnings from checkpatch in the amd-xgbe-phy driver
- Add a check to be sure that the amd-xgbe driver is using the
amd-xgbe-phy driver
- Use a saved control register value when bringing the PCS out of
suspend
- Clear all device state during a device restart
- Simplify the Rx descriptor ring tracking
- Remove the need for Tx path spinlocks
- Update the auto-negotiation logic to make use of the auto-negotiation
interrupt
- Properly support/advertise the FEC capability of the device
- Use the proper page registers during auto-negotiation extended next
page exchange
- Add ACPI support to the amd-xgbe and amd-xgbe-phy drivers
- Allow platform specific phy settings to be supplied by UEFI
This patch series is based on net-next.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Lendacky, Thomas [Fri, 16 Jan 2015 18:47:21 +0000 (12:47 -0600)]
amd-xgbe-phy: Allow certain PHY settings to be set by UEFI
Certain PHY settings need to be configurable by UEFI depending on the
platform being used. Add new device tree / ACPI properties that, if
present, will override the pre-determined values currently used.
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Lendacky, Thomas [Fri, 16 Jan 2015 18:47:10 +0000 (12:47 -0600)]
amd-xgbe-phy: Use the proper auto-negotiation XNP registers
When receiving and processing extended next pages the base registers
were used instead of the XNP registers. Update the code to use the
device XNP and link partner XNP registers.
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Lendacky, Thomas [Fri, 16 Jan 2015 18:47:05 +0000 (12:47 -0600)]
amd-xgbe-phy: Properly support the FEC auto-negotiation
Advertise and apply the Forward Error Correction capabilities of the
device based on the FEC ability of the device. Also, remove the use
of some hard coded values related to KR and FEC in preference of some
#defines.
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Lendacky, Thomas [Fri, 16 Jan 2015 18:47:00 +0000 (12:47 -0600)]
amd-xgbe-phy: Change auto-negotiation logic
The auto negotiation logic was geared to being the initiator of the
auto negotiation. This presented problems when auto negotiation was
initiated by the remote end. Change the auto negotiation logic to
make use of the auto negotiation event interrupt thus allowing the
auto negotiation state machine to function properly in either scenario.
This also removes the polling during auto-negotiation.
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Lendacky, Thomas [Fri, 16 Jan 2015 18:46:55 +0000 (12:46 -0600)]
amd-xgbe: Remove need for Tx path spinlock
Since the Tx ring cleanup can run at the same time that data is being
transmitted, a spin lock was used to protect the ring. This patch
eliminates the need for Tx spinlocks by updating the current ring
position only after all ownership bits for data being transmitted have
been set. This will insure that ring operations in the Tx cleanup path
do not interfere with the ring operations in the Tx transmit path.
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Lendacky, Thomas [Fri, 16 Jan 2015 18:46:50 +0000 (12:46 -0600)]
amd-xgbe: Simplify the Rx desciptor ring tracking
Make the Rx descriptor ring processing similar to the Tx descriptor
ring processing. Remove the realloc_index and realloc_threshold
variables and base everything on the current index counter and the
dirty index counter.
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Lendacky, Thomas [Fri, 16 Jan 2015 18:46:45 +0000 (12:46 -0600)]
amd-xgbe: Clear all state during a device restart
When performing a device restart, like during an MTU change, sometimes
the device queues still have data and get hung up trying to flush
resulting in the device becoming unresponsive until brought down and
back up. To prevent this, always perform a device reset during a
restart.
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Lendacky, Thomas [Fri, 16 Jan 2015 18:46:39 +0000 (12:46 -0600)]
amd-xgbe-phy: On suspend, save CTRL1 reg for use on resume
Reads to registers are undefined when the PCS is powered down. To be
safe, save the CTRL1 register used for power down during suspend and
restore that value during resume.
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Lendacky, Thomas [Fri, 16 Jan 2015 18:46:34 +0000 (12:46 -0600)]
amd-xgbe: Add check to be sure amd-xgbe-phy driver is used
The amd-xgbe driver relies on the amd-xgbe-phy phylib driver. Add a
check to be sure that if any errors occur during probing of the
amd-xgbe-phy driver then the amd-xgbe driver returns an error.
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Change-ID: Ice127eee3a5a5d1b8765d83cff8c30f9f3b1bc32 Signed-off-by: Sravanthi Tangeda <sravanthi.tangeda@intel.com> Tested-by: Jim Young <james.m.young@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Neerav Parikh [Sun, 14 Dec 2014 01:55:16 +0000 (01:55 +0000)]
i40e: Support for NPAR iSCSI partition with DCB
Add parsing and reporting of iSCSI capability for a given device or
function.
Also add support for iSCSI partition type with DCB in NPAR mode.
In this mode it is expected that software would configure both the LAN
and iSCSI traffic classes for the iSCSI partition; whereas all the NIC
type partitions will use LAN TC (TC0) only.
Hence, the patch enables querying of DCB configuration in MFP mode and
configures TCs for iSCSI partition type.
Though NIC type partitions may not have more than 1 TC enabled for them
the port may have multiple TCs enabled and hence I40E_FLAG_DCB_ENABLED
will be set/reset on all the partitions based on number of TCs on the
port. This is required as in DCB environment it is expected that all
traffic will be priority tagged.
Change-ID: I8c6e1cfd46c46d8a39c57d9020d9ff8d42ed8a7d Signed-off-by: Neerav Parikh <neerav.parikh@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Jacob Keller [Sun, 14 Dec 2014 01:55:15 +0000 (01:55 +0000)]
i40e: when Rx timestamps disabled set specific mode
Instead of leaving the Rx timestamps in the same mode as before if we
disable the Rx logic, we can set it into a mode that has the fewest
possible timestamps generated. To do this, select only V1 mode, but do
not enable UDP packet recognition. This should eliminate all (or at
least almost all) Rx timestamps, since V1 packets are always over UDP.
Change-ID: If847288e0030a716e059c4c33ab114f2cf038f05 Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Acked-by: Shannon Nelson <shannon.nelson@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Jacob Keller [Sun, 14 Dec 2014 01:55:14 +0000 (01:55 +0000)]
i40e: use same check for Rx hang as for Rx timestamps
It's possible that the user configured only Tx hardware timestamping,
and thus we might be receiving PTP traffic which we timestamp but which
software never reads. In this case we don't want to check for Rx
timestamp hang, because we already know that software won't be handling
them. Thus, we add the same check against pf->ptp_rx as we have in the
Rx timestamp code path.
Change-ID: I66486c8dba307facbff8eace4e52e2f083789d1b Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Acked-by: Shannon Nelson <shannon.nelson@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Shannon Nelson [Sun, 14 Dec 2014 01:55:12 +0000 (01:55 +0000)]
i40e: add more struct size checks
Add struct size checks to many of the indirect structs and a few
command structs that were left out previously.
Change-ID: I7810b9af0f04e3ced670639f8671daf7df9b3f4d Signed-off-by: Shannon Nelson <shannon.nelson@intel.com> Acked-by: Greg Rose <gregory.v.rose@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Jacob Keller [Sun, 14 Dec 2014 01:55:09 +0000 (01:55 +0000)]
i40e: check I40E_FLAG_PTP before handling Tx or Rx timestamps
We should not be doing Tx or Rx timestamps if we do not have PTP
enabled. Add checks to ensure that we don't attempt to handle any PTP
related timestamping code if we have not enabled PTP on that PF.
Change-ID: I4335942ae2d5c5f91abfdbeeea02bcace49e7677 Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Jacob Keller [Sun, 14 Dec 2014 01:55:08 +0000 (01:55 +0000)]
i40e: only enable PTP interrupt cause if PTP is enabled
We should not blindly enable the PTP interrupt flags for all PFs. We
should only enable the PTP interrupt in PFs which have enabled
PTP.
Change-ID: I051a17cae4c199a2f3cf7852266e27eda6630525 Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Shannon Nelson [Thu, 11 Dec 2014 07:06:38 +0000 (07:06 +0000)]
i40e: don't give up on DCB error after reset
We don't need to give up in the reset/rebuild process if the DCB setup failed,
so handle it here the same as in the probe setup. Also adjust the log strings
a little to look less scary.
Change-ID: I57308d703047e61d3f1a5e471ea77be232444ca0 Signed-off-by: Shannon Nelson <shannon.nelson@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Carolyn Wyborny [Thu, 11 Dec 2014 07:06:37 +0000 (07:06 +0000)]
i40e: fix proc/int descriptions
This patch fixes a problem where the /proc/interrupts descriptions
from the msix configuration were truncating the needed bus info,
making it hard to distinguish configuration from port to port.
This patch increases the string buffer size in order to allow the
full data to be displayed and sync's the text formatting of the misc
and fdir interrupt names
Change-ID: Ib01d6c61fb3f4ac70fbdf5bcc520b22638ea54b7 Signed-off-by: Carolyn Wyborny <carolyn.wyborny@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Kevin Scott [Thu, 11 Dec 2014 07:06:36 +0000 (07:06 +0000)]
i40e/i40evf: Increase ASQ timeout
Increase ASQ timeout for some scenarios with multi-function devices
Change-ID: I2d7655b19e6c6f9a7ad04deacb106ca8d53886db Signed-off-by: Kevin Scott <kevin.c.scott@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Shannon Nelson [Fri, 12 Dec 2014 07:50:07 +0000 (07:50 +0000)]
i40e/i40evf: AdminQ updates ww36
Several little tweaks to keep FW, SV, and SW in line together
- Remove the unused and deprecated
i40e_aqc_opc_debug_modify_internals
- Add define for iSCSI capability
- Fix queue mask size
- Adjust i40e_aqc_oem_param_change for ease-of-use
Change-ID: I51f250b367912968a7cec61b3a68110d9796e914 Signed-off-by: Shannon Nelson <shannon.nelson@intel.com> Signed-off-by: Kamil Kacperski <kamil.kacperski@intel.com> Acked-by: Greg Rose <gregory.v.rose@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Herbert Xu [Fri, 16 Jan 2015 06:23:48 +0000 (17:23 +1100)]
netlink: Fix netlink_insert EADDRINUSE error
The patch c5adde9468b0714a051eac7f9666f23eb10b61f7 ("netlink:
eliminate nl_sk_hash_lock") introduced a bug where the EADDRINUSE
error has been replaced by ENOMEM. This patch rectifies that
problem.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Acked-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Ying Xue [Fri, 16 Jan 2015 03:13:09 +0000 (11:13 +0800)]
rhashtable: Fix race in rhashtable_destroy() and use regular work_struct
When we put our declared work task in the global workqueue with
schedule_delayed_work(), its delay parameter is always zero.
Therefore, we should define a regular work in rhashtable structure
instead of a delayed work.
By the way, we add a condition to check whether resizing functions
are NULL before cancelling the work, avoiding to cancel an
uninitialized work.
Lastly, while we wait for all work items we submitted before to run
to completion with cancel_delayed_work(), ht->mutex has been taken in
rhashtable_destroy(). Moreover, cancel_delayed_work() doesn't return
until all work items are accomplished, and when work items are
scheduled, the work's function - rht_deferred_worker() will be called.
However, as rht_deferred_worker() also needs to acquire the lock,
deadlock might happen at the moment as the lock is already held before.
So if the cancel work function is moved out of the lock covered scope,
this will avoid the deadlock.
Fixes: 97defe1 ("rhashtable: Per bucket locks & deferred expansion/shrinking") Signed-off-by: Ying Xue <ying.xue@windriver.com> Cc: Thomas Graf <tgraf@suug.ch> Acked-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Fri, 16 Jan 2015 06:07:02 +0000 (01:07 -0500)]
Merge branch 'iw_cxgb4-next'
Hariprasad Shenai says:
====================
Refactor macros to conform to uniform standards
This patch series cleansup macros/register defines, defined in t4.h and
t4fw_ri_api.h and all the affected files.
This patch series is created against net-next tree and includes patches on
iw_cxgb4 tree. Since the patches are dependent on previous cleanup patched we
would line to get this series merged through net-next tree.
We have included all the maintainers of respective drivers. Kindly review the
change and let us know in case of any review comments.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Xander Huff [Thu, 15 Jan 2015 21:55:20 +0000 (15:55 -0600)]
net/macb: Create gem_ethtool_ops for new statistics functions
10/100 MACB does not have the same statistics possibilities as GEM. Separate
macb_ethtool_ops to make a new GEM-specific struct with the new statistics
functions included.
Signed-off-by: Xander Huff <xander.huff@ni.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Gao Zhenyu <gzhenyu@vmware.com> Signed-off-by: Shrikrishna Khare <skhare@vmware.com> Reviewed-by: Shreyas N Bhatewara <sbhatewara@vmware.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Michael Schmitz [Thu, 15 Jan 2015 13:06:15 +0000 (14:06 +0100)]
net: smc91x: Add Atari EtherNAT support
Add Atari specific code to the smc91x Ethernet driver. This code is used
on the EtherNAT adapter card for the Atari Falcon extension port.
Signed-off-by: Michael Schmitz <schmitz@debian.org> Tested-by: Christian Steigies <cts@debian.org>
[geert: Sort Kconfig entries, split in hard and soft dependencies] Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org> Acked-by: Nicolas Pitre <nico@fluxnic.net> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Fri, 16 Jan 2015 00:16:56 +0000 (19:16 -0500)]
Merge tag 'mac80211-next-for-davem-2015-01-15' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211-next
Here's a big pile of changes for this round.
We have
* a lot of regulatory code changes to deal with the
way newer Intel devices handle this
* a change to drop packets while disconnecting from
an AP instead of trying to wait for them
* a new attempt at improving the tailroom accounting
to not kick in too much for performance reasons
* improvements in wireless link statistics
* many other small improvements and small fixes that
didn't seem necessary for 3.19 (e.g. in hwsim which
is testing only code)
Anish Bhatt [Wed, 14 Jan 2015 23:17:34 +0000 (15:17 -0800)]
cxgb4 : Update ipv6 address handling api
This patch improves on previously added support for ipv6 addresses. The code
is consolidated to a single file and adds an api for use by dependent upper
level drivers such as cxgb4i/iw_cxgb4 etc.
Signed-off-by: Anish Bhatt <anish@chelsio.com> Signed-off-by: Deepak Singh <deepak.s@chelsio.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Wed, 14 Jan 2015 23:17:06 +0000 (15:17 -0800)]
ipv4: per cpu uncached list
RAW sockets with hdrinc suffer from contention on rt_uncached_lock
spinlock.
One solution is to use percpu lists, since most routes are destroyed
by the cpu that created them.
It is unclear why we even have to put these routes in uncached_list,
as all outgoing packets should be freed when a device is dismantled.
Signed-off-by: Eric Dumazet <edumazet@google.com> Fixes: caacf05e5ad1 ("ipv4: Properly purge netdev references on uncached routes.") Signed-off-by: David S. Miller <davem@davemloft.net>
In boards, the dm9000 chip's power and reset can be controlled by gpio.
It makes sense to add them to the dm9000 driver and let dt be used to
enable power and reset the phy.
Signed-off-by: Zubair Lutfullah Kakakhel <Zubair.Kakakhel@imgtec.com> Signed-off-by: Paul Burton <paul.burton@imgtec.com> Signed-off-by: David S. Miller <davem@davemloft.net>
====================
netfilter updates for net-next
The following patchset contains netfilter updates for net-next, just a
bunch of cleanups and small enhancement to selectively flush conntracks
in ctnetlink, more specifically the patches are:
1) Rise default number of buckets in conntrack from 16384 to 65536 in
systems with >= 4GBytes, patch from Marcelo Leitner.
2) Small refactor to save one level on indentation in xt_osf, from
Joe Perches.
3) Remove unnecessary sizeof(char) in nf_log, from Fabian Frederick.
4) Another small cleanup to remove redundant variable in nfnetlink,
from Duan Jiong.
5) Fix compilation warning in nfnetlink_cthelper on parisc, from
Chen Gang.
6) Fix wrong format in debugging for ctseqadj, from Gao feng.
7) Selective conntrack flushing through the mark for ctnetlink, patch
from Kristian Evensen.
8) Remove nf_ct_conntrack_flush_report() exported symbol now that is
not required anymore after the selective flushing patch, again from
Kristian.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Thu, 15 Jan 2015 06:12:13 +0000 (01:12 -0500)]
Merge branch 'vxlan_group_policy_extension'
Thomas Graf says:
====================
VXLAN Group Policy Extension
Implements supports for the Group Policy VXLAN extension [0] to provide
a lightweight and simple security label mechanism across network peers
based on VXLAN. The security context and associated metadata is mapped
to/from skb->mark. This allows further mapping to a SELinux context
using SECMARK, to implement ACLs directly with nftables, iptables, OVS,
tc, etc.
The extension is disabled by default and should be run on a distinct
port in mixed Linux VXLAN VTEP environments. Liberal VXLAN VTEPs
which ignore unknown reserved bits will be able to receive VXLAN-GBP
frames.
Simple usage example:
10.1.1.1:
# ip link add vxlan0 type vxlan id 10 remote 10.1.1.2 gbp
# iptables -I OUTPUT -m owner --uid-owner 101 -j MARK --set-mark 0x200
10.1.1.2:
# ip link add vxlan0 type vxlan id 10 remote 10.1.1.1 gbp
# iptables -I INPUT -m mark --mark 0x200 -j DROP
iproute2 [1] and OVS [2] support will be provided in separate patches.
Thomas Graf [Thu, 15 Jan 2015 02:53:59 +0000 (03:53 +0100)]
openvswitch: Support VXLAN Group Policy extension
Introduces support for the group policy extension to the VXLAN virtual
port. The extension is disabled by default and only enabled if the user
has provided the respective configuration.
The configuration interface to enable the extension is based on a new
attribute OVS_VXLAN_EXT_GBP nested inside OVS_TUNNEL_ATTR_EXTENSION
which can carry additional extensions as needed in the future.
The group policy metadata is stored as binary blob (struct ovs_vxlan_opts)
internally just like Geneve options but transported as nested Netlink
attributes to user space.
Renames the existing TUNNEL_OPTIONS_PRESENT to TUNNEL_GENEVE_OPT with the
binary value kept intact, a new flag TUNNEL_VXLAN_OPT is introduced.
The attributes OVS_TUNNEL_KEY_ATTR_VXLAN_OPTS and existing
OVS_TUNNEL_KEY_ATTR_GENEVE_OPTS are implemented mutually exclusive.
Signed-off-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
Thomas Graf [Thu, 15 Jan 2015 02:53:58 +0000 (03:53 +0100)]
openvswitch: Allow for any level of nesting in flow attributes
nlattr_set() is currently hardcoded to two levels of nesting. This change
introduces struct ovs_len_tbl to define minimal length requirements plus
next level nesting tables to traverse the key attributes to arbitrary depth.
Signed-off-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
Thomas Graf [Thu, 15 Jan 2015 02:53:57 +0000 (03:53 +0100)]
openvswitch: Rename GENEVE_TUN_OPTS() to TUN_METADATA_OPTS()
Also factors out Geneve validation code into a new separate function
validate_and_copy_geneve_opts().
A subsequent patch will introduce VXLAN options. Rename the existing
GENEVE_TUN_OPTS() to reflect its extended purpose of carrying generic
tunnel metadata options.
Signed-off-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
Thomas Graf [Thu, 15 Jan 2015 02:53:56 +0000 (03:53 +0100)]
vxlan: Only bind to sockets with compatible flags enabled
A VXLAN net_device looking for an appropriate socket may only consider
a socket which has a matching set of flags/extensions enabled. If
incompatible flags are enabled, return a conflict to have the caller
create a distinct socket with distinct port.
The OVS VXLAN port is kept unaware of extensions at this point.
Signed-off-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
Thomas Graf [Thu, 15 Jan 2015 02:53:55 +0000 (03:53 +0100)]
vxlan: Group Policy extension
Implements supports for the Group Policy VXLAN extension [0] to provide
a lightweight and simple security label mechanism across network peers
based on VXLAN. The security context and associated metadata is mapped
to/from skb->mark. This allows further mapping to a SELinux context
using SECMARK, to implement ACLs directly with nftables, iptables, OVS,
tc, etc.
The group membership is defined by the lower 16 bits of skb->mark, the
upper 16 bits are used for flags.
SELinux allows to manage label to secure local resources. However,
distributed applications require ACLs to implemented across hosts. This
is typically achieved by matching on L2-L4 fields to identify the
original sending host and process on the receiver. On top of that,
netlabel and specifically CIPSO [1] allow to map security contexts to
universal labels. However, netlabel and CIPSO are relatively complex.
This patch provides a lightweight alternative for overlay network
environments with a trusted underlay. No additional control protocol
is required.
Host 1: Host 2:
Group A Group B Group B Group A
+-----+ +-------------+ +-------+ +-----+
| lxc | | SELinux CTX | | httpd | | VM |
+--+--+ +--+----------+ +---+---+ +--+--+
\---+---/ \----+---/
| |
+---+---+ +---+---+
| vxlan | | vxlan |
+---+---+ +---+---+
+------------------------------+
Backwards compatibility:
A VXLAN-GBP socket can receive standard VXLAN frames and will assign
the default group 0x0000 to such frames. A Linux VXLAN socket will
drop VXLAN-GBP frames. The extension is therefore disabled by default
and needs to be specifically enabled:
ip link add [...] type vxlan [...] gbp
In a mixed environment with VXLAN and VXLAN-GBP sockets, the GBP socket
must run on a separate port number.
Examples:
iptables:
host1# iptables -I OUTPUT -m owner --uid-owner 101 -j MARK --set-mark 0x200
host2# iptables -I INPUT -m mark --mark 0x200 -j DROP
1) Don't use uninitialized data in IPVS, from Dan Carpenter.
2) conntrack race fixes from Pablo Neira Ayuso.
3) Fix TX hangs with i40e, from Jesse Brandeburg.
4) Fix budget return from poll calls in dnet and alx, from Eric
Dumazet.
5) Fix bugus "if (unlikely(x) < 0)" test in AF_PACKET, from Christoph
Jaeger.
6) Fix bug introduced by conversion to list_head in TIPC retransmit
code, from Jon Paul Maloy.
7) Don't use GFP_NOIO under spinlock in USB kaweth driver, from Alexey
Khoroshilov.
8) Fix bridge build with INET disabled, from Arnd Bergmann.
9) Fix netlink array overrun for PROBE attributes in openvswitch, from
Thomas Graf.
10) Don't hold spinlock across synchronize_irq() in tg3 driver, from
Prashant Sreedharan.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (44 commits)
tg3: Release tp->lock before invoking synchronize_irq()
tg3: tg3_reset_task() needs to use rtnl_lock to synchronize
tg3: tg3_timer() should grab tp->lock before checking for tp->irq_sync
team: avoid possible underflow of count_pending value for notify_peers and mcast_rejoin
openvswitch: packet messages need their own probe attribtue
i40e: adds FCoE configure option
cxgb4vf: Fix queue allocation for 40G adapter
netdevice: Add missing parentheses in macro
bridge: only provide proxy ARP when CONFIG_INET is enabled
neighbour: fix base_reachable_time(_ms) not effective immediatly when changed
net: fec: fix MDIO bus assignement for dual fec SoC's
xen-netfront: use different locks for Rx and Tx stats
drivers: net: cpsw: fix multicast flush in dual emac mode
cxgb4vf: Initialize mdio_addr before using it
net: Corrected the comment describing the ndo operations to reflect the actual prototype for couple of operations
usb/kaweth: use GFP_ATOMIC under spin_lock in usb_start_wait_urb()
MAINTAINERS: add me as ibmveth maintainer
tipc: fix bug in broadcast retransmit code
update ip-sysctl.txt documentation (v2)
net/at91_ether: prepare and unprepare clock
...
tg3: Release tp->lock before invoking synchronize_irq()
synchronize_irq() can sleep waiting, for pending IRQ handlers so driver
should release the tp->lock spin lock before invoking synchronize_irq()
Reported-by: Peter Hurley <peter@hurleysoftware.com> Tested-by: Peter Hurley <peter@hurleysoftware.com> Signed-off-by: Prashant Sreedharan <prashant@broadcom.com> Signed-off-by: Michael Chan <mchan@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
tg3: tg3_reset_task() needs to use rtnl_lock to synchronize
Currently tg3_reset_task() uses only tp->lock for synchronizing with code
paths like tg3_open() etc. But since tp->lock is released before doing
synchronize_irq(), rtnl_lock should be taken in tg3_reset_task() to
synchronize it with other code paths.
Reported-by: Peter Hurley <peter@hurleysoftware.com> Tested-by: Peter Hurley <peter@hurleysoftware.com> Signed-off-by: Prashant Sreedharan <prashant@broadcom.com> Signed-off-by: Michael Chan <mchan@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
tg3: tg3_timer() should grab tp->lock before checking for tp->irq_sync
This is to avoid the race between tg3_timer() and the execution paths
which does not invoke tg3_timer_stop() and releases tp->lock before
calling synchronize_irq()
Reported-by: Peter Hurley <peter@hurleysoftware.com> Tested-by: Peter Hurley <peter@hurleysoftware.com> Signed-off-by: Prashant Sreedharan <prashant@broadcom.com> Signed-off-by: Michael Chan <mchan@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Linus Torvalds [Wed, 14 Jan 2015 21:54:30 +0000 (10:54 +1300)]
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull kvm fixes from Paolo Bonzini:
"Two bugfixes for arm64. I will have another pull request next week,
but otherwise things are calm"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
arm64: KVM: Fix HCR setting for 32bit guests
arm64: KVM: Fix TLB invalidation by IPA/VMID
Jiri Pirko [Wed, 14 Jan 2015 17:15:30 +0000 (18:15 +0100)]
team: avoid possible underflow of count_pending value for notify_peers and mcast_rejoin
This patch is fixing a race condition that may cause setting
count_pending to -1, which results in unwanted big bulk of arp messages
(in case of "notify peers").
Consider following scenario:
count_pending == 2
CPU0 CPU1
team_notify_peers_work
atomic_dec_and_test (dec count_pending to 1)
schedule_delayed_work
team_notify_peers
atomic_add (adding 1 to count_pending)
team_notify_peers_work
atomic_dec_and_test (dec count_pending to 1)
schedule_delayed_work
team_notify_peers_work
atomic_dec_and_test (dec count_pending to 0)
schedule_delayed_work
team_notify_peers_work
atomic_dec_and_test (dec count_pending to -1)
Fix this race by using atomic_dec_if_positive - that will prevent
count_pending running under 0.
Fixes: fc423ff00df3a1955441 ("team: add peer notification") Fixes: 492b200efdd20b8fcfd ("team: add support for sending multicast rejoins") Signed-off-by: Jiri Pirko <jiri@resnulli.us> Signed-off-by: Jiri Benc <jbenc@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Linus Torvalds [Wed, 14 Jan 2015 21:50:29 +0000 (10:50 +1300)]
Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux
Pull s390 fixes from Martin Schwidefsky:
"Two small performance tweaks, the plumbing for the execveat system
call and a couple of bug fixes"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux:
s390/uprobes: fix user space PER events
s390/bpf: Fix JMP_JGE_X (A > X) and JMP_JGT_X (A >= X)
s390/bpf: Fix ALU_NEG (A = -A)
s390/mm: avoid using pmd_to_page for !USE_SPLIT_PMD_PTLOCKS
s390/timex: fix get_tod_clock_ext() inline assembly
s390: wire up execveat syscall
s390/kernel: use stnsm 255 instead of stosm 0
s390/vtime: Get rid of redundant WARN_ON
s390/zcrypt: kernel oops at insmod of the z90crypt device driver
Thomas Graf [Wed, 14 Jan 2015 13:56:19 +0000 (13:56 +0000)]
openvswitch: packet messages need their own probe attribtue
User space is currently sending a OVS_FLOW_ATTR_PROBE for both flow
and packet messages. This leads to an out-of-bounds access in
ovs_packet_cmd_execute() because OVS_FLOW_ATTR_PROBE >
OVS_PACKET_ATTR_MAX.
Introduce a new OVS_PACKET_ATTR_PROBE with the same numeric value
as OVS_FLOW_ATTR_PROBE to grow the range of accepted packet attributes
while maintaining to be binary compatible with existing OVS binaries.
Fixes: 05da589 ("openvswitch: Add support for OVS_FLOW_ATTR_PROBE.") Reported-by: Sander Eikelenboom <linux@eikelenboom.it> Tracked-down-by: Florian Westphal <fw@strlen.de> Signed-off-by: Thomas Graf <tgraf@suug.ch> Reviewed-by: Jesse Gross <jesse@nicira.com> Acked-by: Pravin B Shelar <pshelar@nicira.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Vasu Dev [Wed, 14 Jan 2015 13:14:07 +0000 (05:14 -0800)]
i40e: adds FCoE configure option
Adds FCoE config option I40E_FCOE, so that FCoE can be enabled
as needed but otherwise have it disabled by default.
This also eliminate multiple FCoE config checks, instead now just
one config check for CONFIG_I40E_FCOE.
The I40E FCoE was added with 3.17 kernel and therefore this patch
shall be applied to stable 3.17 kernel also.
CC: <stable@vger.kernel.org> Signed-off-by: Vasu Dev <vasu.dev@intel.com> Tested-by: Jim Young <jamesx.m.young@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Linus Torvalds [Wed, 14 Jan 2015 21:38:07 +0000 (10:38 +1300)]
Merge tag 'locks-v3.19-1' of git://git.samba.org/jlayton/linux
Pull file locking fix from Jeff Layton:
"Just a simple bugfix for a regression that I introduced into v3.18
with the internal lease API overhaul -- mea culpa. Kudos to Linda and
Neil for tracking this down and fixing it"
* tag 'locks-v3.19-1' of git://git.samba.org/jlayton/linux:
locks: fix NULL-deref in generic_delete_lease
Linus Torvalds [Wed, 14 Jan 2015 21:27:56 +0000 (10:27 +1300)]
Merge branch 'for-linus' of git://git.kernel.dk/linux-block
Pull block layer fixes from Jens Axboe:
"The major part is an update to the NVMe driver, fixing various issues
around surprise removal and hung controllers. Most of that is from
Keith, and parts are simple blk-mq fixes or exports/additions of minor
functions to aid this effort, and parts are changes directly to the
NVMe driver.
Apart from the above, this contains:
- Small blk-mq change from me, killing an unused member of the
hardware queue structure.
- Small fix from Ming Lei, fixing up a few drivers that didn't
properly check for ERR_PTR() returns from blk_mq_init_queue()"
* 'for-linus' of git://git.kernel.dk/linux-block:
NVMe: Fix locking on abort handling
NVMe: Start and stop h/w queues on reset
NVMe: Command abort handling fixes
NVMe: Admin queue removal handling
NVMe: Reference count admin queue usage
NVMe: Start all requests
blk-mq: End unstarted requests on a dying queue
blk-mq: Allow requests to never expire
blk-mq: Add helper to abort requeued requests
blk-mq: Let drivers cancel requeue_work
blk-mq: Export if requests were started
blk-mq: Wake tasks entering queue on dying
blk-mq: get rid of ->cmd_size in the hardware queue
block: fix checking return value of blk_mq_init_queue
block: wake up waiters when a queue is marked dying
NVMe: Fix double free irq
blk-mq: Export freeze/unfreeze functions
blk-mq: Exit queue on alloc failure
David S. Miller [Wed, 14 Jan 2015 20:20:11 +0000 (15:20 -0500)]
Merge branch 'vxlan_rco'
Tom Herbert says:
====================
net: Remote checksum offload for VXLAN
This patch set adds support for remote checksum offload in VXLAN.
The remote checksum offload is generalized by creating a common
function (remcsum_adjust) that does the work of modifying the
checksum in remote checksum offload. This function can be called
from normal or GRO path. GUE was modified to use this function.
To support RCO is VXLAN we use the 9th bit in the reserved
flags to indicated remote checksum offload. The start and offset
values are encoded n a compressed form in the low order (reserved)
byte of the vni field.
Remote checksum offload is described in
https://tools.ietf.org/html/draft-herbert-remotecsumoffload-01
Changes in v2:
- Add udp_offload_callbacks which has GRO functions that take a
udp_offload pointer argument. This argument can be used to retrieve
a per port structure of the encapsulation for use in gro processing
(mostly by doing container_of on the structure).
- Use the 10th bit in VXLAN flags for RCO which does not seem to
conflict with other proposals at this time (ie. VXLAN-GPE and
VXLAN-GPB)
- Require that RCO must be explicitly enabled on the receiver
as well as the sender.
Tested by running 200 TCP_STREAM connections with VXLAN (over IPv4).
With UDP checksums and Remote Checksum Offload
IPv4
Client
11.84% CPU utilization
Server
12.96% CPU utilization
9197 Mbps
IPv6
Client
12.46% CPU utilization
Server
14.48% CPU utilization
8963 Mbps
With UDP checksums, no remote checksum offload
IPv4
Client
15.67% CPU utilization
Server
14.83% CPU utilization
9094 Mbps
IPv6
Client
16.21% CPU utilization
Server
14.32% CPU utilization
9058 Mbps
No UDP checksums
IPv4
Client
15.03% CPU utilization
Server
23.09% CPU utilization
9089 Mbps
IPv6
Client
16.18% CPU utilization
Server
26.57% CPU utilization
8954 Mbps
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Tom Herbert [Tue, 13 Jan 2015 01:00:38 +0000 (17:00 -0800)]
vxlan: Remote checksum offload
Add support for remote checksum offload in VXLAN. This uses a
reserved bit to indicate that RCO is being done, and uses the low order
reserved eight bits of the VNI to hold the start and offset values in a
compressed manner.
Start is encoded in the low order seven bits of VNI. This is start >> 1
so that the checksum start offset is 0-254 using even values only.
Checksum offset (transport checksum field) is indicated in the high
order bit in the low order byte of the VNI. If the bit is set, the
checksum field is for UDP (so offset = start + 6), else checksum
field is for TCP (so offset = start + 16). Only TCP and UDP are
supported in this implementation.
Remote checksum offload for VXLAN is described in:
Tested by running 200 TCP_STREAM connections with VXLAN (over IPv4).
With UDP checksums and Remote Checksum Offload
IPv4
Client
11.84% CPU utilization
Server
12.96% CPU utilization
9197 Mbps
IPv6
Client
12.46% CPU utilization
Server
14.48% CPU utilization
8963 Mbps
With UDP checksums, no remote checksum offload
IPv4
Client
15.67% CPU utilization
Server
14.83% CPU utilization
9094 Mbps
IPv6
Client
16.21% CPU utilization
Server
14.32% CPU utilization
9058 Mbps
No UDP checksums
IPv4
Client
15.03% CPU utilization
Server
23.09% CPU utilization
9089 Mbps
IPv6
Client
16.18% CPU utilization
Server
26.57% CPU utilization
8954 Mbps
Signed-off-by: Tom Herbert <therbert@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Tom Herbert [Tue, 13 Jan 2015 01:00:37 +0000 (17:00 -0800)]
udp: pass udp_offload struct to UDP gro callbacks
This patch introduces udp_offload_callbacks which has the same
GRO functions (but not a GSO function) as offload_callbacks,
except there is an argument to a udp_offload struct passed to
gro_receive and gro_complete functions. This additional argument
can be used to retrieve the per port structure of the encapsulation
for use in gro processing (mostly by doing container_of on the
structure).
Signed-off-by: Tom Herbert <therbert@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Arnd Bergmann [Tue, 13 Jan 2015 14:10:27 +0000 (15:10 +0100)]
bridge: only provide proxy ARP when CONFIG_INET is enabled
When IPV4 support is disabled, we cannot call arp_send from
the bridge code, which would result in a kernel link error:
net/built-in.o: In function `br_handle_frame_finish':
:(.text+0x59914): undefined reference to `arp_send'
:(.text+0x59a50): undefined reference to `arp_tbl'
This makes the newly added proxy ARP support in the bridge
code depend on the CONFIG_INET symbol and lets the compiler
optimize the code out to avoid the link error.
Signed-off-by: Arnd Bergmann <arnd@arndb.de> Fixes: 958501163ddd ("bridge: Add support for IEEE 802.11 Proxy ARP") Cc: Kyeyoon Park <kyeyoonp@codeaurora.org> Signed-off-by: David S. Miller <davem@davemloft.net>
Luciano Coelho [Fri, 9 Jan 2015 12:06:37 +0000 (14:06 +0200)]
nl80211: send netdetect configuration info in NL80211_CMD_GET_WOWLAN
Send the netdetect configuration information in the response to
NL8021_CMD_GET_WOWLAN commands. This includes the scan interval,
SSIDs to match and frequencies to scan.
Additionally, add the NL80211_WOWLAN_TRIG_NET_DETECT with
NL80211_ATTR_WOWLAN_TRIGGERS_SUPPORTED.
Signed-off-by: Luciano Coelho <luciano.coelho@intel.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Arik Nemtsov [Wed, 7 Jan 2015 14:47:20 +0000 (16:47 +0200)]
cfg80211: avoid reg-hints in self-managed only systems
When a system contains only self-managed regulatory devices all hints
from the regulatory core are ignored. Stop hint processing early in this
case. These systems usually don't have CRDA deployed, which results in
endless (irrelevent) logs of the form:
cfg80211: Calling CRDA to update world regulatory domain
Make sure there's at least one self-managed device before discarding a
hint, in order to prevent initial hints from disappearing on CRDA
managed systems.
Signed-off-by: Arik Nemtsov <arikx.nemtsov@intel.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Arik Nemtsov [Wed, 7 Jan 2015 14:47:19 +0000 (16:47 +0200)]
cfg80211: introduce sync regdom set API for self-managed
A self-managed device will sometimes need to set its regdomain synchronously.
Notably it should be set before usermode has a chance to query it. Expose
a new API to accomplish this which requires the RTNL.
Signed-off-by: Arik Nemtsov <arikx.nemtsov@intel.com> Reviewed-by: Ilan Peer <ilan.peer@intel.com> Reviewed-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Eliad Peller [Wed, 7 Jan 2015 15:50:09 +0000 (17:50 +0200)]
mac80211: remove local->radar_detect_enabled
local->radar_detect_enabled should tell whether
radar_detect is enabled on any interface belonging
to local.
However, it's not getting updated correctly
in many cases (actually, when testing with hwsim
it's never been set, even when the dfs master
is beaconing).
Instead of handling all the corner cases
(e.g. channel switch), simply check whether
radar detection is enabled only when needed,
instead of caching the result.
Signed-off-by: Eliad Peller <eliad@wizery.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Arik Nemtsov [Wed, 7 Jan 2015 14:45:07 +0000 (16:45 +0200)]
mac80211: add TDLS supported channels correctly
The function adding the supported channels IE during a TDLS connection had
several issues:
1. If the entire subband is usable, the function exitted the loop without
adding it
2. The function only checked chandef_usable, ignoring flags like RADAR
which would prevent TDLS off-channel communcation.
3. HT20 was explicitly required in the chandef, while not a requirement
for TDLS off-channel.
When roaming / suspending, it makes no sense to wait until
the transmit queues of the device are empty. In extreme
condition they can be starved (VO saturating the air), but
even in regular cases, it is pointless to delay the roaming
because the low level driver is trying to send packets to
an AP which is far away. We'd rather drop these packets and
let TCP retransmit if needed. This will allow to speed up
the roaming.
For suspend, the explanation is even more trivial.
Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com>
v13:
- Fix the problem of alignment parameters for function and checkpatch warming.
v12:
- According Alex's suggestion, modify the changelog and add MODULE_DEVICE_TABLE
for hip04 ethernet.
v11:
- Add ethtool support for tx coalecse getting and setting, the xmit_more
is not supported for this patch, but I think it could work for hip04,
will support it later after some tests for performance better.
Here are some performance test results by ping and iperf(add tx_coalesce_frames/users),
it looks that the performance and latency is more better by tx_coalesce_frames/usecs.
v10:
- According Arnd's suggestion, remove the skb_orphan and use the hrtimer
for the cleanup of the TX queue and add some modification for the hip04
drivers.
1) drop the broken skb_orphan call
2) drop the workqueue
3) batch cleanup based on tx_coalesce_frames/usecs for better throughput
4) use a reasonable default tx timeout (200us, could be shorted
based on measurements) with a range timer
5) fix napi poll function return value
6) use a lockless queue for cleanup
v9:
- There is no tx completion interrupts to free DMAd Tx packets, it means taht
we rely on new tx packets arriving to run the destructors of completed packets,
which open up space in their sockets's send queues. Sometimes we don't get such
new packets causing Tx to stall, a single UDP transmitter is a good example of
this situation, so we need a clean up workqueue to reclaims completed packets,
the workqueue will only free the last packets which is already stay for several jiffies.
Also fix some format cleanups.
v8:
- Use poll to reclaim xmitted buffer as workaround since no tx done interrupt
v7:
- Remove select NET_CORE in 0002
v6:
- Suggest by Russell: Use netdev_sent_queue & netdev_completed_queue to solve latency issue
Also shorten the period of timer, which is used to wakeup the queue since no
tx completed interrupt.
v5:
- no big change, fix typo
v4:
- Modify accoringly to the suggetion from Arnd, Florian, Eric, David
Use of_parse_phandle_with_fixed_args & syscon_node_to_regmap get ppe info
Add skb_orphan() and tx_timer for reclaim since no tx_finished interrupt
Update timeout, and move of_phy_connect to probe to reuse open/stop
v3:
- Suggest from Arnd, use syscon & regmap_write/read to replace static void __iomem *ppebase.
Modify hisilicon-hip04-net.txt accrordingly to suggestion from Florian and Sergei.
v2:
- Got many suggestions from Russell, Arnd, Florian, Mark and Sergei
Remove memcpy, use dma_map/unmap_single, use dma_alloc_coherent rather than dma_pool, etc.
Refer property in ethernet.txt, change ppe description, etc.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
dingtianhong [Wed, 14 Jan 2015 06:34:14 +0000 (14:34 +0800)]
net: hisilicon: new hip04 ethernet driver
Support Hisilicon hip04 ethernet driver, including 100M / 1000M controller.
The controller has no tx done interrupt, reclaim xmitted buffer in the poll.
v13: Fix the problem of alignment parameters for function and checkpatch warming.
v12: According Alex's suggestion, modify the changelog and add MODULE_DEVICE_TABLE
for hip04 ethernet.
v11: Add ethtool support for tx coalecse getting and setting, the xmit_more
is not supported for this patch, but I think it could work for hip04,
will support it later after some tests for performance better.
Here are some performance test results by ping and iperf(add tx_coalesce_frames/users),
it looks that the performance and latency is more better by tx_coalesce_frames/usecs.
v10: According David Miller and Arnd Bergmann's suggestion, add some modification
for v9 version
- drop the workqueue
- batch cleanup based on tx_coalesce_frames/usecs for better throughput
- use a reasonable default tx timeout (200us, could be shorted
based on measurements) with a range timer
- fix napi poll function return value
- use a lockless queue for cleanup
Signed-off-by: Zhangfei Gao <zhangfei.gao@linaro.org> Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Ding Tianhong <dingtianhong@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Zhangfei Gao <zhangfei.gao@linaro.org> Signed-off-by: Ding Tianhong <dingtianhong@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Zhangfei Gao [Wed, 14 Jan 2015 06:34:12 +0000 (14:34 +0800)]
Documentation: add Device tree bindings for Hisilicon hip04 ethernet
This patch adds the Device Tree bindings for the Hisilicon hip04
Ethernet controller, including 100M / 1000M controller.
Signed-off-by: Zhangfei Gao <zhangfei.gao@linaro.org> Signed-off-by: Ding Tianhong <dingtianhong@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
neighbour: fix base_reachable_time(_ms) not effective immediatly when changed
When setting base_reachable_time or base_reachable_time_ms on a
specific interface through sysctl or netlink, the reachable_time
value is not updated.
This means that neighbour entries will continue to be updated using the
old value until it is recomputed in neigh_period_work (which
recomputes the value every 300*HZ).
On systems with HZ equal to 1000 for instance, it means 5mins before
the change is effective.
This patch changes this behavior by recomputing reachable_time after
each set on base_reachable_time or base_reachable_time_ms.
The new value will become effective the next time the neighbour's timer
is triggered.
Changes are made in two places: the netlink code for set and the sysctl
handling code. For sysctl, I use a proc_handler. The ipv6 network
code does provide its own handler but it already refreshes
reachable_time correctly so it's not an issue.
Any other user of neighbour which provide its own handlers must
refresh reachable_time.
Signed-off-by: Jean-Francois Remy <jeff@melix.org> Signed-off-by: David S. Miller <davem@davemloft.net>
Stefan Agner [Tue, 13 Jan 2015 23:20:21 +0000 (00:20 +0100)]
net: fec: fix MDIO bus assignement for dual fec SoC's
On i.MX28, the MDIO bus is shared between the two FEC instances.
The driver makes sure that the second FEC uses the MDIO bus of the
first FEC. This is done conditionally if FEC_QUIRK_ENET_MAC is set.
However, in newer designs, such as Vybrid or i.MX6SX, each FEC MAC
has its own MDIO bus. Simply removing the quirk FEC_QUIRK_ENET_MAC
is not an option since other logic, triggered by this quirk, is
still needed.
Furthermore, there are board designs which use the same MDIO bus
for both PHY's even though the second bus would be available on the
SoC side. Such layout are popular since it saves pins on SoC side.
Due to the above quirk, those boards currently do work fine. The
boards in the mainline tree with such a layout are:
- Freescale Vybrid Tower with TWR-SER2 (vf610-twr.dts)
- Freescale i.MX6 SoloX SDB Board (imx6sx-sdb.dts)
This patch adds a new quirk FEC_QUIRK_SINGLE_MDIO for i.MX28, which
makes sure that the MDIO bus of the first FEC is used in any case.
However, the boards above do have a SoC with a MDIO bus for each FEC
instance. But the PHY's are not connected in a 1:1 configuration. A
proper device tree description is needed to allow the driver to
figure out where to find its PHY. This patch fixes that shortcoming
by adding a MDIO bus child node to the first FEC instance, along
with the two PHY's on that bus, and making use of the phy-handle
property to add a reference to the PHY's.
Acked-by: Sascha Hauer <s.hauer@pengutronix.de> Signed-off-by: Stefan Agner <stefan@agner.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
Xander Huff [Tue, 13 Jan 2015 22:15:51 +0000 (16:15 -0600)]
net/macb: improved ethtool statistics support
Currently `ethtool -S` simply returns "no stats available". It
would be more useful to see what the various ethtool statistics
registers' values are. This change implements get_ethtool_stats,
get_strings, and get_sset_count functions to accomplish this.
Read all GEM statistics registers and sum them into
macb.ethtool_stats. Add the necessary infrastructure to make this
accessible via `ethtool -S`.
Update gem_update_stats to utilize ethtool_stats.
Signed-off-by: Xander Huff <xander.huff@ni.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Xander Huff [Tue, 13 Jan 2015 22:15:50 +0000 (16:15 -0600)]
net/macb: Adding comments to various #defs to make interpretation easier
This change is to help improve at-a-glace knowledge of the purpose of the
various Cadence MACB/GEM registers. Comments are more helpful for human
readability than short acronyms.
Describe various #define varibles Cadence MACB/GEM registers as documented
in Xilinix's "Zynq-7000 All Programmable SoC TechnicalReference Manual, v1.9.1
(UG-585)"
Signed-off-by: Xander Huff <xander.huff@ni.com> Signed-off-by: David S. Miller <davem@davemloft.net>