Mitch Williams [Thu, 3 Sep 2015 21:19:01 +0000 (17:19 -0400)]
i40evf: speed up init
Shorten up the delays in the init task, allowing the VF driver to
initialize faster. This aids performance in load/unload tests and
mitigates DMAR errors in VF enable/disable tests with absurdly short
delays. In the real world, the VF driver will come up more quickly.
The original values were set conservatively based on what we expected
from the firmware in terms of performance. Now that the driver is in use
and we know how well firmware responds to our requests, we can shorten
these delays.
Change-ID: Ibead77d34b19e8170e667c3f58bc14748bbc5bc9 Signed-off-by: Mitch Williams <mitch.a.williams@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Shannon Nelson [Thu, 3 Sep 2015 21:19:00 +0000 (17:19 -0400)]
i40e: remove unnecessary string copy operations
Save a little stack space and remove unnecessary strncpy() with a little
string pointer.
Change-ID: Id2719d34710bfc273d3bb445fec085cd04276e88 Signed-off-by: Shannon Nelson <shannon.nelson@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Kevin Scott [Thu, 3 Sep 2015 21:18:58 +0000 (17:18 -0400)]
i40e: Store off PHY capabilities
Store off reported PHY capabilities in link_info structure.
Change-ID: Ife0f037c26983ca985dbf79abf33f8f8791369e8 Signed-off-by: Kevin Scott <kevin.c.scott@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Shannon Nelson [Thu, 3 Sep 2015 21:18:57 +0000 (17:18 -0400)]
i40e/i40evf: remove redundant declarations of a variable and a function
Remove a variable declaration inside an if block hiding an existing
declaration at the start of the function.
Also remove a forward function declaration that is no longer needed due
to code re-organization.
Change-ID: I12954668b722718074949c93d74cd20eaacd93e4 Signed-off-by: Shannon Nelson <shannon.nelson@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Shannon Nelson [Thu, 3 Sep 2015 21:18:56 +0000 (17:18 -0400)]
i40e: remove FD atr control from debugfs
Since the flow-director-atr priv flag was added to our ethtool interface,
we don't need the on/off control in debugfs.
Change-ID: Ib3b599916434ab30ccd40074e71d7a81609b5bb5 Signed-off-by: Shannon Nelson <shannon.nelson@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Shannon Nelson [Thu, 3 Sep 2015 21:18:55 +0000 (17:18 -0400)]
i40e: allow FD SB if MFP mode only has 1 partition
Even though the device might be in MFP mode, if there's only one partition
enabled, then we still have plenty of interrupts for managing the Flow
Directory Sideband activity. This patch enables FD SB in this case.
This patch also reverses the sense of the conditional in order to remove
the negative logic.
Change-ID: I9edf211a6219fc8d159b4be9964f9fd7f4e00bc0 Signed-off-by: Shannon Nelson <shannon.nelson@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Mitch Williams [Thu, 3 Sep 2015 21:18:54 +0000 (17:18 -0400)]
i40e: remove obsolete version check
This version check only applies to very, very old firmware,
that only ran on A0 hardware, which we never shipped and don't
support in this driver anyway. Remove it, before somebody
gets hurt.
Change-ID: I3752d090ff488acf98ee76b075af961e9c968ee4 Signed-off-by: Mitch Williams <mitch.a.williams@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
i40e: Change some messages from info to debug only
There are several error messages that have been printing when there is
no functional issue. These messages should be available at debug message
level only.
Change-ID: Id91e47bf942c483563995f30d8705fa53acd5aa3 Signed-off-by: Neerav Parikh <neerav.parikh@intel.com> Signed-off-by: Catherine Sullivan <catherine.sullivan@intel.com> Signed-off-by: Shannon Nelson <shannon.nelson@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Some customers wish to be able to control our hardware specific
feature called flow director, at runtime. This patch enables
ethtool priv flags to control this driver/hardware specific feature.
ethtool --set-priv-flags ethX flow-director-atr off
NOTE: the ethtool ntuple interface controls the flow-director
sideband rules.
Change-ID: Iba156350b07fa2ce66f53ded51739f9a3781fe0e Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Mark Rustad [Wed, 26 Aug 2015 21:10:22 +0000 (14:10 -0700)]
ixgbe: Fix CS4227-related semaphore error on reset failure
If the reset never completes, it is necessary to retake the
semaphore before returning, because the caller will release
the semaphore.
Signed-off-by: Mark Rustad <mark.d.rustad@intel.com> Tested-by: Darin Miller <darin.j.miller@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Emil Tantilov [Tue, 25 Aug 2015 01:08:31 +0000 (18:08 -0700)]
ixgbe: disable LRO by default
This patch disables LRO by default in favor of GRO.
LRO is incompatible with forwarding and is disabled when forwarding
is turned on which makes the default offloads of the driver
inconsistent. LRO can still be enabled via ethtool.
Signed-off-by: Emil Tantilov <emil.s.tantilov@intel.com> Tested-by: Darin Miller <darin.j.miller@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
David S. Miller [Thu, 15 Oct 2015 02:14:50 +0000 (19:14 -0700)]
Merge branch 'mlx-next'
Or Gerlitz says:
====================
Mellanox driver update, Oct 14 2015
This series contains two more patches from Eli, patch from Majd
to support PCI error handlers and a fix from Jack to mlx4 VFs
when probed without a provisioned mac address.
The patch set applied on top of net-next commit bbb300e "Merge branch 'bridge-vlan'"
changes from V0:
- made the health flag int --> bool to address comment from Dave on patch #1
- fixed sparse warning noted by the 0-day build tests in patch #2
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Jack Morgenstein [Wed, 14 Oct 2015 14:43:48 +0000 (17:43 +0300)]
net/mlx4_core: Replace VF zero mac with random mac in mlx4_core
By design, when no default MAC addresses are set in the Hypervisor for VFs,
the VFs are passed zero-macs. When such a MAC is received by the VF, it
generates a random MAC address and registers that MAC address
with the Hypervisor.
This random mac generation is currently done in the mlx4_en module.
There is a problem, though, if the mlx4_ib module is loaded by a VF before
the mlx4_en module. In this case, for RoCE, mlx4_ib will see the un-replaced
zero-mac and register that zero-mac as part of QP1 initialization.
Having a zero-mac in the port's MAC table creates problems for a
Baseboard Management Console. The BMC occasionally sends packets with a
zero-mac destination MAC. If there is a zero-mac present in the port's
MAC table, the FW will send such BMC packets to the host driver rather than
to the wire, and BMC will stop working.
To address this problem, we move the replacement of zero-mac addresses
with random-mac addresses to procedure mlx4_slave_cap(), which is part of the
driver startup for VFs, and is before activation of mlx4_ib and mlx4_en.
As a result, zero-mac addresses will never be registered in the port MAC table
by the driver.
In addition, when mlx4_en does initialize the net device, it needs to set
the NET_ADDR_RANDOM flag in the netdev structure if the address was
randomly generated. This is done so that udev on the VM does not create
a new device name after each VF probe (VM boot and such). To accomplish this,
we add a per-port flag in mlx4_dev which gets set whenever mlx4_core replaces
a zero-mac with a randomly-generated mac. This flag is examined when mlx4_en
initializes the net-device.
Fix was suggested by Matan Barak <matanb@mellanox.com>
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il> Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Majd Dibbiny [Wed, 14 Oct 2015 14:43:46 +0000 (17:43 +0300)]
net/mlx5_core: Add pci error handlers to mlx5_core driver
This patch implement the pci_error_handlers for mlx5_core which allow the
driver to recover from PCI error.
Once an error is detected in the PCI, the mlx5_pci_err_detected is called
and it:
1) Marks the device to be in 'Internal Error' state.
2) Dispatches an event to the mlx5_ib to flush all the outstanding cqes
with error.
3) Returns all the on going commands with error.
4) Unloads the driver.
Afterwards, the FW is reset and mlx5_pci_slot_reset is called and it
enables the device and restore it's pci state.
If the later succeeds, mlx5_pci_resume is called, and it loads the SW
stack.
Signed-off-by: Majd Dibbiny <majd@mellanox.com> Signed-off-by: Eli Cohen <eli@mellanox.com> Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
The detection of a fatal condition has been updated to take into account
the state reported by the device or by detecting an all ones read of the
firmware version which indicates that the device is not accessible.
Signed-off-by: Eli Cohen <eli@mellanox.com> Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Wed, 14 Oct 2015 13:16:49 +0000 (06:16 -0700)]
tcp: avoid spurious SYN flood detection at listen() time
At listen() time, there is a small window where listener is visible with
a zero backlog, triggering a spurious "Possible SYN flooding on port"
message.
Nothing prevents us from setting the correct backlog.
Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Wed, 14 Oct 2015 12:58:38 +0000 (05:58 -0700)]
tcp/dccp: fix potential NULL deref in __inet_inherit_port()
As we no longer hold listener lock in fast path, it is possible that a
child is created right after listener freed its bound port, if a close()
is done while incoming packets are processed.
__inet_inherit_port() must detect this and return an error,
so that caller can free the child earlier.
Fixes: e994b2f0fb92 ("tcp: do not lock listener to process SYN packets") Fixes: 079096f103fa ("tcp/dccp: install syn_recv requests into ehash table") Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
lipeng [Wed, 14 Oct 2015 02:28:57 +0000 (10:28 +0800)]
net: hisilicon net: fix a bug about led
this patch fixes a bug in hns driver. the link led is on at the beginning,
but at this time the ethernet port is on down status. it needs to reset
the led status on init sequence.
Signed-off-by: lipeng <lipeng321@huawei.com> Signed-off-by: yankejian <yankejian@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
this is a pull request of 4 patches for net-next/master.
Two patches are by Gerhard Bertelsmann, fixing some problems in the
sun4i driver. The patch by Arnd Bergmann stops using timeval for the
CAN broadcast manager. The last patch by Alexandre Belloni removes the
otherwise unused struct at91_can_data from the driver.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
yankejian [Tue, 13 Oct 2015 01:53:45 +0000 (09:53 +0800)]
net: hisilicon: supports promisc mode
this patch adds support to set promisc mode. it configs the queue on
init seq when it is on promisc mode.and being enabled or disabled promisc
mode by upper level user.
Signed-off-by: yankejian <yankejian@huawei.com> Signed-off-by: Yisen Zhuang <yisen.zhuang@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Paolo Abeni [Wed, 14 Oct 2015 12:25:53 +0000 (14:25 +0200)]
Revert "ipv4/icmp: redirect messages can use the ingress daddr as source"
Revert the commit e2ca690b657f ("ipv4/icmp: redirect messages
can use the ingress daddr as source"), which tried to introduce a more
suitable behaviour for ICMP redirect messages generated by VRRP routers.
However RFC 5798 section 8.1.1 states:
The IPv4 source address of an ICMP redirect should be the address
that the end-host used when making its next-hop routing decision.
while said commit used the generating packet destination
address, which do not match the above and in most cases leads to
no redirect packets to be generated.
Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Wed, 14 Oct 2015 12:53:48 +0000 (05:53 -0700)]
Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/next-queue
Jeff Kirsher says:
====================
Intel Wired LAN Driver Updates 2015-10-13
This series contains updates to i40e, i40evf, ixgbe and fm10k.
Carolyn cleans up ndo_bridge_getlink() by flagging a parameter as
__always_unused, since it is never used. Adds a member to the nvm_info
struct to store OEM version info to be output either by OID or ethtool.
Neerav cleans up a remaining bit shift to use BIT() macro.
Mitch fixes the i40evf driver to properly handle calls to its
ndo_set_mac_address() method. It did not properly check to see if the
override would be allowed by the PF driver, and it never removed the old
address from its filter list. Cleaned up the use of
i40e_enable_vf_mappings() in i40e_alloc_vfs(), since it is just redundant
since we already call it by i40e_reset_vf(). Fixed a possible panic
in some circumstances where the firmware may fail to allocate a VSI for
a VF by checking the return value from i40e_alloc_vf_res() and don't
try to configure the device further if it failed.
Greg fixes the parsing of CEE App TLVs so the caller does not have to
consider whether the App came from a CEE or IEEE DCBx negotiation.
Shannon moves the device ids into a standalone file due to the desire
to write user-land drivers (and other requests) without needing the rest
of the include files.
Catherine adds the ability to save the module information from
get_phy_capabilities() to be used to determine which speeds the module
supports. Also cleaned up the PHY structure by removing unused members
and add the ability to store the PHY capabilities reported by the
firmware.
Emil modifies ixgbe to ensure that flow control packets initiated by the
VF are dropped and reported as spoofed.
Jacob cleans up the fm10k driver to avoid buffer overflow by using
sprintf(), so convert to using snprintf(). Also fixed the use of an
enum as a boolean, so check for the actual value of NETREG_UNINITIALIZED
in case it ever changes from the current value of zero.
v2: Dropped patch 11 of the original series, which added functions that
were never used.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Jacob Keller [Tue, 25 Aug 2015 00:02:00 +0000 (17:02 -0700)]
fm10k: do not use enum as boolean
Check for actual value NETREG_UNINITIALIZED in case it ever changes from
the current value of zero.
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Signed-off-by: Bruce Allan <bruce.w.allan@intel.com> Tested-by: Krishneil Singh <krishneil.k.singh@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Jacob Keller [Tue, 25 Aug 2015 00:01:58 +0000 (17:01 -0700)]
fm10k: use snprintf() instead of sprintf() to avoid buffer overflow
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Signed-off-by: Bruce Allan <bruce.w.allan@intel.com> Tested-by: Krishneil Singh <krishneil.k.singh@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Emil Tantilov [Thu, 20 Aug 2015 22:31:20 +0000 (15:31 -0700)]
ixgbe: add flow control ethertype to the anti-spoofing filter
This patch makes sure that flow control packets initiated by the VF are
dropped and reported as spoofed.
Flow control packets can be used to limit the throughput or as DOS
attack when generated from a VF. Flow control is not supported per VF
hence any pause frames generated from a VF are considered malicious.
Also cleaned up indentation and some redundant comments.
Signed-off-by: Emil Tantilov <emil.s.tantilov@intel.com> Tested-by: Krishneil Singh <krishneil.k.singh@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
i40e/i40evf: Refactor PHY structure and add phy_capabilities enum
Remove unused members in the PHY structure and add a new member to store
all the capabilities the PHY has as reported by the FW. This information
will help us determine what speeds the device is capable of when link is
down.
Also add an enum to decode the PHY types the NVM is capable of.
Use the phy_types variable to determine what phy types are possible
when link is down instead of device id as it will be more accurate.
When on a backplane device, we do not support changing any settings,
however we should display all the phy_types we are capable of so if we
see a backplane dev ID set supported and advertised purely based on
the phy_types variable.
Change-ID: Ia75d560f1fcd30c54cbfb7458690c5867559a930 Signed-off-by: Catherine Sullivan <catherine.sullivan@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
i40e/i40evf: Add module_types and update_link_info
Add a module_types variable to the link_info struct to save the module
information from get_phy_capabilities. This information can be used to
determine which speeds the module supports.
Also add a new function update_link_info which updates the module_types
parameter and then calls get_link_info. This function should be called
in place of get_link_info so that the module_types variable stays
up-to-date with the rest of the link information.
The EAS table does not reflect the values that are actually returned,
so instead, basing these values on the Ethernet compliance codes
specified in table 33 of SFF-8436 as these have been accurate.
Use the new variable in ethtool to differentiate between a 10G/1G dual
speed fiber module and a 10G only module.
Change-ID: Ib7585cce321319c10ce15180054c41a6cbd41389 Signed-off-by: Catherine Sullivan <catherine.sullivan@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Shannon Nelson [Mon, 31 Aug 2015 23:54:50 +0000 (19:54 -0400)]
i40e/i40evf: split device ids into a separate file
Due to desires to write userland drivers, and other requests, without
needing the rest of the include files, the device ids are pulled out
into a standalone file.
Change-ID: Ic0b047dbf9d4b0891892309c1f2079f56d9b60e8 Signed-off-by: Shannon Nelson <shannon.nelson@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Carolyn Wyborny [Mon, 31 Aug 2015 23:54:49 +0000 (19:54 -0400)]
i40e: update fw version text string per previous product formats
This patch moves the internal fw version and fw api version info to be
output in probe. The nvm version, etrack and oem version info are now
configured for output via ethtool -i.
Change-ID: I05d490093a7137dbefcdef263d014d1e5c9e83d0 Signed-off-by: Carolyn Wyborny <carolyn.wyborny@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Mitch Williams [Mon, 31 Aug 2015 23:54:48 +0000 (19:54 -0400)]
i40e: don't panic on VSI allocation failure
In some circumstances, the firmware may fail to allocate a VSI for a VF.
When this happens, the driver does not react well to the bad news and
has a panic attack.
To fix this problem, check the return value from i40e_alloc_vf_res and
don't try to configure the device further if it failed. Additionally,
explicitly clear the INIT bit when we free VF resources, so that this
bit will be in the proper state in the failure case, and won't blow up
elsewhere.
Change-ID: I6a20ce2b59c3458fd832032e88fa28cd42500189 Signed-off-by: Mitch Williams <mitch.a.williams@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Mitch Williams [Mon, 31 Aug 2015 23:54:47 +0000 (19:54 -0400)]
i40e: remove redundant call
This function call isn't needed here; the same function is already
called by i40e_reset_vf.
Change-ID: I96ccbf91b752965c9e28fe895d4c7d4c46e3ba44 Signed-off-by: Mitch Williams <mitch.a.williams@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Greg Bowers [Mon, 31 Aug 2015 23:54:46 +0000 (19:54 -0400)]
i40e: Convert CEE App TLV selector to IEEE selector
Changes the parsing of CEE App TLVs to fill in the App selector in struct
i40e_dcbx_config with the IEEE App selector so the caller doesn't have to
consider whether the App came from a CEE or IEEE DCBX negotiation.
Change-ID: Ia7d9d664cde04d2ebcc9822fd22e4929c6edab3a Signed-off-by: Greg Bowers <gregory.j.bowers@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Mitch Williams [Mon, 31 Aug 2015 23:54:44 +0000 (19:54 -0400)]
i40evf: properly handle ndo_set_mac_address calls
The driver was not correctly handling calls to its ndo_set_mac_address
method. It did not properly check to see if the override would be
allowed by the PF driver, and never removed the old address from its
filter list.
Add a new flag to the adapter struct which is set if the MAC address is
assigned by the PF. Check this flag and don't allow the MAC address to
be changed if it is set. Search for and properly remove the filter
for the old MAC address when the new one is set.
Change-ID: I817bf620c869c5a80e6a7eab65c9cbad1dc89799 Signed-off-by: Mitch Williams <mitch.a.williams@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Shannon Nelson [Mon, 31 Aug 2015 23:54:41 +0000 (19:54 -0400)]
i40e/i40evf: Add new link status defines
Add the new Port link status bit and rename the link status to function
link status.
Change-ID: I71289327ae62638ce967b6ad40114caf998b6dab Signed-off-by: Shannon Nelson <shannon.nelson@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Geliang Tang [Mon, 12 Oct 2015 08:19:07 +0000 (01:19 -0700)]
mISDN: use kstrdup() in dsp_pipeline_build
Use kstrdup instead of strlen-kmalloc-strcpy. Remove unneeded NULL
test, it will be tested inside kstrdup. Remove 0 length string test,
it has been tested in the caller of dsp_pipeline_build.
Signed-off-by: Geliang Tang <geliangtang@163.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Wed, 14 Oct 2015 00:12:54 +0000 (17:12 -0700)]
tcp/dccp: fix behavior of stale SYN_RECV request sockets
When a TCP/DCCP listener is closed, its pending SYN_RECV request sockets
become stale, meaning 3WHS can not complete.
But current behavior is wrong :
incoming packets finding such stale sockets are dropped.
We need instead to cleanup the request socket and perform another
lookup :
- Incoming ACK will give a RST answer,
- SYN rtx might find another listener if available.
- We expedite cleanup of request sockets and old listener socket.
Fixes: 079096f103fa ("tcp/dccp: install syn_recv requests into ehash table") Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
struct at91_can_data was used to pass a callback to the driver, allowing it
to switch the transceiver on and off. As all at91 boards are now using DT,
this is not used anymore, remove that structure.
Signed-off-by: Alexandre Belloni <alexandre.belloni@free-electrons.com> Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
The can subsystem communicates with user space using a bcm_msg_head
header, which contains two timestamps. This is problematic for
multiple reasons:
a) The structure layout is currently incompatible between 64-bit
user space and 32-bit user space, and cannot work in compat
mode (other than x32).
b) The timeval structure layout will change in 32-bit user
space when we fix the y2038 overflow problem by redefining
time_t to 64-bit, making new 32-bit user space incompatible
with the current kernel interface.
Cars last a long time and often use old kernels, so the actual
users of this code are the most likely ones to migrate to y2038
safe user space.
This tries to work around part of the problem by changing the
publicly visible user interface in the header, but not the binary
interface. Fortunately, the values passed around in the structure
are relative times and do not actually suffer from the y2038
overflow, so 32-bit is enough here.
We replace the use of 'struct timeval' with a newly defined
'struct bcm_timeval' that uses the exact same binary layout
as before and that still suffers from problem a) but not problem
b).
The downside of this approach is that any user space program
that currently assigns a timeval structure to these members
rather than writing the tv_sec/tv_usec portions individually
will suffer a compile-time error when built with an updated
kernel header. Fixing this error makes it work fine with old
and new headers though.
We could address problem a) by using '__u32' or 'int' members
rather than 'long', but that would have a more significant
downside in also breaking support for all existing 64-bit user
binaries that might be using this interface, which is likely
not acceptable.
Signed-off-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Oliver Hartkopp <socketcan@hartkopp.net> Cc: linux-can@vger.kernel.org Cc: linux-api@vger.kernel.org Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
This patch fixes a bug in arbitration error reporting
Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Gerhard Bertelsmann <info@gerhard-bertelsmann.de> Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
Patch 01 converts the vlgrp member to use rcu as it was already used in a
similar way so better to make it official and use all the available RCU
instrumentation. Patch 02 fixes a bug where the vlan_list can be traversed
without rtnl or rcu held which could lead to using freed entries.
Patch 03 removes some redundant code that isn't needed anymore.
Patch 04 fixes a bug reported by Ido Schimmel about the vlan_flush order
and switchdevs, it moves it back.
v2: patch 03 and 04 are new, couldn't escape the second synchronize_rcu()
since the rhtable destruction can sleep
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Ido Schimmel reported a problem with switchdev devices because of the
order change of del_nbp operations, more specifically the move of
nbp_vlan_flush() which deletes all vlans and frees vlgrp after the
rx_handler has been unregistered. So in order to fix this move
vlan_flush back where it was and make it destroy the rhtable after
NULLing vlgrp and waiting a grace period to make sure noone can see it.
Reported-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Reviewed-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
As Ido Schimmel pointed out the vlan_vid_del() code in nbp_vlan_flush is
unnecessary (and is actually a remnant of the old vlan code) so we can
remove it.
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Reviewed-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
bridge: vlan: use rcu for vlan_list traversal in br_fill_ifinfo
br_fill_ifinfo is called by br_ifinfo_notify which can be called from
many contexts with different locks held, sometimes it relies upon
bridge's spinlock only which is a problem for the vlan code, so use
explicitly rcu for that to avoid problems.
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Reviewed-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
The bridge and port's vlgrp member is already used in RCU way, currently
we rely on the fact that it cannot disappear while the port exists but
that is error-prone and we might miss places with improper locking
(either RCU or RTNL must be held to walk the vlan_list). So make it
official and use RCU for vlgrp to catch offenders. Introduce proper vlgrp
accessors and use them consistently throughout the code.
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Reviewed-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Tue, 13 Oct 2015 11:55:10 +0000 (04:55 -0700)]
Merge branch 'vrf-ipv6'
David Ahern says:
====================
net: VRF support in IPv6 stack
Initial support for VRF in IPv6 stack. Makes IPv6 functionality on par
with IPv4 -- ping, tcp client/server and udp client/server all work fine.
tcpdump on vrf device and external tap (e.g., host side tap device) shows
all packets with proper addresses. IPv6 does not need the source address
operation like IPv4. Verified vti6 works properly in my setup as does use
of an IPv6 address on the VRF device.
v3
- re-based to top of net-next (updates per net namespace changes by Eric)
- fixed dst_entry typecasts as requested by Dave
- added flags to inet6_rtm_getroute (IPv6 version of deaa0a6a930e)
v2
- fixed CONFIG_IPV6 dependency as questioned by Cong
- if IPV6 is a module, kbuild ensures VRF is a module
- if IPV6 is disabled IPV6 functionality is compiled out of VRF module
- addressed comments from Nik over IRC
- removed duplicate call to netif_is_l3_master in l3mdev_rt6_dst_by_oif
- changed allocation flag from GFP_ATOMIC to GFP_KERNEL since it is init time
- added free of rt6i_pcpu
- check_ipv6_frame returns false only if packet is NDISC type
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
David Ahern [Mon, 12 Oct 2015 18:47:10 +0000 (11:47 -0700)]
net: Add VRF support to IPv6 stack
As with IPv4 support for VRFs added to IPv6 stack by replacing hardcoded
table ids with possibly device specific ones and manipulating the oif in
the flowi6. The flow flags are used to skip oif compare in nexthop lookups
if the device is enslaved to a VRF via the L3 master device.
Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
commit c62987bbd8a1 ("bridge: push bridge setting ageing_time down to
switchdev") introduced a timer race condition because the gc_timer can
get rearmed after it's supposedly stopped and flushed in br_dev_delete()
leading to a use of freed memory. So take rtnl to sync with bridge
destruction when setting ageing_timer.
Here's the trace reproduced with these two commands running in parallel:
while :; do echo 10000 > /sys/class/net/br0/bridge/ageing_timer; done;
while :; do brctl addbr br0; ip l set br0 up; ip l set br0 down;
brctl delbr br0; done;
Fixes: c62987bbd8a1 ("bridge: push bridge setting ageing_time down to switchdev") Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Reviewed-by: Jiri Pirko <jiri@mellanox.com> Acked-by: Scott Feldman <sfeldma@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
We shouldn't allow BRIDGE_VLAN_INFO_PVID flag in VLAN ranges.
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Acked-by: Elad Raz <eladr@mellanox.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
DSA and its drivers currently hook the NETDEV_CHANGEUPPER net_device event in
order to configure the VLAN map of every port.
This VLAN map is a feature of these switch chips to hardcode and restrict which
output ports a given input port can egress frames to.
A Linux bridge is a simple untagged VLAN propagated by the bridge code itself.
With a proper 802.1Q support, a driver does not need this hook anymore, and
will simply program the related VLAN object.
This patchset improves the hardware bridging code in the mv88e6xxx driver with
a strict 802.1Q mode.
Ideally, the equivalent must be done for Broadcom Starfighter 2 and Rocker,
before completely getting rid of this hook.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Vivien Didelot [Sun, 11 Oct 2015 22:08:38 +0000 (18:08 -0400)]
net: dsa: mv88e6xxx: fix hardware bridging
Playing with the VLAN map of every port to implement "hardware bridging"
in the 88E6352 driver was a hack until full 802.1Q was supported.
Indeed with 802.1Q port mode "Disabled" or "Fallback", this feature is
used to restrict which output ports an input port can egress frames to.
A Linux bridge is an untagged VLAN. With full 802.1Q support, we don't
need this hack anymore and can use the "Secure" strict 802.1Q port mode.
With this mode, the port-based VLAN map still needs to be configured,
but all the logic is VTU-centric. This means that the switch only cares
about rules described in its hardware VLAN table, which is exactly what
Linux bridge expects and what we want.
Note also that the hardware bridging was broken with the previous
flexible "Fallback" 802.1Q port mode. Here's an example:
Port0 and Port1 belong to the same bridge. If Port0 sends crafted tagged
frames with VID 200 to Port1, Port1 receives it. Even if Port1 is in
hardware VLAN 200, but not Port0, Port1 will still receive it, because
Fallback mode doesn't care about invalid VID or non-member source port.
Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Sowmini Varadhan [Sun, 11 Oct 2015 20:49:44 +0000 (16:49 -0400)]
RDS-TCP: Reset tcp callbacks if re-using an outgoing socket in rds_tcp_accept_one()
Consider the following "duelling syn" sequence between two peers A and B:
A B
SYN1 -->
<-- SYN2
SYN2ACK -->
Note that the SYN/ACK has already been sent out by TCP before
rds_tcp_accept_one() gets invoked as part of callbacks.
If the inet_addr(A) is numerically less than inet_addr(B),
the arbitration scheme in rds_tcp_accept_one() will prefer the
TCP connection triggered by SYN1, and will send a CLOSE for the
SYN2 (just after the SYN2ACK was sent).
Since B also follows the same arbitration scheme, it will send the SYN-ACK
for SYN1 that will set up a healthy ESTABLISHED connection on both sides.
B will also get a CLOSE for SYN2, which should result in the cleanup
of the TCP state machine for SYN2, but it should not trigger any
stale RDS-TCP callbacks (such as ->writespace, ->state_change etc),
that would disrupt the progress of the SYN2 based RDS-TCP connection.
Thus the arbitration scheme in rds_tcp_accept_one() should restore
rds_tcp callbacks for the winner before setting them up for the
new accept socket, and also make sure that conn->c_outgoing
is set to 0 so that we do not trigger any reconnect attempts on the
passive side of the tcp socket in the future, in conformance with
commit c82ac7e69efe ("net/rds: RDS-TCP: only initiate reconnect attempt
on outgoing TCP socket.")
Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com> Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Sowmini Varadhan [Sun, 11 Oct 2015 20:46:03 +0000 (16:46 -0400)]
RDS: Invoke ->laddr_check() in rds_bind() for explicitly bound transports.
The IP address passed to rds_bind() should be vetted by the
transport's ->laddr_check() for a previously bound transport.
This needs to be done to avoid cases where, for example,
the application has asked for an IB transport,
but the IP address passed to bind is only usable on
ethernet interfaces.
Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com> Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Julia Lawall [Sun, 11 Oct 2015 11:48:05 +0000 (13:48 +0200)]
qlcnic: constify qlcnic_mbx_ops structure
The only instance of a qlcnic_mbx_ops structure is never modified. Thus
the declaration of the structure and all references to the structure type
can be made const.
In the definition of the qlcnic_mailbox structure, the ops field is no
longer lined up with the other fields. This was left as is, to avoid a lot
of trivial changes on the other lines.
Done with the help of Coccinelle.
Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr> Acked-by: Sony Chacko <sony.chacko@qlogic.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Currently it's possible for someone to send a vlan range to the kernel
with the pvid flag set which will result in the pvid bouncing from a
vlan to vlan and isn't correct, it also introduces problems for hardware
where it doesn't make sense having more than 1 pvid. iproute2 already
enforces this, so let's enforce it on kernel-side as well.
Reported-by: Elad Raz <eladr@mellanox.com> Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
huangdaode [Sat, 10 Oct 2015 09:20:38 +0000 (17:20 +0800)]
net: hns: fix the unknown phy_nterface_t type error
This patch fix the building error reported by Jiri Pirko <jiri@resnulli.us>
drivers/net/ethernet/hisilicon/hns/hnae.h:465:2: error: unknown type
name 'phy_interface_t'
phy_interface_t phy_if;
^
the full build log is on https://lists.01.org/pipermail/kbuild-all.
Signed-off-by: huangdaode <huangdaode@hisilicon.com> Signed-off-by: yankejian <yankejian@huawei.com> Reviewed-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
(These pseudo sockets do not have an error queue either)
Fixes: ca6fb0651883 ("tcp: attach SYNACK messages to request sockets instead of listener") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Willem de Bruijn <willemb@google.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Tue, 13 Oct 2015 02:44:22 +0000 (19:44 -0700)]
Merge branch 'netns-defrag'
Eric W. Biederman says:
====================
net: Pass net into defragmentation
This is the next installment of my work to pass struct net through the
output path so the code does not need to guess how to figure out which
network namespace it is in, and ultimately routes can have output
devices in another network namespace.
In netfilter and af_packet we defragment packets in the output path,
and there is the usual amount of confusion about how to compute which
net we are processing the packets in. This patchset clears that
confusion up by explicitly passing in struct net in ip_defrag,
ip_check_defrag, and nf_ct_frag6_gather.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
The function nf_ct_frag6_gather is called on both the input and the
output paths of the networking stack. In particular ipv6_defrag which
calls nf_ct_frag6_gather is called from both the the PRE_ROUTING chain
on input and the LOCAL_OUT chain on output.
The addition of a net parameter makes it explicit which network
namespace the packets are being reassembled in, and removes the need
for nf_ct_frag6_gather to guess.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Acked-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: David S. Miller <davem@davemloft.net>
ipv4: Pass struct net into ip_defrag and ip_check_defrag
The function ip_defrag is called on both the input and the output
paths of the networking stack. In particular conntrack when it is
tracking outbound packets from the local machine calls ip_defrag.
So add a struct net parameter and stop making ip_defrag guess which
network namespace it needs to defragment packets in.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Acked-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: David S. Miller <davem@davemloft.net>
ip_call_ra_chain is called early in the forwarding chain from
ip_forward and ip_mr_input, which makes skb->dev the correct
expression to get the input network device and dev_net(skb->dev) a
correct expression for the network namespace the packet is being
processed in.
Compute the network namespace and store it in a variable to make the
code clearer.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Fri, 9 Oct 2015 18:29:32 +0000 (11:29 -0700)]
packet: fix match_fanout_group()
Recent TCP listener patches exposed a prior af_packet bug :
match_fanout_group() blindly assumes it is always safe
to cast sk to a packet socket to compare fanout with af_packet_priv
But SYNACK packets can be sent while attached to request_sock, which
are smaller than a "struct sock".
We can read non existent memory and crash.
Fixes: c0de08d04215 ("af_packet: don't emit packet on orig fanout group") Fixes: ca6fb0651883 ("tcp: attach SYNACK messages to request sockets instead of listener") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Willem de Bruijn <willemb@google.com> Cc: Eric Leblond <eric@regit.org> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Tue, 13 Oct 2015 02:39:18 +0000 (19:39 -0700)]
Merge tag 'wireless-drivers-next-for-davem-2015-10-09' of git://git.kernel.org/pub/scm/linux/kernel/git/kvalo/wireless-drivers-next
Kalle Valo says:
====================
Major changes:
iwlwifi
* some debugfs improvements
* fix signedness in beacon statistics
* deinline some functions to reduce size when device tracing is enabled
* filter beacons out in AP mode when no stations are associated
* deprecate firmwares version -12
* fix a runtime PM vs. legacy suspend race
* one-liner fix for a ToF bug
* clean-ups in the rx code
* small debugging improvement
* fix WoWLAN with new firmware versions
* more clean-ups towards multiple RX queues;
* some rate scaling fixes and improvements;
* some time-of-flight fixes;
* other generic improvements and clean-ups;
brcmfmac
* rework code dealing with multiple interfaces
* allow logging firmware console using debug level
* support for BCM4350, BCM4365, and BCM4366 PCIE devices
* fixed for legacy P2P and P2P device handling
* correct set and get tx-power
ath9k
* add support for Outside Context of a BSS (OCB) mode
mwifiex
* add USB multichannel feature
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Paolo Abeni [Fri, 9 Oct 2015 12:34:31 +0000 (14:34 +0200)]
ipv4/icmp: redirect messages can use the ingress daddr as source
This patch allows configuring how the source address of ICMP
redirect messages is selected; by default the old behaviour is
retained, while setting icmp_redirects_use_orig_daddr force the
usage of the destination address of the packet that caused the
redirect.
The new behaviour fits closely the RFC 5798 section 8.1.1, and fix the
following scenario:
Two machines are set up with VRRP to act as routers out of a subnet,
they have IPs x.x.x.1/24 and x.x.x.2/24, with VRRP holding on to
x.x.x.254/24.
If a host in said subnet needs to get an ICMP redirect from the VRRP
router, i.e. to reach a destination behind a different gateway, the
source IP in the ICMP redirect is chosen as the primary IP on the
interface that the packet arrived at, i.e. x.x.x.1 or x.x.x.2.
The host will then ignore said redirect, due to RFC 1122 section 3.2.2.2,
and will continue to use the wrong next-op.
Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Jiri Pirko [Fri, 9 Oct 2015 11:54:11 +0000 (13:54 +0200)]
bridge: try switchdev op first in __vlan_vid_add/del
Some drivers need to implement both switchdev vlan ops and
vid_add/kill ndos. For that to work in bridge code, we need to try
switchdev op first when adding/deleting vlan id.
Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: Ido Schimmel <idosch@mellanox.com> Acked-by: Scott Feldman <sfeldma@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
As promised in last patch series, we implement a better SO_REUSEPORT
strategy, based on cpu hints if given by the application.
We also moved sk_refcnt out of the cache line containing the lookup
keys, as it was considerably slowing down smp operations because
of false sharing. This was simpler than converting listen sockets
to conventional RCU (to avoid sk_refcnt dirtying)
Could process 6.0 Mpps SYN instead of 4.2 Mpps on my test server.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Fri, 9 Oct 2015 02:33:22 +0000 (19:33 -0700)]
net: align sk_refcnt on 128 bytes boundary
sk->sk_refcnt is dirtied for every TCP/UDP incoming packet.
This is a performance issue if multiple cpus hit a common socket,
or multiple sockets are chained due to SO_REUSEPORT.
By moving sk_refcnt 8 bytes further, first 128 bytes of sockets
are mostly read. As they contain the lookup keys, this has
a considerable performance impact, as cpus can cache them.
These 8 bytes are not wasted, we use them as a place holder
for various fields, depending on the socket type.
Tested:
SYN flood hitting a 16 RX queues NIC.
TCP listener using 16 sockets and SO_REUSEPORT
and SO_INCOMING_CPU for proper siloing.
Eric Dumazet [Fri, 9 Oct 2015 02:33:21 +0000 (19:33 -0700)]
net: SO_INCOMING_CPU setsockopt() support
SO_INCOMING_CPU as added in commit 2c8c56e15df3 was a getsockopt() command
to fetch incoming cpu handling a particular TCP flow after accept()
This commits adds setsockopt() support and extends SO_REUSEPORT selection
logic : If a TCP listener or UDP socket has this option set, a packet is
delivered to this socket only if CPU handling the packet matches the specified
one.
This allows to build very efficient TCP servers, using one listener per
RX queue, as the associated TCP listener should only accept flows handled
in softirq by the same cpu.
This provides optimal NUMA behavior and keep cpu caches hot.
Note that __inet_lookup_listener() still has to iterate over the list of
all listeners. Following patch puts sk_refcnt in a different cache line
to let this iteration hit only shared and read mostly cache lines.
Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Edward Jee [Thu, 8 Oct 2015 21:56:49 +0000 (14:56 -0700)]
packet: support per-packet fwmark for af_packet sendmsg
Signed-off-by: Edward Hyunkoo Jee <edjee@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Edward Jee [Thu, 8 Oct 2015 21:56:48 +0000 (14:56 -0700)]
sock: support per-packet fwmark
It's useful to allow users to set fwmark for an individual packet,
without changing the socket state. The function this patch adds in
sock layer can be used by the protocols that need such a feature.
Signed-off-by: Edward Hyunkoo Jee <edjee@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Tue, 13 Oct 2015 02:13:41 +0000 (19:13 -0700)]
Merge branch 'bpf-unprivileged'
Alexei Starovoitov says:
====================
bpf: unprivileged
v1-v2:
- this set logically depends on cb patch
"bpf: fix cb access in socket filter programs":
http://patchwork.ozlabs.org/patch/527391/
which is must have to allow unprivileged programs.
Thanks Daniel for finding that issue.
- refactored sysctl to be similar to 'modules_disabled'
- dropped bpf_trace_printk
- split tests into separate patch and added more tests
based on discussion
v1 cover letter:
I think it is time to liberate eBPF from CAP_SYS_ADMIN.
As was discussed when eBPF was first introduced two years ago
the only piece missing in eBPF verifier is 'pointer leak detection'
to make it available to non-root users.
Patch 1 adds this pointer analysis.
The eBPF programs, obviously, need to see and operate on kernel addresses,
but with these extra checks they won't be able to pass these addresses
to user space.
Patch 2 adds accounting of kernel memory used by programs and maps.
It changes behavoir for existing root users, but I think it needs
to be done consistently for both root and non-root, since today
programs and maps are only limited by number of open FDs (RLIMIT_NOFILE).
Patch 2 accounts program's and map's kernel memory as RLIMIT_MEMLOCK.
Unprivileged eBPF is only meaningful for 'socket filter'-like programs.
eBPF programs for tracing and TC classifiers/actions will stay root only.
In parallel the bpf fuzzing effort is ongoing and so far
we've found only one verifier bug and that was already fixed.
The 'constant blinding' pass also being worked on.
It will obfuscate constant-like values that are part of eBPF ISA
to make jit spraying attacks even harder.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
unpriv: return pointer
checks that pointer cannot be returned from the eBPF program
unpriv: add const to pointer
unpriv: add pointer to pointer
unpriv: neg pointer
checks that pointer arithmetic is disallowed
unpriv: cmp pointer with const
unpriv: cmp pointer with pointer
checks that comparison of pointers is disallowed
Only one case allowed 'void *value = bpf_map_lookup_elem(..); if (value == 0) ...'
unpriv: check that printk is disallowed
since bpf_trace_printk is not available to unprivileged
unpriv: pass pointer to helper function
checks that pointers cannot be passed to functions that expect integers
If function expects a pointer the verifier allows only that type of pointer.
Like 1st argument of bpf_map_lookup_elem() must be pointer to map.
(applies to non-root as well)
unpriv: indirectly pass pointer on stack to helper function
checks that pointer stored into stack cannot be used as part of key
passed into bpf_map_lookup_elem()
unpriv: mangle pointer on stack 1
unpriv: mangle pointer on stack 2
checks that writing into stack slot that already contains a pointer
is disallowed
unpriv: read pointer from stack in small chunks
checks that < 8 byte read from stack slot that contains a pointer is
disallowed
unpriv: write pointer into ctx
checks that storing pointers into skb->fields is disallowed
unpriv: write pointer into map elem value
checks that storing pointers into element values is disallowed
For example:
int bpf_prog(struct __sk_buff *skb)
{
u32 key = 0;
u64 *value = bpf_map_lookup_elem(&map, &key);
if (value)
*value = (u64) skb;
}
will be rejected.
unpriv: partial copy of pointer
checks that doing 32-bit register mov from register containing
a pointer is disallowed
unpriv: pass pointer to tail_call
checks that passing pointer as an index into bpf_tail_call
is disallowed
unpriv: cmp map pointer with zero
checks that comparing map pointer with constant is disallowed
unpriv: write into frame pointer
checks that frame pointer is read-only (applies to root too)
unpriv: cmp of frame pointer
checks that R10 cannot be using in comparison
unpriv: cmp of stack pointer
checks that Rx = R10 - imm is ok, but comparing Rx is not
unpriv: obfuscate stack pointer
checks that Rx = R10 - imm is ok, but Rx -= imm is not
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com> Signed-off-by: David S. Miller <davem@davemloft.net>
bpf: charge user for creation of BPF maps and programs
since eBPF programs and maps use kernel memory consider it 'locked' memory
from user accounting point of view and charge it against RLIMIT_MEMLOCK limit.
This limit is typically set to 64Kbytes by distros, so almost all
bpf+tracing programs would need to increase it, since they use maps,
but kernel charges maximum map size upfront.
For example the hash map of 1024 elements will be charged as 64Kbyte.
It's inconvenient for current users and changes current behavior for root,
but probably worth doing to be consistent root vs non-root.
Similar accounting logic is done by mmap of perf_event.
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com> Signed-off-by: David S. Miller <davem@davemloft.net>
In order to let unprivileged users load and execute eBPF programs
teach verifier to prevent pointer leaks.
Verifier will prevent
- any arithmetic on pointers
(except R10+Imm which is used to compute stack addresses)
- comparison of pointers
(except if (map_value_ptr == 0) ... )
- passing pointers to helper functions
- indirectly passing pointers in stack to helper functions
- returning pointer from bpf program
- storing pointers into ctx or maps
Spill/fill of pointers into stack is allowed, but mangling
of pointers stored in the stack or reading them byte by byte is not.
Within bpf programs the pointers do exist, since programs need to
be able to access maps, pass skb pointer to LD_ABS insns, etc
but programs cannot pass such pointer values to the outside
or obfuscate them.
Only allow BPF_PROG_TYPE_SOCKET_FILTER unprivileged programs,
so that socket filters (tcpdump), af_packet (quic acceleration)
and future kcm can use it.
tracing and tc cls/act program types still require root permissions,
since tracing actually needs to be able to see all kernel pointers
and tc is for root only.
For example, the following unprivileged socket filter program is allowed:
int bpf_prog1(struct __sk_buff *skb)
{
u32 index = load_byte(skb, ETH_HLEN + offsetof(struct iphdr, protocol));
u64 *value = bpf_map_lookup_elem(&my_map, &index);
if (value)
*value += skb->len;
return 0;
}
but the following program is not:
int bpf_prog1(struct __sk_buff *skb)
{
u32 index = load_byte(skb, ETH_HLEN + offsetof(struct iphdr, protocol));
u64 *value = bpf_map_lookup_elem(&my_map, &index);
if (value)
*value += (u64) skb;
return 0;
}
since it would leak the kernel address into the map.
Unprivileged socket filter bpf programs have access to the
following helper functions:
- map lookup/update/delete (but they cannot store kernel pointers into them)
- get_random (it's already exposed to unprivileged user space)
- get_smp_processor_id
- tail_call into another socket filter program
- ktime_get_ns
The feature is controlled by sysctl kernel.unprivileged_bpf_disabled.
This toggle defaults to off (0), but can be set true (1). Once true,
bpf programs and maps cannot be accessed from unprivileged process,
and the toggle cannot be set back to false.
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com> Reviewed-by: Kees Cook <keescook@chromium.org> Signed-off-by: David S. Miller <davem@davemloft.net>
Arnd Bergmann [Fri, 9 Oct 2015 12:53:54 +0000 (14:53 +0200)]
net: HNS: fix MDIO dependencies
The newly introduced HNS_MDIO Kconfig symbol selects 'MDIO', but
that is the wrong symbol as the code used by this driver is
provided by PHYLIB rather than the MDIO driver. Also, there is
no need to make this driver user selectable, because it is already
selected by all drivers that need it.
This changes the Kconfig file to select the correct library, and
to make the option silent.
Signed-off-by: Arnd Bergmann <arnd@arndb.de> Fixes: 5b904d39406 ("net: add Hisilicon Network Subsystem MDIO support") Signed-off-by: David S. Miller <davem@davemloft.net>
Daniel Pieczko [Fri, 9 Oct 2015 09:40:35 +0000 (10:40 +0100)]
sfc: fully reset if MC_REBOOT event received without warm_boot_count increment
On EF10, MC_CMD_VPORT_RECONFIGURE can cause a CODE_MC_REBOOT event
to be sent to a function without incrementing the (adapter-wide)
warm_boot_count. In this case, the reboot is not detected by the
loop on efx_mcdi_poll_reboot(), so prepare for recovery from an MC
reboot anyway. When this codepath is run, the MC has always just
rebooted, so this recovery is valid.
The loop on efx_mcdi_poll_reboot() is still required for other MC
reboot cases, so that actions in response to an MC reboot are
performed, such as clearing locally calculated statistics.
Siena NICs are unaffected by this change as the above scenario
does not apply.
Signed-off-by: Shradha Shah <sshah@solarflare.com> Signed-off-by: David S. Miller <davem@davemloft.net>