Kristian Evensen [Fri, 10 Jan 2014 20:43:20 +0000 (21:43 +0100)]
netfilter: nft_ct: fix compilation warning if NF_CONNTRACK_MARK is not set
net/netfilter/nft_ct.c: In function 'nft_ct_set_eval':
net/netfilter/nft_ct.c:136:6: warning: unused variable 'value' [-Wunused-variable]
Reported-by: kbuild test robot <fengguang.wu@intel.com> Signed-off-by: Kristian Evensen <kristian.evensen@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Paul Gortmaker [Mon, 13 Jan 2014 18:01:10 +0000 (13:01 -0500)]
netfilter: Add dependency on IPV6 for NF_TABLES_INET
Commit 1d49144c0aa ("netfilter: nf_tables: add "inet" table for
IPv4/IPv6") allows creation of non-IPV6 enabled .config files that
will fail to configure/link as follows:
warning: (NF_TABLES_INET) selects NF_TABLES_IPV6 which has unmet direct dependencies (NET && INET && IPV6 && NETFILTER && NF_TABLES)
warning: (NF_TABLES_INET) selects NF_TABLES_IPV6 which has unmet direct dependencies (NET && INET && IPV6 && NETFILTER && NF_TABLES)
warning: (NF_TABLES_INET) selects NF_TABLES_IPV6 which has unmet direct dependencies (NET && INET && IPV6 && NETFILTER && NF_TABLES)
net/built-in.o: In function `nft_reject_eval':
nft_reject.c:(.text+0x651e8): undefined reference to `nf_ip6_checksum'
nft_reject.c:(.text+0x65270): undefined reference to `ip6_route_output'
nft_reject.c:(.text+0x656c4): undefined reference to `ip6_dst_hoplimit'
make: *** [vmlinux] Error 1
Since the feature is to allow for a mixed IPV4 and IPV6 table, it
seems sensible to make it depend on IPV6.
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com> Acked-by: Patrick McHardy <kaber@trash.net> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Paul Durrant [Wed, 8 Jan 2014 12:41:58 +0000 (12:41 +0000)]
xen-netback: stop vif thread spinning if frontend is unresponsive
The recent patch to improve guest receive side flow control (ca2f09f2) had a
slight flaw in the wait condition for the vif thread in that any remaining
skbs in the guest receive side netback internal queue would prevent the
thread from sleeping. An unresponsive frontend can lead to a permanently
non-empty internal queue and thus the thread will spin. In this case the
thread should really sleep until the frontend becomes responsive again.
This patch adds an extra flag to the vif which is set if the shared ring
is full and cleared when skbs are drained into the shared ring. Thus,
if the thread runs, finds the shared ring full and can make no progress the
flag remains set. If the flag remains set then the thread will sleep,
regardless of a non-empty queue, until the next event from the frontend.
Signed-off-by: Paul Durrant <paul.durrant@citrix.com> Cc: Wei Liu <wei.liu2@citrix.com> Cc: Ian Campbell <ian.campbell@citrix.com> Cc: David Vrabel <david.vrabel@citrix.com> Acked-by: Wei Liu <wei.liu2@citrix.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Huqiu reported current sh_sir driver doesn't
call free_irq() in spite of using request_irq().
This patch replaces request_irq() into devm_request_irq()
to solve this issue
Reported-by: Huqiu Liu<huqiuliu@gmail.com> Signed-off-by: Kuninori Morimoto <kuninori.morimoto.gx@renesas.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Huqiu reported current sh_irda driver doesn't
call free_irq() in spite of using request_irq().
This patch replaces request_irq() into devm_request_irq()
to solve this issue
Reported-by: Huqiu Liu<huqiuliu@gmail.com> Signed-off-by: Kuninori Morimoto <kuninori.morimoto.gx@renesas.com> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Fri, 10 Jan 2014 02:36:01 +0000 (21:36 -0500)]
Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nftables
Pablo Neira Ayuso says:
====================
nf_tables updates for net-next
The following patchset contains the following nf_tables updates,
mostly updates from Patrick McHardy, they are:
* Add the "inet" table and filter chain type for this new netfilter
family: NFPROTO_INET. This special table/chain allows IPv4 and IPv6
rules, this should help to simplify the burden in the administration
of dual stack firewalls. This also includes several patches to prepare
the infrastructure for this new table and a new meta extension to
match the layer 3 and 4 protocol numbers, from Patrick McHardy.
* Load both IPv4 and IPv6 conntrack modules in nft_ct if the rule is used
in NFPROTO_INET, as we don't certainly know which one would be used,
also from Patrick McHardy.
* Do not allow to delete a table that contains sets, otherwise these
sets become orphan, from Patrick McHardy.
* Hold a reference to the corresponding nf_tables family module when
creating a table of that family type, to avoid the module deletion
when in use, from Patrick McHardy.
* Update chain counters before setting the chain policy to ensure that
we don't leave the chain in inconsistent state in case of errors (aka.
restore chain atomicity). This also fixes a possible leak if it fails
to allocate the chain counters if no counters are passed to be restored,
from Patrick McHardy.
* Don't check for overflows in the table counter if we are just renaming
a chain, from Patrick McHardy.
* Replay the netlink request after dropping the nfnl lock to load the
module that supports provides a chain type, from Patrick.
* Fix chain type module references, from Patrick.
* Several cleanups, function renames, constification and code
refactorizations also from Patrick McHardy.
* Add support to set the connmark, this can be used to set it based on
the meta mark (similar feature to -j CONNMARK --restore), from
Kristian Evensen.
* A couple of fixes to the recently added meta/set support and nft_reject,
and fix missing chain type unregistration if we fail to register our
the family table/filter chain type, from myself.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Thu, 9 Jan 2014 20:13:12 +0000 (15:13 -0500)]
Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/net-next
Jeff Kirsher says:
====================
Intel Wired LAN Driver Updates
This series contains updates to i40e only.
Anjali provides a fix where interrupts were not being re-enabled on ICR0
even though they were auto masked by hardware. Then provides a fix to
cleanup RSS initialization because it was doing some extra work, so
remove the extra work and any bugs it created when managing number of
queues. Since hardware requires a full packet template to be pointed to
when adding hardware flow filters, add the template and use it for
programming filters.
Jesse provides a fix to replace the use of driver specific defines with
kernel ETH_ALEN defines. Then disables packet split because with the
use of GRO, we do not need the extra bus overhead. Fixes spelling
error in code comment.
Kamil provides a fix for the driver where the hardware expects the MAC
address in a very specific format and the driver was filing the data
incorrectly.
Mitch provides a fix to resolve a panic on reset by adding checks to
VSI->rx_rings. Then shortens alloc_rx_buff_failed and
alloc_rx_page_failed variables since both part of an RX specific
structure so just remove the _rx part of the name. Then fixes
badly formatted lines, long lines and mis-formatted lines.
Shannon provides a fix to call AQ to release any reservation held by this
PF on the NVM resource lock on startup, in order to clear anything that
might have been left over from a previous run. Then removes interrupt on
AQ error since nearly everything we do is synchronous, using the
interrupt-on-error bit is unnecessary and causing unneeded interrupts.
Adds code to handle the ability to send messages among the physical
function interfaces by the admin queue.
Catherine sets the MFP flag earlier in software init and uses that flag
to decide if other hardware work-arounds are required which turns
off flow director in MFP mode.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Wei Yongjun [Wed, 8 Jan 2014 10:13:14 +0000 (18:13 +0800)]
openvswitch: Use kmem_cache_free() instead of kfree()
memory allocated by kmem_cache_alloc() should be freed using
kmem_cache_free(), not kfree().
Fixes: e298e5057006 ('openvswitch: Per cpu flow stats.') Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn> Acked-by: Jesse Gross <jesse@nicira.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Patrick McHardy [Thu, 9 Jan 2014 18:42:43 +0000 (18:42 +0000)]
netfilter: nf_tables: rename nft_do_chain_pktinfo() to nft_do_chain()
We don't encode argument types into function names and since besides
nft_do_chain() there are only AF-specific versions, there is no risk
of confusion.
Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Patrick McHardy [Thu, 9 Jan 2014 18:42:38 +0000 (18:42 +0000)]
netfilter: nf_tables: minor nf_chain_type cleanups
Minor nf_chain_type cleanups:
- reorder struct to plug a hoe
- rename struct module member to "owner" for consistency
- rename nf_hookfn array to "hooks" for consistency
- reorder initializers for better readability
Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Patrick McHardy [Thu, 9 Jan 2014 18:42:34 +0000 (18:42 +0000)]
netfilter: nf_tables: fix chain type module reference handling
The chain type module reference handling makes no sense at all: we take
a reference immediately when the module is registered, preventing the
module from ever being unloaded.
Fix by taking a reference when we're actually creating a chain of the
chain type and release the reference when destroying the chain.
Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Chain counter validation is performed after the chain policy has
potentially been changed. Move counter validation/setting before
changing of the chain policy to fix this.
Additionally fix a memory leak if chain counter allocation fails
for new chains, remove an unnecessary free_percpu() and move
counter allocation for new chains
Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Patrick McHardy [Thu, 9 Jan 2014 18:42:31 +0000 (18:42 +0000)]
netfilter: nf_tables: split chain policy validation from actually setting it
Currently nf_tables_newchain() atomicity is broken because of having
validation of some netlink attributes performed after changing attributes
of the chain. The chain policy is (currently) fine, but split it up as
preparation for the following fixes and to avoid future mistakes.
Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
netfilter: nft_meta: fix lack of validation of the input register
We have to validate that the input register is in the range of
allowed registers, otherwise we can take a incorrect register
value as input that may lead us to a crash.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
netfilter: nft_ct: Add support to set the connmark
This patch adds kernel support for setting properties of tracked
connections. Currently, only connmark is supported. One use-case
for this feature is to provide the same functionality as
-j CONNMARK --save-mark in iptables.
Some restructuring was needed to implement the set op. The new
structure follows that of nft_meta.
Signed-off-by: Kristian Evensen <kristian.evensen@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Jesse Brandeburg [Wed, 18 Dec 2013 13:46:02 +0000 (13:46 +0000)]
i40e: Add a dummy packet template
The hardware requires a full packet template to be pointed to when adding
hardware flow filters. This patch adds the template and uses it for
programming filters.
Mitch Williams [Wed, 18 Dec 2013 13:45:59 +0000 (13:45 +0000)]
i40e: shorten wordy fields
The alloc_rx_buff_failed and alloc_rx_page_failed variables
are both part of an rx specific structure so just remove
the _rx part of the name. No functional changes.
Change-ID: Icffa2f5d13c6f2b1e09cf45b9472b83c9dae8fc6 Signed-off-by: Mitch Williams <mitch.a.williams@intel.com> Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com> Tested-by: Kavindya Deegala <kavindya.s.deegala@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Shannon Nelson [Wed, 18 Dec 2013 13:45:57 +0000 (13:45 +0000)]
i40e: remove interrupt on AQ error
Since nearly everything we do is synchronous, using the interrupt-on-error bit
is unnecessary and causing unneeded interrupts. If anyone wants to use the bit
they can turn it on in individual AQ requests using the cmd_details parameter.
Change-ID: I4690a9c561d3e0836aeadb4f88f8a8702b1d1366 Signed-off-by: Shannon Nelson <shannon.nelson@intel.com> Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com> Tested-by: Kavindya Deegala <kavindya.s.deegala@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Shannon Nelson [Wed, 18 Dec 2013 13:45:56 +0000 (13:45 +0000)]
i40e: release NVM resource reservation on startup
Call AQ to release any reservation held by this PF on the NVM resource
lock on startup, in order to clear anything that might have been left over from
a previous run. The lock is only cleared by the requestor calling for it to be
cleared, on power-on, or firmware reset. This should help limit the need for
rebooting a customer machine if something goes wrong on a firmware update or
some other action.
Change-ID: I8c8473e601d4ef512dda7baa77a6e75f2e5fea49 Signed-off-by: Shannon Nelson <shannon.nelson@intel.com> Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com> Tested-by: Kavindya Deegala <kavindya.s.deegala@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Jesse Brandeburg [Wed, 18 Dec 2013 13:45:54 +0000 (13:45 +0000)]
i40e: disable packet split
The driver and hardware support splitting packets on headers
but with the use of GRO we don't need the extra bus
overhead, so make this driver more like igb and ixgbe and
disable packet split.
Greg Rose [Wed, 18 Dec 2013 13:45:51 +0000 (13:45 +0000)]
i40e: Fix GPL header
The GPL header included in each file in the i40e driver doesn't
need to include the "this program" text since this driver
is already part of the larger kernel.
Signed-off-by: Greg Rose <gregory.v.rose@intel.com> Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com> Tested-by: Kavindya Deegala <kavindya.s.deegala@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
The hardware can occasionally give an interrupt on the misc
queue for which there is no driver work to do. In that case
the driver was not re-enabling interrupts even though they
were auto masked by hardware. This left interrupts disabled
on this queue.
Re-enable the interrupt whenever leaving this function.
David S. Miller [Wed, 8 Jan 2014 20:04:56 +0000 (15:04 -0500)]
Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next
Pablo Neira Ayuso says:
====================
Netfilter updates for net-next
The following patchset contains three Netfilter updates, they are:
* Fix wrong usage of skb_header_pointer in the DCCP protocol helper that
has been there for quite some time. It was resulting in copying the dccp
header to a pointer allocated in the stack. Fortunately, this pointer
provides room for the dccp header is 4 bytes long, so no crashes have been
reported so far. From Daniel Borkmann.
* Use format string to print in the invocation of nf_log_packet(), again
in the DCCP helper. Also from Daniel Borkmann.
* Revert "netfilter: avoid get_random_bytes call" as prandom32 does not
guarantee enough entropy when being calling this at boot time, that may
happen when reloading the rule.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Wed, 8 Jan 2014 06:11:19 +0000 (01:11 -0500)]
Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/net-next
Jeff Kirsher says:
====================
Intel Wired LAN Driver Updates
This series contains updates to i40e only.
Anjali adds more functionality to debugfs to assist development and
testing of admin queue commands.
Greg makes sure broadcast promiscuous is disabled by default, otherwise
VLAN tagged packets out of the assigned VLAN domain are received. Also
provides a fix when the 8021q driver is loaded, so that VLAN 0 tagged
packets are accepted so that upper layers can interpret the priority
bits. Then provides a fix to let the VF to request the PF to set its
already assigned MAC address without generating an error. Greg also
adds helper functions to enable or disable internal switch loopback
when VFs are created or destroyed via the sysfs interface.
Shannon provides most of the changes, where he adds code to ensure
that the hardware waits to make sure that the firmware is ready as well
after reset. Also updates the code to use the new features in the
firmware. Provides a fix while in MFP mode where resources are
reduced, so use a smaller range of test registers than when in SFP mode.
Moves the PF ID initialization code to earlier in the driver
initialization function since a few operations need the information
before the first PF reset is called. Shannon adds a check for MAC
type before reading anything from the registers to ensure we dealing
with the correct MAC type. Then reworks Shadow RAM read word/buffer
functions as to not block whole NVM resources for SR read operations.
Mitch lastly provides a fix to correctly setup ARQ descriptors in
the cleanup code.
Catherine bumps the driver version due to all the recent changes.
v2:
- Added blank lines after local variable declarations to patch 01
as suggested by David Miller
- Used the suggested memset() line in patch 15 as suggested by David
Miller
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Mitch Williams [Wed, 18 Dec 2013 13:45:48 +0000 (13:45 +0000)]
i40e: correctly setup ARQ descriptors
When cleaning descriptors, we must set up ALL fields, not just the DMA
address. The initial setup does this correctly, but not the cleanup
code, so the firmware would process the ring exactly once and then fail.
Change-ID: I2930b83c76194b3016a8ac0fa693f9a573995640 Signed-off-by: Mitch Williams <mitch.a.williams@intel.com> Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com> Tested-by: Kavindya Deegala <kavindya.s.deegala@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Kamil Krawczyk [Wed, 18 Dec 2013 13:45:46 +0000 (13:45 +0000)]
i40e: remove redundant AQ enable
The admin queue length register is updated in
config_a<sq|rq>_regs functions. We should not update it again,
as we will trigger firmware to init the AQ again. In this case
firmware will lose the information about the AQ Rx tail position
and will see Rx queue as full (no free desc for FW to use).
Greg Rose [Fri, 13 Dec 2013 08:38:38 +0000 (08:38 +0000)]
i40e: Enable/Disable PF switch LB on SR-IOV configure changes
The PF VSI was never updated to enable or disable internal switch loopback
when VFs were created or destroyed via the sysfs interface. Add some
helper functions to take care of that.
Signed-off-by: Greg Rose <gregory.v.rose@intel.com> Tested-by: Sibai Li <sibai.li@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Shannon Nelson [Wed, 18 Dec 2013 05:29:17 +0000 (05:29 +0000)]
i40e: whitespace paren and comment tweaks
Addresses a few code format issues that have crept in over time.
Change-Id: I1a62cbd16b29a218a933b0f7176abe748f9615e8 Signed-off-by: Mitch Williams <mitch.a.williams@intel.com> Signed-off-by: Shannon Nelson <shannon.nelson@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Shannon Nelson [Wed, 11 Dec 2013 08:17:15 +0000 (08:17 +0000)]
i40e: rework shadow ram read functions
Rework Shadow RAM read word/buffer functions to not use AQ Request
Resource commands. Requesting resource is not needed for SR read
operations which are done through SRCTL register. Access to SR through
register is controlled through DONE bit within SRCTL. With this change
we do not block whole NVM resource for SR read operations.
Change-Id: I73e96cdea39a45ee7b5bdf038e527308de2d9efe Signed-off-by: Kamil Krawczyk <kamil.krawczyk@intel.com> Signed-off-by: Shannon Nelson <shannon.nelson@intel.com> Tested-by: Kavindya Deegala <kavindya.s.deegala@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Shannon Nelson [Wed, 11 Dec 2013 08:17:13 +0000 (08:17 +0000)]
i40e: move PF ID init from PF reset to SC init
Move PF ID initialization code from PF reset routine to earlier driver
initialization function. There are a few operations which need the
PF ID before the first PF reset is called.
Change-Id: I7e971f7556b46a837149850ec05ce115c35db575 Signed-off-by: Kamil Krawczyk <kamil.krawczyk@intel.com> Signed-off-by: Shannon Nelson <shannon.nelson@intel.com> Tested-by: Kavindya Deegala <kavindya.s.deegala@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Shannon Nelson [Wed, 11 Dec 2013 08:17:10 +0000 (08:17 +0000)]
i40e: Add code to wait for FW to complete in reset path
The RSTAT comes back with DEVICE READY to indicate that the HW is ready,
but we still have to wait for the FW to indicate that the Core and Global
modules are back up again. This needs a read of the NVM_ULD register.
Change-Id: I88276165f9cd446d2f166fb4b8cff00521af4bec Signed-off-by: Anjali Singhai Jain <anjali.singhai@intel.com> Signed-off-by: Shannon Nelson <shannon.nelson@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Greg Rose [Thu, 28 Nov 2013 06:42:40 +0000 (06:42 +0000)]
i40e: Stop accepting any VLAN tag on VLAN 0 filter set
When the 8021q driver is loaded it sets VLAN 0 filters on all devices.
This does not mean that any VLAN tagged packet should be accepted.
Instead accept only VLAN 0 tagged packets so that upper layers can
interpret the priority bits.
Change-Id: I17274a540b613749612ffe23a3aef2b8ee6ff6a4 Signed-off-by: Greg Rose <gregory.v.rose@intel.com> Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com> Tested-by: Sibai Li <sibai.li@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Greg Rose [Thu, 28 Nov 2013 06:42:39 +0000 (06:42 +0000)]
i40e: Do not enable broadcast promiscuous by default
Broadcast promiscuous should only be turned on when general
promiscuous mode is turned on, otherwise VLAN tagged packets out of
the assigned VLAN domain are received.
Add a broadcast MAC filter in order to continue to receive
broadcast traffic on VLANs, MAIN or VMDQ VSI.
Change-Id: I99d8e382a082ee51201228f1226af3b46452ac55 Signed-off-by: Greg Rose <gregory.v.rose@intel.com> Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com> Tested-by: Sibai Li <sibai.li@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
David S. Miller [Tue, 7 Jan 2014 23:44:35 +0000 (18:44 -0500)]
Merge branch 'tipc'
Jon Maloy says:
====================
tipc: link setup and failover improvements
This series consists of four unrelated commits with different purposes.
- Commit #1 is purely cosmetic and pedagogic, hopefully making the
failover/tunneling logics slightly easier to understand.
- Commit #2 fixes a bug that has always been in the code, but was not
discovered until very recently.
- Commit #3 fixes a non-fatal race issue in the neighbour discovery
code.
- Commit #4 removes an unnecessary indirection step during link
startup.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Jon Paul Maloy [Tue, 7 Jan 2014 22:02:44 +0000 (17:02 -0500)]
tipc: make link start event synchronous
When a link is created we delay the start event by launching it
to be executed later in a tasklet. As we hold all the
necessary locks at the moment of creation, and there is no risk
of deadlock or contention, this delay serves no purpose in the
current code.
We remove this obsolete indirection step, and the associated function
link_start(). At the same time, we rename the function tipc_link_stop()
to the more appropriate tipc_link_purge_queues().
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Reviewed-by: Paul Gortmaker <paul.gortmaker@windriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Ying Xue [Tue, 7 Jan 2014 22:02:43 +0000 (17:02 -0500)]
tipc: introduce new spinlock to protect struct link_req
Currently, only 'bearer_lock' is used to protect struct link_req in
the function disc_timeout(). This is unsafe, since the member fields
'num_nodes' and 'timer_intv' might be accessed by below three different
threads simultaneously, none of them grabbing bearer_lock in the
critical region:
Without lock protection, the only symptom of a race is that discovery
messages occasionally may not be sent out. This is not fatal, since such
messages are best-effort anyway. On the other hand, since discovery
messages are not time critical, adding a protecting lock brings no
serious overhead either. So we add a new, dedicated spinlock in
order to guarantee absolute data consistency in link_req objects.
This also helps reduce the overall role of the bearer_lock, which
we want to remove completely in a later commit series.
Signed-off-by: Ying Xue <ying.xue@windriver.com> Reviewed-by: Paul Gortmaker <paul.gortmaker@windriver.com> Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Jon Paul Maloy [Tue, 7 Jan 2014 22:02:42 +0000 (17:02 -0500)]
tipc: remove 'has_redundant_link' flag from STATE link protocol messages
The flag 'has_redundant_link' is defined only in RESET and ACTIVATE
protocol messages. Due to an ambiguity in the protocol specification it
is currently also transferred in STATE messages. Its value is used to
initialize a link state variable, 'permit_changeover', which is used
to inhibit futile link failover attempts when it is known that the
peer node has no working links at the moment, although the local node
may still think it has one.
The fact that 'has_redundant_link' incorrectly is read from STATE
messages has the effect that 'permit_changeover' sometimes gets a wrong
value, and permanently blocks any links from being re-established. Such
failures can only occur in in dual-link systems, and are extremely rare.
This bug seems to have always been present in the code.
Furthermore, since commit b4b5610223f17790419b03eaa962b0e3ecf930d7
("tipc: Ensure both nodes recognize loss of contact between them"),
the 'permit_changeover' field serves no purpose any more. The task of
enforcing 'lost contact' cycles at both peer endpoints is now taken
by a new mechanism, using the flags WAIT_NODE_DOWN and WAIT_PEER_DOWN
in struct tipc_node to abort unnecessary failover attempts.
We therefore remove the 'has_redundant_link' flag from STATE messages,
as well as the now redundant 'permit_changeover' variable.
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Reviewed-by: Ying Xue <ying.xue@windriver.com> Reviewed-by: Paul Gortmaker <paul.gortmaker@windriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Jon Paul Maloy [Tue, 7 Jan 2014 22:02:41 +0000 (17:02 -0500)]
tipc: rename functions related to link failover and improve comments
The functionality related to link addition and failover is unnecessarily
hard to understand and maintain. We try to improve this by renaming
some of the functions, at the same time adding or improving the
explanatory comments around them. Names such as "tipc_rcv()" etc. also
align better with what is used in other networking components.
The changes in this commit are purely cosmetic, no functional changes
are made.
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Reviewed-by: Ying Xue <ying.xue@windriver.com> Reviewed-by: Paul Gortmaker <paul.gortmaker@windriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Daniel Borkmann [Tue, 7 Jan 2014 22:23:44 +0000 (23:23 +0100)]
net: skbuff: const-ify casts in skb_queue_* functions
We should const-ify comparisons on skb_queue_* inline helper
functions as their parameters are const as well, so lets not
drop that.
Suggested-by: Brad Spengler <spender@grsecurity.net> Signed-off-by: Daniel Borkmann <dborkman@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Daniel Borkmann [Tue, 7 Jan 2014 22:20:27 +0000 (23:20 +0100)]
net: xfrm: xfrm_policy: fix inline not at beginning of declaration
Fix three warnings related to:
net/xfrm/xfrm_policy.c:1644:1: warning: 'inline' is not at beginning of declaration [-Wold-style-declaration]
net/xfrm/xfrm_policy.c:1656:1: warning: 'inline' is not at beginning of declaration [-Wold-style-declaration]
net/xfrm/xfrm_policy.c:1668:1: warning: 'inline' is not at beginning of declaration [-Wold-style-declaration]
Just removing the inline keyword is sufficient as the compiler will
decide on its own about inlining or not.
Signed-off-by: Daniel Borkmann <dborkman@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Patrick McHardy [Mon, 6 Jan 2014 18:09:49 +0000 (18:09 +0000)]
netfilter: nft_ct: load both IPv4 and IPv6 conntrack modules for NFPROTO_INET
The ct expression can currently not be used in the inet family since
we don't have a conntrack module for NFPROTO_INET, so
nf_ct_l3proto_try_module_get() fails. Add some manual handling to
load the modules for both NFPROTO_IPV4 and NFPROTO_IPV6 if the
ct expression is used in the inet family.
Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Patrick McHardy [Fri, 3 Jan 2014 12:16:18 +0000 (12:16 +0000)]
netfilter: nft_meta: add l4proto support
For L3-proto independant rules we need to get at the L4 protocol value
directly. Add it to the nft_pktinfo struct and use the meta expression
to retrieve it.
Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Patrick McHardy [Fri, 3 Jan 2014 12:16:16 +0000 (12:16 +0000)]
netfilter: nf_tables: add "inet" table for IPv4/IPv6
This patch adds a new table family and a new filter chain that you can
use to attach IPv4 and IPv6 rules. This should help to simplify
rule-set maintainance in dual-stack setups.
Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Patrick McHardy [Fri, 3 Jan 2014 12:16:13 +0000 (12:16 +0000)]
netfilter: nf_tables: make chain types override the default AF functions
Currently the AF-specific hook functions override the chain-type specific
hook functions. That doesn't make too much sense since the chain types
are a special case of the AF-specific hooks.
Make the AF-specific hook functions the default and make the optional
chain type hooks override them.
As a side effect, the necessary code restructuring reduces the code size,
f.i. in case of nf_tables_ipv4.o:
Shawn Bohrer [Tue, 7 Jan 2014 18:49:17 +0000 (12:49 -0600)]
mlx4_en: Select PTP_1588_CLOCK
Now that mlx4_en includes a PHC driver it must select PTP_1588_CLOCK.
drivers/built-in.o: In function `mlx4_en_get_ts_info':
>> en_ethtool.c:(.text+0x391a11): undefined reference to `ptp_clock_index'
drivers/built-in.o: In function `mlx4_en_remove_timestamp':
>> (.text+0x397913): undefined reference to `ptp_clock_unregister'
drivers/built-in.o: In function `mlx4_en_init_timestamp':
>> (.text+0x397b20): undefined reference to `ptp_clock_register'
Fixes: ad7d4eaed995d ("mlx4_en: Add PTP hardware clock") Signed-off-by: Shawn Bohrer <sbohrer@rgmadvisors.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Jerry Chu [Tue, 7 Jan 2014 18:23:19 +0000 (10:23 -0800)]
net-gre-gro: Add GRE support to the GRO stack
This patch built on top of Commit 299603e8370a93dd5d8e8d800f0dff1ce2c53d36
("net-gro: Prepare GRO stack for the upcoming tunneling support") to add
the support of the standard GRE (RFC1701/RFC2784/RFC2890) to the GRO
stack. It also serves as an example for supporting other encapsulation
protocols in the GRO stack in the future.
The patch supports version 0 and all the flags (key, csum, seq#) but
will flush any pkt with the S (seq#) flag. This is because the S flag
is not support by GSO, and a GRO pkt may end up in the forwarding path,
thus requiring GSO support to break it up correctly.
Currently the "packet_offload" structure only contains L3 (ETH_P_IP/
ETH_P_IPV6) GRO offload support so the encapped pkts are limited to
IP pkts (i.e., w/o L2 hdr). But support for other protocol type can
be easily added, so is the support for GRE variations like NVGRE.
The patch also support csum offload. Specifically if the csum flag is on
and the h/w is capable of checksumming the payload (CHECKSUM_COMPLETE),
the code will take advantage of the csum computed by the h/w when
validating the GRE csum.
Note that commit 60769a5dcd8755715c7143b4571d5c44f01796f1 "ipv4: gre:
add GRO capability" already introduces GRO capability to IPv4 GRE
tunnels, using the gro_cells infrastructure. But GRO is done after
GRE hdr has been removed (i.e., decapped). The following patch applies
GRO when pkts first come in (before hitting the GRE tunnel code). There
is some performance advantage for applying GRO as early as possible.
Also this approach is transparent to other subsystem like Open vSwitch
where GRE decap is handled outside of the IP stack hence making it
harder for the gro_cells stuff to apply. On the other hand, some NICs
are still not capable of hashing on the inner hdr of a GRE pkt (RSS).
In that case the GRO processing of pkts from the same remote host will
all happen on the same CPU and the performance may be suboptimal.
I'm including some rough preliminary performance numbers below. Note
that the performance will be highly dependent on traffic load, mix as
usual. Moreover it also depends on NIC offload features hence the
following is by no means a comprehesive study. Local testing and tuning
will be needed to decide the best setting.
All tests spawned 50 copies of netperf TCP_STREAM and ran for 30 secs.
(super_netperf 50 -H 192.168.1.18 -l 30)
An IP GRE tunnel with only the key flag on (e.g., ip tunnel add gre1
mode gre local 10.246.17.18 remote 10.246.17.17 ttl 255 key 123)
is configured.
The GRO support for pkts AFTER decap are controlled through the device
feature of the GRE device (e.g., ethtool -K gre1 gro on/off).
1.1 ethtool -K gre1 gro off; ethtool -K eth0 gro off
thruput: 9.16Gbps
CPU utilization: 19%
1.2 ethtool -K gre1 gro on; ethtool -K eth0 gro off
thruput: 5.9Gbps
CPU utilization: 15%
1.3 ethtool -K gre1 gro off; ethtool -K eth0 gro on
thruput: 9.26Gbps
CPU utilization: 12-13%
1.4 ethtool -K gre1 gro on; ethtool -K eth0 gro on
thruput: 9.26Gbps
CPU utilization: 10%
The following tests were performed on a different NIC that is capable of
csum offload. I.e., the h/w is capable of computing IP payload csum
(CHECKSUM_COMPLETE).
2.1 ethtool -K gre1 gro on (hence will use gro_cells)
There are many cases where this feature does not improve performance or even
reduces it.
For example, here are the results from tests that I've run using 3.12.6 on one
Intel Xeon W3565 and one i7 920 connected by ixgbe adapters. The results are
from the Xeon, but they're similar on the i7. All numbers report the
mean±stddev over 10 runs of 10s.
1) latency tests similar to what is described in "c6e1a0d net: Allow no-cache
copy from user on transmit"
There is no statistically significant difference between tx-nocache-copy
on/off.
nic irqs spread out (one queue per cpu)
200x netperf -r 1400,1
tx-nocache-copy off
692000±1000 tps
50/90/95/99% latency (us): 275±2/643.8±0.4/799±1/2474.4±0.3
tx-nocache-copy on
693000±1000 tps
50/90/95/99% latency (us): 274±1/644.1±0.7/800±2/2474.5±0.7
200x netperf -r 14000,14000
tx-nocache-copy off
86450±80 tps
50/90/95/99% latency (us): 334.37±0.02/838±1/2100±20/3990±40
tx-nocache-copy on
86110±60 tps
50/90/95/99% latency (us): 334.28±0.01/837±2/2110±20/3990±20
2) single stream throughput tests
tx-nocache-copy leads to higher service demand
nic irqs and netperf on cpu0 (1x netperf -T0,0 -t omni -- -d send)
tx-nocache-copy off 9402±5 9.4±0.2 0.80±0.01
tx-nocache-copy on 9403±3 9.85±0.04 0.838±0.004
nic irqs on cpu0, netperf on cpu1 (1x netperf -T1,1 -t omni -- -d send)
tx-nocache-copy off 9401±5 5.83±0.03 5.0±0.1 0.923±0.007
tx-nocache-copy on 9404±2 5.74±0.03 5.523±0.009 0.958±0.002
As a second example, here are some results from Eric Dumazet with latest
net-next.
tx-nocache-copy also leads to higher service demand
(cpu is Intel(R) Xeon(R) CPU X5660 @ 2.80GHz)
lpq83:~# ./ethtool -K eth0 tx-nocache-copy on
lpq83:~# perf stat ./netperf -H lpq84 -c
MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to lpq84.prod.google.com () port 0 AF_INET
Recv Send Send Utilization Service Demand
Socket Socket Message Elapsed Send Recv Send Recv
Size Size Size Time Throughput local remote local remote
bytes bytes bytes secs. 10^6bits/s % S % U us/KB us/KB
lpq83:~# ./ethtool -K eth0 tx-nocache-copy off
lpq83:~# perf stat ./netperf -H lpq84 -c
MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to lpq84.prod.google.com () port 0 AF_INET
Recv Send Send Utilization Service Demand
Socket Socket Message Elapsed Send Recv Send Recv
Size Size Size Time Throughput local remote local remote
bytes bytes bytes secs. 10^6bits/s % S % U us/KB us/KB
Jiri Pirko [Tue, 7 Jan 2014 14:55:45 +0000 (15:55 +0100)]
ipv4: loopback device: ignore value changes after device is upped
When lo is brought up, new ifa is created. Then, devconf and neigh values
bitfield should be set so later changes of default values would not
affect lo values.
Note that the same behaviour is in ipv6. Also note that this is likely
not an issue in many distros (for example Fedora 19) because userspace
sets address to lo manually before bringing it up.
Signed-off-by: Jiri Pirko <jiri@resnulli.us> Signed-off-by: David S. Miller <davem@davemloft.net>
FX Le Bail [Tue, 7 Jan 2014 13:57:27 +0000 (14:57 +0100)]
IPv6: add the option to use anycast addresses as source addresses in echo reply
This change allows to follow a recommandation of RFC4942.
- Add "anycast_src_echo_reply" sysctl to control the use of anycast addresses
as source addresses for ICMPv6 echo reply. This sysctl is false by default
to preserve existing behavior.
- Add inline check ipv6_anycast_destination().
- Use them in icmpv6_echo_reply().
2.1.6. Anycast Traffic Identification and Security
[...]
To avoid exposing knowledge about the internal structure of the
network, it is recommended that anycast servers now take advantage of
the ability to return responses with the anycast address as the
source address if possible.
Signed-off-by: Francois-Xavier Le Bail <fx.lebail@yahoo.com> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
Wei Yongjun [Tue, 7 Jan 2014 08:56:07 +0000 (16:56 +0800)]
net/mlx4_en: fix error return code in mlx4_en_get_qp()
Fix to return a negative error code from the error handling
case instead of 0.
Fixes: 837052d0ccc5 ('net/mlx4_en: Add netdev support for TCP/IP offloads of vxlan tunneling') Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn> Signed-off-by: David S. Miller <davem@davemloft.net>
Hayes Wang [Tue, 7 Jan 2014 03:18:22 +0000 (11:18 +0800)]
r8152: correct some messages
- Replace pr_warn_ratelimited() with net_ratelimit() and netdev_warn().
- Adjust the algnment of some messages.
- Remove the peroid.
- Fix some messages don't have terminating newline.
Signed-off-by: Hayes Wang <hayeswang@realtek.com> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Tue, 7 Jan 2014 01:37:41 +0000 (20:37 -0500)]
bna: Fix build due to missing use of dma_unmap_len_set()
> as reported for linux-next of Dec.20, 2013
> when CONFIG_NEED_DMA_MAP_STATE is not enabled:
>
> drivers/net/ethernet/brocade/bna/bnad.c: In function 'bnad_start_xmit':
> drivers/net/ethernet/brocade/bna/bnad.c:3074:26: error: 'struct bnad_tx_vector' has no member named 'dma_len'
Reported-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: David S. Miller <davem@davemloft.net>
Hauke Mehrtens [Mon, 6 Jan 2014 22:24:29 +0000 (23:24 +0100)]
bgmac: fix typos
This fixes some typos found by Sergei.
Reported-by: Sergei Shtylyov <sergei.shtylyov@cogentembedded.com> Signed-off-by: Hauke Mehrtens <hauke@hauke-m.de> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Tue, 7 Jan 2014 00:48:38 +0000 (19:48 -0500)]
Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/jesse/openvswitch
Jesse Gross says:
====================
[GIT net-next] Open vSwitch
Open vSwitch changes for net-next/3.14. Highlights are:
* Performance improvements in the mechanism to get packets to userspace
using memory mapped netlink and skb zero copy where appropriate.
* Per-cpu flow stats in situations where flows are likely to be shared
across CPUs. Standard flow stats are used in other situations to save
memory and allocation time.
* A handful of code cleanups and rationalization.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Thomas Graf [Fri, 13 Dec 2013 14:22:22 +0000 (15:22 +0100)]
openvswitch: Compute checksum in skb_gso_segment() if needed
The copy & csum optimization is no longer present with zerocopy
enabled. Compute the checksum in skb_gso_segment() directly by
dropping the HW CSUM capability from the features passed in.
Signed-off-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: Jesse Gross <jesse@nicira.com>
Thomas Graf [Fri, 13 Dec 2013 14:22:21 +0000 (15:22 +0100)]
openvswitch: Use skb_zerocopy() for upcall
Use of skb_zerocopy() can avoid the expensive call to memcpy()
when copying the packet data into the Netlink skb. Completes
checksum through skb_checksum_help() if not already done in
GSO segmentation.
Zerocopy is only performed if user space supported unaligned
Netlink messages. memory mapped netlink i/o is preferred over
zerocopy if it is set up.
Thomas Graf [Fri, 13 Dec 2013 14:22:17 +0000 (15:22 +0100)]
net: Export skb_zerocopy() to zerocopy from one skb to another
Make the skb zerocopy logic written for nfnetlink queue available for
use by other modules.
Signed-off-by: Thomas Graf <tgraf@suug.ch> Reviewed-by: Daniel Borkmann <dborkman@redhat.com> Acked-by: David S. Miller <davem@davemloft.net> Signed-off-by: Jesse Gross <jesse@nicira.com>
Daniel Borkmann [Tue, 10 Dec 2013 11:02:03 +0000 (12:02 +0100)]
net: ovs: use kfree_rcu instead of rcu_free_{sw_flow_mask_cb,acts_callback}
As we're only doing a kfree() anyway in the RCU callback, we can
simply use kfree_rcu, which does the same job, and remove the
function rcu_free_sw_flow_mask_cb() and rcu_free_acts_callback().
Signed-off-by: Daniel Borkmann <dborkman@redhat.com> Signed-off-by: Jesse Gross <jesse@nicira.com>
Pravin B Shelar [Wed, 30 Oct 2013 00:22:21 +0000 (17:22 -0700)]
openvswitch: Per cpu flow stats.
With mega flow implementation ovs flow can be shared between
multiple CPUs which makes stats updates highly contended
operation. This patch uses per-CPU stats in cases where a flow
is likely to be shared (if there is a wildcard in the 5-tuple
and therefore likely to be spread by RSS). In other situations,
it uses the current strategy, saving memory and allocation time.
Signed-off-by: Pravin B Shelar <pshelar@nicira.com> Signed-off-by: Jesse Gross <jesse@nicira.com>
An insufficent ring frame size configuration can lead to an
unnecessary skb allocation for every Netlink message. Check frame
size before taking the queue lock and allocating the skb and
re-check with lock to be safe.
Signed-off-by: Thomas Graf <tgraf@suug.ch> Reviewed-by: Daniel Borkmann <dborkman@redhat.com> Signed-off-by: Jesse Gross <jesse@nicira.com>
Thomas Graf [Sat, 30 Nov 2013 12:21:30 +0000 (13:21 +0100)]
genl: Add genlmsg_new_unicast() for unicast message allocation
Allocates a new sk_buff large enough to cover the specified payload
plus required Netlink headers. Will check receiving socket for
memory mapped i/o capability and use it if enabled. Will fall back
to non-mapped skb if message size exceeds the frame size of the ring.
Signed-of-by: Thomas Graf <tgraf@suug.ch> Reviewed-by: Daniel Borkmann <dborkman@redhat.com> Signed-off-by: Jesse Gross <jesse@nicira.com>
Jesse Gross [Tue, 3 Dec 2013 18:58:53 +0000 (10:58 -0800)]
openvswitch: Silence RCU lockdep checks from flow lookup.
Flow lookup can happen either in packet processing context or userspace
context but it was annotated as requiring RCU read lock to be held. This
also allows OVS mutex to be held without causing warnings.