There are many cases where this feature does not improve performance or even
reduces it.
For example, here are the results from tests that I've run using 3.12.6 on one
Intel Xeon W3565 and one i7 920 connected by ixgbe adapters. The results are
from the Xeon, but they're similar on the i7. All numbers report the
mean±stddev over 10 runs of 10s.
1) latency tests similar to what is described in "c6e1a0d net: Allow no-cache
copy from user on transmit"
There is no statistically significant difference between tx-nocache-copy
on/off.
nic irqs spread out (one queue per cpu)
200x netperf -r 1400,1
tx-nocache-copy off
692000±1000 tps
50/90/95/99% latency (us): 275±2/643.8±0.4/799±1/2474.4±0.3
tx-nocache-copy on
693000±1000 tps
50/90/95/99% latency (us): 274±1/644.1±0.7/800±2/2474.5±0.7
200x netperf -r 14000,14000
tx-nocache-copy off
86450±80 tps
50/90/95/99% latency (us): 334.37±0.02/838±1/2100±20/3990±40
tx-nocache-copy on
86110±60 tps
50/90/95/99% latency (us): 334.28±0.01/837±2/2110±20/3990±20
2) single stream throughput tests
tx-nocache-copy leads to higher service demand
nic irqs and netperf on cpu0 (1x netperf -T0,0 -t omni -- -d send)
tx-nocache-copy off 9402±5 9.4±0.2 0.80±0.01
tx-nocache-copy on 9403±3 9.85±0.04 0.838±0.004
nic irqs on cpu0, netperf on cpu1 (1x netperf -T1,1 -t omni -- -d send)
tx-nocache-copy off 9401±5 5.83±0.03 5.0±0.1 0.923±0.007
tx-nocache-copy on 9404±2 5.74±0.03 5.523±0.009 0.958±0.002
As a second example, here are some results from Eric Dumazet with latest
net-next.
tx-nocache-copy also leads to higher service demand
(cpu is Intel(R) Xeon(R) CPU X5660 @ 2.80GHz)
lpq83:~# ./ethtool -K eth0 tx-nocache-copy on
lpq83:~# perf stat ./netperf -H lpq84 -c
MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to lpq84.prod.google.com () port 0 AF_INET
Recv Send Send Utilization Service Demand
Socket Socket Message Elapsed Send Recv Send Recv
Size Size Size Time Throughput local remote local remote
bytes bytes bytes secs. 10^6bits/s % S % U us/KB us/KB
lpq83:~# ./ethtool -K eth0 tx-nocache-copy off
lpq83:~# perf stat ./netperf -H lpq84 -c
MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to lpq84.prod.google.com () port 0 AF_INET
Recv Send Send Utilization Service Demand
Socket Socket Message Elapsed Send Recv Send Recv
Size Size Size Time Throughput local remote local remote
bytes bytes bytes secs. 10^6bits/s % S % U us/KB us/KB
Erik Hugne [Tue, 7 Jan 2014 20:51:36 +0000 (15:51 -0500)]
tipc: correctly unlink packets from deferred packet queue
When we pull a received packet from a link's 'deferred packets' queue
for processing, its 'next' pointer is not cleared, and still refers to
the next packet in that queue, if any. This is incorrect, but caused
no harm before commit 40ba3cdf542a469aaa9083fa041656e59b109b90 ("tipc:
message reassembly using fragment chain") was introduced. After that
commit, it may sometimes lead to the following oops:
This happens when the last fragment of a message has passed through the
the receiving link's 'deferred packets' queue, and at least one other
packet was added to that queue while it was there. After the fragment
chain with the complete message has been successfully delivered to the
receiving socket, it is released. Since 'next' pointer of the last
fragment in the released chain now is non-NULL, we get the crash shown
above.
We fix this by clearing the 'next' pointer of all received packets,
including those being pulled from the 'deferred' queue, before they
undergo any further processing.
Fixes: 40ba3cdf542a4 ("tipc: message reassembly using fragment chain") Signed-off-by: Erik Hugne <erik.hugne@ericsson.com> Reported-by: Ying Xue <ying.xue@windriver.com> Reviewed-by: Paul Gortmaker <paul.gortmaker@windriver.com> Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Jiri Pirko [Tue, 7 Jan 2014 14:55:45 +0000 (15:55 +0100)]
ipv4: loopback device: ignore value changes after device is upped
When lo is brought up, new ifa is created. Then, devconf and neigh values
bitfield should be set so later changes of default values would not
affect lo values.
Note that the same behaviour is in ipv6. Also note that this is likely
not an issue in many distros (for example Fedora 19) because userspace
sets address to lo manually before bringing it up.
Signed-off-by: Jiri Pirko <jiri@resnulli.us> Signed-off-by: David S. Miller <davem@davemloft.net>
FX Le Bail [Tue, 7 Jan 2014 13:57:27 +0000 (14:57 +0100)]
IPv6: add the option to use anycast addresses as source addresses in echo reply
This change allows to follow a recommandation of RFC4942.
- Add "anycast_src_echo_reply" sysctl to control the use of anycast addresses
as source addresses for ICMPv6 echo reply. This sysctl is false by default
to preserve existing behavior.
- Add inline check ipv6_anycast_destination().
- Use them in icmpv6_echo_reply().
2.1.6. Anycast Traffic Identification and Security
[...]
To avoid exposing knowledge about the internal structure of the
network, it is recommended that anycast servers now take advantage of
the ability to return responses with the anycast address as the
source address if possible.
Signed-off-by: Francois-Xavier Le Bail <fx.lebail@yahoo.com> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
Wei Yongjun [Tue, 7 Jan 2014 08:56:07 +0000 (16:56 +0800)]
net/mlx4_en: fix error return code in mlx4_en_get_qp()
Fix to return a negative error code from the error handling
case instead of 0.
Fixes: 837052d0ccc5 ('net/mlx4_en: Add netdev support for TCP/IP offloads of vxlan tunneling') Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn> Signed-off-by: David S. Miller <davem@davemloft.net>
Before 469bdcefdc ("ipv6: fix the use of pcpu_tstats in ip6_vti.c"),
the pcpu_tstats.syncp is not used to pretect the 64bit elements of
pcpu_tstats, so not appear this calltrace.
Reported-by: Fengguang Wu <fengguang.wu@intel.com> Signed-off-by: Li RongQing <roy.qing.li@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
swiotlb: Don't DoS us with 'swiotlb buffer is full' (v2)
There is no need for that so lets use ratelimiting.
Also add some extra information to be helpful.
Acked-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
[v2: s/ld/zs on the printk] Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Huang Shijie [Wed, 25 Dec 2013 06:19:27 +0000 (14:19 +0800)]
ARM: dts: vf610: use the interrupt macros
This patch uses the IRQ_TYPE_LEVEL_HIGH/IRQ_TYPE_NONE to replace
the hardcode.
[shawn.guo: While at it, we also fix the typo in uart0 interrupts
property, where the 0x00 should 0x04. Hense, it should also be
IRQ_TYPE_LEVEL_HIGH just like other UART instances.]
Bin Shi [Fri, 3 Jan 2014 06:08:54 +0000 (14:08 +0800)]
apm-emulation: add hibernation APM events to support suspend2disk
Some embedded systems use hibernation for fast boot. and in it,
some software components need to handle specific things before
hibernation and after restore. So it needs to capture the apm
status about these pm events.
Currently apm just supports suspend to ram, but not suspend to disk,
so here add logic about hibernation apm events.
Signed-off-by: Bin Shi <Bin.Shi@csr.com> Signed-off-by: Barry Song <Baohua.Song@csr.com> Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Bob Peterson [Mon, 6 Jan 2014 22:16:01 +0000 (17:16 -0500)]
GFS2: Increase i_writecount during gfs2_setattr_chown
This patch calls get_write_access in function gfs2_setattr_chown,
which merely increases inode->i_writecount for the duration of the
function. That will ensure that any file closes won't delete the
inode's multi-block reservation while the function is running.
It also ensures that a multi-block reservation exists when needed
for quota change operations during the chown.
Signed-off-by: Bob Peterson <rpeterso@redhat.com> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Daniel Vetter [Tue, 7 Jan 2014 08:24:31 +0000 (09:24 +0100)]
MAINTAINERS: Updates for drm/i915
Jani for co-maintainer!
Jani has been a really active bug-scrubber in the past few months.
I've asked him whether he wants to do this in a more official capacity
and he agreed. I've already chatted with Dave and Jesse and they
support this.
Note that everyone can't now just relax because "Jani will do all the
bug scrubbing" - au contraire expect more nagging and poking now that
we have more bandwidth.
Longer-term the plan is to share more of the maintainer duties, but we
need to fix up the infrastructure a bit first (like moving the git
repo to a common location).
While at it also add the newly set-up patchwork instance.
Cc: Dave Airlie <airlied@gmail.com> Cc: Jesse Barnes <jbarnes@virtuousgeek.org> Cc: Jani Nikula <jani.nikula@linux.intel.com> Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch>
Hayes Wang [Tue, 7 Jan 2014 03:18:22 +0000 (11:18 +0800)]
r8152: correct some messages
- Replace pr_warn_ratelimited() with net_ratelimit() and netdev_warn().
- Adjust the algnment of some messages.
- Remove the peroid.
- Fix some messages don't have terminating newline.
Signed-off-by: Hayes Wang <hayeswang@realtek.com> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Tue, 7 Jan 2014 01:37:41 +0000 (20:37 -0500)]
bna: Fix build due to missing use of dma_unmap_len_set()
> as reported for linux-next of Dec.20, 2013
> when CONFIG_NEED_DMA_MAP_STATE is not enabled:
>
> drivers/net/ethernet/brocade/bna/bnad.c: In function 'bnad_start_xmit':
> drivers/net/ethernet/brocade/bna/bnad.c:3074:26: error: 'struct bnad_tx_vector' has no member named 'dma_len'
Reported-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: David S. Miller <davem@davemloft.net>
Hauke Mehrtens [Mon, 6 Jan 2014 22:24:29 +0000 (23:24 +0100)]
bgmac: fix typos
This fixes some typos found by Sergei.
Reported-by: Sergei Shtylyov <sergei.shtylyov@cogentembedded.com> Signed-off-by: Hauke Mehrtens <hauke@hauke-m.de> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Tue, 7 Jan 2014 00:48:38 +0000 (19:48 -0500)]
Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/jesse/openvswitch
Jesse Gross says:
====================
[GIT net-next] Open vSwitch
Open vSwitch changes for net-next/3.14. Highlights are:
* Performance improvements in the mechanism to get packets to userspace
using memory mapped netlink and skb zero copy where appropriate.
* Per-cpu flow stats in situations where flows are likely to be shared
across CPUs. Standard flow stats are used in other situations to save
memory and allocation time.
* A handful of code cleanups and rationalization.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull networking fixes from David Miller:
"I'm hoping this is the very last batch of networking fixes for 3.13,
here goes nothing:
1) Fix crashes in VLAN's header_ops passthru.
2) Bridge multicast code needs to use BH spinlocks to prevent
deadlocks with timers. From Curt Brune.
3) ipv6 tunnels lack proper synchornization when updating percpu
statistics. From Li RongQing.
4) Fixes to bnx2x driver from Yaniv Rosner, Dmitry Kravkov and Michal
Kalderon.
5) Avoid undefined operator evaluation order in llc code, from Daniel
Borkmann.
6) Error paths in various GSO offload paths do not unwind properly,
in particular they must undo any modifications they have made to
the SKB. From Wei-Chun Chao.
7) Fix RX refill races during restore in virtio-net, from Jason Wang.
8) Fix SKB use after free in LLC code, from Daniel Borkmann.
9) Missing unlock and OOPS in netpoll code when VLAN tag handling
fails.
10) Fix vxlan device attachment wrt ipv6, from Fan Du.
11) Don't allow creating infiniband links to non-infiniband devices,
from Hangbin Liu.
12) Revert FEC phy reset active low change, it breaks things. From
Fabio Estevam.
13) Fix header pointer handling in 6lowpan header building code, from
Daniel Borkmann.
14) Fix RSS handling in be2net driver, from Vasundhara Volam.
15) Fix modem port indexing in HSO driver, from Dan Williams"
* http://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (38 commits)
bridge: use spin_lock_bh() in br_multicast_set_hash_max
ipv6: don't install anycast address for /128 addresses on routers
hso: fix handling of modem port SERIAL_STATE notifications
isdn: Drop big endian cpp checks from telespci and hfc_pci drivers
be2net: fix max_evt_qs calculation for BE3 in SR-IOV config
be2net: increase the timeout value for loopback-test FW cmd
be2net: disable RSS when number of RXQs is reduced to 1 via set-channels
xen-netback: Include header for vmalloc
net: 6lowpan: fix lowpan_header_create non-compression memcpy call
fec: Revert "fec: Do not assume that PHY reset is active low"
bnx2x: fix VLAN configuration for VFs.
bnx2x: fix AFEX memory overflow
bnx2x: Clean before update RSS arrives
bnx2x: Correct number of MSI-X vectors for VFs
bnx2x: limit number of interrupt vectors for 57711
qlcnic: Fix bug in Tx completion path
infiniband: make sure the src net is infiniband when create new link
{vxlan, inet6} Mark vxlan_dev flags with VXLAN_F_IPV6 properly
cxgb4: allow large buffer size to have page size
netpoll: Fix missing TXQ unlock and and OOPS.
...
Thomas Graf [Fri, 13 Dec 2013 14:22:22 +0000 (15:22 +0100)]
openvswitch: Compute checksum in skb_gso_segment() if needed
The copy & csum optimization is no longer present with zerocopy
enabled. Compute the checksum in skb_gso_segment() directly by
dropping the HW CSUM capability from the features passed in.
Signed-off-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: Jesse Gross <jesse@nicira.com>
Thomas Graf [Fri, 13 Dec 2013 14:22:21 +0000 (15:22 +0100)]
openvswitch: Use skb_zerocopy() for upcall
Use of skb_zerocopy() can avoid the expensive call to memcpy()
when copying the packet data into the Netlink skb. Completes
checksum through skb_checksum_help() if not already done in
GSO segmentation.
Zerocopy is only performed if user space supported unaligned
Netlink messages. memory mapped netlink i/o is preferred over
zerocopy if it is set up.
Thomas Graf [Fri, 13 Dec 2013 14:22:17 +0000 (15:22 +0100)]
net: Export skb_zerocopy() to zerocopy from one skb to another
Make the skb zerocopy logic written for nfnetlink queue available for
use by other modules.
Signed-off-by: Thomas Graf <tgraf@suug.ch> Reviewed-by: Daniel Borkmann <dborkman@redhat.com> Acked-by: David S. Miller <davem@davemloft.net> Signed-off-by: Jesse Gross <jesse@nicira.com>
Daniel Borkmann [Tue, 10 Dec 2013 11:02:03 +0000 (12:02 +0100)]
net: ovs: use kfree_rcu instead of rcu_free_{sw_flow_mask_cb,acts_callback}
As we're only doing a kfree() anyway in the RCU callback, we can
simply use kfree_rcu, which does the same job, and remove the
function rcu_free_sw_flow_mask_cb() and rcu_free_acts_callback().
Signed-off-by: Daniel Borkmann <dborkman@redhat.com> Signed-off-by: Jesse Gross <jesse@nicira.com>
Pravin B Shelar [Wed, 30 Oct 2013 00:22:21 +0000 (17:22 -0700)]
openvswitch: Per cpu flow stats.
With mega flow implementation ovs flow can be shared between
multiple CPUs which makes stats updates highly contended
operation. This patch uses per-CPU stats in cases where a flow
is likely to be shared (if there is a wildcard in the 5-tuple
and therefore likely to be spread by RSS). In other situations,
it uses the current strategy, saving memory and allocation time.
Signed-off-by: Pravin B Shelar <pshelar@nicira.com> Signed-off-by: Jesse Gross <jesse@nicira.com>