Asias He [Sun, 2 Dec 2012 01:33:53 +0000 (09:33 +0800)]
vhost-blk: Add vhost-blk support v6
vhost-blk is an in-kernel virito-blk device accelerator.
Due to lack of proper in-kernel AIO interface, this version converts
guest's I/O request to bio and use submit_bio() to submit I/O directly.
So this version any supports raw block device as guest's disk image,
e.g. /dev/sda, /dev/ram0. We can add file based image support to
vhost-blk once we have in-kernel AIO interface. There are some work in
progress for in-kernel AIO interface from Dave Kleikamp and Zach Brown:
Performance evaluation:
-----------------------------
LKVM: Fio with libaio ioengine on 1 Fusion IO device
IOPS(k) Before After Improvement
seq-read 107 121 +13.0%
seq-write 130 179 +37.6%
rnd-read 102 122 +19.6%
rnd-write 125 159 +27.0%
QEMU: Fio with libaio ioengine on 1 Fusion IO device
IOPS(k) Before After Improvement
seq-read 76 123 +61.8%
seq-write 139 173 +24.4%
rnd-read 73 120 +64.3%
rnd-write 75 156 +108.0%
QEMU: Fio with libaio ioengine on 1 Ramdisk device
IOPS(k) Before After Improvement
seq-read 138 437 +216%
seq-write 191 436 +128%
rnd-read 137 426 +210%
rnd-write 140 415 +196%
QEMU: Fio with libaio ioengine on 8 Ramdisk device
50% read + 50% write
IOPS(k) Before After Improvement
randrw 64/64 189/189 +195%/+195%
Userspace bits:
-----------------------------
1) LKVM
The latest vhost-blk userspace bits for kvm tool can be found here:
git@github.com:asias/linux-kvm.git blk.vhost-blk
2) QEMU
The latest vhost-blk userspace prototype for QEMU can be found here:
git@github.com:asias/qemu.git blk.vhost-blk
Changes in v6:
- Use inline req_page_list to reduce kmalloc
- Switch to single thread model, thanks mst!
- Wait until requests fired before vhost_blk_flush to be finished
Changes in v5:
- Do not assume the buffer layout
- Fix wakeup race
Changes in v4:
- Mark req->status as userspace pointer
- Use __copy_to_user() instead of copy_to_user() in vhost_blk_set_status()
- Add if (need_resched()) schedule() in blk thread
- Kill vhost_blk_stop_vq() and move it into vhost_blk_stop()
- Use vq_err() instead of pr_warn()
- Fail un Unsupported request
- Add flush in vhost_blk_set_features()
Changes in v3:
- Sending REQ_FLUSH bio instead of vfs_fsync, thanks Christoph!
- Check file passed by user is a raw block device file
Acked-by: David S. Miller <davem@davemloft.net> Signed-off-by: Asias He <asias@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Zero copy TX has been around for a while now.
We seem to be down to eliminating theoretical bugs
and performance tuning at this point:
it's probably time to enable it by default so that
most users get the benefit.
Keep the flag around meanwhile so users can experiment
with disabling this if they experience regressions.
I expect that we will remove it in the future.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
For short packets zerocopy mode adds overhead
of managing heads which isn't necessary: we
could simly update used ring directly
same as with zerocopy disabled.
Things seem to run a bit faster if we detect
and bypass head management when zcopy isn't used.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
vhost-net: flush outstanding DMAs on memory change
When memory map changes, we need to flush outstanding
DMAs as they might in theory reference old memory addresses.
To do this simply stop initiating new DMAs
and wait for ubufs ref count to drop to 0.
Afterwards reset the count back to 1.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
vring changes already do a flush internally where appropriate, so we do
not need a second flush.
It's currently not very expensive but a follow-up patch makes flush more
heavy-weight, so remove the extra flush here to avoid regressing
performance if call or kick fds are changed on data path.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Cong Wang [Thu, 6 Dec 2012 02:04:04 +0000 (10:04 +0800)]
net: fix some compiler warning in net/core/neighbour.c
net/core/neighbour.c:65:12: warning: 'zero' defined but not used [-Wunused-variable]
net/core/neighbour.c:66:12: warning: 'unres_qlen_max' defined but not used [-Wunused-variable]
These variables are only used when CONFIG_SYSCTL is defined,
so move them under #ifdef CONFIG_SYSCTL.
Reported-by: Fengguang Wu <fengguang.wu@intel.com> Signed-off-by: Cong Wang <amwang@redhat.com> Acked-by: Shan Wei <davidshan@tencent.com> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Wed, 5 Dec 2012 21:24:45 +0000 (16:24 -0500)]
bridge: implement multicast fast leave
V3: make it a flag
V2: make the toggle per-port
Fast leave allows bridge to immediately stops the multicast
traffic on the port receives IGMP Leave when IGMP snooping is enabled,
no timeouts are observed.
Cc: Herbert Xu <herbert@gondor.apana.org.au> Cc: Stephen Hemminger <shemminger@vyatta.com> Cc: "David S. Miller" <davem@davemloft.net> Signed-off-by: Cong Wang <amwang@redhat.com>
Eddie Wai [Wed, 5 Dec 2012 10:10:15 +0000 (10:10 +0000)]
cnic: Fix rare race condition during iSCSI disconnect.
If the initiator and target try to close the connection at about the same
time, there is a race condition in the termination sequence for bnx2x.
Fix the problem by waiting for the remote termination to complete before
deleting the Connection ID. This will prevent the firmware assert.
Update version to 2.5.15.
Signed-off-by: Eddie Wai <eddie.wai@broadcom.com> Signed-off-by: Michael Chan <mchan@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Shan Wei [Tue, 4 Dec 2012 18:49:15 +0000 (18:49 +0000)]
net: neighbour: prohibit negative value for unres_qlen_bytes parameter
unres_qlen_bytes and unres_qlen are int type.
But multiple relation(unres_qlen_bytes = unres_qlen * SKB_TRUESIZE(ETH_FRAME_LEN))
will cause type overflow when seting unres_qlen. e.g.
The gutted value is not that we setting。
But user/administrator don't know this is caused by int type overflow.
what's more, it is meaningless and even dangerous that unres_qlen_bytes is set
with negative number. Because, for unresolved neighbour address, kernel will cache packets
without limit in __neigh_event_send()(e.g. (u32)-1 = 2GB).
Signed-off-by: Shan Wei <davidshan@tencent.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Andrew Gallatin [Tue, 4 Dec 2012 10:17:15 +0000 (10:17 +0000)]
myri10ge: fix most sparse warnings
- convert remaining htonl/ntohl +__raw_read/__raw_writel to
swab32 + readl/writel
- add missing __iomem qualifier in myri10ge_open()
- fix dubious: x & !y warning by switching from logical to bitwise not
The swab32 conversion fixes a bug in myri10ge_led() where
big-endian machines would write the wrong pattern.
The only remaining warning (lock context imbalance) is due to
the use of __netif_tx_trylock(), and cannot easily be fixed.
Signed-off-by: Andrew Gallatin <gallatin@myri.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Amerigo Wang [Mon, 3 Dec 2012 23:56:40 +0000 (23:56 +0000)]
bridge: implement multicast fast leave
V2: make the toggle per-port
Fast leave allows bridge to immediately stops the multicast
traffic on the port receives IGMP Leave when IGMP snooping is enabled,
no timeouts are observed.
Cc: Herbert Xu <herbert@gondor.apana.org.au> Cc: Stephen Hemminger <shemminger@vyatta.com> Cc: "David S. Miller" <davem@davemloft.net> Signed-off-by: Cong Wang <amwang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Serge Hallyn [Mon, 3 Dec 2012 16:17:12 +0000 (16:17 +0000)]
net: dev_change_net_namespace: send a KOBJ_REMOVED/KOBJ_ADD
When a new nic is created in namespace ns1, the kernel sends a KOBJ_ADD uevent
to ns1. When the nic is moved to ns2, we only send a KOBJ_MOVE to ns2, and
nothing to ns1.
This patch changes that behavior so that when moving a nic from ns1 to ns2, we
send a KOBJ_REMOVED to ns1 and KOBJ_ADD to ns2. (The KOBJ_MOVE is still
sent to ns2).
The effects of this can be seen when starting and stopping containers in
an upstart based host. Lxc will create a pair of veth nics, the kernel
sends KOBJ_ADD, and upstart starts network-instance jobs for each. When
one nic is moved to the container, because no KOBJ_REMOVED event is
received, the network-instance job for that veth never goes away. This
was reported at https://bugs.launchpad.net/ubuntu/+source/lxc/+bug/1065589
With this patch the networ-instance jobs properly go away.
The other oddness solved here is that if a nic is passed into a running
upstart-based container, without this patch no network-instance job is
started in the container. But when the container creates a new nic
itself (ip link add new type veth) then network-interface jobs are
created. With this patch, behavior comes in line with a regular host.
v2: also send KOBJ_ADD to new netns. There will then be a
_MOVE event from the device_rename() call, but that should
be innocuous.
Signed-off-by: Serge Hallyn <serge.hallyn@canonical.com> Acked-by: "Eric W. Biederman" <ebiederm@xmission.com> Acked-by: Daniel Lezcano <daniel.lezcano@free.fr> Signed-off-by: David S. Miller <davem@davemloft.net>
Nicolas Dichtel [Tue, 4 Dec 2012 01:13:41 +0000 (01:13 +0000)]
ip6mr: advertise new mfc entries via rtnl
This patch allows to monitor mf6c activities via rtnetlink.
To avoid parsing two times the mf6c oifs, we use maxvif to allocate the rtnl
msg, thus we may allocate some superfluous space.
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Nicolas Dichtel [Tue, 4 Dec 2012 01:13:40 +0000 (01:13 +0000)]
ipmr: advertise new mfc entries via rtnl
This patch allows to monitor mfc activities via rtnetlink.
To avoid parsing two times the mfc oifs, we use maxvif to allocate the rtnl
msg, thus we may allocate some superfluous space.
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Nicolas Dichtel [Tue, 4 Dec 2012 01:13:39 +0000 (01:13 +0000)]
ipmr/ip6mr: allow to get unresolved cache via netlink
/proc/net/ip[6]_mr_cache allows to get all mfc entries, even if they are put in
the unresolved list (mfc[6]_unres_queue). But only the table RT_TABLE_DEFAULT is
displayed.
This patch adds the parsing of the unresolved list when the dump is made via
rtnetlink, hence each table can be checked.
In IPv6, we set rtm_type in ip6mr_fill_mroute(), because in case of unresolved
mfc __ip6mr_fill_mroute() will not set it. In IPv4, it is already done.
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Nicolas Dichtel [Tue, 4 Dec 2012 01:13:38 +0000 (01:13 +0000)]
ipmr/ip6mr: report origin of mfc entry into rtnl msg
A mfc entry can be static or not (added via the mroute_sk socket). The patch
reports MFC_STATIC flag into rtm_protocol by setting rtm_protocol to
RTPROT_STATIC or RTPROT_MROUTED.
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Nicolas Dichtel [Tue, 4 Dec 2012 01:13:37 +0000 (01:13 +0000)]
ipmr/ip6mr: advertise mfc stats via rtnetlink
These statistics can be checked only via /proc/net/ip_mr_cache or
SIOCGETSGCNT[_IN6] and thus only for the table RT_TABLE_DEFAULT.
Advertising them via rtnetlink allows to get statistics for all cache entries,
whatever the table is.
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Tue, 4 Dec 2012 18:01:19 +0000 (13:01 -0500)]
Merge branch 'master' of git://1984.lsi.us.es/nf-next
Pablo Neira Ayuso says:
====================
* Remove limitation in the maximum number of supported sets in ipset.
Now ipset automagically increments the number of slots in the array
of sets by 64 new spare slots, from Jozsef Kadlecsik.
* Partially remove the generic queue infrastructure now that ip_queue
is gone. Its only client is nfnetlink_queue now, from Florian
Westphal.
* Add missing attribute policy checkings in ctnetlink, from Florian
Westphal.
* Automagically kill conntrack entries that use the wrong output
interface for the masquerading case in case of routing changes,
from Jozsef Kadlecsik.
* Two patches two improve ct object traceability. Now ct objects are
always placed in any of the existing lists. This allows us to dump
the content of unconfirmed and dying conntracks via ctnetlink as
a way to provide more instrumentation in case you suspect leaks,
from myself.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Sony Chacko [Tue, 4 Dec 2012 03:33:56 +0000 (03:33 +0000)]
qlcnic: update NIC partition interface routines
Refactor 82xx driver to support new adapter
Update routines to support variable number of NIC partitions
Signed-off-by: Rajesh Borundia <rajesh.borundia@qlogic.com> Signed-off-by: Sony Chacko <sony.chacko@qlogic.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Sony Chacko [Tue, 4 Dec 2012 03:33:54 +0000 (03:33 +0000)]
qlcnic: modify PCI and register access routines
Refactor 82xx driver to support new adapter
Update PCI and hardware access routines
Signed-off-by: Anirban Chakraborty <anirban.chakraborty@qlogic.com> Signed-off-by: Sony Chacko <sony.chacko@qlogic.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Sony Chacko [Tue, 4 Dec 2012 03:33:53 +0000 (03:33 +0000)]
qlcnic: move HW specific data to seperate structure
Move HW specific data to a seperate structure as part of
refactoring 82xx adapter driver.
Signed-off-by: Anirban Chakraborty <anirban.chakraborty@qlogic.com> Signed-off-by: Sony Chacko <sony.chacko@qlogic.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Matt Carlson [Mon, 3 Dec 2012 19:37:00 +0000 (19:37 +0000)]
tg3: PTP - Enable the timestamping feature in hardware and fill skb tx/rx timestamps
This patch implements the hardware timestamping as described in
Documentation/networking/timestamping.txt
Update version to 3.128.
Signed-off-by: Nithin Nayak Sujir <nsujir@broadcom.com> Signed-off-by: Michael Chan <mchan@broadcom.com> Cc: Richard Cochran <richardcochran@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Matt Carlson [Mon, 3 Dec 2012 19:36:59 +0000 (19:36 +0000)]
tg3: PTP - Add the hardware timestamp ioctl
This patch implements the SIOCSHWTSTAMP ioctl as described in
Documentation/networking/timestamping.txt
[Removed HWTSTAMP_FILTER_ALL handling by returning -ERANGE based on input
from Richard Cochran.]
Signed-off-by: Nithin Nayak Sujir <nsujir@broadcom.com> Signed-off-by: Michael Chan <mchan@broadcom.com> Cc: Richard Cochran <richardcochran@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Matt Carlson [Mon, 3 Dec 2012 19:36:58 +0000 (19:36 +0000)]
tg3: PTP - Implement the ptp api and ethtool functions
This patch adds the ptp_caps structure, ptp api implementation,
reference clock read and register/unregister functions. All the basic
clock operations as described in Documentation/ptp/ptp.txt are
supported.
Frequency adjustment is performed using hardware with a 24 bit
accumulator and a programmable correction value. On each clk, the
correction value gets added to the accumulator and when it overflows,
the time counter is incremented/decremented and the accumulator reset.
So conversion from ppb to correction value is
ppb * (1 << 24) / 1000000000
[Re-organized to put the ptp_clock_info struct declaration in one patch,
added ptp_clock_info.name, and added locking to tg3_ptp_adjtime() based
on input from Richard Cochran.]
Signed-off-by: Nithin Nayak Sujir <nsujir@broadcom.com> Signed-off-by: Michael Chan <mchan@broadcom.com> Cc: Richard Cochran <richardcochran@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds code to write the reference clock. If a chip reset is
performed, the hwclock is reinitialized with the adjusted kernel time
Signed-off-by: Nithin Nayak Sujir <nsujir@broadcom.com> Signed-off-by: Michael Chan <mchan@broadcom.com> Cc: Richard Cochran <richardcochran@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
tg3: Fix inconsistent locking for tg3_netif_start().
Every caller holds tp->lock when calling tg3_netif_start() except
tg3_io_resume(). Fix it so that it is all consistent. The subsequent
PTP patches add tg3_ptp_resume() to tg3_netif_start() and the tp->lock
is required.
Signed-off-by: Nithin Nayak Sujir <nsujir@broadcom.com> Signed-off-by: Michael Chan <mchan@broadcom.com> Acked-by: Richard Cochran <richardcochran@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Mon, 3 Dec 2012 20:35:28 +0000 (15:35 -0500)]
Merge tag 'dev_removal' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/net-next
Networking: Remove __dev* markings from the networking drivers
This is a series of patches that remove the dev* attributes for all
networking drivers, with the exception of wireless drivers, those are in
a different branch.
Use of __devinit, __devexit_p, __devinitdata, __devinitconst, and
__devexit are no longer needed since CONFIG_HOTPLUG is being removed as
an option.
Note, there are some devinit compiler section mismatch warnings due to
this series, but they are fixed up when merged with my driver-next
branch, which fixes the PCI device id warnings, and removes the modpost
detection, as it's no longer needed.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: David S. Miller <davem@davemloft.net>
Paul Marks [Mon, 3 Dec 2012 10:26:54 +0000 (10:26 +0000)]
ipv6: Fix default route failover when CONFIG_IPV6_ROUTER_PREF=n
I believe this commit from 2008 was incorrect:
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux.git;a=commitdiff;h=398bcbebb6f721ac308df1e3d658c0029bb74503
When CONFIG_IPV6_ROUTER_PREF is disabled, the kernel should follow
RFC4861 section 6.3.6: if no route is NUD_VALID, then traffic should be
sprayed across all routers (indirectly triggering NUD) until one of them
becomes NUD_VALID.
However, the following experiment demonstrates that this does not work:
1) Connect to an IPv6 network.
2) Change the router's MAC (and link-local) address.
The kernel will lock onto the first router and never try the new one, even
if the first becomes unreachable. This patch fixes the problem by
allowing rt6_check_neigh() to return 0; if all routers return 0, then
rt6_select() will fall back to round-robin behavior.
This patch should have no effect when CONFIG_IPV6_ROUTER_PREF=y.
Note that rt6_check_neigh() is only used in a boolean context, so I've
changed its return type accordingly.
Signed-off-by: Paul Marks <pmarks@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Historically tun supported two modes of operation:
- in default mode, a small number of packets would get queued
at the device, the rest would be queued in qdisc
- in one queue mode, all packets would get queued at the device
This might have made sense up to a point where we made the
queue depth for both modes the same and set it to
a huge value (500) so unless the consumer
is stuck the chance of losing packets is small.
Thus in practice both modes behave the same, but the
default mode has some problems:
- if packets are never consumed, fragments are never orphaned
which cases a DOS for sender using zero copy transmit
- overrun errors are hard to diagnose: fifo error is incremented
only once so you can not distinguish between
userspace that is stuck and a transient failure,
tcpdump on the device does not show any traffic
Userspace solves this simply by enabling IFF_ONE_QUEUE
but there seems to be little point in not doing the
right thing for everyone, by default.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Bill Pemberton [Mon, 3 Dec 2012 14:24:25 +0000 (09:24 -0500)]
net/intel: remove __dev* attributes
CONFIG_HOTPLUG is going away as an option. As result the __dev*
markings will be going away.
Remove use of __devinit, __devexit_p, __devinitdata, __devinitconst,
and __devexit.
Signed-off-by: Bill Pemberton <wfp5p@virginia.edu> Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com> Cc: Jesse Brandeburg <jesse.brandeburg@intel.com> Cc: Bruce Allan <bruce.w.allan@intel.com> Cc: Carolyn Wyborny <carolyn.wyborny@intel.com> Cc: Don Skidmore <donald.c.skidmore@intel.com> Cc: Greg Rose <gregory.v.rose@intel.com> Cc: Peter P Waskiewicz Jr <peter.p.waskiewicz.jr@intel.com> Cc: Alex Duyck <alexander.h.duyck@intel.com> Cc: John Ronciak <john.ronciak@intel.com> Cc: Tushar Dave <tushar.n.dave@intel.com> Cc: e1000-devel@lists.sourceforge.net Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Bill Pemberton [Mon, 3 Dec 2012 14:24:15 +0000 (09:24 -0500)]
virtio_net: remove __dev* attributes
CONFIG_HOTPLUG is going away as an option. As result the __dev*
markings will be going away.
Remove use of __devinit, __devexit_p, __devinitdata, __devinitconst,
and __devexit.
Signed-off-by: Bill Pemberton <wfp5p@virginia.edu> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Michael S. Tsirkin <mst@redhat.com> Cc: virtualization@lists.linux-foundation.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Bill Pemberton [Mon, 3 Dec 2012 14:24:10 +0000 (09:24 -0500)]
fddi: remove __dev* attributes
CONFIG_HOTPLUG is going away as an option. As result the __dev*
markings will be going away.
Remove use of __devinit, __devexit_p, __devinitdata, __devinitconst,
and __devexit.
Signed-off-by: Bill Pemberton <wfp5p@virginia.edu> Cc: Maciej W. Rozycki <macro@linux-mips.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>