Jan Engelhardt [Mon, 19 Apr 2010 14:05:10 +0000 (16:05 +0200)]
netfilter: xtables: make ip_tables reentrant
Currently, the table traverser stores return addresses in the ruleset
itself (struct ip6t_entry->comefrom). This has a well-known drawback:
the jumpstack is overwritten on reentry, making it necessary for
targets to return absolute verdicts. Also, the ruleset (which might
be heavy memory-wise) needs to be replicated for each CPU that can
possibly invoke ip6t_do_table.
This patch decouples the jumpstack from struct ip6t_entry and instead
puts it into xt_table_info. Not being restricted by 'comefrom'
anymore, we can set up a stack as needed. By default, there is room
allocated for two entries into the traverser.
arp_tables is not touched though, because there is just one/two
modules and further patches seek to collapse the table traverser
anyhow.
Signed-off-by: Jan Engelhardt <jengelh@medozas.de> Signed-off-by: Patrick McHardy <kaber@trash.net>
Jan Engelhardt [Mon, 19 Apr 2010 12:17:47 +0000 (14:17 +0200)]
netfilter: xtables: inclusion of xt_TEE
xt_TEE can be used to clone and reroute a packet. This can for
example be used to copy traffic at a router for logging purposes
to another dedicated machine.
References: http://www.gossamer-threads.com/lists/iptables/devel/68781 Signed-off-by: Jan Engelhardt <jengelh@medozas.de> Signed-off-by: Patrick McHardy <kaber@trash.net>
Patrick McHardy [Thu, 15 Apr 2010 17:09:01 +0000 (19:09 +0200)]
netfilter: ipt_LOG/ip6t_LOG: use more appropriate log level as default
Use KERN_NOTICE instead of KERN_EMERG by default. This only affects
kernel internal logging (like conntrack), user-specified logging rules
contain a seperate log level.
netfilter: bridge-netfilter: Fix MAC header handling with IP DNAT
- fix IP DNAT on vlan- or pppoe-encapsulated traffic: The functions
neigh_hh_output() or dst->neighbour->output() overwrite the complete
Ethernet header, although we only need the destination MAC address.
For encapsulated packets, they ended up overwriting the encapsulating
header. The new code copies the Ethernet source MAC address and
protocol number before calling dst->neighbour->output(). The Ethernet
source MAC and protocol number are copied back in place in
br_nf_pre_routing_finish_bridge_slow(). This also makes the IP DNAT
more transparent because in the old scheme the source MAC of the
bridge was copied into the source address in the Ethernet header. We
also let skb->protocol equal ETH_P_IP resp. ETH_P_IPV6 during the
execution of the PF_INET resp. PF_INET6 hooks.
- Speed up IP DNAT by calling neigh_hh_bridge() instead of
neigh_hh_output(): if dst->hh is available, we already know the MAC
address so we can just copy it.
Signed-off-by: Bart De Schuymer <bdschuym@pandora.be> Signed-off-by: Patrick McHardy <kaber@trash.net>
Remove br_netfilter.c::br_nf_local_out(). The function
br_nf_local_out() was needed because the PF_BRIDGE::LOCAL_OUT hook
could be called when IP DNAT happens on to-be-bridged traffic. The
new scheme eliminates this mess.
Signed-off-by: Bart De Schuymer <bdschuym@pandora.be> Signed-off-by: Patrick McHardy <kaber@trash.net>
Jan Engelhardt [Tue, 13 Apr 2010 13:32:16 +0000 (15:32 +0200)]
netfilter: ipv6: add IPSKB_REROUTED exclusion to NF_HOOK/POSTROUTING invocation
Similar to how IPv4's ip_output.c works, have ip6_output also check
the IPSKB_REROUTED flag. It will be set from xt_TEE for cloned packets
since Xtables can currently only deal with a single packet in flight
at a time.
Signed-off-by: Jan Engelhardt <jengelh@medozas.de> Acked-by: David S. Miller <davem@davemloft.net>
[Patrick: changed to use an IP6SKB value instead of IPSKB] Signed-off-by: Patrick McHardy <kaber@trash.net>
Jan Engelhardt [Tue, 13 Apr 2010 13:28:11 +0000 (15:28 +0200)]
netfilter: ipv6: move POSTROUTING invocation before fragmentation
Patrick McHardy notes: "We used to invoke IPv4 POST_ROUTING after
fragmentation as well just to defragment the packets in conntrack
immediately afterwards, but that got changed during the
netfilter-ipsec integration. Ideally IPv6 would behave like IPv4."
This patch makes it so. Sending an oversized frame (e.g. `ping6
-s64000 -c1 ::1`) will now show up in POSTROUTING as a single skb
rather than multiple ones.
Signed-off-by: Jan Engelhardt <jengelh@medozas.de> Signed-off-by: Patrick McHardy <kaber@trash.net>
- remove some of the graffiti at the head of br_netfilter.c
- remove __br_dnat_complain()
- remove KERN_INFO messages when CONFIG_NETFILTER_DEBUG is defined
Signed-off-by: Bart De Schuymer <bdschuym@pandora.be> Signed-off-by: Patrick McHardy <kaber@trash.net>
netfilter: xtables: make XT_ALIGN() usable in exported headers by exporting __ALIGN_KERNEL()
XT_ALIGN() was rewritten through ALIGN() by commit 42107f5009da223daa800d6da6904d77297ae829
"netfilter: xtables: symmetric COMPAT_XT_ALIGN definition".
ALIGN() is not exported in userspace headers, which created compile problem for tc(8)
and will create problem for iptables(8).
We can't export generic looking name ALIGN() but we can export less generic
__ALIGN_KERNEL() (suggested by Ben Hutchings).
Google knows nothing about __ALIGN_KERNEL().
COMPAT_XT_ALIGN() changed for symmetry.
Reported-by: Andreas Henriksson <andreas@fatal.se> Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Patrick McHardy <kaber@trash.net>
Patrick McHardy [Fri, 9 Apr 2010 14:42:15 +0000 (16:42 +0200)]
netfilter: remove invalid rcu_dereference() calls
The CONFIG_PROVE_RCU option discovered a few invalid uses of
rcu_dereference() in netfilter. In all these cases, the code code
intends to check whether a pointer is already assigned when
performing registration or whether the assigned pointer matches
when performing unregistration. The entire registration/
unregistration is protected by a mutex, so we don't need the
rcu_dereference() calls.
Herbert Xu [Thu, 8 Apr 2010 12:52:28 +0000 (14:52 +0200)]
netfilter: only do skb_checksum_help on CHECKSUM_PARTIAL in ip_queue
While doing yet another audit on ip_summed I noticed ip_queue
calling skb_checksum_help unnecessarily. As we will set ip_summed
to CHECKSUM_NONE when necessary in ipq_mangle_ipv4, there is no
need to zap CHECKSUM_COMPLETE in ipq_build_packet_message.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: Patrick McHardy <kaber@trash.net>
Patrick McHardy [Thu, 8 Apr 2010 11:35:47 +0000 (13:35 +0200)]
IPVS: fix potential stack overflow with overly long protocol names
When protocols use very long names, the sprintf calls might overflow
the on-stack buffer. No protocol in the kernel does this however.
Print the protocol name in the pr_debug statement directly to avoid
this.
Based on patch by Zhitong Wang <zhitong.wangzt@alibaba-inc.com> Acked-by: Simon Horman <horms@verge.net.au> Signed-off-by: Patrick McHardy <kaber@trash.net>
When extended status codes are available, such as ENOMEM on failed
allocations, or subsequent functions (e.g. nf_ct_get_l3proto), passing
them up to userspace seems like a good idea compared to just always
EINVAL.
Signed-off-by: Jan Engelhardt <jengelh@medozas.de>
Jan Engelhardt [Fri, 19 Mar 2010 16:16:42 +0000 (17:16 +0100)]
netfilter: xtables: change xt_target.checkentry return type
Restore function signatures from bool to int so that we can report
memory allocation failures or similar using -ENOMEM rather than
always having to pass -EINVAL back.
Jan Engelhardt [Fri, 19 Mar 2010 16:16:42 +0000 (17:16 +0100)]
netfilter: xtables: change xt_match.checkentry return type
Restore function signatures from bool to int so that we can report
memory allocation failures or similar using -ENOMEM rather than
always having to pass -EINVAL back.
This semantic patch may not be too precise (checking for functions
that use xt_mtchk_param rather than functions referenced by
xt_match.checkentry), but reviewed, it produced the intended result.
Jan Engelhardt [Tue, 23 Mar 2010 03:07:21 +0000 (04:07 +0100)]
netfilter: bridge: use NFPROTO values for NF_HOOK invocation
The first argument to NF_HOOK* is an nfproto since quite some time.
Commit v2.6.27-2457-gfdc9314 was the first to practically start using
the new names. Do that now for the remaining NF_HOOK calls.
The semantic patch used was:
// <smpl>
@@
@@
(NF_HOOK
|NF_HOOK_THRESH
)(
-PF_BRIDGE,
+NFPROTO_BRIDGE,
...)
Downgrade the log level to INFO for most checkentry messages as they
are, IMO, just an extra information to the -EINVAL code that is
returned as part of a parameter "constraint violation". Leave errors
to real errors, such as being unable to create a LED trigger.
Signed-off-by: Jan Engelhardt <jengelh@medozas.de>
Jan Engelhardt [Wed, 17 Mar 2010 23:27:03 +0000 (00:27 +0100)]
netfilter: xtables: do not print any messages on ENOMEM
ENOMEM is a very obvious error code (cf. EINVAL), so I think we do not
really need a warning message. Not to mention that if the allocation
fails, the user is most likely going to get a stack trace from slab
already.
Signed-off-by: Jan Engelhardt <jengelh@medozas.de>
Jan Engelhardt [Thu, 18 Mar 2010 10:03:51 +0000 (11:03 +0100)]
netfilter: xtables: remove almost-unused xt_match_param.data member
This member is taking up a "long" per match, yet is only used by one
module out of the roughly 90 modules, ip6t_hbh. ip6t_hbh can be
restructured a little to accomodate for the lack of the .data member.
This variant uses checking the par->match address, which should avoid
having to add two extra functions, including calls, i.e.
Jan Engelhardt [Wed, 17 Mar 2010 23:44:52 +0000 (00:44 +0100)]
netfilter: xtables: make use of caller family rather than match family
The matches can have .family = NFPROTO_UNSPEC, and though that is not
the case for the touched modules, it seems better to just use the
nfproto from the caller.
Signed-off-by: Jan Engelhardt <jengelh@medozas.de>
Tim Gardner [Tue, 16 Mar 2010 18:53:13 +0000 (19:53 +0100)]
netfilter: xt_recent: add an entry reaper
One of the problems with the way xt_recent is implemented is that
there is no efficient way to remove expired entries. Of course,
one can write a rule '-m recent --remove', but you have to know
beforehand which entry to delete. This commit adds reaper
logic which checks the head of the LRU list when a rule
is invoked that has a '--seconds' value and XT_RECENT_REAP set. If an
entry ceases to accumulate time stamps, then it will eventually bubble
to the top of the LRU list where it is then reaped.
Signed-off-by: Tim Gardner <tim.gardner@canonical.com> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: Patrick McHardy <kaber@trash.net>
Jan Engelhardt [Sat, 28 Feb 2009 02:23:57 +0000 (03:23 +0100)]
netfilter: xtables: merge xt_MARK into xt_mark
Two arguments for combining the two:
- xt_mark is pretty useless without xt_MARK
- the actual code is so small anyway that the kmod metadata and the module
in its loaded state totally outweighs the combined actual code size.
i586-before:
-rw-r--r-- 1 jengelh users 3821 Feb 10 01:01 xt_MARK.ko
-rw-r--r-- 1 jengelh users 2592 Feb 10 00:04 xt_MARK.o
-rw-r--r-- 1 jengelh users 3274 Feb 10 01:01 xt_mark.ko
-rw-r--r-- 1 jengelh users 2108 Feb 10 00:05 xt_mark.o
text data bss dec hex filename
354 264 0 618 26a xt_MARK.o
223 176 0 399 18f xt_mark.o
And the runtime size is like 14 KB.
i586-after:
-rw-r--r-- 1 jengelh users 3264 Feb 18 17:28 xt_mark.o
Signed-off-by: Jan Engelhardt <jengelh@medozas.de>
Joe Perches [Wed, 17 Mar 2010 04:24:32 +0000 (21:24 -0700)]
drivers/net/e100.c: Use pr_<level> and netif_<level>
Convert DPRINTK, commonly used for debugging, to netif_<level>
Remove #define PFX
Use #define pr_fmt
Consistently use no periods for non-sentence logging messages
Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Jason Gunthorpe [Tue, 9 Mar 2010 09:17:42 +0000 (09:17 +0000)]
NET: Support clause 45 MDIO commands at the MDIO bus level
IEEE 802.3ae clause 45 specifies a somewhat modified MDIO protocol
for use by 10GIGE phys. The main change is a 21 bit address split into
a 5 bit device ID and a 16 bit register offset. The definition is designed
so that normal and extended devices can run on the same MDIO bus.
Extend mdio-bitbang to do the new protocol. At the MDIO bus level the
protocol is requested by or'ing MII_ADDR_C45 into the register offset.
Make phy_read/phy_write/etc pass a full 32 bit register offset.
This does not attempt to make the phy layer support C45 style PHYs, just
to provide the MDIO bus support.
Tested against a Broadcom 10GE phy with ID 0x206034, and several
Broadcom 10/100/1000 Phys in normal mode.
Signed-off-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Use the PCI runtime power management framework to add basic PCI
runtime PM support to the e1000e driver. Namely, make the driver
suspend the device when the link is off and set it up for generating
a wakeup event after the link has been detected again. [This
feature is disabled until the user space enables it with the help of
the /sys/devices/.../power/contol device attribute.]
Based on a patch from Matthew Garrett.
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Signed-off-by: David S. Miller <davem@davemloft.net>
Use the PCI runtime power management framework to add basic PCI
runtime PM support to the r8169 driver. Namely, make the driver
suspend the device when the link is not present and set it up for
generating a wakeup event after the link has been detected again.
[This feature is disabled until the user space enables it with the
help of the /sys/devices/.../power/contol device attribute.]
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Signed-off-by: David S. Miller <davem@davemloft.net>
Joe Perches [Sat, 27 Feb 2010 14:43:51 +0000 (14:43 +0000)]
drivers/net/ks*: Use netdev_<level>, netif_<level> and pr_<level>
I'm not sure this is correct.
It changes logging macros from:
dev_<level>(&ks->spidev->dev,
to
netdev_<level>(ks->netdev,
Comments?
Use netdev_<level>
Use netif_<level>
Use pr_<level>
Add #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
Add missing line to message in ks8851_remove
Change kmalloc/memset(,0) to kzalloc
Remove ks_<level> macros
Consolidation code into set_media_state
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch fixes an issue with TIPC's message retransmission logic
that prevented retransmission of clone sk_buffs. Originally intended
as a means of avoiding wasted work in retransmitting messages that
were still on the driver's outbound queue, it also prevented TIPC
from retransmitting messages through other means -- such as the
secondary bearer of the broadcast link, or another interface in a
set of bonded interfaces. This fix removes existing checks for
cloned sk_buffs that prevented such retransmission.
Origionally-Signed-off-by: Allan Stephens <allan.stephens@windriver.com> Signed-off-by: Neil Horman <nhorman@tuxdriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Origional commit message:
Increase frequency of load distribution over broadcast link
This patch enhances the behavior of TIPC's broadcast link so that it
alternates between redundant bearers (if available) after every
message sent, rather than after every 10 messages. This change helps
to speed up delivery of retransmitted messages by ensuring that
they are not sent repeatedly over a bearer that is no longer working,
but not yet recognized as failed.
Tested by myself in the latest net-2.6 tree using the tipc sanity test suite
Origionally-signed-off-by: Allan Stephens <allan.stephens@windriver.com> Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
bcast.c | 35 ++++++++++++++---------------------
1 file changed, 14 insertions(+), 21 deletions(-) Signed-off-by: David S. Miller <davem@davemloft.net>
Jan Engelhardt [Thu, 11 Mar 2010 09:57:29 +0000 (09:57 +0000)]
net: core: add IFLA_STATS64 support
`ip -s link` shows interface counters truncated to 32 bit. This is
because interface statistics are transported only in 32-bit quantity
to userspace. This commit adds a new IFLA_STATS64 attribute that
exports them in full 64 bit.
References: http://lkml.indiana.edu/hypermail/linux/kernel/0307.3/0215.html Signed-off-by: Jan Engelhardt <jengelh@medozas.de> Signed-off-by: David S. Miller <davem@davemloft.net>
The shared packet statistics are a potential source of slow down
on bridged traffic. Convert to per-cpu array, but only keep those
statistics which change per-packet.
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Tom Herbert [Tue, 16 Mar 2010 08:03:29 +0000 (08:03 +0000)]
rps: Receive Packet Steering
This patch implements software receive side packet steering (RPS). RPS
distributes the load of received packet processing across multiple CPUs.
Problem statement: Protocol processing done in the NAPI context for received
packets is serialized per device queue and becomes a bottleneck under high
packet load. This substantially limits pps that can be achieved on a single
queue NIC and provides no scaling with multiple cores.
This solution queues packets early on in the receive path on the backlog queues
of other CPUs. This allows protocol processing (e.g. IP and TCP) to be
performed on packets in parallel. For each device (or each receive queue in
a multi-queue device) a mask of CPUs is set to indicate the CPUs that can
process packets. A CPU is selected on a per packet basis by hashing contents
of the packet header (e.g. the TCP or UDP 4-tuple) and using the result to index
into the CPU mask. The IPI mechanism is used to raise networking receive
softirqs between CPUs. This effectively emulates in software what a multi-queue
NIC can provide, but is generic requiring no device support.
Many devices now provide a hash over the 4-tuple on a per packet basis
(e.g. the Toeplitz hash). This patch allow drivers to set the HW reported hash
in an skb field, and that value in turn is used to index into the RPS maps.
Using the HW generated hash can avoid cache misses on the packet when
steering it to a remote CPU.
The CPU mask is set on a per device and per queue basis in the sysfs variable
/sys/class/net/<device>/queues/rx-<n>/rps_cpus. This is a set of canonical
bit maps for receive queues in the device (numbered by <n>). If a device
does not support multi-queue, a single variable is used for the device (rx-0).
Generally, we have found this technique increases pps capabilities of a single
queue device with good CPU utilization. Optimal settings for the CPU mask
seem to depend on architectures and cache hierarcy. Below are some results
running 500 instances of netperf TCP_RR test with 1 byte req. and resp.
Results show cumulative transaction rate and system CPU utilization.
e1000e on 8 core Intel
Without RPS: 108K tps at 33% CPU
With RPS: 311K tps at 64% CPU
forcedeth on 16 core AMD
Without RPS: 156K tps at 15% CPU
With RPS: 404K tps at 49% CPU
bnx2x on 16 core AMD
Without RPS 567K tps at 61% CPU (4 HW RX queues)
Without RPS 738K tps at 96% CPU (8 HW RX queues)
With RPS: 854K tps at 76% CPU (4 HW RX queues)
Caveats:
- The benefits of this patch are dependent on architecture and cache hierarchy.
Tuning the masks to get best performance is probably necessary.
- This patch adds overhead in the path for processing a single packet. In
a lightly loaded server this overhead may eliminate the advantages of
increased parallelism, and possibly cause some relative performance degradation.
We have found that masks that are cache aware (share same caches with
the interrupting CPU) mitigate much of this.
- The RPS masks can be changed dynamically, however whenever the mask is changed
this introduces the possibility of generating out of order packets. It's
probably best not change the masks too frequently.
Signed-off-by: Tom Herbert <therbert@google.com>
include/linux/netdevice.h | 32 ++++-
include/linux/skbuff.h | 3 +
net/core/dev.c | 335 +++++++++++++++++++++++++++++++++++++--------
net/core/net-sysfs.c | 225 ++++++++++++++++++++++++++++++-
net/core/skbuff.c | 2 +
5 files changed, 538 insertions(+), 59 deletions(-) Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Andy Grover [Thu, 11 Mar 2010 13:50:06 +0000 (13:50 +0000)]
RDS: Do not call set_page_dirty() with irqs off
set_page_dirty() unconditionally re-enables interrupts, so
if we call it with irqs off, they will be on after the call,
and that's bad. This patch moves the call after we've re-enabled
interrupts in send_drop_to(), so it's safe.
Also, add BUG_ONs to let us know if we ever do call set_page_dirty
with interrupts off.
Signed-off-by: Andy Grover <andy.grover@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Sherman Pun [Thu, 11 Mar 2010 13:50:05 +0000 (13:50 +0000)]
RDS: Properly unmap when getting a remote access error
If the RDMA op has aborted with a remote access error,
in addition to what we already do (tell userspace it has
completed with an error) also unmap it and put() the rm.
Otherwise, hangs may occur on arches that track maps and
will not exit without proper cleanup.
Signed-off-by: Andy Grover <andy.grover@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Andy Grover [Thu, 11 Mar 2010 13:50:04 +0000 (13:50 +0000)]
RDS: only put sockets that have seen congestion on the poll_waitq
rds_poll_waitq's listeners will be awoken if we receive a congestion
notification. Bad performance may result because *all* polled sockets
contend for this single lock. However, it should not be necessary to
wake pollers when a congestion update arrives if they have never
experienced congestion, and not putting these on the waitq will
hopefully greatly reduce contention.
Signed-off-by: Andy Grover <andy.grover@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Tina Yang [Thu, 11 Mar 2010 13:50:03 +0000 (13:50 +0000)]
RDS: Fix locking in rds_send_drop_to()
It seems rds_send_drop_to() called
__rds_rdma_send_complete(rs, rm, RDS_RDMA_CANCELED)
with only rds_sock lock, but not rds_message lock. It raced with
other threads that is attempting to modify the rds_message as well,
such as from within rds_rdma_send_complete().
Signed-off-by: Tina Yang <tina.yang@oracle.com> Signed-off-by: Andy Grover <andy.grover@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Andy Grover [Thu, 11 Mar 2010 13:50:02 +0000 (13:50 +0000)]
RDS: Turn down alarming reconnect messages
RDS's error messages when a connection goes down are a little
extreme. A connection may go down, and it will be re-established,
and everything is fine. This patch links these messages through
rdsdebug(), instead of to printk directly.
Signed-off-by: Andy Grover <andy.grover@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Andy Grover [Thu, 11 Mar 2010 13:50:01 +0000 (13:50 +0000)]
RDS: Workaround for in-use MRs on close causing crash
if a machine is shut down without closing sockets properly, and
freeing all MRs, then a BUG_ON will bring it down. This patch
changes these to WARN_ONs -- leaking MRs is not fatal (although
not ideal, and there is more work to do here for a proper fix.)
Signed-off-by: Andy Grover <andy.grover@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Tina Yang [Thu, 11 Mar 2010 13:50:00 +0000 (13:50 +0000)]
RDS: Fix send locking issue
Fix a deadlock between rds_rdma_send_complete() and
rds_send_remove_from_sock() when rds socket lock and
rds message lock are acquired out-of-order.
Signed-off-by: Tina Yang <Tina.Yang@oracle.com> Signed-off-by: Andy Grover <andy.grover@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Andy Grover [Thu, 11 Mar 2010 13:49:59 +0000 (13:49 +0000)]
RDS: Fix congestion issues for loopback
We have two kinds of loopback: software (via loop transport)
and hardware (via IB). sw is used for 127.0.0.1, and doesn't
support rdma ops. hw is used for sends to local device IPs,
and supports rdma. Both are used in different cases.
For both of these, when there is a congestion map update, we
want to call rds_cong_map_updated() but not actually send
anything -- since loopback local and foreign congestion maps
point to the same spot, they're already in sync.
The old code never called sw loop's xmit_cong_map(),so
rds_cong_map_updated() wasn't being called for it. sw loop
ports would not work right with the congestion monitor.
Fixing that meant that hw loopback now would send congestion maps
to itself. This is also undesirable (racy), so we check for this
case in the ib-specific xmit code.
Signed-off-by: Andy Grover <andy.grover@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Andy Grover [Thu, 11 Mar 2010 13:49:58 +0000 (13:49 +0000)]
RDS/TCP: Wait to wake thread when write space available
Instead of waking the send thread whenever any send space is available,
wait until it is at least half empty. This is modeled on how
sock_def_write_space() does it, and may help to minimize context
switches.
Signed-off-by: Andy Grover <andy.grover@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Andy Grover [Thu, 11 Mar 2010 13:49:57 +0000 (13:49 +0000)]
RDS: update copy_to_user state in tcp transport
Other transports use rds_page_copy_user, which updates our
s_copy_to_user counter. TCP doesn't, so it needs to explicity
call rds_stats_add().
Reported-by: Richard Frank <richard.frank@oracle.com> Signed-off-by: Andy Grover <andy.grover@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>