Nick Piggin [Sun, 28 Aug 2005 06:49:11 +0000 (16:49 +1000)]
[PATCH] Lazy page table copies in fork()
Defer copying of ptes until fault time when it is possible to reconstruct
the pte from backing store. Idea from Andi Kleen and Nick Piggin.
Thanks to input from Rik van Riel and Linus and to Hugh for correcting
my blundering.
Ray Fucillo <fucillo@intersystems.com> reports:
"I applied this latest patch to a 2.6.12 kernel and found that it does
resolve the problem. Prior to the patch on this machine, I was
seeing about 23ms spent in fork for ever 100MB of shared memory
segment.
After applying the patch, fork is taking about 1ms regardless of the
shared memory size."
Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
[CCID3]: Call sk->sk_write_space(sk) when receiving a feedback packet
This makes the send rate calculations behave way more closely to what
is specified, with the jitter previously seen on x and x_recv
disappearing completely on non lossy setups.
This resembles the tcp_data_snd_check code, that possibly we'll end up
using in DCCP as well, perhaps moving this code to
inet_connection_sock.
For now I'm doing the simplest implementation tho.
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com> Signed-off-by: David S. Miller <davem@davemloft.net>
So that applications can set dccp_sock->dccps_pkt_size, that in turn
is used in the CCID3 half connection init routines to set
ccid3hc[tr]x_s and use it in its rate calculations.
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com> Signed-off-by: David S. Miller <davem@davemloft.net>
[DCCP]: Introduce dccp_tfrc_lib module with net/dccp/ccids/lib/*.c
I'll now take a look at the other proposed TFRC DCCP CCIDs to find
more code that is now in ccid3.c and move to this module, the loss
event rate, calc_X, etc most probably will be moved there.
The main goal of these changes is to pave the way for the
implementation of more TFRC based DCCP CCIDs and to shrink ccid3.c,
reducing its complexity and helping in getting it rock solid.
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com> Signed-off-by: David S. Miller <davem@davemloft.net>
[CCID3]: Move the loss interval code to loss_interval.[ch]
And put this into net/dccp/ccids/lib/, where packet_history.[ch] will also be
moved and then we'll have a tfrc_lib.ko module that will be used by
dccp_ccid3.ko and other CCIDs that are variations of TFRC (RFC 3448).
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Introducing functions to add to or subtract from a timeval variable
and renaming now_delta to timeval_new_delta that calls do_gettimeofday
and then timeval_delta, that should be used when there are several
deltas made relative to the current time or setting variables to it,
so as to avoid calling do_gettimeofday excessively.
I'm leaving these "timeval_" prefixed funcions internal to DCCP for a
while till we're sure there are no subtle bugs in it.
It also is more correct as it checks if the number of usecs added to
or subtracted from a tv_usec field is more than 2 seconds.
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com> Signed-off-by: David S. Miller <davem@davemloft.net>
[DCCP]: Introduce dccp_wait_for_ccid and use it in dccp_write_xmit
This is not quite what I think we should have long term but improves
performance for now, so lets use it till we get CCID3 working well,
then we can think about using sk_write_queue, perhaps using some ideas
from Juwen Lai's old stack for 2.4.20.
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Michael Chan [Thu, 25 Aug 2005 22:39:15 +0000 (15:39 -0700)]
[BNX2]: update version and minor fixes
Update version and add 4 minor fixes, the last 2 were suggested by
Jeff Garzik:
1. check for a valid ethernet address before setting it
2. zero out bp->regview if init_one encounters an error and unmaps
the IO address. This prevents remove_one from unmapping again.
3. use netif_rx_schedule() instead of hand coding the same.
4. use IRQ_HANDLED and IRQ_NONE.
Signed-off-by: Michael Chan <mchan@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Michael Chan [Thu, 25 Aug 2005 22:36:58 +0000 (15:36 -0700)]
[BNX2]: remove atomics in tx
Remove atomic operations in the fast tx path. Expensive atomic
operations were used to keep track of the number of available tx
descriptors. The new code uses the difference between the consumer
and producer index to determine the number of free tx descriptors.
As suggested by Jeff Garzik, the name of the inline function is
changed to all lower case.
Signed-off-by: Michael Chan <mchan@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Michael Chan [Thu, 25 Aug 2005 22:35:24 +0000 (15:35 -0700)]
[BNX2]: speedup serdes linkup
This speeds up link-up time on 5706 SerDes if the link partner does
not autoneg, a rather common scenario in blade servers. Some blade
servers use IPMI for keyboard input and it's important to minimize
link disruptions.
The speedup is achieved by shortening the timer to (HZ / 3) during
the transient period right after initiating a SerDes autoneg. If
autoneg does not complete, parallel detect can be done sooner. After
the transient period is over, the timer goes back to its normal HZ
interval.
As suggested by Jeff Garzik, the timer initialization is moved to
bnx2_init_board() from bnx2_open().
An eeprom bit is also added to allow default forced SerDes speed for
even faster link-up time.
Signed-off-by: Michael Chan <mchan@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Michael Chan [Thu, 25 Aug 2005 22:34:29 +0000 (15:34 -0700)]
[BNX2]: Fix rtnl deadlock in bnx2_close
This fixes an rtnl deadlock problem when flush_scheduled_work() is
called from bnx2_close(). In rare cases, linkwatch_event() may be on
the workqueue from a previous close of a different device and it will
try to get the rtnl lock which is already held by dev_close().
The fix is to set a flag if we are in the reset task which is run
from the workqueue. bnx2_close() will loop until the flag is cleared.
As suggested by Jeff Garzik, the loop is changed to call msleep(1)
instead of yield() in the original patch.
flush_scheduled_work() is also moved to bnx2_remove_one() before the
netdev is freed.
Signed-off-by: Michael Chan <mchan@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Robert Olsson [Thu, 25 Aug 2005 20:01:29 +0000 (13:01 -0700)]
[IPV4]: Convert FIB Trie to RCU.
* Removes RW-lock
* Proteced read functions uses
rcu_dereference proteced with rcu_read_lock()
* writing of procted pointer w. rcu_assigen_pointer
* Insert/Replace atomic list_replace_rcu
* A BUG_ON condition removed.in trie_rebalance
With help from Paul E. McKenney.
Signed-off-by: Robert Olsson <Robert.Olsson@data.slu.se> Signed-off-by: Stephen Hemminger <shemminger@osdl.org> Signed-off-by: David S. Miller <davem@davemloft.net>
Robert Olsson [Thu, 25 Aug 2005 20:01:03 +0000 (13:01 -0700)]
[IPV4]: Prepare FIB core for RCU.
* RCU versions of hlist_***_rcu
* fib_alias partial rcu port just whats needed now.
Signed-off-by: Robert Olsson <Robert.Olsson@data.slu.se> Signed-off-by: Stephen Hemminger <shemminger@osdl.org> Signed-off-by: David S. Miller <davem@davemloft.net>
Ralf Baechle [Wed, 24 Aug 2005 18:35:51 +0000 (11:35 -0700)]
[AX25/NETROM]: Cleanup direct calls into IP stack
Get rid of the calls to ip_rcv and arp_rcv which were layering
violations anyway. With those being replaced by netif_rx, less parts
of AX.25 and relatives depend on INET support actually being enabled.
This also will make PF_PACKET sockets work for IP and ARP packets
received over AX.25 and for IP packets over NET/ROM.
Signed-off-by: Ralf Baechle DL5RB <ralf@linux-mips.org> Signed-off-by: David S. Miller <davem@davemloft.net>
This is a redo of earlier cleanup stuff:
* replace DBG() macro with pr_debug()
* get rid of duplicate extern's that are already in fib_lookup.h
* use BUG_ON and WARN_ON
* don't use BUG checks for null pointers where next statement would
get a fault anyway
* remove debug printout when rebalance causes deep tree
* remove trailing blanks
Signed-off-by: Stephen Hemminger <shemminger@osdl.org> Signed-off-by: Robert Olsson <robert.olsson@its.uu.se> Signed-off-by: David S. Miller <davem@davemloft.net>
This, for instance, shows that we're not congestion controlling ACKs,
as the above output is in the ttcp receiving host, and ttcp is a one
way app, i.e. the received never calls sendmsg, so
ccid_hc_tx_send_packet is never called, so the TX half connection
stays in TFRC_SSTATE_NO_SENT state and hctx_rtt is never calculated,
stays with the value set in ccid3_hc_tx_init, 4us, as show above in
milliseconds (0.004ms), upcoming patches will fix this.
rcv_rtt seems sane tho, matching ping results :-)
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Pablo Neira Ayuso <pablo@eurodev.net> Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>
Jon Wetzel [Sun, 21 Aug 2005 00:15:54 +0000 (17:15 -0700)]
[NET]: Add support for getting the permanent hardware address.
This patch adds a new field to net device to hold the permanent
hardware address, and adds a new generic ethtool_op function to
get that address.
Signed-off-by: Jon Wetzel <jon_wetzel@dell.com> Signed-off-by: John W. Linville <linville@tuxdriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Ian McDonald [Sat, 20 Aug 2005 03:23:43 +0000 (00:23 -0300)]
[DCCP]: Fix the timestamp options
This changes timestamp, timestamp echo, and elapsed time to use units of 10
usecs as per DCCP spec. This has been tested to verify that times are correct.
Also fixed up length and used hton/ntoh more.
Still to add in later patches:
- actually use elapsed time to adjust RTT
(commented out as was prior to this patch)
- send options at times more closely following the spec
(content is now correct)
Signed-off-by: Ian McDonald <iam4@cs.waikato.ac.nz> Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Ian McDonald [Thu, 18 Aug 2005 23:45:29 +0000 (20:45 -0300)]
[DCCP]: Fix elapsed time option as per section 13.2 of spec v11
The elapsed time can be two bytes or four bytes only.
Signed-off-by: Ian McDonald <iam4@cs.waikato.ac.nz> Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Wed, 17 Aug 2005 21:57:30 +0000 (14:57 -0700)]
[NET]: Implement SKB fast cloning.
Protocols that make extensive use of SKB cloning,
for example TCP, eat at least 2 allocations per
packet sent as a result.
To cut the kmalloc() count in half, we implement
a pre-allocation scheme wherein we allocate
2 sk_buff objects in advance, then use a simple
reference count to free up the memory at the
correct time.
Based upon an initial patch by Thomas Graf and
suggestions from Herbert Xu.
Signed-off-by: David S. Miller <davem@davemloft.net>
CHECK net/ipv6/netfilter.c
net/ipv6/netfilter.c:96:12: warning: symbol 'ipv6_netfilter_init' was not declared. Should it be static?
net/ipv6/netfilter.c:101:6: warning: symbol 'ipv6_netfilter_fini' was not declared. Should it be static?
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Patrick McHardy [Mon, 15 Aug 2005 19:32:15 +0000 (12:32 -0700)]
[NETLINK]: Add set/getsockopt options to support more than 32 groups
NETLINK_ADD_MEMBERSHIP/NETLINK_DROP_MEMBERSHIP are used to join/leave
groups, NETLINK_PKTINFO is used to enable nl_pktinfo control messages
for received packets to get the extended destination group number.
Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>
Patrick McHardy [Mon, 15 Aug 2005 02:31:36 +0000 (19:31 -0700)]
[NETLINK]: Return -EPROTONOSUPPORT in netlink_create() if no kernel socket is registered
This is necessary for dynamic number of netlink groups to make sure we know
the number of possible groups before bind() is called. With this change pure
userspace communication using unused netlink protocols becomes impossible.
Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>
Patrick McHardy [Mon, 15 Aug 2005 02:27:50 +0000 (19:27 -0700)]
[NETLINK]: Use group numbers instead of bitmasks internally
Using the group number allows increasing the number of groups without
beeing limited by the size of the bitmask. It introduces one limitation
for netlink users: messages can't be broadcasted to multiple groups anymore,
however this feature was never used inside the kernel.
Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>
Patrick McHardy [Mon, 15 Aug 2005 02:27:13 +0000 (19:27 -0700)]
[NETLINK]: Fix module refcounting problems
Use-after-free: the struct proto_ops containing the module pointer
is freed when a socket with pid=0 is released, which besides for kernel
sockets is true for all unbound sockets.
Module refcount leak: when the kernel socket is closed before all user
sockets have been closed the proto_ops struct for this family is
replaced by the generic one and the module refcount can't be dropped.
The second problem can't be solved cleanly using module refcounting in the
generic socket code, so this patch adds explicit refcounting to
netlink_create/netlink_release.
Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>
Evgeniy Polyakov [Mon, 15 Aug 2005 02:24:58 +0000 (19:24 -0700)]
[NETLINK]: w1_int.c: fix default netlink group
w1 does not need to multicast its state to several groups at once,
and upcoming netlink changes will not allow bitmask for groups anyway.
Signed-off-by: Evgeniy Polyakov <johnpol@2ka.mipt.ru> Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>
Patrick McHardy [Mon, 15 Aug 2005 00:05:53 +0000 (21:05 -0300)]
[DCCP]: Fix compiler warnings
may be a false warning if there always is something on ccid3hcrx_hist:
net/dccp/ccids/ccid3.c: In function 'ccid3_hc_rx_packet_recv':
net/dccp/ccids/ccid3.c:1634: warning: 'tstamp.tv_usec' may be used uninitialized in this function
net/dccp/ccids/ccid3.c:1634: warning: 'tstamp.tv_sec' may be used uninitialized in this function
const on inline functions doesn't have any effect:
net/dccp/dccp.h:64: warning: type qualifiers ignored on function return type
net/dccp/dccp.h:70: warning: type qualifiers ignored on function return type
net/dccp/dccp.h:76: warning: type qualifiers ignored on function return type
Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com> Signed-off-by: David S. Miller <davem@davemloft.net>
[PACKET_HISTORY]: Add dccphtx_rtt and rename the win_count fields
As requested by Ian.
Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com> Signed-off-by: Ian McDonald <iam4@cs.waikato.ac.nz> Signed-off-by: David S. Miller <davem@davemloft.net>
Gary Wayne Smith [Mon, 15 Aug 2005 00:33:24 +0000 (17:33 -0700)]
[NETFILTER]: Make NETMAP target usable in OUTPUT
Signed-off-by: Gary Wayne Smith <gary.w.smith@primeexalia.com> Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>