Andrea Merello [Thu, 3 Oct 2013 19:18:37 +0000 (21:18 +0200)]
atl1e: enable support for NETIF_F_RXALL and NETIF_F_RXCRC features
This patch allows (optionally, via ethtool) the atl1e NIC to:
- Receive bad frames (runt, bad-fcs, etc..)
- Receive full frames without stripping the FCS.
This has been tested on my board by injecting runt and bad-fcs
frames with a FPGA-based device.
The particular scenario of receiving very short frames (<4 bytes)
without passing FCS to the upper layer has been also tested:
This could be potentially dangerous because the driver performs a
4 byte subtraction on the frame length, but I finally have NOT
added anything to avoid this because it seems the NIC always
discards frames so much short..
If someone still have some reason to worry about this, please
tell me.. I will add an explicit SW check..
Signed-off-by: Andrea Merello <andrea.merello@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
There is the rc variable on both myri10ge_ss_lock_napi and
myri10ge_ss_lock_poll functions. In both cases rc is only assigned the
values true and false. Both functions already return bool. Change rc
type to bool.
The simplified semantic patch that find this problem is as
follows (http://coccinelle.lip6.fr/):
@exists@
type T;
identifier b;
@@
- T
+ bool
b = ...;
... when any
b = \(true\|false\)
Signed-off-by: Peter Senna Tschudin <peter.senna@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Thu, 3 Oct 2013 19:37:10 +0000 (15:37 -0400)]
Merge branch 'bond_hash'
Nikolay Aleksandrov says:
====================
This is a complete remake of my old patch that modified the bonding hash
functions to use skb_flow_dissect which was suggested by Eric Dumazet.
This time around I've left the old modes although using a new hash function
again suggested by Eric, which is the same for all modes. The only
difference is the way the headers are obtained. The old modes obtain them
as before in order to address concerns about speed, but the 2 new ones use
skb_flow_dissect. The unification of the hash function allows to remove a
pointer from struct bonding and also a few extra functions that dealt with
it. Two new functions are added which take care of the hashing based on
bond->params.xmit_policy only:
bond_xmit_hash() - global function, used by XOR and 3ad modes
bond_flow_dissect() - used by bond_xmit_hash() to obtain the necessary
headers and combine them according to bond->params.xmit_policy.
Also factor out the ports extraction from skb_flow_dissect and add a new
function - skb_flow_get_ports() which can be re-used.
v2: add the flow_dissector patch and use skb_flow_get_ports in patch 02
v3: fix a bug in the flow_dissector patch that caused a different thoff
by modifying the thoff argument in skb_flow_get_ports directly, most
of the users already do it anyway.
Also add the necessary export symbol for skb_flow_get_ports.
v4: integrate the thoff bug fix in patch 01
v5: disintegrate the thoff bug fix and re-base on top of Eric's fix
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
bonding: modify the old and add new xmit hash policies
This patch adds two new hash policy modes which use skb_flow_dissect:
3 - Encapsulated layer 2+3
4 - Encapsulated layer 3+4
There should be a good improvement for tunnel users in those modes.
It also changes the old hash functions to:
hash ^= (__force u32)flow.dst ^ (__force u32)flow.src;
hash ^= (hash >> 16);
hash ^= (hash >> 8);
Where hash will be initialized either to L2 hash, that is
SRCMAC[5] XOR DSTMAC[5], or to flow->ports which should be extracted
from the upper layer. Flow's dst and src are also extracted based on the
xmit policy either directly from the buffer or by using skb_flow_dissect,
but in both cases if the protocol is IPv6 then dst and src are obtained by
ipv6_addr_hash() on the real addresses. In case of a non-dissectable
packet, the algorithms fall back to L2 hashing.
The bond_set_mode_ops() function is now obsolete and thus deleted
because it was used only to set the proper hash policy. Also we trim a
pointer from struct bonding because we no longer need to keep the hash
function, now there's only a single hash function - bond_xmit_hash that
works based on bond->params.xmit_policy.
The hash function and skb_flow_dissect were suggested by Eric Dumazet.
The layer names were suggested by Andy Gospodarek, because I suck at
semantics.
Signed-off-by: Nikolay Aleksandrov <nikolay@redhat.com> Acked-by: Eric Dumazet <edumazet@google.com> Acked-by: Veaceslav Falico <vfalico@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
flow_dissector: factor out the ports extraction in skb_flow_get_ports
Factor out the code that extracts the ports from skb_flow_dissect and
add a new function skb_flow_get_ports which can be re-used.
Suggested-by: Veaceslav Falico <vfalico@redhat.com> Signed-off-by: Nikolay Aleksandrov <nikolay@redhat.com> Acked-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Veaceslav Falico <vfalico@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Wed, 2 Oct 2013 11:29:50 +0000 (04:29 -0700)]
inet: consolidate INET_TW_MATCH
TCP listener refactoring, part 2 :
We can use a generic lookup, sockets being in whatever state, if
we are sure all relevant fields are at the same place in all socket
types (ESTABLISH, TIME_WAIT, SYN_RECV)
Joe Perches [Thu, 3 Oct 2013 03:39:11 +0000 (20:39 -0700)]
ath10k: wmi: Convert use of 6 to ETH_ALEN
Use the appropriate define instead of 6.
Signed-off-by: Joe Perches <joe@perches.com> Noticed-by: Julia Lawall <julia.lawall@lip6.fr> via spatch script Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Tue, 1 Oct 2013 17:23:44 +0000 (10:23 -0700)]
tcp: sndbuf autotuning improvements
tcp_fixup_sndbuf() is underestimating initial send buffer requirements.
It was not noticed because big GSO packets were escaping the limitation,
but with smaller TSO packets (or TSO/GSO/SG off), application hits
sk_sndbuf before having a chance to fill enough packets in socket write
queue.
- initial cwnd can be bigger than 10 for specific routes
- SKB_TRUESIZE() is a bit under real needs in some cases,
because of power-of-two rounding in kmalloc()
- Extra cushion (application might react slowly to POLLOUT)
tcp_v4_conn_req_fastopen() needs to call tcp_init_metrics() before
calling tcp_init_buffer_space()
Then we realize tcp_new_space() should call tcp_fixup_sndbuf()
instead of duplicating this stuff.
Rename tcp_fixup_sndbuf() to tcp_sndbuf_expand() to be more
descriptive.
Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Acked-by: Maciej Żenczykowski <maze@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
fib_trie: avoid a redundant bit judgement in inflate
Because 'node' is the i'st child of 'oldnode',
thus, here 'i' equals
tkey_extract_bits(node->key, oldtnode->pos, oldtnode->bits)
we just get 1 more bit,
and need not care the detail value of this bits.
I apologize for the mistake.
I generated the patch on a branch version,
and did not notice the put_child has been changed.
I have redone the test on HEAD version with my patch.
two cases are used.
case 1. inflate a node which has a leaf child node.
case 2: inflate a node which has a an child node with skipped bits
test env:
ip link set eth0 up
ip a add dev eth0 192.168.11.1/32
here, we just focus on route table(MAIN),
so I use a "192.168.11.1/32" address to simplify the test case.
Test case 1: inflate a node which has a leaf child node.
===========================================================
step 1. prepare a fib trie
------------------------------------------
ip r a 192.168.0.0/24 via 192.168.11.1
ip r a 192.168.1.0/24 via 192.168.11.1
we get a fib trie.
root@baker:~# cat /proc/net/fib_trie
Main:
+-- 192.168.0.0/23 1 0 0
|-- 192.168.0.0
/24 universe UNICAST
|-- 192.168.1.0
/24 universe UNICAST
Local:
.....
step 2. Add the third route
------------------------------------------
root@baker:~# ip r a 192.168.2.0/24 via 192.168.11.1
A fib_trie leaf will be inserted in fib_insert_node before trie_rebalance.
For function 'inflate':
'inflate' is called with following trie.
+-- 192.168.0.0/22 1 1 0 <=== tn node
+-- 192.168.0.0/23 1 0 0 <== node a
|-- 192.168.0.0
/24 universe UNICAST
|-- 192.168.1.0
/24 universe UNICAST
|-- 192.168.2.0 <== leaf(node b)
When process node b, which is a leaf. here:
i is 1,
node key "192.168.2.0"
oldnode is (pos:22, bits:1)
Test case 2: inflate a node which has a an child node with skipped bits
==========================================================================
step 1. prepare a fib trie.
ip link set eth0 up
ip a add dev eth0 192.168.11.1/32
ip r a 192.168.128.0/24 via 192.168.11.1
ip r a 192.168.0.0/24 via 192.168.11.1
ip r a 192.168.16.0/24 via 192.168.11.1
ip r a 192.168.32.0/24 via 192.168.11.1
ip r a 192.168.48.0/24 via 192.168.11.1
ip r a 192.168.144.0/24 via 192.168.11.1
ip r a 192.168.160.0/24 via 192.168.11.1
ip r a 192.168.176.0/24 via 192.168.11.1
step 2. add a route to trigger inflate.
ip r a 192.168.96.0/24 via 192.168.11.1
This command will call serveral times inflate.
In the first time, the fib_trie is:
________________________
+-- 192.168.128.0/(16, 1) <== tn node
+-- 192.168.0.0/(17, 1) <== node a
+-- 192.168.0.0/(18, 2)
|-- 192.168.0.0
|-- 192.168.16.0
|-- 192.168.32.0
|-- 192.168.48.0
|-- 192.168.96.0
+-- 192.168.128.0/(18, 2) <== node b.
|-- 192.168.128.0
|-- 192.168.144.0
|-- 192.168.160.0
|-- 192.168.176.0
NOTE: node b is a interal node with skipped bits.
here,
i:1,
node->key "192.168.128.0",
oldnode:(pos:16, bits:1)
so
tkey_extract_bits(node->key, oldtnode->pos + oldtnode->bits, 1)
it equals:
tkey_extract_bits("192.168,128,0", 16 + 1, 1) <=== 0
isdn: eicon: free pointer after using it in log msg in divas_um_idi_delete_entity()
Not really a problem, but nice IMHO; the Coverity static analyzer
complains that we use the pointer 'e' after it has been freed, so move
the freeing below the final use, even if that use is just using the
value of the pointer and not actually dereferencing it.
Signed-off-by: Jesper Juhl <jj@chaosbits.net> Signed-off-by: David S. Miller <davem@davemloft.net>
Wei Liu [Mon, 30 Sep 2013 12:46:34 +0000 (13:46 +0100)]
xen-netfront: convert to GRO API
Anirban was seeing netfront received MTU size packets, which downgraded
throughput. The following patch makes netfront use GRO API which
improves throughput for that case.
Signed-off-by: Wei Liu <wei.liu2@citrix.com> Cc: Anirban Chakraborty <abchak@juniper.net> Cc: Ian Campbell <ian.campbell@citrix.com> Acked-by: Konrad Wilk <konrad.wilk@oracle.com> Acked-by: Ian Campbell <ian.campbell@citrix.com> Signed-off-by: David S. Miller <davem@davemloft.net>
When a switch is connected as a PHY to the MAC driven by tg3, use
phylib and provide the phy address to tg3 from the sprom.
Signed-off-by: Hauke Mehrtens <hauke@hauke-m.de> Acked-by: Nithin Nayak Sujir <nsujir@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
ssb: provide phy address for Gigabit Ethernet driver
Add a function to provide the phy address which should be used to the
Gigabit Ethernet driver connected to ssb.
Signed-off-by: Hauke Mehrtens <hauke@hauke-m.de> Reviewed-by: Nithin Nayak Sujir <nsujir@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
tg3: add support a phy at an address different than 01
When phylib was in use tg3 only searched at address 01 on the mdio
bus and did not work with any other address. On the BCM4705 SoCs the
switch is connected as a PHY behind the MAC driven by tg3 and it is at
PHY address 30 in most cases. This is a preparation patch to allow
support for such switches.
phy_addr is set to TG3_PHY_MII_ADDR for all devices, which are using
phylib, so this should not change any behavior.
Signed-off-by: Hauke Mehrtens <hauke@hauke-m.de> Acked-by: Nithin Nayak Sujir <nsujir@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
1) Conflicts with Joe Perches's 'extern' removal from header file
function declarations. Usually it's an argument signature change
or a function being added/removed. The resolutions are trivial.
2) Some overlapping changes in qmi_wwan.c and be.h, one commit adds
a new value, another changes an existing value. That sort of
thing.
Signed-off-by: David S. Miller <davem@davemloft.net>
1) Multiply in netfilter IPVS can overflow when calculating destination
weight. From Simon Kirby.
2) Use after free fixes in IPVS from Julian Anastasov.
3) SFC driver bug fixes from Daniel Pieczko.
4) Memory leak in pcan_usb_core failure paths, from Alexey Khoroshilov.
5) Locking and encapsulation fixes to serial line CAN driver, from
Andrew Naujoks.
6) Duplex and VF handling fixes to bnx2x driver from Yaniv Rosner,
Eilon Greenstein, and Ariel Elior.
7) In lapb, if no other packets are outstanding, T1 timeouts actually
stall things and no packet gets sent. Fix from Josselin Costanzi.
8) ICMP redirects should not make it to the socket error queues, from
Duan Jiong.
9) Fix bugs in skge DMA mapping error handling, from Nikulas Patocka.
10) Fix setting of VLAN priority field on via-rhine driver, from Roget
Luethi.
11) Fix TX stalls and VLAN promisc programming in be2net driver from
Ajit Khaparde.
12) Packet padding doesn't get handled correctly in new usbnet SG
support code, from Ming Lei.
13) Fix races in netdevice teardown wrt. network namespace closing.
From Eric W. Biederman.
14) Fix potential missed initialization of net_secret if not TCP
connections are openned. From Eric Dumazet.
15) Cinterion PLXX product ID in qmi_wwan driver is wrong, from
Aleksander Morgado.
16) skb_cow_head() can change skb->data and thus packet header pointers,
don't use stale ip_hdr reference in ip_tunnel code.
17) Backend state transition handling fixes in xen-netback, from Paul
Durrant.
18) Packet offset for AH protocol is handled wrong in flow dissector,
from Eric Dumazet.
19) Taking down an fq packet scheduler instance can leave stale packets
in the queues, fix from Eric Dumazet.
20) Fix performance regressions introduced by TCP Small Queues. From
Eric Dumazet.
21) IPV6 GRE tunneling code calculates max_headroom incorrectly, from
Hannes Frederic Sowa.
22) Multicast timer handlers in ipv4 and ipv6 can be the last and final
reference to the ipv4/ipv6 specific network device state, so use the
reference put that will check and release the object if the
reference hits zero. From Salam Noureddine.
23) Fix memory corruption in ip_tunnel driver, and use skb_push()
instead of __skb_push() so that similar bugs are less hard to find.
From Steffen Klassert.
24) Add forgotten hookup of rtnl_ops in SIT and ip6tnl drivers, from
Nicolas Dichtel.
25) fq scheduler doesn't accurately rate limit in certain circumstances,
from Eric Dumazet.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (103 commits)
pkt_sched: fq: rate limiting improvements
ip6tnl: allow to use rtnl ops on fb tunnel
sit: allow to use rtnl ops on fb tunnel
ip_tunnel: Remove double unregister of the fallback device
ip_tunnel_core: Change __skb_push back to skb_push
ip_tunnel: Add fallback tunnels to the hash lists
ip_tunnel: Fix a memory corruption in ip_tunnel_xmit
qlcnic: Fix SR-IOV configuration
ll_temac: Reset dma descriptors indexes on ndo_open
skbuff: size of hole is wrong in a comment
ipv6 mcast: use in6_dev_put in timer handlers instead of __in6_dev_put
ipv4 igmp: use in_dev_put in timer handlers instead of __in_dev_put
ethernet: moxa: fix incorrect placement of __initdata tag
ipv6: gre: correct calculation of max_headroom
powerpc/83xx: gianfar_ptp: select 1588 clock source through dts file
Revert "powerpc/83xx: gianfar_ptp: select 1588 clock source through dts file"
bonding: Fix broken promiscuity reference counting issue
tcp: TSQ can use a dynamic limit
dm9601: fix IFF_ALLMULTI handling
pkt_sched: fq: qdisc dismantle fixes
...
Linus Torvalds [Tue, 1 Oct 2013 17:28:11 +0000 (10:28 -0700)]
Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs lru leak fix from Al Viro:
"The fix in "super: fix for destroy lrus" didn't - they need to be
destroyed, all right, but that's the wrong place..."
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
fs/super.c: fix lru_list leak for real
Linus Torvalds [Tue, 1 Oct 2013 17:25:10 +0000 (10:25 -0700)]
Merge git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull two KVM fixes from Gleb Natapov.
* git://git.kernel.org/pub/scm/virt/kvm/kvm:
KVM: VMX: do not check bit 12 of EPT violation exit qualification when undefined
ARM: kvm: rename cpu_reset to avoid name clash
Al Viro [Tue, 1 Oct 2013 17:11:21 +0000 (13:11 -0400)]
fs/super.c: fix lru_list leak for real
Freeing ->s_{inode,dentry}_lru in deactivate_locked_super() is wrong;
the right place is destroy_super(). As it is, we leak them if sget()
decides that new superblock it has allocated (and never shown to
anybody) isn't needed and should be freed.
Eric Dumazet [Tue, 1 Oct 2013 16:10:16 +0000 (09:10 -0700)]
pkt_sched: fq: rate limiting improvements
FQ rate limiting suffers from two problems, reported
by Steinar :
1) FQ enforces a delay when flow quantum is exhausted in order
to reduce cpu overhead. But if packets are small, current
delay computation is slightly wrong, and observed rates can
be too high.
Steinar had this problem because he disabled TSO and GSO,
and default FQ quantum is 2*1514.
(Of course, I wish recent TSO auto sizing changes will help
to not having to disable TSO in the first place)
2) maxrate was not used for forwarded flows (skbs not attached
to a socket)
Reported-by: Steinar H. Gunderson <sesse@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Nicolas Dichtel [Tue, 1 Oct 2013 16:05:00 +0000 (18:05 +0200)]
ip6tnl: allow to use rtnl ops on fb tunnel
rtnl ops where introduced by c075b13098b3 ("ip6tnl: advertise tunnel param via
rtnl"), but I forget to assign rtnl ops to fb tunnels.
Now that it is done, we must remove the explicit call to
unregister_netdevice_queue(), because the fallback tunnel is added to the queue
in ip6_tnl_destroy_tunnels() when checking rtnl_link_ops of all netdevices (this
is valid since commit 0bd8762824e7 ("ip6tnl: add x-netns support")).
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Nicolas Dichtel [Tue, 1 Oct 2013 16:04:59 +0000 (18:04 +0200)]
sit: allow to use rtnl ops on fb tunnel
rtnl ops where introduced by ba3e3f50a0e5 ("sit: advertise tunnel param via
rtnl"), but I forget to assign rtnl ops to fb tunnels.
Now that it is done, we must remove the explicit call to
unregister_netdevice_queue(), because the fallback tunnel is added to the queue
in sit_destroy_tunnels() when checking rtnl_link_ops of all netdevices (this
is valid since commit 5e6700b3bf98 ("sit: add support of x-netns")).
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Tue, 1 Oct 2013 16:50:14 +0000 (12:50 -0400)]
Merge branch 'intel'
Jeff Kirsher says:
====================
This series contains updates to ixgbevf, ixgbe and igb.
Don provides 3 patches for ixgbevf where he cleans up a redundant
read mailbox failure check, adds a new function to wait for receive
queues to be disabled before disabling NAPI, and move the API
negotiation so that it occurs in the reset path. This will allow
the PF to be informed of the API version earlier.
Jacob provides a ixgbevf and ixgbe patch. His ixgbevf patch removes
the use of hw_dbg when the ixgbe_get_regs function is called in ethtool.
The ixgbe patch renames the LL_EXTENDED_STATS and some of the functions
required to implement busy polling in order to remove the marketing
"low latency" blurb which hides what the code actually does.
Leonardo provides a ixgbe patch to add support for DCB registers dump
using ethtool for 82599 and x540 ethernet controllers.
I (Jeff) provide a ixgbe patch to cleanup whitespace issues seen in a
code review.
Todd provides for igb to add support for i354 in the ethtool offline
tests.
Laura provides an igb patch to add the ethtool callbacks necessary to
configure the number of RSS queues.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
igb: Add ethtool support to configure number of channels
This patch adds the ethtool callbacks necessary to configure the
number of RSS queues.
The maximum number of queues is in accordance with the datasheets.
Signed-off-by: Laura Mihaela Vasilescu <laura.vasilescu@rosedu.org> Tested-by: Aaron Brown <aaron.f.brown@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Fujinaka, Todd [Tue, 1 Oct 2013 11:33:55 +0000 (04:33 -0700)]
igb: Add ethtool offline tests for i354
Add the ethtool offline tests for i354 devices.
Signed-off-by: Todd Fujinaka <todd.fujinaka@intel.com> Tested-by: Aaron Brown <aaron.f.brown@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Jacob Keller [Tue, 1 Oct 2013 11:33:54 +0000 (04:33 -0700)]
ixgbe: remove marketing names from busy poll code
This patch renames the LL_EXTENDED_STATS and some of the functions required to
implement busy polling in the ixgbe driver, in order to remove the marketing
"low latency" blurb which hides what the code actually does.
This furthers work which was requested by Linus Torvalds when the initial busy
poll code was included in the kernel. The code in the ixgbe driver itself was
never properly renamed to reflect the change to busy polling as the title.
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Tested-by: Phil Schmitt <phillip.j.schmitt@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Jeff Kirsher [Tue, 1 Oct 2013 11:33:53 +0000 (04:33 -0700)]
ixgbe: Cleanup the use of tabs and spaces
Cleans up the whitespace issues noticed during code review where
a mix of tabs and spaces were used for indentation.
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com> Tested-by: Phil Schmitt <phillip.j.schmitt@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
ixgbe: ethtool DCB registers dump for 82599 and x540
Added support for DCB registers dump using ethtool -d option both for
82599 and x540 ethernet controllers
Signed-off-by: Leonardo Potenza <leonardo.potenza@intel.com> Signed-off-by: Maryam Tahhan <maryam.tahhan@intel.com> Acked-by: John Fastabend <john.r.fastabend@intel.com> Tested-by: Phil Schmitt <phillip.j.schmitt@intel.com> Tested-by: Jack Morgan <jack.morgan@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Don Skidmore [Tue, 1 Oct 2013 11:33:51 +0000 (04:33 -0700)]
ixgbevf: move API neg to reset path
After this patch the API negotiation will occur in the reset path. So now
the PF will be informed of the API version earlier. This will also require
the mailbox lock to be initialize sooner as well.
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com> Signed-off-by: Don Skidmore <donald.c.skidmore@intel.com> Tested-by: Stephen Ko <stephen.s.ko@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Don Skidmore [Tue, 1 Oct 2013 11:33:50 +0000 (04:33 -0700)]
ixgbevf: add wait for Rx queue disable
New function was added to wait for Rx queues to be disabled before
disabling NAPI. This function also allows us to modify
ixgbevf_rx_desc_queue_enable() to better match ixgbe. I also cleaned up
some msleep calls to usleep_range while I was in this code anyway.
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com> Signed-off-by: Don Skidmore <donald.c.skidmore@intel.com> Tested-by: Stephen Ko <stephen.s.ko@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Since we are already checking for read failure in check_link we don't need
to do it here. Instead just make sure the watchdog task gets scheduled, if
we are up, and it can be done there. This will better follow igbvf method
of handling a mailbox event and message timeout.
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com> Signed-off-by: Don Skidmore <donald.c.skidmore@intel.com> Tested-by: Stephen Ko <stephen.s.ko@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Jacob Keller [Tue, 1 Oct 2013 11:33:48 +0000 (04:33 -0700)]
ixgbevf: do not print registers to dmesg in ixgbevf_get_regs
This patch removes the use of hw_dbg in ixgbevf when the ixgbe_get_regs function
is called from ethtool. This goes along side a patch to ethtool which enables
proper support for ixgbevf pretty-printing of registers.
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Tested-by: Phil Schmitt <phillip.j.schmitt@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Sathya Perla [Tue, 1 Oct 2013 10:30:01 +0000 (16:00 +0530)]
be2net: add a counter for pkts dropped in xmit path
In the xmit path, the driver may drop some pkts due to reasons such as
DMA mapping errors, out of memory conditions or to protect HW from
unrecoverable errors. Add a counter in TX-stats for such drops.
Signed-off-by: Sathya Perla <sathya.perla@emulex.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Sathya Perla [Tue, 1 Oct 2013 10:30:00 +0000 (16:00 +0530)]
be2net: fix adaptive interrupt coalescing
The current EQ delay calculation for AIC is based only on RX packet rate.
This fails to be effective when there's only TX and no RX.
This patch inclues:
- Calculating EQ-delay based on both RX and TX pps.
- Modifying EQ-delay of all EQs via one cmd, instead of issuing a separate
cmd for each EQ.
- A new structure to store interrupt coalescing parameters, in a separate
cache-line from the EQ-obj.
Signed-off-by: Sathya Perla <sathya.perla@emulex.com> Signed-off-by: David S. Miller <davem@davemloft.net>
This cmd needs to be sent to FW when enabling VFs (currently used only
for Lancer.) Also, avoid calling the cmd when driver loads and finds that
VFs are already enabled from a previous load.
Signed-off-by: Vasundhara Volam <vasundhara.volam@emulex.com> Signed-off-by: Sathya Perla <sathya.perla@emulex.com> Signed-off-by: David S. Miller <davem@davemloft.net>
On BE3-R 1G ports (identified by port numbers 2 and 3) the FW cannot properly
support multiple TXQs. This also makes the number of RX and TX queues symmetric
as only a single RXQ is available on 1G ports.
Signed-off-by: Vasundhara Volam <vasundhara.volam@emulex.com> Signed-off-by: Sathya Perla <sathya.perla@emulex.com> Signed-off-by: David S. Miller <davem@davemloft.net>
be2net: pass if_id for v1 and V2 versions of TX_CREATE cmd
It is a required field for all TX_CREATE cmd versions > 0. Signed-off-by: Vasundhara Volam <vasundhara.volam@emulex.com> Signed-off-by: Sathya Perla <sathya.perla@emulex.com> Signed-off-by: David S. Miller <davem@davemloft.net>
be2net: Call be_vf_setup() even when VFs are enbaled from previous load
Re-define the sriov_want() macro to check for number of VFs that need
to be enabled in the current load of the driver or the number of VFs that
still remain enabled from the previous load (attached VFs cannot be disabled.)
Signed-off-by: Vasundhara Volam <vasundhara.volam@emulex.com> Signed-off-by: Sathya Perla <sathya.perla@emulex.com> Signed-off-by: David S. Miller <davem@davemloft.net>
ip_tunnel: Remove double unregister of the fallback device
When queueing the netdevices for removal, we queue the
fallback device twice in ip_tunnel_destroy(). The first
time when we queue all netdevices in the namespace and
then again explicitly. Fix this by removing the explicit
queueing of the fallback device.
Bug was introduced when network namespace support was added
with commit 6c742e714d8 ("ipip: add x-netns support").
Cc: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com> Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: David S. Miller <davem@davemloft.net>
ip_tunnel_core: Change __skb_push back to skb_push
Git commit 0e6fbc5b ("ip_tunnels: extend iptunnel_xmit()")
moved the IP header installation to iptunnel_xmit() and
changed skb_push() to __skb_push(). This makes possible
bugs hard to track down, so change it back to skb_push().
Cc: Pravin Shelar <pshelar@nicira.com> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Currently we can not update the tunnel parameters of
the fallback tunnels because we don't find them in the
hash lists. Fix this by adding them on initialization.
Bug was introduced with commit c544193214
("GRE: Refactor GRE tunneling code.")
Cc: Pravin Shelar <pshelar@nicira.com> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: David S. Miller <davem@davemloft.net>
ip_tunnel: Fix a memory corruption in ip_tunnel_xmit
We might extend the used aera of a skb beyond the total
headroom when we install the ipip header. Fix this by
calling skb_cow_head() unconditionally.
Bug was introduced with commit c544193214
("GRE: Refactor GRE tunneling code.")
Cc: Pravin Shelar <pshelar@nicira.com> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Tue, 1 Oct 2013 16:39:35 +0000 (12:39 -0400)]
Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf
Pablo Neira Ayuso says:
====================
The following patchset contains Netfilter/IPVS fixes for your net
tree, they are:
* Fix BUG_ON splat due to malformed TCP packets seen by synproxy, from
Patrick McHardy.
* Fix possible weight overflow in lblc and lblcr schedulers due to
32-bits arithmetics, from Simon Kirby.
* Fix possible memory access race in the lblc and lblcr schedulers,
introduced when it was converted to use RCU, two patches from
Julian Anastasov.
* Fix hard dependency on CPU 0 when reading per-cpu stats in the
rate estimator, from Julian Anastasov.
* Fix race that may lead to object use after release, when invoking
ipvsadm -C && ipvsadm -R, introduced when adding RCU, from Julian
Anastasov.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Ricardo Ribalda [Tue, 1 Oct 2013 06:17:10 +0000 (08:17 +0200)]
ll_temac: Reset dma descriptors indexes on ndo_open
The dma descriptors indexes are only initialized on the probe function.
If a packet is on the buffer when temac_stop is called, the dma
descriptors indexes can be left on a incorrect state where no other
package can be sent.
So an interface could be left in an usable state after ifdow/ifup.
This patch makes sure that the descriptors indexes are in a proper
status when the device is open.
Signed-off-by: Ricardo Ribalda Delgado <ricardo.ribalda@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
The data structure of_match_ptr() protects is always compiled in.
Hence of_match_ptr() is not needed.
Signed-off-by: Sachin Kamat <sachin.kamat@linaro.org> Cc: Marc Kleine-Budde <mkl@pengutronix.de> Cc: linux-can@vger.kernel.org Signed-off-by: David S. Miller <davem@davemloft.net>
The data structure of_match_ptr() protects is always compiled in.
Hence of_match_ptr() is not needed.
Signed-off-by: Sachin Kamat <sachin.kamat@linaro.org> Acked-by: Mugunthan V N <mugunthanvnm@ti.com> Signed-off-by: David S. Miller <davem@davemloft.net>
The data structure of_match_ptr() protects is always compiled in.
Hence of_match_ptr() is not needed.
Signed-off-by: Sachin Kamat <sachin.kamat@linaro.org> Acked-by: Mugunthan V N <mugunthanvnm@ti.com> Signed-off-by: David S. Miller <davem@davemloft.net>
ipv6 mcast: use in6_dev_put in timer handlers instead of __in6_dev_put
It is possible for the timer handlers to run after the call to
ipv6_mc_down so use in6_dev_put instead of __in6_dev_put in the
handler function in order to do proper cleanup when the refcnt
reaches 0. Otherwise, the refcnt can reach zero without the
inet6_dev being destroyed and we end up leaking a reference to
the net_device and see messages like the following,
unregister_netdevice: waiting for eth0 to become free. Usage count = 1
Tested on linux-3.4.43.
Signed-off-by: Salam Noureddine <noureddine@aristanetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
ipv4 igmp: use in_dev_put in timer handlers instead of __in_dev_put
It is possible for the timer handlers to run after the call to
ip_mc_down so use in_dev_put instead of __in_dev_put in the handler
function in order to do proper cleanup when the refcnt reaches 0.
Otherwise, the refcnt can reach zero without the in_device being
destroyed and we end up leaking a reference to the net_device and
see messages like the following,
unregister_netdevice: waiting for eth0 to become free. Usage count = 1
Tested on linux-3.4.43.
Signed-off-by: Salam Noureddine <noureddine@aristanetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Reported-by: Joe Lawrence <joe.lawrence@stratus.com> Tested-by: Joe Lawrence <joe.lawrence@stratus.com> CC: Jay Vosburgh <fubar@us.ibm.com> CC: Andy Gospodarek <andy@greyhouse.net> Signed-off-by: Veaceslav Falico <vfalico@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
ethernet: moxa: fix incorrect placement of __initdata tag
__initdata tag should be placed between the variable name and equal
sign for the variable to be placed in the intended .init.data section.
In this particular case __initdata is incorrect as moxart_mac_driver
can be used after the driver gets initialized.
Also while at it static-ize moxart_mac_driver.
Signed-off-by: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com> Signed-off-by: Kyungmin Park <kyungmin.park@samsung.com> Signed-off-by: David S. Miller <davem@davemloft.net>
removes following warnings-
drivers/net/phy/marvell.c:37: WARNING: Use #include <linux/io.h> instead of <asm/io.h>
drivers/net/phy/marvell.c:39: WARNING: Use #include <linux/uaccess.h> instead of <asm/uaccess.h>
Signed-off-by: Avinash Kumar <avi.kp.137@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Sun, 29 Sep 2013 08:21:32 +0000 (01:21 -0700)]
net: skb_is_gso_v6() requires skb_is_gso()
bnx2x makes a dangerous use of skb_is_gso_v6().
It should first make sure skb is a gso packet
Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Eilon Greenstein <eilong@broadcom.com> Acked-by: Dmitry Kravkov <dmitry@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Sun, 29 Sep 2013 08:12:40 +0000 (01:12 -0700)]
net: add missing sk_max_pacing_rate doc
Warning(include/net/sock.h:411): No description found for parameter
'sk_max_pacing_rate'
Lets please "make htmldocs" and kbuild bot.
Reported-by: Wu Fengguang <fengguang.wu@intel.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
gre_hlen already accounts for sizeof(struct ipv6_hdr) + gre header,
so initialize max_headroom to zero. Otherwise the
if (encap_limit >= 0) {
max_headroom += 8;
mtu -= 8;
}
increments an uninitialized variable before max_headroom was reset.
Found with coverity: 728539
Cc: Dmitry Kozlov <xeb@mail.ru> Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
net ipv4: Convert ipv4.ip_local_port_range to be per netns v3
- Move sysctl_local_ports from a global variable into struct netns_ipv4.
- Modify inet_get_local_port_range to take a struct net, and update all
of the callers.
- Move the initialization of sysctl_local_ports into
sysctl_net_ipv4.c:ipv4_sysctl_init_net from inet_connection_sock.c
v2:
- Ensure indentation used tabs
- Fixed ip.h so it applies cleanly to todays net-next
v3:
- Compile fixes of strange callers of inet_get_local_port_range.
This patch now successfully passes an allmodconfig build.
Removed manual inlining of inet_get_local_port_range in ipv4_local_port_range
Originally-by: Samya <samya@twitter.com> Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Mark code path's likely/unlikely based on most common usage.
* Very few devices use dsa tags.
* Most traffic is Ethernet (not 802.2)
* No sane person uses trailer type or Novell encapsulation
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org> Signed-off-by: David S. Miller <davem@davemloft.net>
Remove old legacy comment and weird if condition.
The comment has outlived it's stay and is throwback to some
early net code (before my time). Maybe Dave remembers what it meant.
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org> Signed-off-by: David S. Miller <davem@davemloft.net>
powerpc/83xx: gianfar_ptp: select 1588 clock source through dts file
Currently IEEE 1588 timer reference clock source is determined through
hard-coded value in gianfar_ptp driver. This patch allows to select ptp
clock source by means of device tree file node.
For instance:
fsl,cksel = <0>;
for using external (TSEC_TMR_CLK input) high precision timer
reference clock.
When this attribute isn't used, eTSEC system clock will serve as
IEEE 1588 timer reference clock.
Signed-off-by: Aida Mynzhasova <aida.mynzhasova@skitlab.ru> Acked-by: Kumar Gala <galak@codeaurora.org> Signed-off-by: David S. Miller <davem@davemloft.net>
Recently grabbed this report:
https://bugzilla.redhat.com/show_bug.cgi?id=1005567
Of an issue in which the bonding driver, with an attached vlan encountered the
following errors when bond0 was taken down and back up:
dummy1: promiscuity touches roof, set promiscuity failed. promiscuity feature of
device might be broken.
The error occurs because, during __bond_release_one, if we release our last
slave, we take on a random mac address and issue a NETDEV_CHANGEADDR
notification. With an attached vlan, the vlan may see that the vlan and bond
mac address were in sync, but no longer are. This triggers a call to dev_uc_add
and dev_set_rx_mode, which enables IFF_PROMISC on the bond device. Then, when
we complete __bond_release_one, we use the current state of the bond flags to
determine if we should decrement the promiscuity of the releasing slave. But
since the bond changed promiscuity state during the release operation, we
incorrectly decrement the slave promisc count when it wasn't in promiscuous mode
to begin with, causing the above error
Fix is pretty simple, just cache the bonding flags at the start of the function
and use those when determining the need to set promiscuity.
This is also needed for the ALLMULTI flag
CC: Jay Vosburgh <fubar@us.ibm.com> CC: Andy Gospodarek <andy@greyhouse.net> CC: Mark Wu <wudxw@linux.vnet.ibm.com> CC: "David S. Miller" <davem@davemloft.net> Reported-by: Mark Wu <wudxw@linux.vnet.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Fri, 27 Sep 2013 10:28:54 +0000 (03:28 -0700)]
tcp: TSQ can use a dynamic limit
When TCP Small Queues was added, we used a sysctl to limit amount of
packets queues on Qdisc/device queues for a given TCP flow.
Problem is this limit is either too big for low rates, or too small
for high rates.
Now TCP stack has rate estimation in sk->sk_pacing_rate, and TSO
auto sizing, it can better control number of packets in Qdisc/device
queues.
New limit is two packets or at least 1 to 2 ms worth of packets.
Low rates flows benefit from this patch by having even smaller
number of packets in queues, allowing for faster recovery,
better RTT estimations.
High rates flows benefit from this patch by allowing more than 2 packets
in flight as we had reports this was a limiting factor to reach line
rate. [ In particular if TX completion is delayed because of coalescing
parameters ]
Example for a single flow on 10Gbp link controlled by FQ/pacing
Note that sk_pacing_rate is currently set to twice the actual rate, but
this might be refined in the future when a flow is in congestion
avoidance.
Additional change : skb->destructor should be set to tcp_wfree().
A future patch (for linux 3.13+) might remove tcp_limit_output_bytes
Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Wei Liu <wei.liu2@citrix.com> Cc: Cong Wang <xiyou.wangcong@gmail.com> Cc: Yuchung Cheng <ycheng@google.com> Cc: Neal Cardwell <ncardwell@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Linus Torvalds [Tue, 1 Oct 2013 00:10:26 +0000 (17:10 -0700)]
Merge tag 'nfs-for-3.12-4' of git://git.linux-nfs.org/projects/trondmy/linux-nfs
Pull NFS client bugfixes from Trond Myklebust:
- Stable fix for Oopses in the pNFS files layout driver
- Fix a regression when doing a non-exclusive file create on NFSv4.x
- NFSv4.1 security negotiation fixes when looking up the root
filesystem
- Fix a memory ordering issue in the pNFS files layout driver
* tag 'nfs-for-3.12-4' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
NFS: Give "flavor" an initial value to fix a compile warning
NFSv4.1: try SECINFO_NO_NAME flavs until one works
NFSv4.1: Ensure memory ordering between nfs4_ds_connect and nfs4_fl_prepare_ds
NFSv4.1: nfs4_fl_prepare_ds - fix bugs when the connect attempt fails
NFSv4: Honour the 'opened' parameter in the atomic_open() filesystem method
Peter Korsgaard [Mon, 30 Sep 2013 21:28:20 +0000 (23:28 +0200)]
dm9601: fix IFF_ALLMULTI handling
Pass-all-multicast is controlled by bit 3 in RX control, not bit 2
(pass undersized frames).
Reported-by: Joseph Chang <joseph_chang@davicom.com.tw> Signed-off-by: Peter Korsgaard <peter@korsgaard.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Wei Liu [Sun, 22 Sep 2013 18:03:44 +0000 (19:03 +0100)]
xen-netback: improve ring effeciency for guest RX
There was a bug that netback routines netbk/xenvif_skb_count_slots and
netbk/xenvif_gop_frag_copy disagreed with each other, which caused
netback to push wrong number of responses to netfront, which caused
netfront to eventually crash. The bug was fixed in 6e43fc04a
("xen-netback: count number required slots for an skb more carefully").
Commit 6e43fc04a focused on backport-ability. The drawback with the
existing packing scheme is that the ring is not used effeciently, as
stated in 6e43fc04a.
If we can do this:
|111122222222|22223333 |
That would save one ring slot, which improves ring effeciency.
This patch effectively reverts 6e43fc04a. That patch made count_slots
agree with gop_frag_copy, while this patch goes the other way around --
make gop_frag_copy agree with count_slots. The end result is that they
still agree with each other, and the ring is now arranged like:
|111122222222|22223333 |
The patch that improves packing was first posted by Xi Xong and Matt
Wilson. I only rebase it on top of net-next and rewrite commit message,
so I retain all their SoBs. For more infomation about the original bug
please refer to email listed below and commit message of 6e43fc04a.
Original patch:
http://lists.xen.org/archives/html/xen-devel/2013-07/msg00760.html
Signed-off-by: Xi Xiong <xixiong@amazon.com> Reviewed-by: Matt Wilson <msw@amazon.com>
[ msw: minor code cleanups, rewrote commit message, adjusted code
to count RX slots instead of meta structures ] Signed-off-by: Matt Wilson <msw@amazon.com> Cc: Annie Li <annie.li@oracle.com> Cc: Wei Liu <wei.liu2@citrix.com> Cc: Ian Campbell <Ian.Campbell@citrix.com>
[ liuw: rebased on top of net-next tree, rewrote commit message, coding
style cleanup. ] Signed-off-by: Wei Liu <wei.liu2@citrix.com> Cc: David Vrabel <david.vrabel@citrix.com> Acked-by: Ian Campbell <Ian.Campbell@citrix.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (22 commits)
pidns: fix free_pid() to handle the first fork failure
ipc,msg: prevent race with rmid in msgsnd,msgrcv
ipc/sem.c: update sem_otime for all operations
mm/hwpoison: fix the lack of one reference count against poisoned page
mm/hwpoison: fix false report on 2nd attempt at page recovery
mm/hwpoison: fix test for a transparent huge page
mm/hwpoison: fix traversal of hugetlbfs pages to avoid printk flood
block: change config option name for cmdline partition parsing
mm/mlock.c: prevent walking off the end of a pagetable in no-pmd configuration
mm: avoid reinserting isolated balloon pages into LRU lists
arch/parisc/mm/fault.c: fix uninitialized variable usage
include/asm-generic/vtime.h: avoid zero-length file
nilfs2: fix issue with race condition of competition between segments for dirty blocks
Documentation/kernel-parameters.txt: replace kernelcore with Movable
mm/bounce.c: fix a regression where MS_SNAP_STABLE (stable pages snapshotting) was ignored
kernel/kmod.c: check for NULL in call_usermodehelper_exec()
ipc/sem.c: synchronize the proc interface
ipc/sem.c: optimize sem_lock()
ipc/sem.c: fix race in sem_lock()
mm/compaction.c: periodically schedule when freeing pages
...
This fixes a race in both msgrcv() and msgsnd() between finding the msg
and actually dealing with the queue, as another thread can delete shmid
underneath us if we are preempted before acquiring the
kern_ipc_perm.lock.
Manfred illustrates this nicely:
Assume a preemptible kernel that is preempted just after
msq = msq_obtain_object_check(ns, msqid)
in do_msgrcv(). The only lock that is held is rcu_read_lock().
Now the other thread processes IPC_RMID. When the first task is
resumed, then it will happily wait for messages on a deleted queue.
Fix this by checking for if the queue has been deleted after taking the
lock.
Signed-off-by: Davidlohr Bueso <davidlohr@hp.com> Reported-by: Manfred Spraul <manfred@colorfullife.com> Cc: Rik van Riel <riel@redhat.com> Cc: Mike Galbraith <efault@gmx.de> Cc: <stable@vger.kernel.org> [3.11] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In commit 0a2b9d4c7967 ("ipc/sem.c: move wake_up_process out of the
spinlock section"), the update of semaphore's sem_otime(last semop time)
was moved to one central position (do_smart_update).
But since do_smart_update() is only called for operations that modify
the array, this means that wait-for-zero semops do not update sem_otime
anymore.
The fix is simple:
Non-alter operations must update sem_otime.
[akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Manfred Spraul <manfred@colorfullife.com> Reported-by: Jia He <jiakernel@gmail.com> Tested-by: Jia He <jiakernel@gmail.com> Cc: Davidlohr Bueso <davidlohr.bueso@hp.com> Cc: Mike Galbraith <efault@gmx.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Wanpeng Li [Mon, 30 Sep 2013 20:45:24 +0000 (13:45 -0700)]
mm/hwpoison: fix the lack of one reference count against poisoned page
The lack of one reference count against poisoned page for hwpoison_inject
w/o hwpoison_filter enabled result in hwpoison detect -1 users still
referenced the page, however, the number should be 0 except the poison
handler held one after successfully unmap. This patch fix it by hold one
referenced count against poisoned page for hwpoison_inject w/ and w/o
hwpoison_filter enabled.
Before patch:
[ 71.902112] Injecting memory failure at pfn 224706
[ 71.902137] MCE 0x224706: dirty LRU page recovery: Failed
[ 71.902138] MCE 0x224706: dirty LRU page still referenced by -1 users
Wanpeng Li [Mon, 30 Sep 2013 20:45:23 +0000 (13:45 -0700)]
mm/hwpoison: fix false report on 2nd attempt at page recovery
If the page is poisoned by software injection w/ MF_COUNT_INCREASED
flag, there is a false report during the 2nd attempt at page recovery
which is not truthful.
This patch fixes it by reporting the first attempt to try free buddy
page recovery if MF_COUNT_INCREASED is set.
Wanpeng Li [Mon, 30 Sep 2013 20:45:21 +0000 (13:45 -0700)]
mm/hwpoison: fix traversal of hugetlbfs pages to avoid printk flood
madvise_hwpoison won't check if the page is small page or huge page and
traverses in small page granularity against the range unconditionally,
which result in a printk flood "MCE xxx: already hardware poisoned" if
the page is a huge page.
This patch fixes it by using compound_order(compound_head(page)) for
huge page iterator.
Paul Gortmaker [Mon, 30 Sep 2013 20:45:19 +0000 (13:45 -0700)]
block: change config option name for cmdline partition parsing
Recently commit bab55417b10c ("block: support embedded device command
line partition") introduced CONFIG_CMDLINE_PARSER. However, that name
is too generic and sounds like it enables/disables generic kernel boot
arg processing, when it really is block specific.
Before this option becomes a part of a full/final release, add the BLK_
prefix to it so that it is clear in absence of any other context that it
is block specific.
In addition, fix up the following less critical items:
- help text was not really at all helpful.
- index file for Documentation was not updated
- add the new arg to Documentation/kernel-parameters.txt
- clarify wording in source comments
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Cai Zhiyong <caizhiyong@huawei.com> Cc: Wei Yongjun <yongjun_wei@trendmicro.com.cn> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
mm/mlock.c: prevent walking off the end of a pagetable in no-pmd configuration
The function __munlock_pagevec_fill() introduced in commit 7a8010cd3627
("mm: munlock: manual pte walk in fast path instead of
follow_page_mask()") uses pmd_addr_end() for restricting its operation
within current page table.
This is insufficient on architectures/configurations where pmd is folded
and pmd_addr_end() just returns the end of the full range to be walked.
In this case, it allows pte++ to walk off the end of a page table
resulting in unpredictable behaviour.
This patch fixes the function by using pgd_addr_end() and pud_addr_end()
before pmd_addr_end(), which will yield correct page table boundary on
all configurations. This is similar to what existing page walkers do
when walking each level of the page table.
Additionaly, the patch clarifies a comment for get_locked_pte() call in the
function.
Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Reported-by: Fengguang Wu <fengguang.wu@intel.com> Reviewed-by: Bob Liu <bob.liu@oracle.com> Cc: Jörn Engel <joern@logfs.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Michel Lespinasse <walken@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Rik van Riel <riel@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Rafael Aquini [Mon, 30 Sep 2013 20:45:16 +0000 (13:45 -0700)]
mm: avoid reinserting isolated balloon pages into LRU lists
Isolated balloon pages can wrongly end up in LRU lists when
migrate_pages() finishes its round without draining all the isolated
page list.
The same issue can happen when reclaim_clean_pages_from_list() tries to
reclaim pages from an isolated page list, before migration, in the CMA
path. Such balloon page leak opens a race window against LRU lists
shrinkers that leads us to the following kernel panic:
This patch fixes the issue, by assuring the proper tests are made at
putback_movable_pages() & reclaim_clean_pages_from_list() to avoid
isolated balloon pages being wrongly reinserted in LRU lists.
But such error messages are consequence of file system's issue that takes
place more earlier. Fortunately, Jerome Poulin <jeromepoulin@gmail.com>
and Anton Eliasson <devel@antoneliasson.se> were reported about another
issue not so recently. These reports describe the issue with segctor
thread's crash:
BUG: unable to handle kernel paging request at 0000000000004c83
IP: nilfs_end_page_io+0x12/0xd0 [nilfs2]
These two issues have one reason. This reason can raise third issue
too. Third issue results in hanging of segctor thread with eating of
100% CPU.
REPRODUCING PATH:
One of the possible way or the issue reproducing was described by
Jermoe me Poulin <jeromepoulin@gmail.com>:
1. init S to get to single user mode.
2. sysrq+E to make sure only my shell is running
3. start network-manager to get my wifi connection up
4. login as root and launch "screen"
5. cd /boot/log/nilfs which is a ext3 mount point and can log when NILFS dies.
6. lscp | xz -9e > lscp.txt.xz
7. mount my snapshot using mount -o cp=3360839,ro /dev/vgUbuntu/root /mnt/nilfs
8. start a screen to dump /proc/kmsg to text file since rsyslog is killed
9. start a screen and launch strace -f -o find-cat.log -t find
/mnt/nilfs -type f -exec cat {} > /dev/null \;
10. start a screen and launch strace -f -o apt-get.log -t apt-get update
11. launch the last command again as it did not crash the first time
12. apt-get crashes
13. ps aux > ps-aux-crashed.log
13. sysrq+W
14. sysrq+E wait for everything to terminate
15. sysrq+SUSB
Simplified way of the issue reproducing is starting kernel compilation
task and "apt-get update" in parallel.
REPRODUCIBILITY:
The issue is reproduced not stable [60% - 80%]. It is very important to
have proper environment for the issue reproducing. The critical
conditions for successful reproducing:
(1) It should have big modified file by mmap() way.
(2) This file should have the count of dirty blocks are greater that
several segments in size (for example, two or three) from time to time
during processing.
(3) It should be intensive background activity of files modification
in another thread.
INVESTIGATION:
First of all, it is possible to see that the reason of crash is not valid
page address:
Moreover, value of b_page (0x1a82) is 6786. This value looks like segment
number. And b_blocknr with b_size values look like block numbers. So,
buffer_head's pointer points on not proper address value.
Detailed investigation of the issue is discovered such picture:
Usually, for every segment we collect dirty files in list. Then, dirty
blocks are gathered for every dirty file, prepared for write and
submitted by means of nilfs_segbuf_submit_bh() call. Finally, it takes
place complete write phase after calling nilfs_end_bio_write() on the
block layer. Buffers/pages are marked as not dirty on final phase and
processed files removed from the list of dirty files.
It is possible to see that we had three prepare_write and submit_bio
phases before segbuf_wait and complete_write phase. Moreover, segments
compete between each other for dirty blocks because on every iteration
of segments processing dirty buffer_heads are added in several lists of
payload_buffers:
The next pointer is the same but prev pointer has changed. It means
that buffer_head has next pointer from one list but prev pointer from
another. Such modification can be made several times. And, finally, it
can be resulted in various issues: (1) segctor hanging, (2) segctor
crashing, (3) file system metadata corruption.
FIX:
This patch adds:
(1) setting of BH_Async_Write flag in nilfs_segctor_prepare_write()
for every proccessed dirty block;
(2) checking of BH_Async_Write flag in
nilfs_lookup_dirty_data_buffers() and
nilfs_lookup_dirty_node_buffers();
(3) clearing of BH_Async_Write flag in nilfs_segctor_complete_write(),
nilfs_abort_logs(), nilfs_forget_buffer(), nilfs_clear_dirty_page().
Reported-by: Jerome Poulin <jeromepoulin@gmail.com> Reported-by: Anton Eliasson <devel@antoneliasson.se> Cc: Paul Fertser <fercerpav@gmail.com> Cc: ARAI Shun-ichi <hermes@ceres.dti.ne.jp> Cc: Piotr Szymaniak <szarpaj@grubelek.pl> Cc: Juan Barry Manuel Canham <Linux@riotingpacifist.net> Cc: Zahid Chowdhury <zahid.chowdhury@starsolutions.com> Cc: Elmer Zhang <freeboy6716@gmail.com> Cc: Kenneth Langga <klangga@gmail.com> Signed-off-by: Vyacheslav Dubeyko <slava@dubeyko.com> Acked-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Darrick J. Wong [Mon, 30 Sep 2013 20:45:09 +0000 (13:45 -0700)]
mm/bounce.c: fix a regression where MS_SNAP_STABLE (stable pages snapshotting) was ignored
The "force" parameter in __blk_queue_bounce was being ignored, which
means that stable page snapshots are not always happening (on ext3).
This of course leads to DIF disks reporting checksum errors, so fix this
regression.
The regression was introduced in commit 6bc454d15004 ("bounce: Refactor
__blk_queue_bounce to not use bi_io_vec")
Reported-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Cc: Kent Overstreet <koverstreet@google.com> Cc: <stable@vger.kernel.org> [3.10+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
kernel/kmod.c: check for NULL in call_usermodehelper_exec()
If /proc/sys/kernel/core_pattern contains only "|", a NULL pointer
dereference happens upon core dump because argv_split("") returns
argv[0] == NULL.
This bug was once fixed by commit 264b83c07a84 ("usermodehelper: check
subprocess_info->path != NULL") but was by error reintroduced by commit 7f57cfa4e2aa ("usermodehelper: kill the sub_info->path[0] check").
This bug seems to exist since 2.6.19 (the version which core dump to
pipe was added). Depending on kernel version and config, some side
effect might happen immediately after this oops (e.g. kernel panic with
2.6.32-358.18.1.el6).
The proc interface is not aware of sem_lock(), it instead calls
ipc_lock_object() directly. This means that simple semop() operations
can run in parallel with the proc interface. Right now, this is
uncritical, because the implementation doesn't do anything that requires
a proper synchronization.
But it is dangerous and therefore should be fixed.
Signed-off-by: Manfred Spraul <manfred@colorfullife.com> Cc: Davidlohr Bueso <davidlohr.bueso@hp.com> Cc: Mike Galbraith <efault@gmx.de> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Operations that need access to the whole array must guarantee that there
are no simple operations ongoing. Right now this is achieved by
spin_unlock_wait(sem->lock) on all semaphores.
If complex_count is nonzero, then this spin_unlock_wait() is not
necessary, because it was already performed in the past by the thread
that increased complex_count and even though sem_perm.lock was dropped
inbetween, no simple operation could have started, because simple
operations cannot start when complex_count is non-zero.
Signed-off-by: Manfred Spraul <manfred@colorfullife.com> Cc: Mike Galbraith <bitbucket@online.de> Cc: Rik van Riel <riel@redhat.com> Reviewed-by: Davidlohr Bueso <davidlohr@hp.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>