git.karo-electronics.de Git - linux-beck.git/log

tcp: snmp stats for Fast Open, SYN rtx, and data pkts

Add the following snmp stats:

TCPFastOpenActiveFail: Fast Open attempts (SYN/data) failed beacuse
the remote does not accept it or the attempts timed out.

TCPSynRetrans: number of SYN and SYN/ACK retransmits to break down
retransmissions into SYN, fast-retransmits, timeout retransmits, etc.

TCPOrigDataSent: number of outgoing packets with original data (excluding
retransmission but including data-in-SYN). This counter is different from
TcpOutSegs because TcpOutSegs also tracks pure ACKs. TCPOrigDataSent is
more useful to track the TCP retransmission rate.

Change TCPFastOpenActive to track only successful Fast Opens to be symmetric to
TCPFastOpenPassive.

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Nandita Dukkipati <nanditad@google.com>
Signed-off-by: Lawrence Brakmo <brakmo@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

sch_tbf: Remove holes in struct tbf_sched_data.

On x86_64 we have 3 holes in struct tbf_sched_data.

The member peak_present can be replaced with peak.rate_bytes_ps,
because peak.rate_bytes_ps is set only when peak is specified in
tbf_change(). tbf_peak_present() is introduced to test
peak.rate_bytes_ps.

The member max_size is moved to fill 32bit hole.

Signed-off-by: Hiroaki SHIMODA <shimoda.hiroaki@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ieee802154: fix at86rf212_set_txpower() exit path

The commit 9b2777d6089bc ("ieee802154: add TX power control to
wpan_phy") introduced the new function at86rf212_set_txpower() with
the questionable check of the return of __at86rf230_write() in the
exit path:

1) Both at86rf212_set_txpower() and __at86rf230_write() have the
same return type.

2) Whatever __at86rf230_write() returns becomes the return value of
at86rf212_set_txpower().

Thus, fix the exit path by getting rid of that check entirely.

Signed-off-by: Jean Sacren <sakiwit@gmail.com>
Cc: Phoebe Buckheister <phoebe.buckheister@itwm.fraunhofer.de>
Signed-off-by: David S. Miller <davem@davemloft.net>

connector: remove duplicated code in cn_call_callback()

There were a couple of patches fixing the same bug that
results in duplicated err = 0; assignment.
The patch removes one of them.

Signed-off-by: Alexey Khoroshilov <khoroshilov@ispras.ru>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'mlx4'

Amir Vadai says:

====================
net/mlx4: Mellanox driver update 27-02-2014

This patchset contains some fixes for small trivial bugs, and
compilation/syntactic parsers warnings

Patchset was applied and tested over commit 750f679 "Merge branch '6lowpan'"

Changes from V1:
-patch 5/9:  Replace mlx4_en_mac_to_u64() with mlx4_mac_to_u64()
  - Remove unnecessary define of ETH_ALEN

Changes from V0:
-patch 3/9:  net/mlx4_en: Pad ethernet packets smaller than 17 bytes
  - Make condition more efficient
  - Didn't use canonical function to pad buffer since using bounce buffer
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

net/mlx4_en: Change Connect-X description in kconfig

The mlx4_en driver support also 1Gbit and 40Gbit Ethernet devices,
changed the driver description in the menuconfig to reflect that.

Signed-off-by: Eyal Perry <eyalpe@mellanox.com>
Signed-off-by: Amir Vadai <amirv@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net/mlx4_en: Use union for BlueFlame WQE

When BlueFlame is turned on, control segment of the TX WQE is changed,
and the second line of it is used for QPN.
Changed code to use a union in the mlx4_wqe_ctrl_seg instead of casting.
This makes the code clearer and solves the static checker warning:

drivers/net/ethernet/mellanox/mlx4/en_tx.c:839 mlx4_en_xmit()
warn: potential memory corrupting cast 4 vs 2 bytes

CC: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Amir Vadai <amirv@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net/mlx4_core: Fix sparse warning

This patch force conversion to u32 to fix the following sparse warning:
drivers/net/ethernet/mellanox/mlx4/fw.c:1822:53: warning: restricted __be32
degrades to integer

Casting to u32 is safe here, because token will be returned as is
from the hardware without any modification.

Signed-off-by: Eyal Perry <eyalpe@mellanox.com>
Signed-off-by: Amir Vadai <amirv@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net/mlx4_en: Fix selftest failing on non 10G link speed

Connect-X devices selftest speed test shouldn't fail on 1G and 40G link
speeds.

Signed-off-by: Amir Vadai <amirv@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net/mlx4: Replace mlx4_en_mac_to_u64() with mlx4_mac_to_u64()

Currently, the EN driver uses a private static function
mlx4_en_mac_to_u64(). Move it to a common include file (driver.h)
for mlx4_en and mlx4_ib for further use.

Signed-off-by: Eugenia Emantayev <eugenia@mellanox.com>
Signed-off-by: Amir Vadai <amirv@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net/mlx4_en: Move queue stopped/waked counters to be per ring

Give accurate counters and avoids cache misses when several rings
update the counters of stop/wake queue.

Signed-off-by: Eugenia Emantayev <eugenia@mellanox.com>
Signed-off-by: Amir Vadai <amirv@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net/mlx4_en: Pad ethernet packets smaller than 17 bytes

Hardware can't accept packets smaller than 17 bytes. Therefore need to
pad with zeros.

Signed-off-by: Amir Vadai <amirv@mellanox.com>
Signed-off-by: Eugenia Emantayev <eugenia@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net/mlx4_en: Verify mlx4_en module parameters

Verify mlx4_en module parameters.
In case they are out of range - reset to default values.

Signed-off-by: Eugenia Emantayev <eugenia@mellanox.com>
Signed-off-by: Amir Vadai <amirv@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net/mlx4_en: Fix UP limit in ieee_ets->prio_tc

User priority limit has to be less than MLX4_EN_NUM_UP.

Signed-off-by: Amir Vadai <amirv@mellanox.com>
Signed-off-by: Eugenia Emantayev <eugenia@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch '6lowpan'

Alexander Aring says:

====================
6lowpan: fix issues with byte ordering types

I got some mail from a "kbuild test robot" and it detected some byte
ordering issues with the tag and datagram size value of 6LoWPAN IEEE
802.15.4 fragmentation header.

This patch series should fix the issues with the byte ordering.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

6lowpan: use memcpy to set tag value in fraghdr

Signed-off-by: Alexander Aring <alex.aring@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

6lowpan: remove initialization of tag value

The initialization of the tag value doesn't matter at begin of
fragmentation. This patch removes the initialization to zero.

Signed-off-by: Alexander Aring <alex.aring@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

6lowpan: fix type of datagram size parameter

Datagram size value is u16 because we convert it to host byte order
and we need to read it. Only the tag value belongs to __be16 type.

Signed-off-by: Alexander Aring <alex.aring@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'intel-next'

Aaron Brown says:

====================
Mark updates ixgbe for LER / adapter removal. He restores the HW
address in the recovery path so the device is not perpetually removed,
fixes up some removed state ethtool results and adds checks related to
config space access.

Jacob adds support for the new SIOCGHWTSTAMP ioctl.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

ixgbe: implement SIOCGHWTSTAMP ioctl

This patch adds support for the new SIOCGHWTSTAMP ioctl, which enables a
process to determine the current timestamp configuration. In order to
implement this, store a copy of the timestamp configuration. In
addition, we can remove the 'int cmd' parameter as the new set_ts_config
function doesn't use it. I also fixed a typo in the function
description.

-v2
* Only save the settings after validating them

Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Tested-by: Phil Schmitt <phillip.j.schmitt@intel.com>
Signed-off-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ixgbe: Check config reads for removal

Configuration space reads should also be checked for removal. So
add some checks related to config space accesses.

v2:
* Fixed indent

Signed-off-by: Mark Rustad <mark.d.rustad@intel.com>
Tested-by: Phil Schmitt <phillip.j.schmitt@intel.com>
Signed-off-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ixgbe: Fix up some ethtool results when adapter is removed

Some ethtool tests returned apparently good results when the
adapter was in a removed state. Fix that by checking for removal.
This also fixes two paths that could return uninitialized memory
in data[4].

Signed-off-by: Mark Rustad <mark.d.rustad@intel.com>
Tested-by: Phil Schmitt <phillip.j.schmitt@intel.com>
Signed-off-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ixgbe: Restore hw_addr in LER recovery paths

The hw_addr needs to be restored in the pcie recovery path or
else the device will be perpetually removed. Also restore the
value in the resume path.

Signed-off-by: Mark Rustad <mark.d.rustad@intel.com>
Tested-by: Phil Schmitt <phillip.j.schmitt@intel.com>
Signed-off-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

bonding: send arp requests even if there's no route to them

Currently we're only sending arp requests if we have a route to the target
(and, thus, can find out the source ip address).

There are some use cases, however, where we don't want/need to set an ip
address (or set up a specific route) for bonding to use arp monitoring *for
traffic generation*. We can easily send arp probes (arp requests with src
ip == 0) to generate arp broadcast responses from the target ip and use
them for determining if the target is up.

This, obviously, won't work with arp validation - because we don't have the
ip address set and, thus, will filter out the responses. So in that case -
print a warning.

CC: François CACHEREUL <f.cachereul@alphalink.fr>
CC: Zhenjie Chen <zhchen@redhat.com>
CC: Jay Vosburgh <fubar@us.ibm.com>
CC: Andy Gospodarek <andy@greyhouse.net>
Signed-off-by: Veaceslav Falico <vfalico@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch '6lowpan'

Alexander Aring says:

====================
6lowpan: reimplementation of fragmentation handling

this patch series reimplementation the fragmentation handling of 6lowpan
accroding to rfc4944 [1].

The first big note is, that the current fragmentation behaviour isn't rfc
complaint. The main issue is a wrong datagram_size value which needs to be:
datagram_size = ipv6_payload + ipv6 header + (maybe compressed transport header,
                currently only udp is supported)

but the current datagram_size value is calculated as:
datagram_size = ipv6_payload

Fragmentation work in a linux<->linux communication only.

Why reimplementation?

I reimplemted the reassembly side only. The current behaviour is to allocate a
skb with the reassembled size and hold all fragments in a list, protected by a
spinlock. After we received all fragments (detected by the sum of all fragments,
it begins to place all fragments into the allocated skb).

This reassembly implementation has some race condition. Additional I make it more
rfc complaint. The current implementation match on the tag value inside the frag
header only, but rfc4944 says we need to match on dst addr(mac), src addr(mac),
tag value, datagram_size value. [2]

The new reassembly handling use the inet_frag api (I mean the callback interface
of ipv6 and ipv4 reassembly). I looked into ipv6 and wanted to see how ipv6 is
dealing with reassembly, so I based my code on this implementation.

On the sending side to generate the fragments I improved the current code to use
the nearest 8 divided payload. (We can do that, because the mac layer has a
dynamic size, so it depends on mac_header how big we can do the payload).

Of course I fix also the reassembly/sending side to be rfc complaint now.

Regards
Alexander Aring

[1] http://tools.ietf.org/html/rfc4944
[2] http://tools.ietf.org/html/rfc4944#section-5.3

changes since v2:
- rework checkpatch code style issue patch.
   Merge two pr_debugs into one pr_debug.

changes since v3:
- rename 6lowpan.ko to 6lowpan_rtnl.c in commit msg of patch 5/8.

changes since v4:
- Add a new patch 2/8 to introduce lowpan_uncompress_size function. Also
   improving this function a little bit.
- Add a new patch 4/8 to change tag value to __be16.
- use skb_header_reset function on FRAG1 only, which should have the
   lowpan header. See lowpan_get_frag_info function. (slightly improving
   of fragmentation header parsing).
- changes types of variables to u16 in lowpan_skb_fragmentation.
- use lowpan_uncompress_size instead of storing necessary information
   in skb control block, this can be destroyed after dev_queue_xmit call.
   Thanks David for this hint.
- remove Tested-by: Martin Townsend <martin.townsend@xsilon.com>, because
   too many funcionality change.

changes since v5:
- handle lowpan_addr_mode_size with lookup table.

changes since v6:
- remove unnecessary parameter in lowpan_frag_queue.
- fix commit message in patch 8/8 which included a describtion of adding the
   lownpan_uncompress_size function. This was splitted in a seperate patch.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

6lowpan: handling 6lowpan fragmentation via inet_frag api

This patch drops the current way of 6lowpan fragmentation on receiving
side and replace it with a implementation which use the inet_frag api.
The old fragmentation handling has some race conditions and isn't
rfc4944 compatible. Also adding support to match fragments on
destination address, source address, tag value and datagram_size
which is missing in the current implementation.

Signed-off-by: Alexander Aring <alex.aring@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: ns: add ieee802154_6lowpan namespace

This patch adds necessary ieee802154 6lowpan namespace to provide the
inet_frag information. This is a initial support for handling 6lowpan
fragmentation with the inet_frag api.

Signed-off-by: Alexander Aring <alex.aring@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

6lowpan: fix some checkpatch issues

Detected with:

./scripts/checkpatch.pl --strict -f net/ieee802154/6lowpan_rtnl.c

Signed-off-by: Alexander Aring <alex.aring@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

6lowpan: move 6lowpan.c to 6lowpan_rtnl.c

We have a 6lowpan.c file and 6lowpan.ko file. To avoid confusing we
should move 6lowpan.c to 6lowpan_rtnl.c. Then we can support multiple
source files for 6lowpan module.

Signed-off-by: Alexander Aring <alex.aring@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

6lowpan: change tag type to __be16

Signed-off-by: Alexander Aring <alex.aring@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

6lowpan: fix fragmentation on sending side

This patch fix the fragmentation on sending side according to rfc4944.

Also add improvement to use the full payload of a PDU which calculate
the nearest divided to 8 payload length for the fragmentation datagram
size attribute.

The main issue is that the datagram size of fragmentation header use the
ipv6 payload length, but rfc4944 says it's the ipv6 payload length inclusive
network header size (and transport header size if compressed).

Signed-off-by: Alexander Aring <alex.aring@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

6lowpan: add uncompress header size function

This patch add a lookup function for uncompressed 6LoWPAN header
size. This is needed to estimate the real size after uncompress the
6LoWPAN header.

Signed-off-by: Alexander Aring <alex.aring@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

6lowpan: add frag information struct

This patch adds a 6lowpan fragmentation struct into cb of skb which
is necessary to hold fragmentation information.

Signed-off-by: Alexander Aring <alex.aring@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: w5100: Use devm_ioremap_resource()

Use devm_ioremap_resource() in order to make the code simpler.

Signed-off-by: Jingoo Han <jg1.han@samsung.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: w5300: Use devm_ioremap_resource()

Use devm_ioremap_resource() in order to make the code simpler.

Signed-off-by: Jingoo Han <jg1.han@samsung.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

packet: allow to transmit +4 byte in TX_RING slot for VLAN case

Commit 57f89bfa2140 ("network: Allow af_packet to transmit +4 bytes
for VLAN packets.") added the possibility for non-mmaped frames to
send extra 4 byte for VLAN header so the MTU increases from 1500 to
1504 byte, for example.

Commit cbd89acb9eb2 ("af_packet: fix for sending VLAN frames via
packet_mmap") attempted to fix that for the mmap part but was
reverted as it caused regressions while using eth_type_trans()
on output path.

Lets just act analogous to 57f89bfa2140 and add a similar logic
to TX_RING. We presume size_max as overcharged with +4 bytes and
later on after skb has been built by tpacket_fill_skb() check
for ETH_P_8021Q header on packets larger than normal MTU. Can
be easily reproduced with a slightly modified trafgen in mmap(2)
mode, test cases:

{ fill(0xff, 12) const16(0x8100) fill(0xff, <1504|1505>) }
{ fill(0xff, 12) const16(0x0806) fill(0xff, <1500|1501>) }

Note that we need to do the test right after tpacket_fill_skb()
as sockets can have PACKET_LOSS set where we would not fail but
instead just continue to traverse the ring.

Reported-by: Mathias Kretschmer <mathias.kretschmer@fokus.fraunhofer.de>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Cc: Ben Greear <greearb@candelatech.com>
Cc: Phil Sutter <phil@nwl.cc>
Tested-by: Mathias Kretschmer <mathias.kretschmer@fokus.fraunhofer.de>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'intel-next'

Aaron Brown says:

====================
This series contains updates to ixgbe and ixgbevf.

Don provides an update to change a hard coded timeout interval to
a system-wide timeout one, collects AUTOC register functions into
one place and fixes some firmware bit handling.

Emil resolves a tx handling error introduced in a recent commit and
adds check for CHECKSUM_PARTIAL to avoid an skb_is_gso check
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

ixgbevf: add check for CHECKSUM_PARTIAL when doing TSO

This patch adds check for CHECKSUM_PARTIAL to avoid the skb_is_gso check
in ixgbevf_tso(). It should reduce overhead for workloads that are not using
TSO or checksum offloads. It is the same as in ixgbe.

Signed-off-by: Emil Tantilov <emil.s.tantilov@intel.com>
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Tested-by: Phil Schmitt <phillip.j.schmitt@intel.com>
Signed-off-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ixgbevf: fix handling of tx checksumming

This patch resolves an issue introduced by:
commit 7ad1a093519e37fb673579819bf6af122641c397
ixgbevf: make the first tx_buffer a repository for most of the skb info

Incorrect check for the result of ixgbevf_tso() can lead to calling
ixgbevf_tx_csum() which can spawn 2 context descriptors and result in
performance degradation and/or corrupted packets.

Signed-off-by: Emil Tantilov <emil.s.tantilov@intel.com>
Tested-by: Phil Schmitt <phillip.j.schmitt@intel.com>
Signed-off-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ixgbe: Add check for FW veto bit

The driver will now honor the MNG FW veto bit in blocking link resets.
This patch will affect x520 and x540 systems.

Signed-off-by: Don Skidmore <donald.c.skidmore@intel.com>
Tested-by: Phil Schmitt <phillip.j.schmitt@intel.com>
Signed-off-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ixgbe: fix bit toggled for 82599 reset fix.

The current code doesn't toggle the correct bit to reset the data pipeline
on Restart_AN assertion. This patch corrects that.

Signed-off-by: Don Skidmore <donald.c.skidmore@intel.com>
Tested-by: Phil Schmitt <phillip.j.schmitt@intel.com>
Signed-off-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ixgbe: collect all 82599 AUTOC code in one function

When reading or writing to the AUTOC register on 82599 devices we need to
preform various operations that aren't needed for other MAC types. This
patch will collect all of that code into one place to minimize MAC checks
in common code paths.

While doing this I also clean up some cases where we weren't holding the
SW/FW semaphore during a read/modify/write of AUTOC.

Signed-off-by: Don Skidmore <donald.c.skidmore@intel.com>
Tested-by: Phil Schmitt <phillip.j.schmitt@intel.com>
Signed-off-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ixgbe: fix to use correct timeout interval for memory read completion

Currently we were just always polling for a hard coded 80 ms and not
respecting the system-wide timeout interval. Since up until now all
devices have been tested with this 80ms value we continue to use this
value as a hard minimum.

Signed-off-by: Don Skidmore <donald.c.skidmore@intel.com>
Tested-by: Phil Schmitt <phillip.j.schmitt@intel.com>
Signed-off-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ipv6: addrconf: silence sparse endianness warnings

Avoid the following sparse __CHECK_ENDIAN__ warnings:

include/net/addrconf.h:318:25: warning: restricted __be64 degrades to integer
include/net/addrconf.h:318:70: warning: restricted __be64 degrades to integer
include/net/addrconf.h:330:25: warning: restricted __be64 degrades to integer
include/net/addrconf.h:330:70: warning: restricted __be64 degrades to integer
include/net/addrconf.h:347:25: warning: restricted __be64 degrades to integer
include/net/addrconf.h:348:26: warning: restricted __be64 degrades to integer
include/net/addrconf.h:349:18: warning: restricted __be64 degrades to integer

The warnings are false but they make it harder to spot real
bugs.

Signed-off-by: Bjørn Mork <bjorn@mork.no>
Signed-off-by: David S. Miller <davem@davemloft.net>

neigh: directly goto out after setting nud_state to NUD_FAILED

Because those following if conditions will not be matched.

Signed-off-by: Duan Jiong <duanj.fnst@cn.fujitsu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next

Steffen Klassert says:

====================
This is the rework of the IPsec virtual tunnel interface
for ipv4 to support inter address family tunneling and
namespace crossing. The only change to the last RFC version
is a compile fix for an odd configuration where CONFIG_XFRM
is set but CONFIG_INET is not set.

1) Add and use a IPsec protocol multiplexer.

2) Add xfrm_tunnel_skb_cb to the skb common buffer
   to store a receive callback there.

3) Make vti work with i_key set by not including the i_key
   when comupting the hash for the tunnel lookup in case of
   vti tunnels.

4) Update ip_vti to use it's own receive hook.

5) Remove xfrm_tunnel_notifier, this is replaced by the IPsec
   protocol multiplexer.

6) We need to be protocol family indepenent, so use the on xfrm_lookup
   returned dst_entry instead of the ipv4 rtable in vti_tunnel_xmit().

7) Add support for inter address family tunneling.

8) Check if the tunnel endpoints of the xfrm state and the vti interface
   are matching and return an error otherwise.

8) Enable namespace crossing tor vti devices.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'kdoc'

Luis R. Rodriguez says:

====================
net: start kdoc'ifying net_device

While working on extending some functionality I felt restricted
with the amount of documentation I can add. Part of this is that
the existing style on the header files don't let me be verbose.
This starts addressing that by using kdoc for the net_device
flags, and as Ben noted, the priv_flags can be moved out from
UAPI.

Luis R. Rodriguez (2):
net: kdoc struct net_device flags and priv_flags
net: move net_device priv_flags out from UAPI
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

net: move net_device priv_flags out from UAPI

These are private to userspace, and they're unstable
anyway and can be shuffled at will (see 080e4130b1fb)
so any userspace application relying on them is on crack.

Test compiled with allyesconfig.

mcgrof@drvbp1 /pub/mem/mcgrof/net-next (git::master)$ make allyesconfig
mcgrof@drvbp1 /pub/mem/mcgrof/net-next (git::master)$ time make -j 20
...
  BUILD   arch/x86/boot/bzImage
Setup is 16992 bytes (padded to 17408 bytes).
System is 56153 kB
CRC 721d2751
Kernel: arch/x86/boot/bzImage is ready  (#1)
real    19m35.744s
user    280m37.984s
sys     27m54.104s

Cc: netdev@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: Ben Hutchings <ben@decadent.org.uk>
Cc: Florian Fainelli <f.fainelli@gmail.com>
Cc: David Miller <davem@davemloft.net>
Signed-off-by: Luis R. Rodriguez <mcgrof@suse.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: kdoc struct net_device flags and priv_flags

We have documentation for these flags but they're scattered
all over the place. #defines don't allow documentation to be
written easily so to help to start bringing some documentation
together use the enums kdoc practice but keep the defines to
allow userspace to be able to #ifdef them.

I've verified the same values are assigned before and after
with a simple userspace test program [0] and checksumming the
output.

[0] http://drvbp1.linux-foundation.org/~mcgrof/kdoc/netdev_flags/

mcgrof@gnat ~/tmp $ ./check-flags | sha1sum
0ec5b6b1840aa3bb9ce464e61c564820871c92c3 -

Cc: netdev@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: Ben Hutchings <ben@decadent.org.uk>
Cc: Florian Fainelli <f.fainelli@gmail.com>
Cc: David Miller <davem@davemloft.net>
Signed-off-by: Luis R. Rodriguez <mcgrof@suse.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

atm: nicstar: remove interruptible_sleep_on_timeout

We are trying to finally kill off interruptible_sleep_on_timeout.
the two uses in the nicstar driver can be trivially replaced
with wait_event_interruptible_lock_irq_timeout, which prevents the
wake-up race and is able to check the buffer state with scq->lock
held.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Chas Williams <chas@cmf.nrl.navy.mil>
Signed-off-by: David S. Miller <davem@davemloft.net>

tcp: switch rtt estimations to usec resolution

Upcoming congestion controls for TCP require usec resolution for RTT
estimations. Millisecond resolution is simply not enough these days.

FQ/pacing in DC environments also require this change for finer control
and removal of bimodal behavior due to the current hack in
tcp_update_pacing_rate() for 'small rtt'

TCP_CONG_RTT_STAMP is no longer needed.

As Julian Anastasov pointed out, we need to keep user compatibility :
tcp_metrics used to export RTT and RTTVAR in msec resolution,
so we added RTT_US and RTTVAR_US. An iproute2 patch is needed
to use the new attributes if provided by the kernel.

In this example ss command displays a srtt of 32 usecs (10Gbit link)

lpk51:~# ./ss -i dst lpk52
Netid  State      Recv-Q Send-Q   Local Address:Port       Peer
Address:Port
tcp    ESTAB      0      1         10.246.11.51:42959
10.246.11.52:64614
         cubic wscale:6,6 rto:201 rtt:0.032/0.001 ato:40 mss:1448
cwnd:10 send
3620.0Mbps pacing_rate 7240.0Mbps unacked:1 rcv_rtt:993 rcv_space:29559

Updated iproute2 ip command displays :

lpk51:~# ./ip tcp_metrics | grep 10.246.11.52
10.246.11.52 age 561.914sec cwnd 10 rtt 274us rttvar 213us source
10.246.11.51

Old binary displays :

lpk51:~# ip tcp_metrics | grep 10.246.11.52
10.246.11.52 age 561.914sec cwnd 10 rtt 250us rttvar 125us source
10.246.11.51

With help from Julian Anastasov, Stephen Hemminger and Yuchung Cheng

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Cc: Stephen Hemminger <stephen@networkplumber.org>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Larry Brakmo <brakmo@google.com>
Cc: Julian Anastasov <ja@ssi.bg>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: add skb_mstamp infrastructure

ktime_get() is too expensive on some cases, and we'd like to get
usec resolution timestamps in TCP stack.

This patch adds a light weight facility using a combination of
local_clock() and jiffies samples.

Instead of :

        u64 t0, t1;

        t0 = ktime_get();
        // stuff
        t1 = ktime_get();
        delta_us = ktime_us_delta(t1, t0);

use :
        struct skb_mstamp t0, t1;

        skb_mstamp_get(&t0);
        // stuff
        skb_mstamp_get(&t1);
        delta_us = skb_mstamp_us_delta(&t1, &t0);

Note : local_clock() might have a (bounded) drift between cpus.

Do not use this infra in place of ktime_get() without understanding the
issues.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Stephen Hemminger <stephen@networkplumber.org>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Larry Brakmo <brakmo@google.com>
Cc: Julian Anastasov <ja@ssi.bg>
Signed-off-by: David S. Miller <davem@davemloft.net>

phy: micrel: add of configuration for LED mode

Add support for the led-mode property for the following PHYs
which have a single LED mode configuration value.

KSZ8001 and KSZ8041 which both use register 0x1e bits 15,14 and
KSZ8021, KSZ8031 and KSZ8051 which use register 0x1f bits 5,4
to control the LED configuration.

Signed-off-by: Ben Dooks <ben.dooks@codethink.co.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>

isdn: fix multiple sleep_on races

The isdn core code uses a couple of wait queues with
interruptible_sleep_on, which is racy and about to get
removed from the kernel. Fortunately, we know for each case
what we are waiting for, so they can all be converted to
the better wait_event_interruptible interface.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Cc: Karsten Keil <isdn@linux-pingi.de>
Cc: netdev@vger.kernel.org
Signed-off-by: David S. Miller <davem@davemloft.net>

isdn: divert, hysdn: fix interruptible_sleep_on race

These two drivers use identical code for their procfs status
file handling, which contains a small race against status
data becoming available while reading the file.

This uses wait_event_interruptible instead to fix this
particular race and eventually get rid of all sleep_on
instances. There seems to be another race involving
multiple concurrent readers of the same procfs file, which
I don't try to fix here.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Cc: Karsten Keil <isdn@linux-pingi.de>
Cc: netdev@vger.kernel.org
Signed-off-by: David S. Miller <davem@davemloft.net>

isdn: hisax/elsa: fix sleep_on race in elsa FSM

The state machine code in the elsa driver uses interruptible_sleep_on
to wait for state changes, which is racy. A closer look at the possible
states reveals that it is always used to wait for getting back into
ARCOFI_NOP, so we can use wait_event_interruptible instead.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Cc: Karsten Keil <isdn@linux-pingi.de>
Cc: netdev@vger.kernel.org
Signed-off-by: David S. Miller <davem@davemloft.net>

isdn: pcbit: fix interruptible_sleep_on race

interruptible_sleep_on is racy and going away. In case of pcbit,
the driver would run into a timeout if the card is initialized
before we start waiting for it. This uses wait_event to fix the
race. In order to do this, the state machine handling for the
timeout case has to get trivially reorganized so we actually know
whether the timeout has occorred or not.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Cc: Karsten Keil <isdn@linux-pingi.de>
Cc: netdev@vger.kernel.org
Signed-off-by: David S. Miller <davem@davemloft.net>

atm: firestream: fix interruptible_sleep_on race

interruptible_sleep_on is racy and going away. This replaces the one use
in the firestream driver with the appropriate wait_event_interruptible
variant.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Chas Williams <chas@cmf.nrl.navy.mil>
Cc: linux-atm-general@lists.sourceforge.net
Cc: netdev@vger.kernel.org
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'intel-next'

Aaron Brown says:

====================
Intel Wired LAN Driver Updates

This series contains updates to ixgbe, igb and documentation.  The
first four have been sent up as part of other series where 1 or more
in the series were rejected and either dropped or still being worked
on for reasons unrelated to these patches.

Don makes recovery from a HW ECC error just schedule a reset as it turns
out the previous behaviour of forcing the user to reload is not necessary.
Mark adds WoL support to port 0 of a new device.  Jacob removes a magic
number from the ptp_caps.name and updates the SubmittingPatches
documentation with details on the Fixed: tag.  And Carolyn updates igb
files to remove the FSF physical mail address.

[ DaveM Note: SubmittingPatches change omitted, will go via LKML ]
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

igb: Update license text to remove FSF address and update copyright.

This patch updates the license text to remove address of Free Software
Foundation and refer users to www.gnu.org instead. This patch also updates
the copyright dates in appropriate igb driver files.

Signed-off-by: Carolyn Wyborny <carolyn.wyborny@intel.com>
Signed-off-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

igb: make local functions static and remove dead code

Based on Stephen Hemminger's original patch.
Make local functions static, and remove unused functions.

Reported-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Tested-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ixgbe: Add WoL support for a new device

Add WoL support for port 0 of a new 82599-based device.

Signed-off-by: Mark Rustad <mark.d.rustad@intel.com>
Tested-by: Phil Schmitt <phillip.j.schmitt@intel.com>
Signed-off-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ixgbe: don't use magic size number to assign ptp_caps.name

Rather than using a magic size number, just use sizeof since that will
work and is more robust to future changes.

Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Tested-by: Phil Schmitt <phillip.j.schmitt@intel.com>
Signed-off-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ixgbe: modify behavior on receiving a HW ECC error.

Currently when we noticed a HW ECC error we would request the use reload
the driver to force a reset of the part. This was done due to the mistaken
believe that a normal reset would not be sufficient. Well it turns out it
would be so now we just schedule a reset upon seeing the ECC.

Signed-off-by: Don Skidmore <donald.c.skidmore@intel.com>
Tested-by: Phil Schmitt <phillip.j.schmitt@intel.com>
Signed-off-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ipv6: yet another new IPV6_MTU_DISCOVER option IPV6_PMTUDISC_OMIT

This option has the same semantic as IP_PMTUDISC_OMIT for IPv4 which
got recently introduced. It doesn't honor the path mtu discovered by the
host but in contrary to IPV6_PMTUDISC_INTERFACE allows the generation of
fragments if the packet size exceeds the MTU of the outgoing interface
MTU.

Fixes: 93b36cf3425b9b ("ipv6: support IPV6_PMTU_INTERFACE on sockets")
Cc: Florian Weimer <fweimer@redhat.com>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

ipv4: yet another new IP_MTU_DISCOVER option IP_PMTUDISC_OMIT

IP_PMTUDISC_INTERFACE has a design error: because it does not allow the
generation of fragments if the interface mtu is exceeded, it is very
hard to make use of this option in already deployed name server software
for which I introduced this option.

This patch adds yet another new IP_MTU_DISCOVER option to not honor any
path mtu information and not accepting new icmp notifications destined for
the socket this option is enabled on. But we allow outgoing fragmentation
in case the packet size exceeds the outgoing interface mtu.

As such this new option can be used as a drop-in replacement for
IP_PMTUDISC_DONT, which is currently in use by most name server software
making the adoption of this option very smooth and easy.

The original advantage of IP_PMTUDISC_INTERFACE is still maintained:
ignoring incoming path MTU updates and not honoring discovered path MTUs
in the output path.

Fixes: 482fc6094afad5 ("ipv4: introduce new IP_MTU_DISCOVER mode IP_PMTUDISC_INTERFACE")
Cc: Florian Weimer <fweimer@redhat.com>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

ipv4: use ip_skb_dst_mtu to determine mtu in ip_fragment

ip_skb_dst_mtu mostly falls back to ip_dst_mtu_maybe_forward if no socket
is attached to the skb (in case of forwarding) or determines the mtu like
we do in ip_finish_output, which actually checks if we should branch to
ip_fragment. Thus use the same function to determine the mtu here, too.

This is important for the introduction of IP_PMTUDISC_OMIT, where we
want the packets getting cut in pieces of the size of the outgoing
interface mtu. IPv6 already does this correctly.

Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

neigh: probe application via netlink in NUD_PROBE

iproute2 arpd seems to expect this as there's code and comments
to handle netlink probes with NUD_PROBE set. It is used to flush
the arpd cached mappings.

opennhrp instead turns off unicast probes (so it can handle all
neighbour discovery). Without this change it will not see NUD_PROBE
probes and cannot reconfirm the mapping. Thus currently neigh entry
will just fail and can cause few packets dropped until broadcast
discovery is restarted.

Earlier discussion on the subject:
http://marc.info/?t=139305877100001&r=1&w=2

Signed-off-by: Timo Teräs <timo.teras@iki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>

ieee802154: fix new function declaration

The commit 8fad346f366a7 ("eee802154: add basic support for RF212 to
at86rf230 driver") introduced the new function is_rf212() with some
minor issues in declaration:

1) Fix the function type by changing it to bool as the function
   definition returns a boolean value. Additionally both callers of
   is_rf212() are expected to return a boolean value.

2) Fix the function specifier by deleting the inline keyword as the
   compiler takes care of that.

Signed-off-by: Jean Sacren <sakiwit@gmail.com>
Cc: Phoebe Buckheister <phoebe.buckheister@itwm.fraunhofer.de>
Signed-off-by: David S. Miller <davem@davemloft.net>

ipv6: log src and dst along with "udp checksum is 0"

These info messages are rather pointless without any means to identify
the source of the bogus packets. Logging the src and dst addresses and
ports may help a bit.

Cc: Joe Perches <joe@perches.com>
Signed-off-by: Bjørn Mork <bjorn@mork.no>
Signed-off-by: David S. Miller <davem@davemloft.net>

vxlan: remove unused port variable in vxlan_udp_encap_recv()

Signed-off-by: Pablo Neira Ayuso <pablo@gnumonks.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'mlx4'

Amir Vadai says:

====================
net, net/mlx4: Add sysfs file for port number

Modern distro's are using biosdevname to rename interface to a name based on
slot/port number.
biosdevname can't get the port number of devices that have multiple ports that
share the same PCI function.
This patch adds a sysfs file under: /sys/devices/.../net/<interface>/dev_port,
that contains the port number (0 based) - to be used by biosdevname.
Also, dev_id was wrongly used in mlx4_en driver - added a patch that fix it.

This patch was tested and applied over commit 51adfcc "net: bcmgenet: remove
unused bh_lock member"
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

net/mlx4_en: Fix bad use of dev_id

dev_id should be set for multiple netdev's sharing the same MAC, which
is not the case here.

Signed-off-by: Amir Vadai <amirv@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net/mlx4_en: Expose port number through sysfs

Initialize dev_port with port number (0 based) to be accessed through
sysfs from user space.

Signed-off-by: Amir Vadai <amirv@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: Add sysfs file for port number

Add a sysfs file to enable user space to query the device
port number used by a netdevice instance. This is needed for
devices that have multiple ports on the same PCI function.

Signed-off-by: Amir Vadai <amirv@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'bnx2x'

Michal Schmidt says:

====================
bnx2x: minimize RAM usage in kdump

kdump kernels usually have only a small amount of memory reserved.
bnx2x can be memory-hungry. Let's minimize its memory usage when
running in kdump.

I detect kdump by looking at the "reset_devices" flag. A couple of
storage drivers (cciss, hpsa) use it for the same purpose. I am not sure
this is the best way to solve the problem, but it works.

Should it be made more generic by, say, looking at the total amount
of lowmem instead? Not using TPA by default when lowmem is small and/or
defaulting to fewer queues would help 32bit systems where a driver for
a multi-function multi-queue NIC can consume a significant amount
of available memory. Or do we want no such heuristics?

Is this something to consider doing for other network drivers too?
====================

Acked-by: Ariel Elior <ariele@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

bnx2x: save RAM in kdump kernel by disabling TPA

When running in a kdump kernel, disable TPA. This saves memory, which
tends to be scarce in kdump.

TPA, being a receive acceleration, is unlikely to be useful for kdump,
whose purpose is to send the memory image out.

This saves additional 5 MB in the kdump environment.

Signed-off-by: Michal Schmidt <mschmidt@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

bnx2x: save RAM in kdump kernel by using a single queue

When running in a kdump kernel, make sure to use only a single ethernet
queue even if a num_queues option in /etc/modprobe.d/*.conf would specify
otherwise. This saves memory, which tends to be scarce in kdump.

This saves about 40 MB in the kdump environment on a setup with
num_queues=8 in the config file.

Signed-off-by: Michal Schmidt <mschmidt@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

bnx2x: clamp num_queues to prevent passing a negative value

Use the clamp() macro to make the calculation of the number of queues
slightly easier to understand. It also avoids a crash when someone
accidentally passes a negative value in num_queues= module parameter.

Signed-off-by: Michal Schmidt <mschmidt@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: tcp: add mib counters to track zero window transitions

Three counters are added:
- one to track when we went from non-zero to zero window
- one to track the reverse
- one counter incremented when we want to announce zero window,
but can't because we would shrink current window.

Suggested-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: order MPLS ethertypes numerically

All ethertypes other than ETH_P_MPLS_UC, ETH_P_MPLS_MC and
ETH_P_ATMMPOA were already ordered numerically. This commit moves
those three ETH_P_... values into correct numerical order too.

Signed-off-by: Neil Jerram <Neil.Jerram@metaswitch.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

bnx2x: Remove hidden flow control goto from BNX2X_ALLOC macros

BNX2X_ALLOC macros use "goto alloc_mem_err"
so these labels appear unused in some functions.

Expand these macros in-place via coccinelle and
some typing.

Update the macros to use statement expressions
and remove the BNX2X_ALLOC macro.

This adds some > 80 char lines.

$ cat bnx2x_pci_alloc.cocci
@@
expression e1;
expression e2;
expression e3;
@@
- BNX2X_PCI_ALLOC(e1, e2, e3);
+ e1 = BNX2X_PCI_ALLOC(e2, e3); if (!e1) goto alloc_mem_err;

@@
expression e1;
expression e2;
expression e3;
@@
- BNX2X_PCI_FALLOC(e1, e2, e3);
+ e1 = BNX2X_PCI_FALLOC(e2, e3); if (!e1) goto alloc_mem_err;

@@
expression e1;
expression e2;
@@
- BNX2X_ALLOC(e1, e2);
+ e1 = kzalloc(e2, GFP_KERNEL); if (!e1) goto alloc_mem_err;

@@
expression e1;
expression e2;
expression e3;
@@
- kzalloc(sizeof(e1) * e2, e3)
+ kcalloc(e2, sizeof(e1), e3)

@@
expression e1;
expression e2;
expression e3;
@@
- kzalloc(e1 * sizeof(e2), e3)
+ kcalloc(e1, sizeof(e2), e3)

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

vti4: Enable namespace changing

vti4 is now fully namespace aware, so allow namespace changing
for vti devices

Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>

vti4: Check the tunnel endpoints of the xfrm state and the vti interface

The tunnel endpoints of the xfrm_state we got from the xfrm_lookup
must match the tunnel endpoints of the vti interface. This patch
ensures this matching.

Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>

vti4: Support inter address family tunneling.

With this patch we can tunnel ipv6 traffic via a vti4
interface. A vti4 interface can now have an ipv6 address
and ipv6 traffic can be routed via a vti4 interface.
The resulting traffic is xfrm transformed and tunneled
throuhg ipv4 if matching IPsec policies and states are
present.

Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>

vti4: Use the on xfrm_lookup returned dst_entry directly

We need to be protocol family indepenent to support
inter addresss family tunneling with vti. So use a
dst_entry instead of the ipv4 rtable in vti_tunnel_xmit.

Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>

xfrm4: Remove xfrm_tunnel_notifier

This was used from vti and is replaced by the IPsec protocol
multiplexer hooks. It is now unused, so remove it.

Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>

vti: Update the ipv4 side to use it's own receive hook.

With this patch, vti uses the IPsec protocol multiplexer to
register it's own receive side hooks for ESP, AH and IPCOMP.

Vti now does the following on receive side:

1. Do an input policy check for the IPsec packet we received.
   This is required because this packet could be already
   prosecces by IPsec, so an inbuond policy check is needed.

2. Mark the packet with the i_key. The policy and the state
   must match this key now. Policy and state belong to the outer
   namespace and policy enforcement is done at the further layers.

3. Call the generic xfrm layer to do decryption and decapsulation.

4. Wait for a callback from the xfrm layer to properly clean the
   skb to not leak informations on namespace and to update the
   device statistics.

On transmit side:

1. Mark the packet with the o_key. The policy and the state
   must match this key now.

2. Do a xfrm_lookup on the original packet with the mark applied.

3. Check if we got an IPsec route.

4. Clean the skb to not leak informations on namespace
   transitions.

5. Attach the dst_enty we got from the xfrm_lookup to the skb.

6. Call dst_output to do the IPsec processing.

7. Do the device statistics.

Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>

ip_tunnel: Make vti work with i_key set

Vti uses the o_key to mark packets that were transmitted or received
by a vti interface. Unfortunately we can't apply different marks
to in and outbound packets with only one key availabe. Vti interfaces
typically use wildcard selectors for vti IPsec policies. On forwarding,
the same output policy will match for both directions. This generates
a loop between the IPsec gateways until the ttl of the packet is
exceeded.

The gre i_key/o_key are usually there to find the right gre tunnel
during a lookup. When vti uses the i_key to mark packets, the tunnel
lookup does not work any more because vti does not use the gre keys
as a hash key for the lookup.

This patch workarounds this my not including the i_key when comupting
the hash for the tunnel lookup in case of vti tunnels.

With this we have separate keys available for the transmitting and
receiving side of the vti interface.

Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>

xfrm: Add xfrm_tunnel_skb_cb to the skb common buffer

IPsec vti_rcv needs to remind the tunnel pointer to
check it later at the vti_rcv_cb callback. So add
this pointer to the IPsec common buffer, initialize
it and check it to avoid transport state matching of
a tunneled packet.

Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>

ipcomp4: Use the IPsec protocol multiplexer API

Switch ipcomp4 to use the new IPsec protocol multiplexer.

Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>

ah4: Use the IPsec protocol multiplexer API

Switch ah4 to use the new IPsec protocol multiplexer.

Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>

esp4: Use the IPsec protocol multiplexer API

Switch esp4 to use the new IPsec protocol multiplexer.

Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>

xfrm4: Add IPsec protocol multiplexer

This patch add an IPsec protocol multiplexer. With this
it is possible to add alternative protocol handlers as
needed for IPsec virtual tunnel interfaces.

Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>

net: bcmgenet: remove unused bh_lock member

bh_lock spinlock is unused, remove it from the private driver structure.

Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: bcmgenet: remove commented code in bcmgenet_xmit()

This code is commented since it is unused, left-over from the very first
time this driver was merged.

Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: bcmgenet: drop checks on priv->phydev

Drop all the checks on priv->phydev since we will refuse probing the
driver if we cannot attach to a PHY device. Drop all checks on
priv->phydev. This also fixes some smatch issues reported by Dan
Carpenter.

Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'gianfar'

Claudiu Manoil says:

====================
gianfar: Device reset and reconfig fixes

These patches end up fixing some notable device reset & reconfig
related problems. One issue is on-the-fly (Rx/Tx on) programming
of interrupt coalescing (IC) registers on the processing path,
against HW recommendation. This is an old issue that became visible
after BQL introduction, as under certain conditions (low traffic)
one TX interrupt gets lost and BQL fires Tx timeout as a result.
Another notable issue is a race on the Tx path (xmit, clean_tx)
during device reset (i.e. during Tx timeout watchdog firing)
that leads to NULL access.
Fixing the problematic on-thy-fly register writes (i.e. the IC regs)
required the implementation of a MAC soft reset procedure.
The race leading to NULL access was addressed by fixing the
stop_gfar()/startup_gfar() pair (disable/enable napi a.s.o.)
and adding the device state DOWN to sync with the TX path.

v2: Refactored if() clauses from gfar_set_features(), PATCH 2.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

gianfar: Fix Tx int miss, dont write IC on-the-fly

Programming the interrupt coalescing (IC) registers while
the controller/DMA is on may incur the loss of one Tx
confirmation interrupt, under certain conditions.  This is
a subtle hw race because it does not occur during a burst
of Tx packets.  It has been observed on p2020 devices that,
if just one packet is being xmit'ed, the Tx confirmation
doesn't trigger and BQL evetually blocks the Tx queues,
followed by Tx timeout and an un-responsive device.
This issue was not apparent prior to introducing BQL
support, as a late Tx confirmation was not an issue back then
and the next burst of Tx frames would have triggered the
Tx confirmation/ Tx ring cleanup anyway.

Bottom line, the hw specifications state that the IC registers
should not be programmed while the Rx/Tx blocks (the DMA) are
enabled. Further more, these registers are currently re-written
with the same values on the processing path, over and over again.
To fix this, rewriting the IC registers has been removed from
the processing path (napi poll).  A complete MAC reset procedure
has been implemented for the ethtool -c option instead, to
reliably update these registers while the controller is stopped.

Signed-off-by: Claudiu Manoil <claudiu.manoil@freescale.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

gianfar: Fix device reset races (oops) for Tx

The device reset procedure, stop_gfar()/startup_gfar(), has
concurrency issues.
"Kernel access of bad area" oopses show up during Tx timeout
device reset or other reset cases (like changing MTU) that
happen while the interface still has traffic. The oopses
happen in start_xmit and clean_tx_ring when accessing tx_queue->
tx_skbuff which is NULL. The race comes from de-allocating the
tx_skbuff while transmission and napi processing are still
active. Though the Tx queues get temoprarily stopped when Tx
timeout occurs, they get re-enabled as a result of Tx congestion
handling inside the napi context (see clean_tx_ring()). Not
disabling the napi during reset is also a bug, because
clean_tx_ring() will try to access tx_skbuff while it is being
de-alloc'ed and re-alloc'ed.

To fix this, stop_gfar() needs to disable napi processing
after stopping the Tx queues. However, in order to prevent
clean_tx_ring() to re-enable the Tx queue before the napi
gets disabled, the device state DOWN has been introduced.
It prevents the Tx congestion management from re-enabling the
de-congested Tx queue while the device is brought down.
An additional locking state, RESETTING, has been introduced
to prevent simultaneous resets or to prevent configuring the
device while it is resetting.
The bogus 'rxlock's (for each Rx queue) have been removed since
their purpose is not justified, as they don't prevent nor are
suited to prevent device reset/reconfig races (such as this one).

Signed-off-by: Claudiu Manoil <claudiu.manoil@freescale.com>
Signed-off-by: David S. Miller <davem@davemloft.net>