git.karo-electronics.de Git - linux-beck.git/log

i40e/i40evf: Fix casting in transmit code

Simple cast to fix a sparse warning.

Fixes: commit 5453205cd097 ("i40e/i40evf: Enable support for
SKB_GSO_UDP_TUNNEL_CSUM")

Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

i40e/i40evf: Add support for bulk free in Tx cleanup

This patch enables bulk Tx clean for skbs. In order to enable it we need
to pass the napi_budget value as that is used to determine if we are truly
running in NAPI mode or if we are simply calling the routine from netpoll
with a budget of 0. In order to avoid adding too many more variables I
thought it best to pass the VSI directly in a fashion similar to what we do
on igb and ixgbe with the q_vector.

Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

i40e/i40evf: Fix handling of boolean logic in polling routines

In the polling routines for i40e and i40evf we were using bitwise operators
to avoid the side effects of the logical operators, specifically the fact
that if the first case is true with "||" we skip the second case, or if it
is false with "&&" we skip the second case. This fixes an earlier patch
that converted the bitwise operators over to the logical operators and
instead replaces the entire thing with just an if statement since it should
be more readable what we are trying to do this way.

Fixes: 1a36d7fadd14 ("i40e/i40evf: use logical operators, not bitwise")
Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

i40evf: remove dead code

The only error case is when the malloc fails, in which case the clean up
loop does nothing at all, so remove it

Signed-off-by: Alan Cox <alan@linux.intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

i40e/i40evf: Allow up to 12K bytes of data per Tx descriptor instead of 8K

From what I can tell the practical limitation on the size of the Tx data
buffer is the fact that the Tx descriptor is limited to 14 bits.  As such
we cannot use 16K as is typically used on the other Intel drivers.  However
artificially limiting ourselves to 8K can be expensive as this means that
we will consume up to 10 descriptors (1 context, 1 for header, and 9 for
payload, non-8K aligned) in a single send.

I propose that we can reduce this by increasing the maximum data for a 4K
aligned block to 12K.  We can reduce the descriptors used for a 32K aligned
block by 1 by increasing the size like this.  In addition we still have the
4K - 1 of space that is still unused.  We can use this as a bit of extra
padding when dealing with data that is not aligned to 4K.

By aligning the descriptors after the first to 4K we can improve the
efficiency of PCIe accesses as we can avoid using byte enables and can fetch
full TLP transactions after the first fetch of the buffer.  This helps to
improve PCIe efficiency.  Below is the results of testing before and after
with this patch:

Recv   Send   Send                         Utilization      Service Demand
Socket Socket Message  Elapsed             Send     Recv    Send    Recv
Size   Size   Size     Time    Throughput  local    remote  local   remote
bytes  bytes  bytes    secs.   10^6bits/s  % S      % U     us/KB   us/KB
Before:
87380  16384  16384    10.00     33682.24  20.27    -1.00   0.592   -1.00
After:
87380  16384  16384    10.00     34204.08  20.54    -1.00   0.590   -1.00

So the net result of this patch is that we have a small gain in throughput
due to a reduction in overhead for putting together the frame.

Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

i40e: call ndo_stop() instead of dev_close() when running offline selftest

Calling dev_close() causes IFF_UP to be cleared which will remove the
interfaces routes and some addresses. That's probably not what the user
intended when running the offline selftest. Besides this does not happen
if the interface is brought down before the test, so the current
behaviour is inconsistent.
Instead call the net_device_ops ndo_stop function directly and avoid
touching IFF_UP at all.

Signed-off-by: Stefan Assmann <sassmann@kpanic.de>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

Merge branch 'tcp-udp-misc'

Eric Dumazet says:

====================
net: various udp/tcp changes

First round of patches for linux-4.7

Add a generic facility for sockets to be freed after an RCU grace
period, if they need to.

Then UDP stack is changed to no longer use SLAB_DESTROY_BY_RCU,
in order to speedup rx processing for traffic encapsulated in UDP.
It gives a 17 % speedup for normal UDP reception in stress conditions.

Then TCP listeners are changed to use SOCK_RCU_FREE as well
to avoid touching sk_refcnt in synflood case :
I got up to 30 % performance increase for a mono listener.

Then three patches add SK_MEMINFO_DROPS to sock_diag
and add per socket rx drops accounting to TCP.

Last patch adds rate limiting on ACK sent on behalf of SYN_RECV
to better resist to SYNFLOOD targeting one or few flows.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

tcp: rate limit ACK sent by SYN_RECV request sockets

Attackers like to use SYNFLOOD targeting one 5-tuple, as they
hit a single RX queue (and cpu) on the victim.

If they use random sequence numbers in their SYN, we detect
they do not match the expected window and send back an ACK.

This patch adds a rate limitation, so that the effect of such
attacks is limited to ingress only.

We roughly double our ability to absorb such attacks.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Willem de Bruijn <willemb@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Maciej Żenczykowski <maze@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ipv4: tcp: set SOCK_USE_WRITE_QUEUE for ip_send_unicast_reply()

TCP uses per cpu 'sockets' to send some packets :
- RST packets ( tcp_v4_send_reset()) )
- ACK packets for SYN_RECV and TIMEWAIT sockets

By setting SOCK_USE_WRITE_QUEUE flag, we tell sock_wfree()
to not call sk_write_space() since these internal sockets
do not care.

This gives a small performance improvement, merely by allowing
cpu to properly predict the sock_wfree() conditional branch,
and avoiding one atomic operation.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

tcp: increment sk_drops for listeners

Goal: packets dropped by a listener are accounted for.

This adds tcp_listendrop() helper, and clears sk_drops in sk_clone_lock()
so that children do not inherit their parent drop count.

Note that we no longer increment LINUX_MIB_LISTENDROPS counter when
sending a SYNCOOKIE, since the SYN packet generated a SYNACK.
We already have a separate LINUX_MIB_SYNCOOKIESSENT

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

tcp: increment sk_drops for dropped rx packets

Now ss can report sk_drops, we can instruct TCP to increment
this per socket counter when it drops an incoming frame, to refine
monitoring and debugging.

Following patch takes care of listeners drops.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

sock_diag: add SK_MEMINFO_DROPS

Reporting sk_drops to user space was available for UDP
sockets using /proc interface.

Add this to sock_diag, so that we can have the same information
available to ss users, and we'll be able to add sk_drops
indications for TCP sockets as well.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

tcp/dccp: do not touch listener sk_refcnt under synflood

When a SYNFLOOD targets a non SO_REUSEPORT listener, multiple
cpus contend on sk->sk_refcnt and sk->sk_wmem_alloc changes.

By letting listeners use SOCK_RCU_FREE infrastructure,
we can relax TCP_LISTEN lookup rules and avoid touching sk_refcnt

Note that we still use SLAB_DESTROY_BY_RCU rules for other sockets,
only listeners are impacted by this change.

Peak performance under SYNFLOOD is increased by ~33% :

On my test machine, I could process 3.2 Mpps instead of 2.4 Mpps

Most consuming functions are now skb_set_owner_w() and sock_wfree()
contending on sk->sk_wmem_alloc when cooking SYNACK and freeing them.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

inet: reqsk_alloc() needs to take care of dead listeners

We'll soon no longer take a refcount on listeners,
so reqsk_alloc() can not assume a listener refcount is not
zero. We need to use atomic_inc_not_zero()

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

tcp/dccp: use rcu locking in inet_diag_find_one_icsk()

RX packet processing holds rcu_read_lock(), so we can remove
pairs of rcu_read_lock()/rcu_read_unlock() in lookup functions
if inet_diag also holds rcu before calling them.

This is needed anyway as __inet_lookup_listener() and
inet6_lookup_listener() will soon no longer increment
refcount on the found listener.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

tcp/dccp: remove BH disable/enable in lookup

Since linux 2.6.29, lookups only use rcu locking.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

udp: no longer use SLAB_DESTROY_BY_RCU

Tom Herbert would like not touching UDP socket refcnt for encapsulated
traffic. For this to happen, we need to use normal RCU rules, with a grace
period before freeing a socket. UDP sockets are not short lived in the
high usage case, so the added cost of call_rcu() should not be a concern.

This actually removes a lot of complexity in UDP stack.

Multicast receives no longer need to hold a bucket spinlock.

Note that ip early demux still needs to take a reference on the socket.

Same remark for functions used by xt_socket and xt_PROXY netfilter modules,
but this might be changed later.

Performance for a single UDP socket receiving flood traffic from
many RX queues/cpus.

Simple udp_rx using simple recvfrom() loop :
438 kpps instead of 374 kpps : 17 % increase of the peak rate.

v2: Addressed Willem de Bruijn feedback in multicast handling
- keep early demux break in __udp4_lib_demux_lookup()

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Tom Herbert <tom@herbertland.com>
Cc: Willem de Bruijn <willemb@google.com>
Tested-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: add SOCK_RCU_FREE socket flag

We want a generic way to insert an RCU grace period before socket
freeing for cases where RCU_SLAB_DESTROY_BY_RCU is adding too
much overhead.

SLAB_DESTROY_BY_RCU strict rules force us to take a reference
on the socket sk_refcnt, and it is a performance problem for UDP
encapsulation, or TCP synflood behavior, as many CPUs might
attempt the atomic operations on a shared sk_refcnt

UDP sockets and TCP listeners can set SOCK_RCU_FREE so that their
lookup can use traditional RCU rules, without refcount changes.
They can set the flag only once hashed and visible by other cpus.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Tom Herbert <tom@herbertland.com>
Tested-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch '10GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/next-queue

Jeff Kirsher says:

====================
10GbE Intel Wired LAN Driver Updates 2016-04-04

This series contains updates to ixgbe and ixgbevf.

Pavel Tikhomirov fixes a typo where we were incrementing transmit stats
instead of receive stats on the receive side.

Emil updates the ixgbevf driver to use bit operations for setting and
checking the adapter state.

Chas Williams adds the new NDO trust feature check so that the VF guest
has the ability to set the unicast address of the interface, if it is a
trusted VF.

Alex cleans up the driver to that the only time we add a PF entry to the
VLVF is either for VLAN 0 or if the PF has requested a VLAN that a VF
is already using.  Also adds support for generic transmit checksums,
giving the added advantage is that we can support inner checksum offloads
for tunnels and MPLS while still being able to transparently insert
VLAN tags.  Lastly, changed ixgbe so that we can use the ethtool
rx-vlan-filter flag to toggle receive VLAN filtering on and off.

Mark cleans up the ixgbe driver by making all op structures that do not
change constants.  Also fixed flow control for Xeon D KR backplanes, since
we cannot use auto-negotiation to determine the mode, we have to use
whatever the user configured.

Sowmini Varadhan updates ixgbe to use eth_platform_get_mac_address()
instead of the arch specific solution that was added by a previous
commit.

Don fixed an issue where it was possible that a system reset could occur
when we were holding the SWFW semaphore lock, which the next time the
driver loaded would see it incorrectly as locked.

v2: updated patch 8 of the series to include a minor flags issue where
    we had lost NETIF_F_HW_TC and we were setting NETIF_F_SCTP_CRC in
    two different areas, when we only needed/wanted it in one spot.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'mv88e6131-hw-bridging-6185'

Vivien Didelot says:

====================
net: dsa: mv88e6131: HW bridging support for 6185

All packets passing through a switch of the 6185 family are currently all
directed to the CPU port. This means that port bridging is software driven.

To enable hardware bridging for this switch family, we need to implement the
port mapping operations, the FDB operations, and optionally the VLAN operations
(for 802.1Q and VLAN filtering aware systems).

However this family only has 256 FDBs indexed by 8-bit identifiers, opposed to
4096 FDBs with 12-bit identifiers for other families such as 6352. It also
doesn't have dedicated FID registers for ATU and VTU operations.

This patchset fixes these differences, and enable hardware bridging for 6185.

Changes v1 -> v2:
- Describe the different numbers of databases and prefer a feature-based logic
over the current ID/family-based logic.
====================

Tested-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: dsa: mv88e6131: enable hardware bridging

By adding support for bridge operations, FDB operations, and optionally
VLAN operations (for 802.1Q and VLAN filtering aware systems), the
switch bridges ports correctly, the CPU is able to populate the hardware
address databases, and thus hardware bridging becomes functional within
the 88E6185 family of switches.

Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: dsa: mv88e6xxx: map destination addresses for 6185

The 88E6185 switch also has a MapDA bit in its Port Control 2 register.
When this bit is cleared, all frames are sent out to the CPU port.

Set this bit to rely on address databases (ATU) hits and direct frames
out of the correct ports, and thus allow hardware bridging.

Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: dsa: mv88e6xxx: support 256 databases

The 6185 family of devices has only 256 address databases. Their 8-bit
FID for ATU and VTU operations are split into ATU Control and ATU/VTU
Operation registers.

Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: dsa: mv88e6xxx: variable number of databases

Marvell switch chips have different number of address databases.

The code currently only supports models with 4096 databases. Such switch
has dedicated FID registers for ATU and VTU operations. Models with
fewer databases have their FID split in several registers.

List them all but only support models with 4096 databases at the moment.

Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: dsa: mv88e6xxx: protect FID registers access

Only switch families with 4096 address databases have dedicated FID
registers for ATU and VTU operations.

Factorize the access to the GLOBAL_ATU_FID register and introduce a
mv88e6xxx_has_fid_reg() helper function to protect the access to
GLOBAL_ATU_FID and GLOBAL_VTU_FID.

Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: dsa: mv88e6xxx: protect SID register access

Introduce a mv88e6xxx_has_stu() helper to protect the access to the
GLOBAL_VTU_SID register, instead of checking switch families.

Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ixgbe: Add support for toggling VLAN filtering flag via ethtool

This change makes it so that we can use the ethtool rx-vlan-filter flag to
toggle Rx VLAN filtering on and off. This is basically just an extension
of the existing VLAN promisc work in that it just adds support for the
additional ethtool flag.

Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

ixgbe: Extend cls_u32 offload to support UDP headers

Added support to match on UDP fields in the transport layer.
Extended core logic to support multiple headers.

Verified with the following filters :

handle 1: u32 divisor 1
u32 ht 800: order 1 link 1: \
offset at 0 mask 0f00 shift 6 plus 0 eat match ip protocol 6 ff
u32 ht 1: order 2 \
match tcp src 1024 ffff match tcp dst 23 ffff action drop
handle 2: u32 divisor 1
u32 ht 800: order 3 link 2: \
offset at 0 mask 0f00 shift 6 plus 0 eat match ip protocol 17 ff
u32 ht 2: order 4 \
match udp src 1025 ffff match udp dst 24 ffff action drop

Signed-off-by: Amritha Nambiar <amritha.nambiar@intel.com>
Acked-by: John Fastabend <john.r.fastabend@intel.com>
Acked-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

ixgbe: Place SWFW semaphore in known valid state at probe

It is possible on some HW that a system reset could occur when we are
holding the SWFW semaphore lock. So next time the driver was loaded we
would see it incorrectly as locked. This patch will recover from that state
by: Attempting to acquire the semaphore and then regardless of whether or
not it was acquire we immediately release it. This will force us into
a known good state.

Signed-off-by: Don Skidmore <donald.c.skidmore@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

ixgbe: add a callback to set the maximum transmit bitrate

This commit adds a callback which allows to adjust the maximum transmit
bitrate the card can output. This makes it possible to get a smooth
traffic instead of the default burst-y behaviour when trying to output
e.g. a video stream.

Much of the logic needed to get a correct bcnrc_val was taken from the
ixgbe_set_vf_rate_limit() function.

Signed-off-by: Rostislav Pehlivanov <atomnuker@gmail.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

ixgbe: Fix flow control for Xeon D KR backplane

Xeon D KR backplane is different from other backplanes,
in that we can't use auto-negotiation to determine the
mode. Instead, use whatever the user configured.

Signed-off-by: Mark Rustad <mark.d.rustad@intel.com>
Tested-by: Phil Schmitt <phillip.j.schmitt@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

ixgbevf: Add support for generic Tx checksums

This patch adds support for generic Tx checksums to the ixgbevf driver.  It
turns out this is actually pretty easy after going over the datasheet as we
were doing a number of steps we didn't need to.

In order to perform a Tx checksum for an L4 header we need to fill in the
following fields in the Tx descriptor:
  MACLEN (maximum of 127), retrieved from:
skb_network_offset()
  IPLEN  (maximum of 511), retrieved from:
skb_checksum_start_offset() - skb_network_offset()
  TUCMD.L4T indicates offset and if checksum or crc32c, based on:
skb->csum_offset

The added advantage to doing this is that we can support inner checksum
offloads for tunnels and MPLS while still being able to transparently
insert VLAN tags.

I also took the opportunity to clean-up many of the feature flag
configuration bits to make them a bit more consistent between drivers.  In
the case of the VF drivers this meant adding support for SCTP CRCs, and
inner checksum offloads for MPLS and various tunnel types.

Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

ixgbe: Add support for generic Tx checksums

This patch adds support for generic Tx checksums to the ixgbe driver.  It
turns out this is actually pretty easy after going over the datasheet as we
were doing a number of steps we didn't need to.

In order to perform a Tx checksum for an L4 header we need to fill in the
following fields in the Tx descriptor:
  MACLEN (maximum of 127), retrieved from:
skb_network_offset()
  IPLEN  (maximum of 511), retrieved from:
skb_checksum_start_offset() - skb_network_offset()
  TUCMD.L4T indicates offset and if checksum or crc32c, based on:
skb->csum_offset

The added advantage to doing this is that we can support inner checksum
offloads for tunnels and MPLS while still being able to transparently
insert VLAN tags.

I also took the opportunity to clean-up many of the feature flag
configuration bits to make them a bit more consistent between drivers.

Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

ixgbe: use eth_platform_get_mac_address()

This commit converts commit c762dff24c06 ("ixgbe: Look up MAC address in
Open Firmware or IDPROM") to use eth_platform_get_mac_address()
added by commit c7f5d105495a ("net: Add eth_platform_get_mac_address()
helper.")

Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Tested-by: Phil Schmitt <phillip.j.schmitt@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

ixgbe: Make all unchanging ops structures const

The source for the ops structure contents are const, so make them
so. Copy them in place with structure assignments instead of memcpys.
Make the mbx_ops accessed by reference instead of making a copy of
the source structure. Update copyright date on the touched files.

Reported-by: Julia Lawall <Julia.Lawall@lip6.fr>
Signed-off-by: Mark Rustad <mark.d.rustad@intel.com>
Acked-by: Julia Lawall <julia.lawall@lip6.fr>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

ixgbe: Avoid adding VLAN 0 twice to VLVF and VFTA

We were adding VLAN 0 twice each time we restored the VLAN configuration.
Instead of doing it twice we can just start working through the active
VLANs from ID 1 on and skip the double write.

Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
Tested-by: Phil Schmitt <phillip.j.schmitt@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

irda: sh_irda: remove driver

Remove the sh-irda driver as it appears to be unused since
c0bb9b302769 ("ARCH: ARM: shmobile: Remove ag5evm board support").

Signed-off-by: Simon Horman <horms+renesas@verge.net.au>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'macb-coding-style'

Moritz Fischer says:

====================
macb: Codingstyle cleanups

resending almost unchanged v2 here:

Changes from v2:
* Rebased onto net-next
* Changed 5th patches commit message
* Added Nicholas' and Michal's Acked-Bys

Changes from v1:
* Backed out variable scope changes
* Separated out ether_addr_copy into it's own commit
* Fixed typo in comments as suggested by Joe
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

net: macb: Fix simple typo

Acked-by: Michal Simek <michal.simek@xilinx.com>
Acked-by: Nicolas Ferre <nicolas.ferre@atmel.com>
Signed-off-by: Moritz Fischer <moritz.fischer@ettus.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: macb: Use ether_addr_copy over memcpy

Checkpatch suggests using ether_addr_copy over memcpy
to copy the mac address.

Acked-by: Michal Simek <michal.simek@xilinx.com>
Acked-by: Nicolas Ferre <nicolas.ferre@atmel.com>
Signed-off-by: Moritz Fischer <moritz.fischer@ettus.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: macb: Fix coding style suggestions

This commit deals with a bunch of checkpatch suggestions
that without changing behavior make checkpatch happier.

Acked-by: Michal Simek <michal.simek@xilinx.com>
Acked-by: Nicolas Ferre <nicolas.ferre@atmel.com>
Signed-off-by: Moritz Fischer <moritz.fischer@ettus.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: macb: Fix coding style warnings

This commit takes care of the coding style warnings
that are mostly due to a different comment style and
lines over 80 chars, as well as a dangling else.

Acked-by: Nicolas Ferre <nicolas.ferre@atmel.com>
Signed-off-by: Moritz Fischer <moritz.fischer@ettus.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: macb: Fix coding style error message

checkpatch.pl gave the following error:

ERROR: space required before the open parenthesis '('
+ for(; p < end; p++, offset += 4)

Acked-by: Nicolas Ferre <nicolas.ferre@atmel.com>
Acked-by: Michal Simek <michal.simek@xilinx.com>
Signed-off-by: Moritz Fischer <moritz.fischer@ettus.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ravb: Add dma queue interrupt support

This patch supports the following interrupts.

- One interrupt for multiple (timestamp, error, gPTP)
- One interrupt for emac
- Four interrupts for dma queue (best effort rx/tx, network control rx/tx)

This patch improve efficiency of the interrupt handler by adding the
interrupt handler corresponding to each interrupt source described
above. Additionally, it reduces the number of times of the access to
EthernetAVB IF.
Also this patch prevent this driver depends on the whim of a boot loader.

[ykaneko0929@gmail.com: define bit names of registers]
[ykaneko0929@gmail.com: add comment for gen3 only registers]
[ykaneko0929@gmail.com: fix coding style]
[ykaneko0929@gmail.com: update changelog]
[ykaneko0929@gmail.com: gen3: fix initialization of interrupts]
[ykaneko0929@gmail.com: gen3: fix clearing interrupts]
[ykaneko0929@gmail.com: gen3: add helper function for request_irq()]
[ykaneko0929@gmail.com: gen3: remove IRQF_SHARED flag for request_irq()]
[ykaneko0929@gmail.com: revert ravb_close() and ravb_ptp_stop()]
[ykaneko0929@gmail.com: avoid calling free_irq() to non-hooked interrupts]
[ykaneko0929@gmail.com: make NC/BE interrupt handler a function]
[ykaneko0929@gmail.com: make timestamp interrupt handler a function]
[ykaneko0929@gmail.com: timestamp interrupt is handled in multiple
interrupt handler instead of dma queue interrupt handler]
Signed-off-by: Kazuya Mizuguchi <kazuya.mizuguchi.ks@renesas.com>
Signed-off-by: Yoshihiro Kaneko <ykaneko0929@gmail.com>
Acked-by: Sergei Shtylyov <sergei.shtylyov@cogentembedded.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ixgbe: Do not allow PF to add VLVF entry unless it actually needs it

While doing the work on igb I realized there were a few cases where we were
still adding VLANs to the VLVF entries for the PF when they were not
needed. This patch cleans that up so that the only time we add a PF entry
to the VLVF is either for VLAN 0 or if the PF has requested a VLAN that a VF
is already using.

Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
Tested-by: Phil Schmitt <phillip.j.schmitt@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

ixgbe: Extend trust to allow guest to set unicast address

When running certain routing protocols like VRRP, VF guests need the
ability to set the unicast address of the interface. Extend the new ndo
trust feature to let the hypervisor trust a guest to set/update its own
unicast address.

Signed-off-by: Chas Williams <3chas3@gmail.com>
Tested-by: Phil Schmitt <phillip.j.schmitt@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

ixgbevf: use bit operations for setting and checking resets

Move the reset flags to adapter->state in order to make use of bit
operations.

This is an alternative patch to the one previously submitted by
John Greene.

Suggested-by: Alexander Duyck <aduyck@mirantis.com>
Reported-by: Scott Otto <otts62@yahoo.com>
Reported-by: John Greene <jogreene@redhat.com>
Signed-off-by: Emil Tantilov <emil.s.tantilov@intel.com>
Tested-by: Phil Schmitt <phillip.j.schmitt@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

Merge branch 'cmsg_timestamp'

Soheil Hassas Yeganeh says:

====================
add TX timestamping via cmsg

This patch series aim at enabling TX timestamping via cmsg.

Currently, to occasionally sample TX timestamping on a socket,
applications need to call setsockopt twice: first for enabling
timestamps and then for disabling them. This is an unnecessary
overhead. With cmsg, in contrast, applications can sample TX
timestamps per sendmsg().

This patch series adds the code for processing SO_TIMESTAMPING
for cmsg's of the SOL_SOCKET level, and adds the glue code for
TCP, UDP, and RAW for both IPv4 and IPv6. This implementation
supports overriding timestamp generation flags (i.e.,
SOF_TIMESTAMPING_TX_*) but not timestamp reporting flags.
Applications must still enable timestamp reporting via
setsockopt to receive timestamps.

This series does not change existing timestamping behavior for
applications that are using socket options.

I will follow up with another patch to enable timestamping for
active TFO (client-side TCP Fast Open) and also setting packet
mark via cmsgs.

Thanks!

Changes in v2:
- Replace u32 with __u32 in the documentation.

Changes in v3:
- Fix the broken build for L2TP (due to changes
in IPv6).
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

sock: document timestamping via cmsg in Documentation

Update docs and add code snippet for using cmsg for timestamping.

Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

sock: enable timestamping using control messages

Currently, SOL_TIMESTAMPING can only be enabled using setsockopt.
This is very costly when users want to sample writes to gather
tx timestamps.

Add support for enabling SO_TIMESTAMPING via control messages by
using tsflags added in `struct sockcm_cookie` (added in the previous
patches in this series) to set the tx_flags of the last skb created in
a sendmsg. With this patch, the timestamp recording bits in tx_flags
of the skbuff is overridden if SO_TIMESTAMPING is passed in a cmsg.

Please note that this is only effective for overriding the recording
timestamps flags. Users should enable timestamp reporting (e.g.,
SOF_TIMESTAMPING_SOFTWARE | SOF_TIMESTAMPING_OPT_ID) using
socket options and then should ask for SOF_TIMESTAMPING_TX_*
using control messages per sendmsg to sample timestamps for each
write.

Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ipv6: process socket-level control messages in IPv6

Process socket-level control messages by invoking
__sock_cmsg_send in ip6_datagram_send_ctl for control messages on
the SOL_SOCKET layer.

This makes sure whenever ip6_datagram_send_ctl is called for
udp and raw, we also process socket-level control messages.

This is a bit uglier than IPv4, since IPv6 does not have
something like ipcm_cookie. Perhaps we can later create
a control message cookie for IPv6?

Note that this commit interprets new control messages that
were ignored before. As such, this commit does not change
the behavior of IPv6 control messages.

Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ipv4: process socket-level control messages in IPv4

Process socket-level control messages by invoking
__sock_cmsg_send in ip_cmsg_send for control messages on
the SOL_SOCKET layer.

This makes sure whenever ip_cmsg_send is called in udp, icmp,
and raw, we also process socket-level control messages.

Note that this commit interprets new control messages that
were ignored before. As such, this commit does not change
the behavior of IPv4 control messages.

Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

sock: accept SO_TIMESTAMPING flags in socket cmsg

Accept SO_TIMESTAMPING in control messages of the SOL_SOCKET level
as a basis to accept timestamping requests per write.

This implementation only accepts TX recording flags (i.e.,
SOF_TIMESTAMPING_TX_HARDWARE, SOF_TIMESTAMPING_TX_SOFTWARE,
SOF_TIMESTAMPING_TX_SCHED, and SOF_TIMESTAMPING_TX_ACK) in
control messages. Users need to set reporting flags (e.g.,
SOF_TIMESTAMPING_OPT_ID) per socket via socket options.

This commit adds a tsflags field in sockcm_cookie which is
set in __sock_cmsg_send. It only override the SOF_TIMESTAMPING_TX_*
bits in sockcm_cookie.tsflags allowing the control message
to override the recording behavior per write, yet maintaining
the value of other flags.

This patch implements validating the control message and setting
tsflags in struct sockcm_cookie. Next commits in this series will
actually implement timestamping per write for different protocols.

Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

tcp: use one bit in TCP_SKB_CB to mark ACK timestamps

Currently, to avoid a cache line miss for accessing skb_shinfo,
tcp_ack_tstamp skips socket that do not have
SOF_TIMESTAMPING_TX_ACK bit set in sk_tsflags. This is
implemented based on an implicit assumption that the
SOF_TIMESTAMPING_TX_ACK is set via socket options for the
duration that ACK timestamps are needed.

To implement per-write timestamps, this check should be
removed and replaced with a per-packet alternative that
quickly skips packets missing ACK timestamps marks without
a cache-line miss.

To enable per-packet marking without a cache line miss, use
one bit in TCP_SKB_CB to mark a whether a SKB might need a
ack tx timestamp or not. Further checks in tcp_ack_tstamp are not
modified and work as before.

Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

tcp: accept SOF_TIMESTAMPING_OPT_ID for passive TFO

SOF_TIMESTAMPING_OPT_ID is set to get data-independent IDs
to associate timestamps with send calls. For TCP connections,
tp->snd_una is used as the starting point to calculate
relative IDs.

This socket option will fail if set before the handshake on a
passive TCP fast open connection with data in SYN or SYN/ACK,
since setsockopt requires the connection to be in the
ESTABLISHED state.

To address these, instead of limiting the option to the
ESTABLISHED state, accept the SOF_TIMESTAMPING_OPT_ID option as
long as the connection is not in LISTEN or CLOSE states.

Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

sock: break up sock_cmsg_snd into __sock_cmsg_snd and loop

To process cmsg's of the SOL_SOCKET level in addition to
cmsgs of another level, protocols can call sock_cmsg_send().
This causes a double walk on the cmsghdr list, one for SOL_SOCKET
and one for the other level.

Extract the inner demultiplex logic from the loop that walks the list,
to allow having this called directly from a walker in the protocol
specific code.

Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ixgbe: on recv increment rx.ring->stats.yields

It seem to be non intentionally changed to Tx in
commit adc810900a70 ("ixgbe: Refactor busy poll socket code to address
multiple issues")

Lock is taken from ixgbe_low_latency_recv, and there under this
lock we use ixgbe_clean_rx_irq so it looks wrong for me to increment
Tx counter.

Yield stats can be shown through ethtool:
ethtool -S enp129s0 | grep yield

Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Tested-by: Phil Schmitt <phillip.j.schmitt@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

Merge branch 'stmmac-GMAC4.x'

Alexandre TORGUE says:

====================
Enhance stmmac driver to support GMAC4.x IP

This is a subset of patch to enhance current stmmac driver to support
new GMAC4.x chips. New set of callbacks is defined to support this new
family: descriptors, dma, core.

One of main changes of GMAC 4.xx IP is descriptors management.
-descriptors are only used in ring mode.
-A descriptor is composed of 4 32bits registers (no more extended
  descriptors)
-descriptor mechanism (Tx for example, but it is exactly the same for RX):
-useful registers:
  -DMA_CH#_TxDesc_Ring_Len: length of transmit descriptor ring
  -DMA_CH#_TxDesc_List_Address: start address of the ring
  -DMA_CH#_TxDesc_Tail_Pointer: address of the last descriptor to send + 1.
  -DMA_CH#_TxDesc_Current_App_TxDesc: address of the current descriptor

-The descriptor Tail Pointer register contains the pointer to the
  descriptor address (N). The base address and the current
  descriptor decide the address of the current descriptor that the
  DMA can process. The descriptors up to one location less than the
  one indicated by the descriptor tail pointer (N-1) are owned by
  the DMA. The DMA continues to process the descriptors until the
  following condition occurs:
  "current descriptor pointer == Descriptor Tail pointer"

  Then the DMA goes into suspend mode. The application must perform
  a write to descriptor tail pointer register and update the tail
  pointer to have the following condition and to start a new transfer:
  "current descriptor pointer < Descriptor tail pointer"

  The DMA automatically wraps around the base address when the end
  of ring is reached.

New features are available on IP:
-TSO (TCP Segmentation Offload) for TX only
-Split header: to have header and payload in 2 different buffers (not yet implemented)

Below some throughput figures obtained on some boxes:

                        iperf (mbps)
--------------------------------------
                       tcp     udp
                    tx   rx   tx  rx
                     -----------------
    GMAC4.x         935  930  750 800

Note: There is a change in 4.10a databook on bitfield mapping of DMA_CHANx_INTR_ENA register.
This requires to have é diffrent set of callbacks between IP 4.00a and 4.10a.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

stmmac: update MAINTAINERS

Signed-off-by: Alexandre TORGUE <alexandre.torgue@st.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

stmmac: update version to Jan_2016

This patch just updates the driver to the version fully
tested on STi platforms. This version is Jan_2016.

Signed-off-by: Alexandre TORGUE <alexandre.torgue@st.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Documentation: networking: update stmmac

Update stmmac driver documentation according to new GMAC 4.x family.

Signed-off-by: Alexandre TORGUE <alexandre.torgue@st.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

stmmac: support new GMAC4

This patch adds the whole GMAC4 support inside the
stmmac d.d. now able to use the new HW and some new features
i.e.: TSO.
It is missing the multi-queue and split Header support at this
stage.
This patch also updates the driver version and the stmmac.txt.

Signed-off-by: Alexandre TORGUE <alexandre.torgue@st.com>
Signed-off-by: Giuseppe Cavallaro <peppe.cavallaro@st.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

stmmac: add new DT platform entries for GMAC4

This is to support the snps,dwmac-4.00 and snps,dwmac-4.10a
and related features on the platform driver.
See binding doc for further details.

Signed-off-by: Giuseppe Cavallaro <peppe.cavallaro@st.com>
Signed-off-by: Alexandre TORGUE <alexandre.torgue@st.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

stmmac: enhance mmc counter management

For gmac3, the MMC addr map is: 0x100 - 0x2fc
For gmac4, the MMC addr map is: 0x700 - 0x8fc

So instead of adding 0x600 to the IO address when setup the mmc,
the RMON base address is saved inside the private structure and
then used to manage the counters.

Signed-off-by: Giuseppe Cavallaro <peppe.cavallaro@st.com>
Signed-off-by: Alexandre TORGUE <alexandre.torgue@st.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

stmmac: add GMAC4 core support

This is the initial support for GMAC4 that includes
the main callbacks to setup the core module: including
Csum, basic filtering, mac address and interrupt (MMC,
MTL, PMT) No LPI added.

Signed-off-by: Alexandre TORGUE <alexandre.torgue@st.com>
Signed-off-by: Giuseppe Cavallaro <peppe.cavallaro@st.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

stmmac: add DMA support for GMAC 4.xx

DMA behavior is linked to descriptor management:

-descriptor mechanism (Tx for example, but it is exactly the same for RX):
-useful registers:
-DMA_CH#_TxDesc_Ring_Len: length of transmit descriptor ring
-DMA_CH#_TxDesc_List_Address: start address of the ring
-DMA_CH#_TxDesc_Tail_Pointer: address of the last
descriptor to send + 1.
-DMA_CH#_TxDesc_Current_App_TxDesc: address of the current
descriptor

-The descriptor Tail Pointer register contains the pointer to the
descriptor address (N). The base address and the current
descriptor decide the address of the current descriptor that the
DMA can process. The descriptors up to one location less than the
one indicated by the descriptor tail pointer (N-1) are owned by
the DMA. The DMA continues to process the descriptors until the
following condition occurs:
"current descriptor pointer == Descriptor Tail pointer"

Then the DMA goes into suspend mode. The application must perform
a write to descriptor tail pointer register and update the tail
pointer to have the following condition and to start a new transfer:
"current descriptor pointer < Descriptor tail pointer"

The DMA automatically wraps around the base address when the end
of ring is reached.

Up to 8 DMA could be use but currently we only use one (channel0)

Signed-off-by: Alexandre TORGUE <alexandre.torgue@st.com>
Signed-off-by: Giuseppe Cavallaro <peppe.cavallaro@st.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

stmmac: add GMAC4 DMA/CORE Header File

This is the main header file to define all the
macro used for GMAC4 DMA and CORE parts.

Signed-off-by: Alexandre TORGUE <alexandre.torgue@st.com>
Signed-off-by: Giuseppe Cavallaro <peppe.cavallaro@st.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

stmmac: add descriptors function for GMAC 4.xx

One of main changes of GMAC 4.xx IP is descriptors management.
-descriptors are only used in ring mode.
-A descriptor is composed of 4 32bits registers (no more extended
descriptors)
-descriptor mechanism (Tx for example, but it is exactly the same for RX):
-useful registers:
-DMA_CH#_TxDesc_Ring_Len: length of transmit descriptor
   ring
-DMA_CH#_TxDesc_List_Address: start address of the ring
-DMA_CH#_TxDesc_Tail_Pointer: address of the last
      descriptor to send + 1.
-DMA_CH#_TxDesc_Current_App_TxDesc: address of the current
    descriptor

-The descriptor Tail Pointer register contains the pointer to the
descriptor address (N). The base address and the current
descriptor decide the address of the current descriptor that the
DMA can process. The descriptors up to one location less than the
one indicated by the descriptor tail pointer (N-1) are owned by
the DMA. The DMA continues to process the descriptors until the
following condition occurs:
"current descriptor pointer == Descriptor Tail pointer"

  Then the DMA goes into suspend mode. The application must perform
  a write to descriptor tail pointer register and update the tail
  pointer to have the following condition and to start a new
      transfer:
  "current descriptor pointer < Descriptor tail pointer"

  The DMA automatically wraps around the base address when the end
  of ring is reached.

-New features are available on IP:
-TSO (TCP Segmentation Offload) for TX only
-Split header: to have header and payload in 2 different buffers

Signed-off-by: Alexandre TORGUE <alexandre.torgue@st.com>
Signed-off-by: Giuseppe Cavallaro <peppe.cavallaro@st.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

stmmac: rework synopsys id read, moved to dwmac setup

synopsys_uid is only used once after setup, to get synopsys_id
by using shitf/mask operation. It's no longer used then.
So, remove this temporary variable and directly compute
synopsys_id from setup routine.

Acked-by: Giuseppe Cavallaro <peppe.cavallaro@st.com>
Signed-off-by: Fabrice Gasnier <fabrice.gasnier@st.com>
Signed-off-by: Alexandre TORGUE <alexandre.torgue@st.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

stmmac: rework the routines to show the ring status

To avoid lot of check in stmmac_main for display ring management
and support the GMAC4 chip, the display_ring function is moved
into dedicated descriptor file.

Signed-off-by: Alexandre TORGUE <alexandre.torgue@st.com>
Signed-off-by: Giuseppe Cavallaro <peppe.cavallaro@st.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

stmmac: rework get_hw_feature function

On next GMAC IP generation (4.xx), the way to get hw feature
is not the same than on previous 3.xx. As it is hardware
dependent, the way to get hw capabilities should be defined in dma ops of
each MAC IP. It will avoid also a huge computation of hw capabilities in
stmmac_main.

Signed-off-by: Alexandre TORGUE <alexandre.torgue@st.com>
Signed-off-by: Giuseppe Cavallaro <peppe.cavallaro@st.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: hns: add support of pause frame ctrl for HNS V2

The patch adds support of pause ctrl for HNS V2, and this feature is lost
by HNS V1:
1) service ports can disable rx pause frame,
2) debug ports can open tx/rx pause frame.

And this patch updates the REGs about the pause ctrl when updated
status function called by upper layer routine.

Signed-off-by: Lisheng <lisheng011@huawei.com>
Signed-off-by: Yisen Zhuang <Yisen.Zhuang@huawei.com>
Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

netlink: use nla_get_in_addr and nla_put_in_addr for ipv4 address

Since nla_get_in_addr and nla_put_in_addr were implemented,
so use them appropriately.

Signed-off-by: Haishuang Yan <yanhaishuang@cmss.chinamobile.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

tcp: remove cwnd moderation after recovery

For non-SACK connections, cwnd is lowered to inflight plus 3 packets
when the recovery ends. This is an optional feature in the NewReno
RFC 2582 to reduce the potential burst when cwnd is "re-opened"
after recovery and inflight is low.

This feature is questionably effective because of PRR: when
the recovery ends (i.e., snd_una == high_seq) NewReno holds the
CA_Recovery state for another round trip to prevent false fast
retransmits. But if the inflight is low, PRR will overwrite the
moderated cwnd in tcp_cwnd_reduction() later regardlessly. So if a
receiver responds bogus ACKs (i.e., acking future data) to speed up
transfer after recovery, it can only induce a burst up to a window
worth of data packets by acking up to SND.NXT. A restart from (short)
idle or receiving streched ACKs can both cause such bursts as well.

On the other hand, if the recovery ends because the sender
detects the losses were spurious (e.g., reordering). This feature
unconditionally lowers a reverted cwnd even though nothing
was lost.

By principle loss recovery module should not update cwnd. Further
pacing is much more effective to reduce burst. Hence this patch
removes the cwnd moderation feature.

v2 changes: revised commit message on bogus ACKs and burst, and
missing signature

Signed-off-by: Matt Mathis <mattmathis@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net

Pull networking fixes from David Miller:

1) Missing device reference in IPSEC input path results in crashes
    during device unregistration.  From Subash Abhinov Kasiviswanathan.

2) Per-queue ISR register writes not being done properly in macb
    driver, from Cyrille Pitchen.

3) Stats accounting bugs in bcmgenet, from Patri Gynther.

4) Lightweight tunnel's TTL and TOS were swapped in netlink dumps, from
    Quentin Armitage.

5) SXGBE driver has off-by-one in probe error paths, from Rasmus
    Villemoes.

6) Fix race in save/swap/delete options in netfilter ipset, from
    Vishwanath Pai.

7) Ageing time of bridge not set properly when not operating over a
    switchdev device.  Fix from Haishuang Yan.

8) Fix GRO regression wrt nested FOU/GUE based tunnels, from Alexander
    Duyck.

9) IPV6 UDP code bumps wrong stats, from Eric Dumazet.

10) FEC driver should only access registers that actually exist on the
    given chipset, fix from Fabio Estevam.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (73 commits)
  net: mvneta: fix changing MTU when using per-cpu processing
  stmmac: fix MDIO settings
  Revert "stmmac: Fix 'eth0: No PHY found' regression"
  stmmac: fix TX normal DESC
  net: mvneta: use cache_line_size() to get cacheline size
  net: mvpp2: use cache_line_size() to get cacheline size
  net: mvpp2: fix maybe-uninitialized warning
  tun, bpf: fix suspicious RCU usage in tun_{attach, detach}_filter
  net: usb: cdc_ncm: adding Telit LE910 V2 mobile broadband card
  rtnl: fix msg size calculation in if_nlmsg_size()
  fec: Do not access unexisting register in Coldfire
  net: mvneta: replace MVNETA_CPU_D_CACHE_LINE_SIZE with L1_CACHE_BYTES
  net: mvpp2: replace MVPP2_CPU_D_CACHE_LINE_SIZE with L1_CACHE_BYTES
  net: dsa: mv88e6xxx: Clear the PDOWN bit on setup
  net: dsa: mv88e6xxx: Introduce _mv88e6xxx_phy_page_{read, write}
  bpf: make padding in bpf_tunnel_key explicit
  ipv6: udp: fix UDP_MIB_IGNOREDMULTI updates
  bnxt_en: Fix ethtool -a reporting.
  bnxt_en: Fix typo in bnxt_hwrm_set_pause_common().
  bnxt_en: Implement proper firmware message padding.
  ...

Merge tag 'clk-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/clk/linux

Pull clk fixes from Stephen Boyd:
"A handful of const updates for reset ops and a couple fixes to the
  newly introduced IPQ4019 clock driver"

* tag 'clk-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/clk/linux:
  clk: qcom: ipq4019: add some fixed clocks for ddrppl and fepll
  clk: qcom: ipq4019: switch remaining defines to enums
  clk: qcom: Make reset_control_ops const
  clk: tegra: Make reset_control_ops const
  clk: sunxi: Make reset_control_ops const
  clk: atlas7: Make reset_control_ops const
  clk: rockchip: Make reset_control_ops const
  clk: mmp: Make reset_control_ops const
  clk: mediatek: Make reset_control_ops const

Merge tag 'pm+acpi-4.6-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm

Pull power management and ACPI fix from Rafael J. Wysocki:
"Just one fix for a nasty boot failure on some systems based on Intel
  Skylake that shipped with broken firmware where enabling
  hardware-coordinated P-states management (HWP) causes a faulty
  interrupt handler in SMM to be invoked and crash the system (Srinivas
  Pandruvada)"

* tag 'pm+acpi-4.6-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
  ACPI / processor: Request native thermal interrupt handling via _OSC

Merge branch 'akpm' (patches from Andrew)

Merge fixes from Andrew Morton:
"11 fixes"

* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
  .mailmap: add Christophe Ricard
  Make CONFIG_FHANDLE default y
  mm/page_isolation.c: fix the function comments
  oom, oom_reaper: do not enqueue task if it is on the oom_reaper_list head
  mm/page_isolation: fix tracepoint to mirror check function behavior
  mm/rmap: batched invalidations should use existing api
  x86/mm: TLB_REMOTE_SEND_IPI should count pages
  mm: fix invalid node in alloc_migrate_target()
  include/linux/huge_mm.h: return NULL instead of false for pmd_trans_huge_lock()
  mm, kasan: fix compilation for CONFIG_SLAB
  MAINTAINERS: orangefs mailing list is subscribers-only

Merge branch 'acpi-processor'

* acpi-processor:
ACPI / processor: Request native thermal interrupt handling via _OSC

Merge branch 'for-linus-4.6' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs

Pull btrfs fixes from Chris Mason:
"This has a few fixes Dave Sterba had queued up.  These are all pretty
  small, but since they were tested I decided against waiting for more"

* 'for-linus-4.6' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
  btrfs: transaction_kthread() is not freezable
  btrfs: cleaner_kthread() doesn't need explicit freeze
  btrfs: do not write corrupted metadata blocks to disk
  btrfs: csum_tree_block: return proper errno value

Merge tag 'for-linus' of git://github.com/martinbrandenburg/linux

Pull OrangeFS fixes from Martin Brandenburg:
"Two bugfixes for OrangeFS.

  One is a reference counting bug and the other is a typo in client
  minimum version"

* tag 'for-linus' of git://github.com/martinbrandenburg/linux:
  orangefs: minimum userspace version is 2.9.3
  orangefs: don't put readdir slot twice

Merge tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux

Pull arm64 fixes from Will Deacon:

- fix oops when patching in alternative sequences on big-endian CPUs

- reconcile asm/perf_event.h after merge window fallout with KVM ARM

- defconfig updates

* tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux:
  arm64: defconfig: updates for 4.6
  arm64: perf: Move PMU register related defines to asm/perf_event.h
  arm64: opcodes.h: Add arm big-endian config options before including arm header

Merge tag 'sound-4.6-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound

Pull sound fixes from Takashi Iwai:
"A collection of small fixes:

   - a fix in ALSA timer core to avoid possible BUG() trigger
   - a fix in ALSA timer core 32bit compat layer
   - a few HD-audio quirks for ASUS and HP machines
   - AMD HD-audio HDMI controller quirks
   - fixes of USB-audio double-free at some error paths
   - a fix for memory leak in DICE driver at hotunplug"

* tag 'sound-4.6-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound:
  ALSA: timer: Use mod_timer() for rearming the system timer
  ALSA: hda - fix front mic problem for a HP desktop
  ALSA: usb-audio: Fix double-free in error paths after snd_usb_add_audio_stream() call
  ALSA: hda: add AMD Polaris-10/11 AZ PCI IDs with proper driver caps
  ALSA: dice: fix memory leak when unplugging
  ALSA: hda - Apply fix for white noise on Asus N550JV, too
  ALSA: hda - Fix white noise on Asus N750JV headphone
  ALSA: hda - Asus N750JV external subwoofer fixup
  ALSA: timer: fix gparams ioctl compatibility for different architectures

.mailmap: add Christophe Ricard

Different computers had different settings in the mail client. Some
contributions appear as Christophe Ricard, others as Christophe RICARD.

Signed-off-by: Christophe Ricard <christophe-h.ricard@st.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Make CONFIG_FHANDLE default y

Newer Fedora and OpenSUSE didn't boot with my standard configuration.
It took me some time to figure out why, in fact I had to write a script
to try different config options systematically.

The problem is that something (systemd) in dracut depends on
CONFIG_FHANDLE, which adds open by file handle syscalls.

While it is set in defconfigs it is very easy to miss when updating
older configs because it is not default y.

Make it default y and also depend on EXPERT, as dracut use is likely
widespread.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
Cc: Richard Weinberger <richard.weinberger@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

mm/page_isolation.c: fix the function comments

Commit fea85cff11de ("mm/page_isolation.c: return last tested pfn rather
than failure indicator") changed the meaning of the return value. Let's
change the function comments as well.

Signed-off-by: Neil Zhang <neilzhang1123@hotmail.com>
Cc: Joonsoo Kim <js1304@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

oom, oom_reaper: do not enqueue task if it is on the oom_reaper_list head

Commit bb29902a7515 ("oom, oom_reaper: protect oom_reaper_list using
simpler way") has simplified the check for tasks already enqueued for
the oom reaper by checking tsk->oom_reaper_list != NULL. This check is
not sufficient because the tsk might be the head of the queue without
any other tasks queued and then we would simply lockup looping on the
same task. Fix the condition by checking for the head as well.

Fixes: bb29902a7515 ("oom, oom_reaper: protect oom_reaper_list using simpler way")
Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

mm/page_isolation: fix tracepoint to mirror check function behavior

Page isolation has not failed if the fin pfn extends beyond the end pfn
and test_pages_isolated checks this correctly. Fix the tracepoint to
report the same result as the actual check function.

Signed-off-by: Lucas Stach <l.stach@pengutronix.de>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Joonsoo Kim <js1304@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

mm/rmap: batched invalidations should use existing api

The recently introduced batched invalidations mechanism uses its own
mechanism for shootdown. However, it does wrong accounting of
interrupts (e.g., inc_irq_stat is called for local invalidations),
trace-points (e.g., TLB_REMOTE_SHOOTDOWN for local invalidations) and
may break some platforms as it bypasses the invalidation mechanisms of
Xen and SGI UV.

This patch reuses the existing TLB flushing mechnaisms instead. We use
NULL as mm to indicate a global invalidation is required.

Fixes 72b252aed506b8 ("mm: send one IPI per CPU to TLB flush all entries after unmapping pages")
Signed-off-by: Nadav Amit <namit@vmware.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

x86/mm: TLB_REMOTE_SEND_IPI should count pages

TLB_REMOTE_SEND_IPI was recently introduced, but it counts bytes instead
of pages. In addition, it does not report correctly the case in which
flush_tlb_page flushes a page. Fix it to be consistent with other TLB
counters.

Fixes: 5b74283ab251b9d ("x86, mm: trace when an IPI is about to be sent")
Signed-off-by: Nadav Amit <namit@vmware.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

mm: fix invalid node in alloc_migrate_target()

It is incorrect to use next_node to find a target node, it will return
MAX_NUMNODES or invalid node. This will lead to crash in buddy system
allocation.

Fixes: c8721bbbdd36 ("mm: memory-hotplug: enable memory hotplug to handle hugepage")
Signed-off-by: Xishi Qiu <qiuxishi@huawei.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Joonsoo Kim <js1304@gmail.com>
Cc: David Rientjes <rientjes@google.com>
Cc: "Laura Abbott" <lauraa@codeaurora.org>
Cc: Hui Zhu <zhuhui@xiaomi.com>
Cc: Wang Xiaoqiang <wangxq10@lzu.edu.cn>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

include/linux/huge_mm.h: return NULL instead of false for pmd_trans_huge_lock()

The return value of pmd_trans_huge_lock() is a pointer, not a boolean
value, so use NULL instead of false as the return value.

Signed-off-by: Chen Gang <gang.chen.5i5j@gmail.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

mm, kasan: fix compilation for CONFIG_SLAB

Add the missing argument to set_track().

Fixes: cd11016e5f52 ("mm, kasan: stackdepot implementation. Enable stackdepot for SLAB")
Signed-off-by: Alexander Potapenko <glider@google.com>
Cc: Andrey Konovalov <adech.fo@gmail.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Konstantin Serebryany <kcc@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

MAINTAINERS: orangefs mailing list is subscribers-only

So update MAINTAINERS to say so.

Signed-off-by: Joe Perches <joe@perches.com>
Cc: Mike Marshall <hubcap@omnibond.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

net: mvneta: fix changing MTU when using per-cpu processing

After enabling per-cpu processing it appeared that under heavy load
changing MTU can result in blocking all port's interrupts and
transmitting data is not possible after the change.

This commit fixes above issue by disabling percpu interrupts for the
time, when TXQs and RXQs are reconfigured.

Signed-off-by: Marcin Wojtas <mw@semihalf.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'stmmac-fixes'

Giuseppe Cavallaro says:

====================
stmmac MDIO and normal descr fixes

This patch series is to fix the problems below and recently debugged
in this mailing list:

o to fix a problem for the HW where the normal descriptor
o to fix the mdio registration according to the different
platform configurations

I am resending all the patches again: built on top of net.git repo.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

stmmac: fix MDIO settings

Initially the phy_bus_name was added to manipulate the
driver name but it was recently just used to manage the
fixed-link and then to take some decision at run-time.
So the patch uses the is_pseudo_fixed_link and removes
the phy_bus_name variable not necessary anymore.

The driver can manage the mdio registration by using phy-handle,
dwmac-mdio and own parameter e.g. snps,phy-addr.
This patch takes care about all these possible configurations
and fixes the mdio registration in case of there is a real
transceiver or a switch (that needs to be managed by using
fixed-link).

Signed-off-by: Giuseppe Cavallaro <peppe.cavallaro@st.com>
Reviewed-by: Andreas Färber <afaerber@suse.de>
Tested-by: Frank Schäfer <fschaefer.oss@googlemail.com>
Cc: Gabriel Fernandez <gabriel.fernandez@linaro.org>
Cc: Dinh Nguyen <dinh.linux@gmail.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Phil Reid <preid@electromag.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>

Revert "stmmac: Fix 'eth0: No PHY found' regression"

This reverts commit 88f8b1bb41c6208f81b6a480244533ded7b59493.
due to problems on GeekBox and Banana Pi M1 board when
connected to a real transceiver instead of a switch via
fixed-link.

Signed-off-by: Giuseppe Cavallaro <peppe.cavallaro@st.com>
Cc: Gabriel Fernandez <gabriel.fernandez@linaro.org>
Cc: Andreas Färber <afaerber@suse.de>
Cc: Frank Schäfer <fschaefer.oss@googlemail.com>
Cc: Dinh Nguyen <dinh.linux@gmail.com>
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: David S. Miller <davem@davemloft.net>

stmmac: fix TX normal DESC

This patch fixs a regression raised when test on chips that use
the normal descriptor layout. In fact, no len bits were set for
the TDES1 and no OWN bit inside the TDES0.

Signed-off-by: Giuseppe CAVALLARO <peppe.cavallaro@st.com>
Tested-by: Andreas Färber <afaerber@suse.de>
Cc: Fabrice Gasnier <fabrice.gasnier@st.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: mvneta: use cache_line_size() to get cacheline size

L1_CACHE_BYTES may not be the real cacheline size, use cache_line_size
to determine the cacheline size in runtime.

Signed-off-by: Jisheng Zhang <jszhang@marvell.com>
Suggested-by: Marcin Wojtas <mw@semihalf.com>
Signed-off-by: David S. Miller <davem@davemloft.net>