git.karo-electronics.de Git - karo-tx-linux.git/log

net: reduce cycles spend on ICMP replies that gets rate limited

This patch split the global and per (inet)peer ICMP-reply limiter
code, and moves the global limit check to earlier in the packet
processing path.  Thus, avoid spending cycles on ICMP replies that
gets limited/suppressed anyhow.

The global ICMP rate limiter icmp_global_allow() is a good solution,
it just happens too late in the process.  The kernel goes through the
full route lookup (return path) for the ICMP message, before taking
the rate limit decision of not sending the ICMP reply.

Details: The kernels global rate limiter for ICMP messages got added
in commit 4cdf507d5452 ("icmp: add a global rate limitation").  It is
a token bucket limiter with a global lock.  It brilliantly avoids
locking congestion by only updating when 20ms (HZ/50) were elapsed. It
can then avoids taking lock when credit is exhausted (when under
pressure) and time constraint for refill is not yet meet.

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Revert "icmp: avoid allocating large struct on stack"

This reverts commit 9a99d4a50cb8 ("icmp: avoid allocating large struct
on stack"), because struct icmp_bxm no really a large struct, and
allocating and free of this small 112 bytes hurts performance.

Fixes: 9a99d4a50cb8 ("icmp: avoid allocating large struct on stack")
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge tag 'rxrpc-rewrite-20170109' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs

David Howells says:

====================
afs: Refcount afs_call struct

These patches provide some tracepoints for AFS and fix a potential leak by
adding refcounting to the afs_call struct.

The patches are:

(1) Add some tracepoints for logging incoming calls and monitoring
     notifications from AF_RXRPC and data reception.

(2) Get rid of afs_wait_mode as it didn't turn out to be as useful as
     initially expected.  It can be brought back later if needed.  This
     clears some stuff out that I don't then need to fix up in (4).

(3) Allow listen(..., 0) to be used to disable listening.  This makes
     shutting down the AFS cache manager server in the kernel much easier
     and the accounting simpler as we can then be sure that (a) all
     preallocated afs_call structs are relesed and (b) no new incoming
     calls are going to be started.

     For the moment, listening cannot be reenabled.

(4) Add refcounting to the afs_call struct to fix a potential multiple
     release detected by static checking and add a tracepoint to follow the
     lifecycle of afs_call objects.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'dsa_swqitch_ops-const'

Florian Fainelli says:

====================
net: dsa: Make dsa_switch_ops const

This patch series allows us to annotate dsa_switch_ops with a const
qualifier.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

net: dsa: Make dsa_switch_ops const

Now that we have properly encapsulated and made drivers utilize exported
functions, we can switch dsa_switch_ops to be a annotated with const.

Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: dsa: Encapsulate legacy switch drivers into dsa_switch_driver

In preparation for making struct dsa_switch_ops const, encapsulate it
within a dsa_switch_driver which has a list pointer and a pointer to
dsa_switch_ops. This allows us to take the list_head pointer out of
dsa_switch_ops, which is written to by {un,}register_switch_driver.

Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: dsa: bcm_sf2: Declare our own dsa_switch_ops

Utilize the b53 exported functions to fill our bcm_sf2_ops structure,
also making it clear what we utilize and what we specifically override.

Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: dsa: b53: Export most operations to other drivers

In preparation for making dsa_switch_ops const, export b53 operations
utilized by other drivers such as bcm_sf2.

Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'sh_eth-csum'

Sergei Shtylyov says:

====================
sh_eth: "intgelligent checksum" related cleanups

Here's a set of 2 patches against DaveM's 'net.git' repo, as they are based
on a couple patches merged there recently; however, the patches are destined
for 'net-next.git' (once 'net.git' gets merged there next time). I'm cleaning
up the "intelligent checksum" related code (however, the driver only disables
this feature for now, theres's no proper offload supprt yet).
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

sh_eth: rename 'sh_eth_cpu_data::hw_crc'

The 'struct sh_eth_cpu_data' field indicating the "intelligent checksum"
support was misnamed 'hw_crc' -- rename it to 'hw_checksum'.

Signed-off-by: Sergei Shtylyov <sergei.shtylyov@cogentembedded.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

sh_eth: get rid of 'sh_eth_cpu_data::shift_rd0'

After checking all  the available manuals,  I have enough information to
conclude  that the 'shift_rd0' flag is only relevant  for the Ether cores
supporting so called "intelligent checksum" (and hence having CSMR) which
is indicated  by the 'hw_crc' flag.  Since  all the relevant SoCs now have
both these flags set, we can  at last  get  rid of the former flag...

Signed-off-by: Sergei Shtylyov <sergei.shtylyov@cogentembedded.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net

phy state machine: failsafe leave invalid RUNNING state

While in RUNNING state, phy_state_machine() checks for link changes by
comparing phydev->link before and after calling phy_read_status().
This works as long as it is guaranteed that phydev->link is never
changed outside the phy_state_machine().

If in some setups this happens, it causes the state machine to miss
a link loss and remain RUNNING despite phydev->link being 0.

This has been observed running a dsa setup with a process continuously
polling the link states over ethtool each second (SNMPD RFC-1213
agent). Disconnecting the link on a phy followed by a ETHTOOL_GSET
causes dsa_slave_get_settings() / dsa_slave_get_link_ksettings() to
call phy_read_status() and with that modify the link status - and
with that bricking the phy state machine.

This patch adds a fail-safe check while in RUNNING, which causes to
move to CHANGELINK when the link is gone and we are still RUNNING.

Signed-off-by: Zefir Kurtisi <zefir.kurtisi@neratec.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net

Pull networking fixes from David Miller:

1) Fix dumping of nft_quota entries, from Pablo Neira Ayuso.

2) Fix out of bounds access in nf_tables discovered by KASAN, from
    Florian Westphal.

3) Fix IRQ enabling in dp83867 driver, from Grygorii Strashko.

4) Fix unicast filtering in be2net driver, from Ivan Vecera.

5) tg3_get_stats64() can race with driver close and ethtool
    reconfigurations, fix from Michael Chan.

6) Fix error handling when pass limit is reached in bpf code gen on
    x86. From Daniel Borkmann.

7) Don't clobber switch ops and use proper MDIO nested reads and writes
    in bcm_sf2 driver, from Florian Fainelli.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (21 commits)
  net: dsa: bcm_sf2: Utilize nested MDIO read/write
  net: dsa: bcm_sf2: Do not clobber b53_switch_ops
  net: stmmac: fix maxmtu assignment to be within valid range
  bpf: change back to orig prog on too many passes
  tg3: Fix race condition in tg3_get_stats64().
  be2net: fix unicast list filling
  be2net: fix accesses to unicast list
  netlabel: add CALIPSO to the list of built-in protocols
  vti6: fix device register to report IFLA_INFO_KIND
  net: phy: dp83867: fix irq generation
  amd-xgbe: Fix IRQ processing when running in single IRQ mode
  sh_eth: R8A7740 supports packet shecksumming
  sh_eth: fix EESIPR values for SH77{34|63}
  r8169: fix the typo in the comment
  nl80211: fix sched scan netlink socket owner destruction
  bridge: netfilter: Fix dropping packets that moving through bridge interface
  netfilter: ipt_CLUSTERIP: check duplicate config when initializing
  netfilter: nft_payload: mangle ckecksum if NFT_PAYLOAD_L4CSUM_PSEUDOHDR is set
  netfilter: nf_tables: fix oob access
  netfilter: nft_queue: use raw_smp_processor_id()
  ...

Merge branch 'dwmac-dwc-qos-eth'

Joao Pinto says:

====================
adding new glue driver dwmac-dwc-qos-eth

This patch set contains the porting of the synopsys/dwc_eth_qos.c driver
to the stmmac structure. This operation resulted in the creation of a new
platform glue driver called dwmac-dwc-qos-eth which was based in the
dwc_eth_qos as is.

dwmac-dwc-qos-eth inherited dwc_eth_qos DT bindings, to assure that current
and old users can continue to use it as before. We can see this driver as
being deprecated, since all new development will be done in stmmac.

Please check each patch for implementation details.
====================

Tested-by: Niklas Cassel <niklas.cassel@axis.com>
Reviewed-by: Lars Persson <larper@axis.com>
Acked-by: Alexandre TORGUE <alexandre.torgue@st.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

stmmac: adding new glue driver dwmac-dwc-qos-eth

This patch adds a new glue driver called dwmac-dwc-qos-eth which
was based in the dwc_eth_qos as is. To assure retro-compatibility a slight
tweak was also added to stmmac_platform.

Signed-off-by: Joao Pinto <jpinto@synopsys.com>
Tested-by: Niklas Cassel <niklas.cassel@axis.com>
Reviewed-by: Lars Persson <larper@axis.com>
Acked-by: Alexandre TORGUE <alexandre.torgue@st.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

stmmac: move stmmac_clk, pclk, clk_ptp_ref and stmmac_rst to platform structure

This patch moves stmmac_clk, pclk, clk_ptp_ref and stmmac_rst to the
plat_stmmacenet_data structure. It also moves these platform variables
initialization to stmmac_platform. This was done for two reasons:

a) If PCI is used, platform related code is being executed in stmmac_main
resulting in warnings that have no sense and conceptually was not right

b) stmmac as a synopsys reference ethernet driver stack will be hosting
more and more drivers to its structure like synopsys/dwc_eth_qos.c.
These drivers have their own DT bindings that are not compatible with
stmmac's. One of the most important are the clock names, and so they need
to be parsed in the glue logic and initialized there, and that is the main
reason why the clocks were passed to the platform structure.

Signed-off-by: Joao Pinto <jpinto@synopsys.com>
Tested-by: Niklas Cassel <niklas.cassel@axis.com>
Reviewed-by: Lars Persson <larper@axis.com>
Acked-by: Alexandre TORGUE <alexandre.torgue@st.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

stmmac: adding DT parameter for LPI tx clock gating

This patch adds a new parameter to the stmmac DT: snps,en-tx-lpi-clockgating.
It was ported from synopsys/dwc_eth_qos.c and it is useful if lpi tx clock
gating is needed by stmmac users also.

Signed-off-by: Joao Pinto <jpinto@synopsys.com>
Tested-by: Niklas Cassel <niklas.cassel@axis.com>
Reviewed-by: Lars Persson <larper@axis.com>
Acked-by: Alexandre TORGUE <alexandre.torgue@st.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

alx: add feature flag for rx checksumming

The code to handle rx checksumming was in the driver since its introduction
but for reasons unknown the feature flag was left out. Now it is possible
to enable this feature with ethtool.

Tested on my AR8161 ethernet card, there are no regressions observed in
netperf if this feature is enabled.

Signed-off-by: Tobias Regnery <tobias.regnery@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'act_csum-sctp'

Davide Caratti says:

====================
net/sched: act_csum: add support for SCTP checksum

This series extends current act_csum functionality to allow computation of
SCTP checksums. Patch 1 ensures LIBCRC32C will be selected if NET_ACT_CSUM
is selected. Patch 2 extends act_csum to handle IPPROTO_SCTP protocol in
IPv4/IPv6 header, and eventually compute the CRC32c value.

v2:
- style fix in tc_csum.h
- avoid nested if statement in act_csum.c
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

net/sched: act_csum: compute crc32c on SCTP packets

modify act_csum to compute crc32c on IPv4/IPv6 packets having SCTP in
their payload, and extend UAPI definitions accordingly.

Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net/sched: Kconfig: select LIBCRC32C if NET_ACT_CSUM is selected

LIBCRC32C is needed to compute crc32c on SCTP packets.

Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'mlxsw-small-driver-update'

Jiri Pirko says:

====================
mlxsw: small driver update

This patchset contains various small "non-net" fixes and enhancements.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

mlxsw: spectrum: Change ENOTSUPP to EOPNOTSUPP

As ENOTSUPP is specific to NFS, change the return error value to
EOPNOTSUPP in various places in the mlxsw driver.

Signed-off-by: Yotam Gigi <yotamg@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

mlxsw: spectrum: Fix order of commands in port remove function

Fix the order of the free directives to match the port init function

Signed-off-by: Yotam Gigi <yotamg@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

mlxsw: spectrum: Make the add_matchall_tc_entry symmetric

Currently, the mlxsw spectrum driver only supports offloading the matchall
classifier together with the mirred action. To allow more matchall tc
offloads, make the code symmetric so that it can be easily extended later
on for other actions.

Signed-off-by: Yotam Gigi <yotamg@mellanox.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

mlxsw: cmd: Fix API name comments for event-queues

Probably some copy-paste error from "int_msix" that caused "int_" prefix to
appear in the comments for all "eq_" APIs.

Signed-off-by: Elad Raz <eladr@mellanox.com>
Acked-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

mlxsw: Fix mlxsw_i2c_write return value

The "err" variable is been checked, return always 0.

Signed-off-by: Elad Raz <eladr@mellanox.com>
Acked-by: Ido Schimmel <idosch@mellanox.com>
Acked-by: Vadim Pasternak <vadimp@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: ethernet: ti: cpsw: extend limits for cpsw_get/set_ringparam

Allow to set number of descs close to possible values. In case of
minimum limit it's equal to number of channels to be able to set
at least one desc per channel. For maximum limit leave enough descs
number for tx channels.

Signed-off-by: Ivan Khoronzhuk <ivan.khoronzhuk@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

cls_u32: don't bother explicitly initializing ->divisor to zero

This struct member is already initialized to zero upon root_ht's
allocation via kzalloc().

Signed-off-by: Alexandru Moise <00moses.alexander00@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'siphash'

Jason A. Donenfeld says:

====================
Introduce The SipHash PRF

This patch series introduces SipHash into the kernel. SipHash is a
cryptographically secure PRF, which serves a variety of functions, and is
introduced in patch #1. The following patch #2 introduces HalfSipHash,
an optimization suitable for hash tables only. Finally, the last two patches
in this series show two usages of the introduced siphash function family.
It is expected that after this initial introduction, other usages will follow.

Please read the extensive descriptions in patch #1 and patch #2 of what these
functions do and the various levels of assurances. They're products of intense
cryptographic research, and I believe they're suitable for the uses outlined
herein.

The use of SipHash is not limited to the networking subsystem -- indeed I
would like to use it in other places too in the kernel. But after discussing
with a few on this list and at Linus' suggestion, the initial import of these
functions is coming through the networking tree. After these are merged, it
will then be easier to expand use elsewhere.

Changes v2->v3:
  - hsiphash keys now simply use an unsigned long, in order to avoid
    a cluttered ifdef and make it a bit more clear what's happening.
  - A typo in the documentation has been fixed.
  - The documentation has been augmented with an example relating to struct
    packing and passing.
  - The net_secret variable is now __read_mostly.

Hopefully this is the last of the required revisions, and v3 can be merged
into net-next.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

syncookies: use SipHash in place of SHA1

SHA1 is slower and less secure than SipHash, and so replacing syncookie
generation with SipHash makes natural sense. Some BSDs have been doing
this for several years in fact.

The speedup should be similar -- and even more impressive -- to the
speedup from the sequence number fix in this series.

Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: David Miller <davem@davemloft.net>
Signed-off-by: David S. Miller <davem@davemloft.net>

secure_seq: use SipHash in place of MD5

This gives a clear speed and security improvement. Siphash is both
faster and is more solid crypto than the aging MD5.

Rather than manually filling MD5 buffers, for IPv6, we simply create
a layout by a simple anonymous struct, for which gcc generates
rather efficient code. For IPv4, we pass the values directly to the
short input convenience functions.

64-bit x86_64:
[    1.683628] secure_tcpv6_sequence_number_md5# cycles: 99563527
[    1.717350] secure_tcp_sequence_number_md5# cycles: 92890502
[    1.741968] secure_tcpv6_sequence_number_siphash# cycles: 67825362
[    1.762048] secure_tcp_sequence_number_siphash# cycles: 67485526

32-bit x86:
[    1.600012] secure_tcpv6_sequence_number_md5# cycles: 103227892
[    1.634219] secure_tcp_sequence_number_md5# cycles: 94732544
[    1.669102] secure_tcpv6_sequence_number_siphash# cycles: 96299384
[    1.700165] secure_tcp_sequence_number_siphash# cycles: 86015473

Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: David Miller <davem@davemloft.net>
Cc: David Laight <David.Laight@aculab.com>
Cc: Tom Herbert <tom@herbertland.com>
Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

siphash: implement HalfSipHash1-3 for hash tables

HalfSipHash, or hsiphash, is a shortened version of SipHash, which
generates 32-bit outputs using a weaker 64-bit key. It has *much* lower
security margins, and shouldn't be used for anything too sensitive, but
it could be used as a hashtable key function replacement, if the output
is never exposed, and if the security requirement is not too high.

The goal is to make this something that performance-critical jhash users
would be willing to use.

On 64-bit machines, HalfSipHash1-3 is slower than SipHash1-3, so we alias
SipHash1-3 to HalfSipHash1-3 on those systems.

64-bit x86_64:
[    0.509409] test_siphash:     SipHash2-4 cycles: 4049181
[    0.510650] test_siphash:     SipHash1-3 cycles: 2512884
[    0.512205] test_siphash: HalfSipHash1-3 cycles: 3429920
[    0.512904] test_siphash:    JenkinsHash cycles:  978267
So, we map hsiphash() -> SipHash1-3

32-bit x86:
[    0.509868] test_siphash:     SipHash2-4 cycles: 14812892
[    0.513601] test_siphash:     SipHash1-3 cycles:  9510710
[    0.515263] test_siphash: HalfSipHash1-3 cycles:  3856157
[    0.515952] test_siphash:    JenkinsHash cycles:  1148567
So, we map hsiphash() -> HalfSipHash1-3

hsiphash() is roughly 3 times slower than jhash(), but comes with a
considerable security improvement.

Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Reviewed-by: Jean-Philippe Aumasson <jeanphilippe.aumasson@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

siphash: add cryptographically secure PRF

SipHash is a 64-bit keyed hash function that is actually a
cryptographically secure PRF, like HMAC. Except SipHash is super fast,
and is meant to be used as a hashtable keyed lookup function, or as a
general PRF for short input use cases, such as sequence numbers or RNG
chaining.

For the first usage:

There are a variety of attacks known as "hashtable poisoning" in which an
attacker forms some data such that the hash of that data will be the
same, and then preceeds to fill up all entries of a hashbucket. This is
a realistic and well-known denial-of-service vector. Currently
hashtables use jhash, which is fast but not secure, and some kind of
rotating key scheme (or none at all, which isn't good). SipHash is meant
as a replacement for jhash in these cases.

There are a modicum of places in the kernel that are vulnerable to
hashtable poisoning attacks, either via userspace vectors or network
vectors, and there's not a reliable mechanism inside the kernel at the
moment to fix it. The first step toward fixing these issues is actually
getting a secure primitive into the kernel for developers to use. Then
we can, bit by bit, port things over to it as deemed appropriate.

While SipHash is extremely fast for a cryptographically secure function,
it is likely a bit slower than the insecure jhash, and so replacements
will be evaluated on a case-by-case basis based on whether or not the
difference in speed is negligible and whether or not the current jhash usage
poses a real security risk.

For the second usage:

A few places in the kernel are using MD5 or SHA1 for creating secure
sequence numbers, syn cookies, port numbers, or fast random numbers.
SipHash is a faster and more fitting, and more secure replacement for MD5
in those situations. Replacing MD5 and SHA1 with SipHash for these uses is
obvious and straight-forward, and so is submitted along with this patch
series. There shouldn't be much of a debate over its efficacy.

Dozens of languages are already using this internally for their hash
tables and PRFs. Some of the BSDs already use this in their kernels.
SipHash is a widely known high-speed solution to a widely known set of
problems, and it's time we catch-up.

Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Reviewed-by: Jean-Philippe Aumasson <jeanphilippe.aumasson@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Eric Biggers <ebiggers3@gmail.com>
Cc: David Laight <David.Laight@aculab.com>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: ipv4: remove disable of bottom half in inet_rtm_getroute

Nothing about the route lookup requires bottom half to be disabled.
Remove the local_bh_disable ... local_bh_enable around ip_route_input.
This appears to be a vestige of days gone by as it has been there
since the beginning of git time.

Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: intel: e100: use new api ethtool_{get|set}_link_ksettings

The ethtool api {get|set}_settings is deprecated.
We move this driver to new api {get|set}_link_ksettings.

Signed-off-by: Philippe Reynes <tremyfr@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: ibm: ibmvnic: use new api ethtool_{get|set}_link_ksettings

The ethtool api {get|set}_settings is deprecated.
We move this driver to new api {get|set}_link_ksettings.

Signed-off-by: Philippe Reynes <tremyfr@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: ibm: ibmveth: use new api ethtool_{get|set}_link_ksettings

The ethtool api {get|set}_settings is deprecated.
We move this driver to new api {get|set}_link_ksettings.

Signed-off-by: Philippe Reynes <tremyfr@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: ibm: emac: use new api ethtool_{get|set}_link_ksettings

The ethtool api {get|set}_settings is deprecated.
We move this driver to new api {get|set}_link_ksettings.

Signed-off-by: Philippe Reynes <tremyfr@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: ibm: ehea: use new api ethtool_{get|set}_link_ksettings

The ethtool api {get|set}_settings is deprecated.
We move this driver to new api {get|set}_link_ksettings.

Signed-off-by: Philippe Reynes <tremyfr@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: change init_inodecache() return void

sock_init() call it but not check it's return value,
so change it to void return and add an internal BUG_ON() check.

Signed-off-by: yuan linyu <Linyu.Yuan@alcatel-sbell.com.cn>
Signed-off-by: David S. Miller <davem@davemloft.net>

afs: Refcount the afs_call struct

A static checker warning occurs in the AFS filesystem:

fs/afs/cmservice.c:155 SRXAFSCB_CallBack()
error: dereferencing freed memory 'call'

due to the reply being sent before we access the server it points to.  The
act of sending the reply causes the call to be freed if an error occurs
(but not if it doesn't).

On top of this, the lifetime handling of afs_call structs is fragile
because they get passed around through workqueues without any sort of
refcounting.

Deal with the issues by:

(1) Fix the maybe/maybe not nature of the reply sending functions with
     regards to whether they release the call struct.

(2) Refcount the afs_call struct and sort out places that need to get/put
     references.

(3) Pass a ref through the work queue and release (or pass on) that ref in
     the work function.  Care has to be taken because a work queue may
     already own a ref to the call.

(4) Do the cleaning up in the put function only.

(5) Simplify module cleanup by always incrementing afs_outstanding_calls
     whenever a call is allocated.

(6) Set the backlog to 0 with kernel_listen() at the beginning of the
     process of closing the socket to prevent new incoming calls from
     occurring and to remove the contribution of preallocated calls from
     afs_outstanding_calls before we wait on it.

A tracepoint is also added to monitor the afs_call refcount and lifetime.

Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Fixes: 08e0e7c82eea: "[AF_RXRPC]: Make the in-kernel AFS filesystem use AF_RXRPC."

rxrpc: Allow listen(sock, 0) to be used to disable listening

Allow listen() with a backlog of 0 to be used to disable listening on an
AF_RXRPC socket. This also releases any preallocation, thereby making it
easier for a kernel service to account for all allocated call structures
when shutting down the service.

The socket cannot thereafter have listening reenabled, but must rather be
closed and reopened.

Signed-off-by: David Howells <dhowells@redhat.com>

afs: Kill afs_wait_mode

The afs_wait_mode struct isn't really necessary. Client calls only use one
of a choice of two (synchronous or the asynchronous) and incoming calls
don't use the wait at all. Replace with a boolean parameter.

Signed-off-by: David Howells <dhowells@redhat.com>

afs: Add some tracepoints

Add three tracepoints to the AFS filesystem:

(1) The afs_recv_data tracepoint logs data segments that are extracted
     from the data received from the peer through afs_extract_data().

(2) The afs_notify_call tracepoint logs notification from AF_RXRPC of data
     coming in to an asynchronous call.

(3) The afs_cb_call tracepoint logs incoming calls that have had their
     operation ID extracted and mapped into a supported cache manager
     service call.

To make (3) work, the name strings in the afs_call_type struct objects have
to be annotated with __tracepoint_string.  This is done with the CM_NAME()
macro.

Further, the AFS call state enum needs a name so that it can be used to
declare parameter types.

Signed-off-by: David Howells <dhowells@redhat.com>

Merge branch 'bcm_sf2-fixes'

Florian Fainelli says:

====================
net: dsa: bcm_sf2: Couple fixes

Here are a couple of fixes for bcm_sf2, please queue these up for -stable
as well, thank you very much!
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

net: dsa: bcm_sf2: Utilize nested MDIO read/write

We are implementing a MDIO bus which is behind another one, so use the
nested version of the accessors to get lockdep annotations correct.

Fixes: 461cd1b03e32 ("net: dsa: bcm_sf2: Register our slave MDIO bus")
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: dsa: bcm_sf2: Do not clobber b53_switch_ops

We make the bcm_sf2 driver override ds->ops which points to
b53_switch_ops since b53_switch_alloc() did the assignent. This is all
well and good until a second b53 switch comes in, and ends up using the
bcm_sf2 operations. Make a proper local copy, substitute the ds->ops
pointer and then override the operations.

Fixes: f458995b9ad8 ("net: dsa: bcm_sf2: Utilize core B53 driver when possible")
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'tc-skb-diet'

Willem de Bruijn says:

====================
convert tc_verd to integer bitfields

The skb tc_verd field takes up two bytes but uses far fewer bits.
Convert the remaining use cases to bitfields that fit in existing
holes (depending on config options) and potentially save the two
bytes in struct sk_buff.

This patchset is based on an earlier set by Florian Westphal and its
discussion (http://www.spinics.net/lists/netdev/msg329181.html).

Patches 1 and 2 are low hanging fruit: removing the last traces of
  data that are no longer stored in tc_verd.

Patches 3 and 4 convert tc_verd to individual bitfields (5 bits).

Patch 5 reduces TC_AT to a single bitfield,
  as AT_STACK is not valid here (unlike in the case of TC_FROM).

Patch 6 changes TC_FROM to two bitfields with clearly defined purpose.

It may be possible to reduce storage further after this initial round.
If tc_skip_classify is set only by IFB, testing skb_iif may suffice.
The L2 header pushing/popping logic can perhaps be shared with
AF_PACKET, which currently not pkt_type for the same purpose.

Changes:
  RFC -> v1
    - (patch 3): remove no longer needed label in tfc_action_exec
    - (patch 5): set tc_at_ingress at the same points as existing
                 SET_TC_AT calls

Tested ingress mirred + netem + ifb:

  ip link set dev ifb0 up
  tc qdisc add dev eth0 ingress
  tc filter add dev eth0 parent ffff: \
    u32 match ip dport 8000 0xffff \
    action mirred egress redirect dev ifb0
  tc qdisc add dev ifb0 root netem delay 1000ms
  nc -u -l 8000 &
  ssh $otherhost nc -u $host 8000

Tested egress mirred:

  ip link add veth1 type veth peer name veth2
  ip link set dev veth1 up
  ip link set dev veth2 up
  tcpdump -n -i veth2 udp and dst port 8000 &

  tc qdisc add dev eth0 root handle 1: prio
  tc filter add dev eth0 parent 1:0 \
    u32 match ip dport 8000 0xffff \
    action mirred egress redirect dev veth1
  tc qdisc add dev veth1 root netem delay 1000ms
  nc -u $otherhost 8000

Tested ingress mirred:

  ip link add veth1 type veth peer name veth2
  ip link add veth3 type veth peer name veth4

  ip netns add ns0
  ip netns add ns1

  for i in 1 2 3 4; do \
    NS=ns$((${i}%2)); \
    ip link set dev veth${i} netns ${NS}; \
    ip netns exec ${NS} \
      ip addr add dev veth${i} 192.168.1.${i}/24; \
    ip netns exec ${NS} \
      ip link set dev veth${i} up; \
  done

  ip netns exec ns0 tc qdisc add dev veth2 ingress
  ip netns exec ns0 \
    tc filter add dev veth2 parent ffff: \
      u32 match ip dport 8000 0xffff \
      action mirred ingress redirect dev veth4

  ip netns exec ns0 \
    tcpdump -n -i veth4 udp and dst port 8000 &
  ip netns exec ns1 \
    nc -u 192.168.1.2 8000
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

net-tc: convert tc_from to tc_from_ingress and tc_redirected

The tc_from field fulfills two roles. It encodes whether a packet was
redirected by an act_mirred device and, if so, whether act_mirred was
called on ingress or egress. Split it into separate fields.

The information is needed by the special IFB loop, where packets are
taken out of the normal path by act_mirred, forwarded to IFB, then
reinjected at their original location (ingress or egress) by IFB.

The IFB device cannot use skb->tc_at_ingress, because that may have
been overwritten as the packet travels from act_mirred to ifb_xmit,
when it passes through tc_classify on the IFB egress path. Cache this
value in skb->tc_from_ingress.

That field is valid only if a packet arriving at ifb_xmit came from
act_mirred. Other packets can be crafted to reach ifb_xmit. These
must be dropped. Set tc_redirected on redirection and drop all packets
that do not have this bit set.

Both fields are set only on cloned skbs in tc actions, so original
packet sources do not have to clear the bit when reusing packets
(notably, pktgen and octeon).

Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net-tc: convert tc_at to tc_at_ingress

Field tc_at is used only within tc actions to distinguish ingress from
egress processing. A single bit is sufficient for this purpose.

Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net-tc: convert tc_verd to integer bitfields

Extract the remaining two fields from tc_verd and remove the __u16
completely. TC_AT and TC_FROM are converted to equivalent two-bit
integer fields tc_at and tc_from. Where possible, use existing
helper skb_at_tc_ingress when reading tc_at. Introduce helper
skb_reset_tc to clear fields.

Not documenting tc_from and tc_at, because they will be replaced
with single bit fields in follow-on patches.

Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net-tc: extract skip classify bit from tc_verd

Packets sent by the IFB device skip subsequent tc classification.
A single bit governs this state. Move it out of tc_verd in
anticipation of removing that __u16 completely.

The new bitfield tc_skip_classify temporarily uses one bit of a
hole, until tc_verd is removed completely in a follow-up patch.

Remove the bit hole comment. It could be 2, 3, 4 or 5 bits long.
With that many options, little value in documenting it.

Introduce a helper function to deduplicate the logic in the two
sites that check this bit.

The field tc_skip_classify is set only in IFB on skbs cloned in
act_mirred, so original packet sources do not have to clear the
bit when reusing packets (notably, pktgen and octeon).

Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net-tc: make MAX_RECLASSIFY_LOOP local

This field is no longer kept in tc_verd. Remove it from the global
definition of that struct.

Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net-tc: remove unused tc_verd fields

Remove the last reference to tc_verd's munge and redirect ttl bits.
These fields are no longer used.

Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: stmmac: fix maxmtu assignment to be within valid range

There is no checking valid value of maxmtu when getting it from
device tree. This resolution added the checking condition to
ensure the assignment is made within a valid range.

Signed-off-by: Kweh, Hock Leong <hock.leong.kweh@intel.com>
Reviewed-by: Andy Shevchenko <andy.shevchenko@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

mdio: Demote print from info to debug in mdio_device_register

While it is useful to know which MDIO device is being registered, demote
the dev_info() to a dev_dbg().

Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: remove useless memset's in drivers get_stats64

In dev_get_stats() the statistic structure storage has already been
zeroed. Therefore network drivers do not need to call memset() again.

Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: make ndo_get_stats64 a void function

The network device operation for reading statistics is only called
in one place, and it ignores the return value. Having a structure
return value is potentially confusing because some future driver could
incorrectly assume that the return value was used.

Fix all drivers with ndo_get_stats64 to have a void function.

Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch '100GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/next-queue

Jeff Kirsher says:

====================
100GbE Intel Wired LAN Driver Updates 2017-01-08

This series contains updates to fm10k only.

Ngai-Mint changes the driver to use the MAC pointer in the fm10k_mac_info
structure for fm10k_get_host_state_generic().  Fixed a race condition
where the mailbox interrupt request bits can be cleared before being
handled causing certain mailbox messages from the PF to be untreated
and the PF will enter in some inactive state.

Jake removes the typecast of u8 to char, and the extra variable that was
created for the typecast.  Bumps the driver version.  Added back the
receive descriptor timestamp value so that applications built on top
of the IES API can function properly.  Cleaned up the debug statistics
flag, since debug statistics were removed and the flag was missed in
the removal.

Scott limits the DMA sync for CPU to the actual length of the packet,
instead of the entire buffer, since the DMA sync occurs every time a
packet is received.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

Linux 4.10-rc3

net: ipv4: Remove flow arg from ip_mkroute_input

fl4 arg is not used; remove it.

Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: ipmr: Remove nowait arg to ipmr_get_route

ipmr_get_route has 1 caller and the nowait arg is 0. Remove the arg and
simplify ipmr_get_route accordingly.

Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

liquidio: simplify octeon_flush_iq()

Because every call to octeon_flush_iq() has a hardcoded 1 for the
pending_thresh argument, simplify that function by removing that argument.
This avoids one atomic read as well.

Signed-off-by: Derek Chickles <derek.chickles@cavium.com>
Signed-off-by: Felix Manlunas <felix.manlunas@cavium.com>
Signed-off-by: Satanand Burla <satananda.burla@cavium.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

bpf: change back to orig prog on too many passes

If after too many passes still no image could be emitted, then
swap back to the original program as we do in all other cases
and don't use the one with blinding.

Fixes: 959a75791603 ("bpf, x86: add support for constant blinding")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge tag 'usb-4.10-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb

Pull USB fixes from Greg KH:
"Here are a bunch of USB fixes for 4.10-rc3. Yeah, it's a lot, an
  artifact of the holiday break I think.

  Lots of gadget and the usual XHCI fixups for reported issues (one day
  that driver will calm down...) Also included are a bunch of usb-serial
  driver fixes, and for good measure, a number of much-reported MUSB
  driver issues have finally been resolved.

  All of these have been in linux-next with no reported issues"

* tag 'usb-4.10-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb: (72 commits)
  USB: fix problems with duplicate endpoint addresses
  usb: ohci-at91: use descriptor-based gpio APIs correctly
  usb: storage: unusual_uas: Add JMicron JMS56x to unusual device
  usb: hub: Move hub_port_disable() to fix warning if PM is disabled
  usb: musb: blackfin: add bfin_fifo_offset in bfin_ops
  usb: musb: fix compilation warning on unused function
  usb: musb: Fix trying to free already-free IRQ 4
  usb: musb: dsps: implement clear_ep_rxintr() callback
  usb: musb: core: add clear_ep_rxintr() to musb_platform_ops
  USB: serial: ti_usb_3410_5052: fix NULL-deref at open
  USB: serial: spcp8x5: fix NULL-deref at open
  USB: serial: quatech2: fix sleep-while-atomic in close
  USB: serial: pl2303: fix NULL-deref at open
  USB: serial: oti6858: fix NULL-deref at open
  USB: serial: omninet: fix NULL-derefs at open and disconnect
  USB: serial: mos7840: fix misleading interrupt-URB comment
  USB: serial: mos7840: remove unused write URB
  USB: serial: mos7840: fix NULL-deref at open
  USB: serial: mos7720: remove obsolete port initialisation
  USB: serial: mos7720: fix parallel probe
  ...

Merge tag 'char-misc-4.10-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc

Pull char/misc fixes from Greg KH:
"Here are a few small char/misc driver fixes for 4.10-rc3.

  Two MEI driver fixes, and three NVMEM patches for reported issues, and
  a new Hyper-V driver MAINTAINER update. Nothing major at all, all have
  been in linux-next with no reported issues"

* tag 'char-misc-4.10-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc:
  hyper-v: Add myself as additional MAINTAINER
  nvmem: fix nvmem_cell_read() return type doc
  nvmem: imx-ocotp: Fix wrong register size
  nvmem: qfprom: Allow single byte accesses for read/write
  mei: move write cb to completion on credentials failures
  mei: bus: fix mei_cldev_enable KDoc

Merge tag 'staging-4.10-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging

Pull staging/IIO fixes from Greg KH:
"Here are some staging and IIO driver fixes for 4.10-rc3.

  Most of these are minor IIO fixes of reported issues, along with one
  network driver fix to resolve an issue. And a MAINTAINERS update with
  a new mailing list. All of these, except the MAINTAINERS file update,
  have been in linux-next with no reported issues (the MAINTAINERS patch
  happened on Friday...)"

* tag 'staging-4.10-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging:
  MAINTAINERS: add greybus subsystem mailing list
  staging: octeon: Call SET_NETDEV_DEV()
  iio: accel: st_accel: fix LIS3LV02 reading and scaling
  iio: common: st_sensors: fix channel data parsing
  iio: max44000: correct value in illuminance_integration_time_available
  iio: adc: TI_AM335X_ADC should depend on HAS_DMA
  iio: bmi160: Fix time needed to sleep after command execution
  iio: 104-quad-8: Fix active level mismatch for the preset enable option
  iio: 104-quad-8: Fix off-by-one errors when addressing IOR
  iio: 104-quad-8: Fix index control configuration

fm10k: remove FM10K_FLAG_DEBUG_STATS

The debug statistics were removed due to complications with the ethtool
statistics API which are not possible to resolve without a new
statistics interface. The flag was left behind, but we no longer need
it.

Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Tested-by: Krishneil Singh <krishneil.k.singh@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

fm10k: report the receive timestamp in FM10K_CB(skb)->tstamp

This was accidentally removed when we defeatured the full 1588 Clock
support. We need to report the Rx descriptor timestamp value so that
applications built on top of the IES API can function properly.

Additionally, remove the FM10K_FLAG_RX_TS_ENABLED, as it is not used now
that 1588 functionality has been removed.

Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Tested-by: Krishneil Singh <krishneil.k.singh@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

fm10k: Limit dma sync of RX buffers to actual packet size

On packet RX, we perform a dma sync for cpu before passing the
packet up. Here we limit that sync to the actual length of the
incoming packet, rather than always syncing the entire buffer.

Signed-off-by: Scott Peterson <scott.d.peterson@intel.com>
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Tested-by: Krishneil Singh <krishneil.k.singh@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

fm10k: bump version number

Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Tested-by: Krishneil Singh <krishneil.k.singh@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

fm10k: do not clear global mailbox interrupt bits

Partially revert commit 5e93cbadd3e9 ("fm10k: Reset mailbox global
interrupts", 2016-06-07)

The register bits related to this commit are now solely being handled by
the IES API. Recent changes in the IES API will allow an automatic
recovery from improper handling of these bits.

Signed-off-by: Ngai-Mint Kwan <ngai-mint.kwan@intel.com>
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Tested-by: Krishneil Singh <krishneil.k.singh@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

fm10k: request reset when mbx->state changes

Multiple IES API resets can cause a race condition where the mailbox
interrupt request bits can be cleared before being handled. This can
leave certain mailbox messages from the PF to be untreated and the PF
will enter in some inactive state. If this situation occurs, the IES API
will initiate a mailbox version reset which, then, trigger a mailbox
state change. Once this mailbox transition occurs (from OPEN to CONNECT
state), a request for reset will be returned.

This ensures that PF will undergo a reset whenever IES API encounters an
unknown global mailbox interrupt event or whenever the IES API
terminates.

Signed-off-by: Ngai-Mint Kwan <ngai-mint.kwan@intel.com>
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Tested-by: Krishneil Singh <krishneil.k.singh@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

fm10k: remove extraneous variable definition in fm10k_ethtool.c

We don't need to typecast a u8 * into a char *, so just remove the extra
variable.

Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Tested-by: Krishneil Singh <krishneil.k.singh@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

fm10k-shared: use mac-> instead of hw->mac.

Since a pointer "mac" to fm10k_mac_info structure exists, use it to
access the contents of its members.

Signed-off-by: Ngai-Mint Kwan <ngai-mint.kwan@intel.com>
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Tested-by: Krishneil Singh <krishneil.k.singh@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

net: dsa: move HWMON support to its own file

Isolate the HWMON support in DSA in its own file. Currently only the
legacy DSA code is concerned.

Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

tg3: Fix race condition in tg3_get_stats64().

The driver's ndo_get_stats64() method is not always called under RTNL.
So it can race with driver close or ethtool reconfigurations. Fix the
race condition by taking tp->lock spinlock in tg3_free_consistent()
when freeing the tp->hw_stats memory block. tg3_get_stats64() is
already taking tp->lock.

Reported-by: Wang Yufen <wangyufen@huawei.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

be2net: fix unicast list filling

The adapter->pmac_id[0] item is used for primary MAC address but
this is not true for adapter->uc_list[0] as is assumed in
be_set_uc_list(). There are N UC addresses copied first from net_device
to adapter->uc_list[1..N] and then N UC addresses from
adapter->uc_list[0..N-1] are sent to HW. So the last UC address is never
stored into HW and address 00:00:00:00;00:00 (from uc_list[0]) is used
instead.

Cc: Sathya Perla <sathya.perla@broadcom.com>
Cc: Ajit Khaparde <ajit.khaparde@broadcom.com>
Cc: Sriharsha Basavapatna <sriharsha.basavapatna@broadcom.com>
Cc: Somnath Kotur <somnath.kotur@broadcom.com>
Fixes: b717241 be2net: replace polling with sleeping in the FW completion path
Signed-off-by: Ivan Vecera <cera@cera.cz>
Signed-off-by: David S. Miller <davem@davemloft.net>

mm: workingset: fix use-after-free in shadow node shrinker

Several people report seeing warnings about inconsistent radix tree
nodes followed by crashes in the workingset code, which all looked like
use-after-free access from the shadow node shrinker.

Dave Jones managed to reproduce the issue with a debug patch applied,
which confirmed that the radix tree shrinking indeed frees shadow nodes
while they are still linked to the shadow LRU:

  WARNING: CPU: 2 PID: 53 at lib/radix-tree.c:643 delete_node+0x1e4/0x200
  CPU: 2 PID: 53 Comm: kswapd0 Not tainted 4.10.0-rc2-think+ #3
  Call Trace:
     delete_node+0x1e4/0x200
     __radix_tree_delete_node+0xd/0x10
     shadow_lru_isolate+0xe6/0x220
     __list_lru_walk_one.isra.4+0x9b/0x190
     list_lru_walk_one+0x23/0x30
     scan_shadow_nodes+0x2e/0x40
     shrink_slab.part.44+0x23d/0x5d0
     shrink_node+0x22c/0x330
     kswapd+0x392/0x8f0

This is the WARN_ON_ONCE(!list_empty(&node->private_list)) placed in the
inlined radix_tree_shrink().

The problem is with 14b468791fa9 ("mm: workingset: move shadow entry
tracking to radix tree exceptional tracking"), which passes an update
callback into the radix tree to link and unlink shadow leaf nodes when
tree entries change, but forgot to pass the callback when reclaiming a
shadow node.

While the reclaimed shadow node itself is unlinked by the shrinker, its
deletion from the tree can cause the left-most leaf node in the tree to
be shrunk.  If that happens to be a shadow node as well, we don't unlink
it from the LRU as we should.

Consider this tree, where the s are shadow entries:

       root->rnode
            |
       [0       n]
        |       |
     [s    ] [sssss]

Now the shadow node shrinker reclaims the rightmost leaf node through
the shadow node LRU:

       root->rnode
            |
       [0        ]
        |
    [s     ]

Because the parent of the deleted node is the first level below the
root and has only one child in the left-most slot, the intermediate
level is shrunk and the node containing the single shadow is put in
its place:

       root->rnode
            |
       [s        ]

The shrinker again sees a single left-most slot in a first level node
and thus decides to store the shadow in root->rnode directly and free
the node - which is a leaf node on the shadow node LRU.

  root->rnode
       |
       s

Without the update callback, the freed node remains on the shadow LRU,
where it causes later shrinker runs to crash.

Pass the node updater callback into __radix_tree_delete_node() in case
the deletion causes the left-most branch in the tree to collapse too.

Also add warnings when linked nodes are freed right away, rather than
wait for the use-after-free when the list is scanned much later.

Fixes: 14b468791fa9 ("mm: workingset: move shadow entry tracking to radix tree exceptional tracking")
Reported-by: Dave Chinner <david@fromorbit.com>
Reported-by: Hugh Dickins <hughd@google.com>
Reported-by: Andrea Arcangeli <aarcange@redhat.com>
Reported-and-tested-by: Dave Jones <davej@codemonkey.org.uk>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Chris Leech <cleech@redhat.com>
Cc: Lee Duncan <lduncan@suse.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Matthew Wilcox <mawilcox@linuxonhyperv.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Merge branch 'netcp-next'

Murali Karicheri says:

====================
netcp: enhancements and minor fixes

This series is for net-next. This propagates enhancements and minor
bug fixes from internal version of the driver to keep the upstream
in sync. Please review and apply if this looks good.

Tested on all of K2HK/E/L boards with nfs rootfs.
Test logs below
K2HK-EVM: http://pastebin.ubuntu.com/23754106/
k2L-EVM: http://pastebin.ubuntu.com/23754143/
K2E-EVM: http://pastebin.ubuntu.com/23754159/

History:
  v1 - dropped 1/10 amd 2/10 of v0 based on comments from Rob as
       it needs more work before submission
  v0 - Initial version
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

net: netcp: ale: add proper ale entry mask bits for netcp switch ALE

For NetCP NU Switch ALE, some of the mask bits are different than
defaults used in the driver. Add a new macro DEFINE_ALE_FIELD1 that use
a configurable mask bits and use it in the driver. These bits are set to
correct values by using the new variables added to cpsw_ale structure
and re-used in the macros. The parameter nu_switch_ale is configured by
the caller driver to indicate the ALE is for that switch and is used in
the ALE driver to do customization as needed.

Signed-off-by: Murali Karicheri <m-karicheri2@ti.com>
Signed-off-by: Sekhar Nori <nsekhar@ti.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: netcp: ale: use ale_status to size the ale table

ALE h/w on newer version of NetCP (K2E/L/G) does provide a ALE_STATUS
register for the size of the ALE Table implemented in h/w. Currently
for example we set ALE Table size to 1024 for NetCP ALE on
K2E even though the ALE Status/Documentation shows it has 8192 entries.
So take advantage of this register to read the size of ALE table supported
and use that value in the driver for the newer version of NetCP ALE.
For NetCP lite, ALE Table size is much less (64) and indicated by a size
of zero in ALE_STATUS. So use that as a default for now. While at it,
also fix the ale table size on 10G switch to 2048 per User guide
http://www.ti.com/lit/ug/spruhj5/spruhj5.pdf

Signed-off-by: Murali Karicheri <m-karicheri2@ti.com>
Signed-off-by: Sekhar Nori <nsekhar@ti.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: netcp: ale: update to support unknown vlan controls for NU switch

In NU Ethernet switch used on some of the Keystone SoCs, there is
separate UNKNOWNVLAN register for membership, unreg mcast flood, reg
mcast flood and force untag egress bits in ALE. So control for these
fields require different address offset, shift and size of field.
As this ALE has the same version number as ALE in CPSW found on other
SoCs, customization based on version number is not possible. So
use a configuration parameter, nu_switch_ale, to identify the ALE
ALE found in NU Switch. Different treatment is needed for NU Switch
ALE due to difference in the ale table bits, separate unknown vlan
registers etc. The register information available in ale_controls,
needs to be updated to support the netcp NU switch h/w. So it is not
constant array any more since it needs to be updated based
on ALE type. The header of the file is also updated to indicate it
supports N port switch ALE, not just 3 port. The version mask is
3 bits in NU Switch ALE vs 8 bits on other ALE types.

While at it, change the debug print to info print so that ALE
version gets displayed in boot log.

Signed-off-by: Murali Karicheri <m-karicheri2@ti.com>
Signed-off-by: Sekhar Nori <nsekhar@ti.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: netcp: use hw capability to remove FCS word from rx packets

Some of the newer Ethernet switch hw (such as that on k2e/l/g) can
strip the Etherenet FCS from packet at the port 0 egress of the switch.
So use this capability instead of doing it in software.

Signed-off-by: Murali Karicheri <m-karicheri2@ti.com>
Signed-off-by: Sekhar Nori <nsekhar@ti.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: netcp: ethss: get phy-handle only if link interface is MAC-to-PHY

Currently to parse phy-handle, driver doesn't check if the interface is
MAC to PHY. This patch add this check for all MAC to PHY interface types
supported by the driver.

Signed-off-by: Murali Karicheri <m-karicheri2@ti.com>
Signed-off-by: Sekhar Nori <nsekhar@ti.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: netcp: store network statistics in 64 bits

Previously the network statistics were stored in 32 bit variable
which can cause some stats to roll over after several minutes of
high traffic. This implements 64 bit storage so larger numbers
can be stored.

Signed-off-by: Michael Scherban <m-scherban@ti.com>
Signed-off-by: Murali Karicheri <m-karicheri2@ti.com>
Signed-off-by: Sekhar Nori <nsekhar@ti.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: netcp: remove the redundant memmov()

The psdata is populated with command data by netcp modules
to the tail of the buffer and set_words() copy the same
to the front of the psdata. So remove the redundant memmov
function call.

Signed-off-by: Murali Karicheri <m-karicheri2@ti.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: netcp: extract eflag from desc for rx_hook handling

Extract the eflag bits from the received desc and pass it down
the rx_hook chain to be available for netcp modules. Also the
psdata and epib data has to be inspected by the netcp modules.
So the desc can be freed only after returning from the rx_hook.
So move knav_pool_desc_put() after the rx_hook processing.

Signed-off-by: Murali Karicheri <m-karicheri2@ti.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

mm: stop leaking PageTables

4.10-rc loadtest (even on x86, and even without THPCache) fails with
"fork: Cannot allocate memory" or some such; and /proc/meminfo shows
PageTables growing.

Commit 953c66c2b22a ("mm: THP page cache support for ppc64") that got
merged in rc1 removed the freeing of an unused preallocated pagetable
after do_fault_around() has called map_pages().

This is usually a good optimization, so that the followup doesn't have
to reallocate one; but it's not sufficient to shift the freeing into
alloc_set_pte(), since there are failure cases (most commonly
VM_FAULT_RETRY) which never reach finish_fault().

Check and free it at the outer level in do_fault(), then we don't need
to worry in alloc_set_pte(), and can restore that to how it was (I
cannot find any reason to pte_free() under lock as it was doing).

And fix a separate pagetable leak, or crash, introduced by the same
change, that could only show up on some ppc64: why does do_set_pmd()'s
failure case attempt to withdraw a pagetable when it never deposited
one, at the same time overwriting (so leaking) the vmf->prealloc_pte?
Residue of an earlier implementation, perhaps? Delete it.

Fixes: 953c66c2b22a ("mm: THP page cache support for ppc64")
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Michael Neuling <mikey@neuling.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Merge branch 'cpsw-cpdma-DDR'

Grygorii Strashko says:

====================
net: ethernet: ti: cpsw: support placing CPDMA descriptors into DDR

This series intended to add support for placing CPDMA descriptors into
DDR by introducing new module parameter "descs_pool_size" to specify
size of descriptor's pool. The "descs_pool_size" defines total number
of CPDMA CPPI descriptors to be used for both ingress/egress packets
processing. If not specified - the default value 256 will be used
which will allow to place descriptor's pool into the internal CPPI
RAM.

In addition, added ability to re-split CPDMA pool of descriptors
between RX and TX path via ethtool '-G' command wich will allow to
configure and fix number of descriptors used by RX and TX path, which,
then, will be split between RX/TX channels proportionally depending on
number of RX/TX channels and its weight.

This allows significantly to reduce UDP packets drop rate for
bandwidth >301 Mbits/sec (am57x).

Before enabling this feature, the am437x SoC has to be fixed as it's
proved that it's not working when CPDMA descriptors placed in DDR.
So, the patch 1 fixes this issue.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

Documentation: DT: net: cpsw: remove no_bd_ram property

Even if no_bd_ram property is described in TI CPSW bindings the support for
it has never been introduced in CPSW driver, so there are no real users of
it. Hence, remove no_bd_ram property from documentation and DT files.

Cc: 'Rob Herring <robh+dt@kernel.org>'
Signed-off-by: Grygorii Strashko <grygorii.strashko@ti.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: ethernet: ti: cpsw: add support for ringparam configuration

The CPDMA uses one pool of descriptors for both RX and TX which by default
split between all channels proportionally depending on total number of
CPDMA channels and number of TX and RX channels. As result, more
descriptors will be consumed by TX path if there are more TX channels and
there is no way now to dedicate more descriptors for RX path.

So, add the ability to re-split CPDMA pool of descriptors between RX and TX
path via ethtool '-G' command wich will allow to configure and fix number
of descriptors used by RX and TX path, which, then, will be split between
RX/TX channels proportionally depending on RX/TX channels number and
weight. ethtool '-G' command will accept only number of RX entries and rest
of descriptors will be arranged for TX automatically.

Command:
  ethtool -G <devname> rx <number of descriptors>

defaults and limitations:
- minimum number of rx descriptors is 10% of total number of descriptors in
  CPDMA pool
- maximum number of rx descriptors is 90% of total number of descriptors in
  CPDMA pool
- by default, descriptors will be split equally between RX/TX path
- any values passed in "tx" parameter will be ignored

Usage:

# ethtool -g eth0
Pre-set maximums:
RX:             7372
RX Mini:        0
RX Jumbo:       0
TX:             0
Current hardware settings:
RX:             4096
RX Mini:        0
RX Jumbo:       0
TX:             4096

# ethtool -G eth0 rx 7372
# ethtool -g eth0
Ring parameters for eth0:
Pre-set maximums:
RX:             7372
RX Mini:        0
RX Jumbo:       0
TX:             0
Current hardware settings:
RX:             7372
RX Mini:        0
RX Jumbo:       0
TX:             820

Signed-off-by: Grygorii Strashko <grygorii.strashko@ti.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: ethernet: ti: cpsw: add support for descs pool size configuration

The CPSW CPDMA can process buffer descriptors placed as in internal
CPPI RAM as in DDR. This patch adds support in CPSW and CPDMA for
descs_pool_size mudule parameter, which defines total number of CPDMA CPPI
descriptors to be used for both ingress/egress packets processing:
- memory size, required for CPDMA descriptor pool, is calculated basing
on number of descriptors specified by user in descs_pool_size and
CPDMA descriptor size and allocated from coherent memory (CMA area);
- CPDMA descriptor pool will be allocated in DDR if pool memory size >
internal CPPI RAM or use internal CPPI RAM otherwise;
- if descs_pool_size not specified in DT - the default value 256 will
be used which will allow to place CPDMA descriptors pool into the
internal CPPI RAM (current default behaviour);
- CPDMA will ignore descs_pool_size if descs_pool_size = 0 for
backward comaptiobility with davinci_emac.

descs_pool_size is boot time setting and can't be changed once
CPSW/CPDMA is initialized.

Signed-off-by: Grygorii Strashko <grygorii.strashko@ti.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: ethernet: ti: cpdma: use devm_ioremap

Use devm_ioremap() and simplify the code.

Signed-off-by: Grygorii Strashko <grygorii.strashko@ti.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: ethernet: ti: cpdma: minimize number of parameters in cpdma_desc_pool_create/destroy()

Update cpdma_desc_pool_create/destroy() to accept only one parameter
struct cpdma_ctlr*, as this structure contains all required
information for pool creation/destruction.

Signed-off-by: Grygorii Strashko <grygorii.strashko@ti.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: ethernet: ti: cpdma: fix desc re-queuing

The currently processing cpdma descriptor with EOQ flag set may
contain two values in Next Descriptor Pointer field:
- valid pointer: means CPDMA missed addition of new desc in queue;
- null: no more descriptors in queue.
In the later case, it's not required to write to HDP register, but now
CPDMA does it.

Hence, add additional check for Next Descriptor Pointer != null in
cpdma_chan_process() function before writing in HDP register.

Signed-off-by: Grygorii Strashko <grygorii.strashko@ti.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: ethernet: ti: cpdma: am437x: allow descs to be plased in ddr

It's observed that cpsw/cpdma is not working properly when CPPI
descriptors are placed in DDR instead of internal CPPI RAM on am437x
SoC:
- rx/tx silently stops processing packets;
- or - after boot it's working for sometime, but stuck once Network
load is increased (ping is working, but iperf is not).
(The same issue has not been reproduced on am335x and am57xx).

It seems that write to HDP register processed faster by interconnect
than writing of descriptor memory buffer in DDR, which is probably
caused by store buffer / write buffer differences as these functions
are implemented differently across devices. So, to fix this i come up
with two minimal, required changes:

1) all accesses to the channel register HDP/CP/RXFREE registers should
be done using sync IO accessors readl()/writel(), because all previous
memory writes writes have to be completed before starting channel
(write to HDP) or completing desc processing.

2) the change 1 only doesn't work on am437x and additional reading of
desc's field is required right after the new descriptor was filled
with data and before pointer on it will be stored in
prev_desc->hw_next field or HDP register.

In addition, to above changes this patch eliminates all relaxed ordering
I/O accessors in this driver as suggested by David Miller to avoid such
kind of issues in the future, but with one exception - relaxed IO accessors
will still be used to fill desc in cpdma_chan_submit(), which is safe as
there is read barrier at the end of write sequence, and because sync IO
accessors usage here will affect on net performance.

Signed-off-by: Grygorii Strashko <grygorii.strashko@ti.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'rc-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/mmarek/kbuild

Pull kbuild fix from Michal Marek:
"The asm-prototypes.h file added in the last merge window results in
  invalid code with CONFIG_KMEMCHECK=y. The net result is that genksyms
  segfaults.

  This pull request fixes the header, the genksyms fix is in my kbuild
  branch for 4.11"

* 'rc-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/mmarek/kbuild:
  asm-prototypes: Clear any CPP defines before declaring the functions