Achiad Shochat [Tue, 29 Dec 2015 12:58:29 +0000 (14:58 +0200)]
net/mlx5e: Do not modify the TX SKB
If the SKB is cloned, or has an elevated users count, someone else
can be looking at it at the same time.
Signed-off-by: Achiad Shochat <achiad@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Tue, 5 Jan 2016 17:24:06 +0000 (12:24 -0500)]
Merge branch 'sctp-transport-rhashtable'
Xin Long says:
====================
sctp: use transport hashtable to replace association's with rhashtable
for telecom center, the usual case is that a server is connected by thousands
of clients. but if the server with only one enpoint(udp style) use the same
sport and dport to communicate with every clients, and every assoc in server
will be hashed in the same chain of global assoc hashtable due to currently we
choose dport and sport as the hash key.
when a packet is received, sctp_rcv try to find the assoc with sport and dport,
since that chain is too long to find it fast, it make the performance turn to
very low, some test data is as follow:
in server:
$./ss [start a udp style server there]
in client:
$./cc [start 2500 sockets to connect server with same port and different ip,
and use one of them to send data to server]
===== test on net-next
-- perf top
server:
55.73% [kernel] [k] sctp_assoc_is_match
6.80% [kernel] [k] sctp_assoc_lookup_paddr
4.81% [kernel] [k] sctp_v4_cmp_addr
3.12% [kernel] [k] _raw_spin_unlock_irqrestore
1.94% [kernel] [k] sctp_cmp_addr_exact
we need to change the way to calculate the hash key, to use lport +
rport + paddr as the hash key can avoid this issue.
besides, this patchset will use transport hashtable to replace
association hashtable to lookup with rhashtable api. get transport
first then get association by t->asoc. and also it will make tcp
style work better.
===== test with this patchset:
-- perf top
server:
15.98% [kernel] [k] _raw_spin_unlock_irqrestore
9.92% [kernel] [k] __pv_queued_spin_lock_slowpath
7.22% [kernel] [k] copy_user_generic_string
2.38% libpthread-2.17.so [.] __recvmsg_nocancel
1.88% [kernel] [k] sctp_recvmsg
Xin Long [Wed, 30 Dec 2015 15:50:50 +0000 (23:50 +0800)]
sctp: remove the local_bh_disable/enable in sctp_endpoint_lookup_assoc
sctp_endpoint_lookup_assoc is called in the protection of sock lock
there is no need to call local_bh_disable in this function. so remove
them.
Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Xin Long [Wed, 30 Dec 2015 15:50:49 +0000 (23:50 +0800)]
sctp: drop the old assoc hashtable of sctp
transport hashtable will replace the association hashtable,
so association hashtable is not used in sctp any more, so
drop the codes about that.
Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Xin Long [Wed, 30 Dec 2015 15:50:48 +0000 (23:50 +0800)]
sctp: apply rhashtable api to sctp procfs
Traversal the transport rhashtable, get the association only once through
the condition assoc->peer.primary_path != transport.
Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Xin Long [Wed, 30 Dec 2015 15:50:47 +0000 (23:50 +0800)]
sctp: apply rhashtable api to send/recv path
apply lookup apis to two functions, for __sctp_endpoint_lookup_assoc
and __sctp_lookup_association, it's invoked in the protection of sock
lock, it will be safe, but sctp_lookup_association need to call
rcu_read_lock() and to detect the t->dead to protect it.
Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Xin Long [Wed, 30 Dec 2015 15:50:46 +0000 (23:50 +0800)]
sctp: add the rhashtable apis for sctp global transport hashtable
tranport hashtbale will replace the association hashtable to do the
lookup for transport, and then get association by t->assoc, rhashtable
apis will be used because of it's resizable, scalable and using rcu.
lport + rport + paddr will be the base hashkey to locate the chain,
with net to protect one netns from another, then plus the laddr to
compare to get the target.
this patch will provider the lookup functions:
- sctp_epaddr_lookup_transport
- sctp_addrs_lookup_transport
Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Tue, 5 Jan 2016 03:49:59 +0000 (22:49 -0500)]
Merge branch 'faster-soreuseport'
Craig Gallek says:
====================
Faster SO_REUSEPORT
This series contains two optimizations for the SO_REUSEPORT feature:
Faster lookup when selecting a socket for an incoming packet and
the ability to select the socket from the group using a BPF program.
This series only includes the UDP path. I plan to submit a follow-up
including the TCP path if the implementation in this series is
acceptable.
Changes in v4:
- pskb_may_pull is unnecessary with pskb_pull (per Alexei Starovoitov)
Changes in v3:
- skb_pull_inline -> pskb_pull (per Alexei Starovoitov)
- reuseport_attach* -> sk_reuseport_attach* and simple return statement
syntax change (per Daniel Borkmann)
Changes in v2:
- Fix ARM build; remove unnecessary include.
- Handle case where protocol header is not in linear section (per
Alexei Starovoitov).
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Craig Gallek [Mon, 4 Jan 2016 22:41:48 +0000 (17:41 -0500)]
soreuseport: BPF selection functional test
This program will build classic and extended BPF programs and
validate the socket selection logic when used with
SO_ATTACH_REUSEPORT_CBPF and SO_ATTACH_REUSEPORT_EBPF.
It also validates the re-programing flow and several edge cases.
Signed-off-by: Craig Gallek <kraig@google.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
Expose socket options for setting a classic or extended BPF program
for use when selecting sockets in an SO_REUSEPORT group. These options
can be used on the first socket to belong to a group before bind or
on any socket in the group after bind.
This change includes refactoring of the existing sk_filter code to
allow reuse of the existing BPF filter validation checks.
Signed-off-by: Craig Gallek <kraig@google.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
Craig Gallek [Mon, 4 Jan 2016 22:41:46 +0000 (17:41 -0500)]
soreuseport: fast reuseport UDP socket selection
Include a struct sock_reuseport instance when a UDP socket binds to
a specific address for the first time with the reuseport flag set.
When selecting a socket for an incoming UDP packet, use the information
available in sock_reuseport if present.
This required adding an additional field to the UDP source address
equality function to differentiate between exact and wildcard matches.
The original use case allowed wildcard matches when checking for
existing port uses during bind. The new use case of adding a socket
to a reuseport group requires exact address matching.
Performance test (using a machine with 2 CPU sockets and a total of
48 cores): Create reuseport groups of varying size. Use one socket
from this group per user thread (pinning each thread to a different
core) calling recvmmsg in a tight loop. Record number of messages
received per second while saturating a 10G link.
10 sockets: 18% increase (~2.8M -> 3.3M pkts/s)
20 sockets: 14% increase (~2.9M -> 3.3M pkts/s)
40 sockets: 13% increase (~3.0M -> 3.4M pkts/s)
This work is based off a similar implementation written by
Ying Cai <ycai@google.com> for implementing policy-based reuseport
selection.
Signed-off-by: Craig Gallek <kraig@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Craig Gallek [Mon, 4 Jan 2016 22:41:45 +0000 (17:41 -0500)]
soreuseport: define reuseport groups
struct sock_reuseport is an optional shared structure referenced by each
socket belonging to a reuseport group. When a socket is bound to an
address/port not yet in use and the reuseport flag has been set, the
structure will be allocated and attached to the newly bound socket.
When subsequent calls to bind are made for the same address/port, the
shared structure will be updated to include the new socket and the
newly bound socket will reference the group structure.
Usually, when an incoming packet was destined for a reuseport group,
all sockets in the same group needed to be considered before a
dispatching decision was made. With this structure, an appropriate
socket can be found after looking up just one socket in the group.
This shared structure will also allow for more complicated decisions to
be made when selecting a socket (eg a BPF filter).
This work is based off a similar implementation written by
Ying Cai <ycai@google.com> for implementing policy-based reuseport
selection.
Signed-off-by: Craig Gallek <kraig@google.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Ido Schimmel [Mon, 4 Jan 2016 09:42:26 +0000 (10:42 +0100)]
mlxsw: spectrum: Change bridge port attributes only when bridged
Bridge port attributes are offloaded to hardware when invoked with SELF
flag set, but it really makes no sense to reflect them when port is not
bridged.
Allow a user to change these attribute only when port is bridged and
initialize them correctly when joining or leaving a bridge.
Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Ido Schimmel [Mon, 4 Jan 2016 09:42:24 +0000 (10:42 +0100)]
mlxsw: spectrum: Return NOTIFY_BAD on bridge failure
It is possible for us to fail when joining or leaving a bridge, so let
the user know about that by returning NOTIFY_BAD, as already done for
LAG join/leave and 802.1D bridges.
Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Arnd Bergmann [Fri, 1 Jan 2016 13:55:24 +0000 (14:55 +0100)]
fsl/fman: allow modular build
ARM allmodconfig fails because of the addition of the FMAN driver:
drivers/built-in.o: In function `dtsec_restart_autoneg':
binder.c:(.text+0x173328): undefined reference to `mdiobus_read'
binder.c:(.text+0x173348): undefined reference to `mdiobus_write'
drivers/built-in.o: In function `dtsec_config':
binder.c:(.text+0x173d24): undefined reference to `of_phy_find_device'
drivers/built-in.o: In function `init_phy':
binder.c:(.text+0x1763b0): undefined reference to `of_phy_connect'
drivers/built-in.o: In function `stop':
binder.c:(.text+0x176014): undefined reference to `phy_stop'
drivers/built-in.o: In function `start':
binder.c:(.text+0x176078): undefined reference to `phy_start'
The reason is that the driver uses PHYLIB, but that is a loadable
module here, and fman itself is built-in.
This patch makes it possible to configure fman as a module as well
so we don't change the status of PHYLIB in an allmodconfig kernel,
and it adds a 'select PHYLIB' statement to ensure that phylib is
always built-in when fman is.
The driver uses "builtin_platform_driver(fman_driver);", which means
it cannot be unloaded, but it's still possible to have it as a loadable
module that gets loaded once and never removed.
Signed-off-by: Arnd Bergmann <arnd@arndb.de> Fixes: 5adae51a64b8 ("fsl/fman: Add FMan MURAM support") Signed-off-by: David S. Miller <davem@davemloft.net>
Arnd Bergmann [Fri, 1 Jan 2016 12:18:48 +0000 (13:18 +0100)]
net: make ip6tunnel_xmit definition conditional
Moving the caller of iptunnel_xmit_stats causes a build error in
randconfig builds that disable CONFIG_INET:
In file included from ../net/xfrm/xfrm_input.c:17:0:
../include/net/ip6_tunnel.h: In function 'ip6tunnel_xmit':
../include/net/ip6_tunnel.h:93:2: error: implicit declaration of function 'iptunnel_xmit_stats' [-Werror=implicit-function-declaration]
iptunnel_xmit_stats(dev, pkt_len);
The reason is that the iptunnel_xmit_stats definition is hidden
inside #ifdef CONFIG_INET but the caller is not. We can change
one or the other to fix it, and this patch adds a second #ifdef
around ip6tunnel_xmit() to avoid seeing the invalid call.
Signed-off-by: Arnd Bergmann <arnd@arndb.de> Fixes: 039f50629b7f ("ip_tunnel: Move stats update to iptunnel_xmit()") Acked-by: Pravin B Shelar <pshelar@nicira.com> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Tue, 5 Jan 2016 02:48:15 +0000 (21:48 -0500)]
Merge tag 'nfc-next-4.5-1' of git://git.kernel.org/pub/scm/linux/kernel/git/sameo/nfc-next
Samuel Ortiz says:
====================
NFC 4.5 pull request
This is the first NFC pull request for 4.5 and it brings:
- A new driver for the STMicroelectronics ST95HF NFC chipset.
The ST95HF is an NFC digital transceiver with an embedded analog
front-end and as such relies on the Linux NFC digital
implementation. This is the 3rd user of the NFC digital stack.
- ACPI support for the ST st-nci and st21nfca drivers.
- A small improvement for the nfcsim driver, as we can now tune
the Rx delay through sysfs.
- A bunch of minor cleanups and small fixes from Christophe Ricard,
for a few drivers and the NFC core code.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Wed, 30 Dec 2015 13:51:12 +0000 (08:51 -0500)]
udp: properly support MSG_PEEK with truncated buffers
Backport of this upstream commit into stable kernels : 89c22d8c3b27 ("net: Fix skb csum races when peeking")
exposed a bug in udp stack vs MSG_PEEK support, when user provides
a buffer smaller than skb payload.
In this case,
skb_copy_and_csum_datagram_iovec(skb, sizeof(struct udphdr),
msg->msg_iov);
returns -EFAULT.
This bug does not happen in upstream kernels since Al Viro did a great
job to replace this into :
skb_copy_and_csum_datagram_msg(skb, sizeof(struct udphdr), msg);
This variant is safe vs short buffers.
For the time being, instead reverting Herbert Xu patch and add back
skb->ip_summed invalid changes, simply store the result of
udp_lib_checksum_complete() so that we avoid computing the checksum a
second time, and avoid the problematic
skb_copy_and_csum_datagram_iovec() call.
This patch can be applied on recent kernels as it avoids a double
checksumming, then backported to stable kernels as a bug fix.
Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
Guillaume Nault [Tue, 29 Dec 2015 12:06:59 +0000 (13:06 +0100)]
l2tp: rely on ppp layer for skb scrubbing
Since 79c441ae505c ("ppp: implement x-netns support"), the PPP layer
calls skb_scrub_packet() whenever the skb is received on the PPP
device. Manually resetting packet meta-data in the L2TP layer is thus
redundant.
Signed-off-by: Guillaume Nault <g.nault@alphalink.fr> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Mon, 4 Jan 2016 21:11:12 +0000 (16:11 -0500)]
Merge branch 'sh_eth-remove-BE-desc-support'
Sergei Shtylyov says:
====================
sh_eth: remove unused BE descriptor support
Here's a set of 2 patches against DaveM's 'net-next.git' repo plus the
recently merged to 'net.git' repo fix for the 16-bit descriptor endianness.
We get rid of ~30 LoCs and ~300 bytes of code.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Sergei Shtylyov [Sun, 27 Dec 2015 23:10:47 +0000 (02:10 +0300)]
sh_eth: get rid of {cpu|edmac}_to_{edmac|cpu}()
Now that {cpu|edmac}_to_{edmac|cpu}() functions boiled down to the mere
{cpu|le32}_to_{le32|cpu}() calls, there's no need for these functions
anymore, so just get rid of them.
Signed-off-by: Sergei Shtylyov <sergei.shtylyov@cogentembedded.com> Acked-by: Simon Horman <horms+renesas@verge.net.au> Signed-off-by: David S. Miller <davem@davemloft.net>
Sergei Shtylyov [Sun, 27 Dec 2015 23:07:08 +0000 (02:07 +0300)]
sh_eth: remove EDMAC_BIG_ENDIAN
Commit 71557a37adb5 ("[netdrvr] sh_eth: Add SH7619 support") added support
for the big-endian EDMAC descriptors. However, it was never used and never
worked right until the recent driver fixes. I think we now can just remove
this support, it was only burdening the driver from the start. It should be
easy to do without disturbing the SH platform code, at least for now...
Signed-off-by: Sergei Shtylyov <sergei.shtylyov@cogentembedded.com> Acked-by: Simon Horman <horms+renesas@verge.net.au> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Mon, 4 Jan 2016 20:54:41 +0000 (15:54 -0500)]
Merge branch 'bnxt_en-combined-rx-tx-channels'
Michael Chan says:
====================
bnxt_en: Support combined and rx/tx channels.
The bnxt hardware uses a completion ring for rx and tx events. The driver
has to process the completion ring entries sequentially for the events.
The current code only supports an rx/tx ring pair for each completion ring.
This patch series add support for using a dedicated completion ring for
rx only or tx only as an option configuarble using ethtool -L.
The benefits for using dedicated completion rings are:
1. A burst of rx packets can cause delay in processing tx events if the
completion ring is shared. If tx queue is stopped by BQL, this can cause
delay in re-starting the tx queue.
2. A completion ring is sized according to the rx and tx ring size rounded
up to the nearest power of 2. When the completion ring is shared, it is
sized by adding the rx and tx ring sizes and then rounded to the next power
of 2, often with a lot of wasted space.
3. Using dedicated completion ring, we can adjust the tx and rx coalescing
parameters independently for rx and tx.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Michael Chan [Sun, 3 Jan 2016 04:44:59 +0000 (23:44 -0500)]
bnxt_en: Separate bnxt_{rx|tx}_ring_info structs from bnxt_napi struct.
Currently, an rx and a tx ring are always paired with a completion ring.
We want to restructure it so that it is possible to have a dedicated
completion ring for tx or rx only.
The bnxt hardware uses a completion ring for rx and tx events. The driver
has to process the completion ring entries sequentially for the rx and tx
events. Using a dedicated completion ring for rx only or tx only has these
benefits:
1. A burst of rx packets can cause delay in processing tx events if the
completion ring is shared. If tx queue is stopped by BQL, this can cause
delay in re-starting the tx queue.
2. A completion ring is sized according to the rx and tx ring size rounded
up to the nearest power of 2. When the completion ring is shared, it is
sized by adding the rx and tx ring sizes and then rounded to the next power
of 2, often with a lot of wasted space.
3. Using dedicated completion ring, we can adjust the tx and rx coalescing
parameters independently for rx and tx.
The first step is to separate the rx and tx ring structures from the
bnxt_napi struct.
In this patch, an rx ring and a tx ring will point to the same bnxt_napi
struct to share the same completion ring. No change in ring assignment
and mapping yet.
Signed-off-by: Michael Chan <mchan@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Linus Torvalds [Thu, 31 Dec 2015 22:59:21 +0000 (14:59 -0800)]
Merge tag 'pci-v4.4-fixes-3' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci
Pull PCI bugfix from Bjorn Helgaas:
"Here's another fix for v4.4.
This fixes 32-bit config reads for the HiSilicon driver. Obviously
the driver is completely broken without this fix (apparently it
actually was tested internally, but got broken somehow in the process
of upstreaming it).
David S. Miller [Thu, 31 Dec 2015 05:53:11 +0000 (00:53 -0500)]
Merge branch 'ethtool-phy-stats'
Andrew Lunn says:
====================
Ethtool support for phy stats
This patchset add ethtool support for reading statistics from the PHY.
The Marvell and Micrel Phys are then extended to report receiver
packet errors and idle errors.
v2:
Fix linking when phylib is not enabled.
v3:
Inline helpers into ethtool.c, so fixing when phylib is a module.
v4:
Add missing static
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Andrew Lunn [Wed, 30 Dec 2015 15:28:25 +0000 (16:28 +0100)]
ethtool: Add phy statistics
Ethernet PHYs can maintain statistics, for example errors while idle
and receive errors. Add an ethtool mechanism to retrieve these
statistics, using the same model as MAC statistics.
Signed-off-by: Andrew Lunn <andrew@lunn.ch> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Xin Long [Tue, 29 Dec 2015 09:49:25 +0000 (17:49 +0800)]
sctp: sctp should release assoc when sctp_make_abort_user return NULL in sctp_close
In sctp_close, sctp_make_abort_user may return NULL because of memory
allocation failure. If this happens, it will bypass any state change
and never free the assoc. The assoc has no chance to be freed and it
will be kept in memory with the state it had even after the socket is
closed by sctp_close().
So if sctp_make_abort_user fails to allocate memory, we should abort
the asoc via sctp_primitive_ABORT as well. Just like the annotation in
sctp_sf_cookie_wait_prm_abort and sctp_sf_do_9_1_prm_abort said,
"Even if we can't send the ABORT due to low memory delete the TCB.
This is a departure from our typical NOMEM handling".
But then the chunk is NULL (low memory) and the SCTP_CMD_REPLY cmd would
dereference the chunk pointer, and system crash. So we should add
SCTP_CMD_REPLY cmd only when the chunk is not NULL, just like other
places where it adds SCTP_CMD_REPLY cmd.
Signed-off-by: Xin Long <lucien.xin@gmail.com> Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Nicolai Stange [Tue, 29 Dec 2015 12:29:55 +0000 (13:29 +0100)]
net, socket, socket_wq: fix missing initialization of flags
Commit ceb5d58b2170 ("net: fix sock_wake_async() rcu protection") from
the current 4.4 release cycle introduced a new flags member in
struct socket_wq and moved SOCKWQ_ASYNC_NOSPACE and SOCKWQ_ASYNC_WAITDATA
from struct socket's flags member into that new place.
Unfortunately, the new flags field is never initialized properly, at least
not for the struct socket_wq instance created in sock_alloc_inode().
One particular issue I encountered because of this is that my GNU Emacs
failed to draw anything on my desktop -- i.e. what I got is a transparent
window, including the title bar. Bisection lead to the commit mentioned
above and further investigation by means of strace told me that Emacs
is indeed speaking to my Xorg through an O_ASYNC AF_UNIX socket. This is
reproducible 100% of times and the fact that properly initializing the
struct socket_wq ->flags fixes the issue leads me to the conclusion that
somehow SOCKWQ_ASYNC_WAITDATA got set in the uninitialized ->flags,
preventing my Emacs from receiving any SIGIO's due to data becoming
available and it got stuck.
Make sock_alloc_inode() set the newly created struct socket_wq's ->flags
member to zero.
Fixes: ceb5d58b2170 ("net: fix sock_wake_async() rcu protection") Signed-off-by: Nicolai Stange <nicstange@gmail.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Wed, 30 Dec 2015 21:34:53 +0000 (16:34 -0500)]
Merge branch '10GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/next-queue
Jeff Kirsher says:
====================
10GbE Intel Wired LAN Driver Updates 2015-12-29
This series contains updates to ixgbe and ixgbevf.
William Dauchy provides a fix for ixgbevf that was implemented for ixgbe,
commit 5d6002b7b822c7 ("ixgbe: Fix handling of NAPI budget when multiple
queues are enabled per vector"). The issue was that the polling routine
would increase the budget for receive to at least 1 per queue if multiple
queues were present, which resulted in receive packets being processed
when the budget was 0.
Emil provides minor cleanups for ixgbevf, one being that we need to
check rx_itr_setting with == and not &, since it is not a mask. Added
QSFP PHY support in ixgbe to allow for more accurate reporting of port
settings. Fixed the max RSS limit for X550 which is 63, not 64.
Veola fixes ixgbe ethtool reporting of backplane type interfaces as
1000/10000baseT link modes, instead, report the media as KR, KX or KX4
based on the backplane interface present.
Mark cleans up redundancy in the setting of hw_enc_features that makes
it appear that X550 has more encapsulation features than other devices.
Also do not set NETIF_F_SG any longer since that is set by the
register_netdev() call. Also fixed the X550EM_x revision check, which
needs to check a value, not just a bit.
Alex Duyck fixes additional bugs in ixgbe_clear_vf_vlans(), one being
that the mask was using a divide instead of a modulus, which resulted
in the mask bit being incorrectly set to 0 or 1 based on the value of
the VF being tested. Alex also found that he was not consistent in
using the "word" argument as an offset or as a register offset, so
made the code consistently use word as the offset in the array.
v2: dropped patch 8 of the original series, as it was undoing a part of
the fix Alex Duyck was doing in patch 9 of the original series.
Dropped based on feedback from Emil (the author).
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Wed, 30 Dec 2015 21:33:46 +0000 (16:33 -0500)]
Merge branch 'be2net-next'
Sathya Perla says:
====================
be2net: patch set
The following patch set contains some feature additions, code re-organization
and cleanup and a few non-critical fixes. Pls consider applying this to
the net-next tree. Thanks.
v3 changes: add a default case to the switch statement in patch 5 to
satisfy the compiler (-Wswitch).
v2 changes: replaced an if/else block that checks for error values with a
switch/case statement in patch 5.
Patch 1 fixes VF link state transition from disabled to auto that did
not work due to an issue in the FW. This issue could not be fixed in FW due
to some backward compatibility issues it causes with released drivers.
The issue has been fixed by introducing a new version (v2) of the cmd
from 10.6 FW onwards. This patch adds support for v2 version of this cmd.
Patch 2 reports a EOPNOTSUPP status to ethtool when the user tries to
configure BE3/SRIOV in VEPA mode as it is not supported by the chip.
Patch 3 cleansup FW flash image related constant definitions. Many of these
definitions (such as section offset values) were defined in decimal format
rather than hexa-decimal. This makes this part of the code un-readable.
Also some defines related to BE2 are labeld "g2" and defines related to BE3
are labeled "g3". This patch cleans up all of this to make this code more
readable.
Patch 4 moves the FW cmd code to be_cmds.c. All code relating to FW cmds
has been in be_cmds.[ch], excepting FW flash cmd related code.
This patch moves these routines from be_main.c to be_cmds.c.
Patch 5 adds a log message to report digital signature errors while
flashing a FW image. From FW version 11.0 onwards, the FW supports a new
"secure mode" feature (based on a jumper setting on the adapter.) In this
mode, the FW image when flashed is authenticated with a digital signature.
Patch 6 removes a line of code that has no effect.
Patch 7 removes some unused variables.
Patch 8 fixes port resource descriptor query via the GET_PROFILE FW cmd.
An earlier commit passed a specific pf_num while issuing this cmd as FW
returns descriptors for all functions when pf_num is zero. But, when pf_num
is set to a non-zero value, FW does not return the port resource descriptor.
This patch fixes this by setting pf_num to 0 while issuing the query cmd
and adds code to pick the correct NIC resource descriptor from the list of
descriptors returned by FW.
Patch 9 adds support for ethtool get-dump feature. In the past when this
option was not yet available, this feature was supported via the
--register-dump option as a workaround. This patch removes support for
FW-dump via --register-dump option as it is now available via --get-dump
option. Even though the "ethtool --register-dump" cmd which used to work
earlier, will now fail with ENOTSUPP error, we feel it is not an issue as
this is used only for diagnostics purpose.
Patch 10 bumps up the driver version.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Suresh Reddy [Wed, 30 Dec 2015 06:29:05 +0000 (01:29 -0500)]
be2net: bump up the driver version to 11.0.0.0
Signed-off-by: Suresh Reddy <suresh.reddy@avagotech.com> Signed-off-by: Sathya Perla <sathya.perla@avagotech.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Venkat Duvvuru [Wed, 30 Dec 2015 06:29:04 +0000 (01:29 -0500)]
be2net: support ethtool get-dump option
This patch adds support for ethtool's --get-dump option in be2net,
to retrieve FW dump. In the past when this option was not yet available,
this feature was supported via the --register-dump option as a workaround.
This patch removes support for FW-dump via --register-dump option as it is
now available via --get-dump option. Even though the
"ethtool --register-dump" cmd which used to work earlier, will now fail
with ENOTSUPP error, we feel it is not an issue as this is used only
for diagnostics purpose.
Signed-off-by: Venkat Duvvuru <venkatkumar.duvvuru@avagotech.com> Signed-off-by: Sathya Perla <sathya.perla@avagotech.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Suresh Reddy [Wed, 30 Dec 2015 06:29:03 +0000 (01:29 -0500)]
be2net: fix port-res desc query of GET_PROFILE_CONFIG FW cmd
Commit 72ef3a88fa8e ("be2net: set pci_func_num while issuing
GET_PROFILE_CONFIG cmd") passed a specific pf_num while issuing a
GET_PROFILE_CONFIG cmd as FW returns descriptors for all functions when
pf_num is zero. But, when pf_num is set to a non-zero value, FW does not
return the Port resource descriptor.
This patch fixes this by setting pf_num to 0 while issuing the query cmd
and adds code to pick the correct NIC resource descriptor from the list of
descriptors returned by FW.
Fixes: 72ef3a88fa8e ("be2net: set pci_func_num while issuing
GET_PROFILE_CONFIG cmd") Signed-off-by: Suresh Reddy <suresh.reddy@avagotech.com> Signed-off-by: Sathya Perla <sathya.perla@avagotech.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Venkat Duvvuru [Wed, 30 Dec 2015 06:29:02 +0000 (01:29 -0500)]
be2net: remove unused error variables
eeh_error, fw_timeout, hw_error variables in the be_adapter structure are
not used anymore. An earlier patch that introduced adapter->err_flags to
store this information missed removing these variables.
Signed-off-by: Venkat Duvvuru <venkatkumar.duvvuru@avagotech.com> Signed-off-by: Sathya Perla <sathya.perla@avagotech.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Sathya Perla [Wed, 30 Dec 2015 06:29:01 +0000 (01:29 -0500)]
be2net: remove a line of code that has no effect
This patch removes a line of code that changes adapter->recommended_prio
value followed by yet another assignment.
Also, the variable is used to store the vlan priority value that is already
shifted to the PCP bits position in the vlan tag format. Hence, the name of
this variable is changed to recommended_prio_bits.
Signed-off-by: Sathya Perla <sathya.perla@avagotech.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Suresh Reddy [Wed, 30 Dec 2015 06:29:00 +0000 (01:29 -0500)]
be2net: log digital signature errors while flashing FW image
(based on a jumper setting on the adapter.) In this mode, the FW image when
flashed is authenticated with a digital signature. This patch logs
appropriate error messages and return a status to ethtool when errors
relating to FW image authentication occur.
Signed-off-by: Suresh Reddy <suresh.reddy@avagotech.com> Signed-off-by: Sathya Perla <sathya.perla@avagotech.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Suresh Reddy [Wed, 30 Dec 2015 06:28:59 +0000 (01:28 -0500)]
be2net: move FW flash cmd code to be_cmds.c
All code relating to FW cmds is in be_cmds.[ch] excepting FW flash cmd
related code. This patch moves these routines from be_main.c to be_cmds.c
Signed-off-by: Suresh Reddy <suresh.reddy@avagotech.com> Signed-off-by: Sathya Perla <sathya.perla@avagotech.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Suresh Reddy [Wed, 30 Dec 2015 06:28:58 +0000 (01:28 -0500)]
be2net: cleanup FW flash image related macro defines
Many constant definitions relating to the FW-image layout
(such as section offset values) were defined in decimal format rather than
hexa-decimal. This makes this part of the code un-readable. Also some
defines related to BE2 are labeld "g2" and defines related to BE3 are
labeled "g3". This patch cleans up all of this to make this code more
readable.
Signed-off-by: Suresh Reddy <suresh.reddy@avagotech.com> Signed-off-by: Sathya Perla <sathya.perla@avagotech.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Suresh Reddy [Wed, 30 Dec 2015 06:28:57 +0000 (01:28 -0500)]
be2net: avoid configuring VEPA mode on BE3
BE3 chip doesn't support VEPA mode.
Signed-off-by: Suresh Reddy <suresh.reddy@avagotech.com> Signed-off-by: Sathya Perla <sathya.perla@avagotech.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Suresh Reddy [Wed, 30 Dec 2015 06:28:56 +0000 (01:28 -0500)]
be2net: fix VF link state transition from disabled to auto
The VF link state setting transition from "disable" to "auto" does not work
due to a bug in SET_LOGICAL_LINK_CONFIG_V1 cmd in FW. This issue could not
be fixed in FW due to some backward compatibility issues it causes with
some released drivers. The issue has been fixed by introducing a new
version (v2) of the cmd from 10.6 FW onwards. In v2, to set the VF link
state to auto, both PLINK_ENABLE and PLINK_TRACK bits have to be set to 1.
The VF link state setting feature now works on Lancer chips too from
FW ver 10.6.315.0 onwards.
Signed-off-by: Suresh Reddy <suresh.reddy@avagotech.com> Signed-off-by: Sathya Perla <sathya.perla@avagotech.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Linus Torvalds [Wed, 30 Dec 2015 18:26:20 +0000 (10:26 -0800)]
Merge branch 'for-linus' of git://git.kernel.dk/linux-block
Pull block fixes from Jens Axboe:
"Make the block layer great again.
Basically three amazing fixes in this pull request, split into 4
patches. Believe me, they should go into 4.4. Two of them fix a
regression, the third and last fixes an easy-to-trigger bug.
- Fix a bad irq enable through null_blk, for queue_mode=1 and using
timer completions. Add a block helper to restart a queue
asynchronously, and use that from null_blk. From me.
- Fix a performance issue in NVMe. Some devices (Intel Pxxxx) expose
a stripe boundary, and performance suffers if we cross it. We took
that into account for merging, but not for the newer splitting
code. Fix from Keith.
- Fix a kernel oops in lightnvm with multiple channels. From Matias"
* 'for-linus' of git://git.kernel.dk/linux-block:
lightnvm: wrong offset in bad blk lun calculation
null_blk: use async queue restart helper
block: add blk_start_queue_async()
block: Split bios on chunk boundaries
Alexander Duyck [Wed, 23 Dec 2015 17:00:35 +0000 (09:00 -0800)]
ixgbe: Fix bugs in ixgbe_clear_vf_vlans()
When I had rewritten the code for ixgbe_clear_vf_vlans() it looks like I
had transitioned back and forth between using word as an offset and using
word as a register offset. As a result I honestly don't see how the code
was working before other than the fact that resetting the VLANs on the VF
like didn't do much to clear them.
Another issue found is that the mask was using a divide instead of a
modulus. As a result the mask bit was incorrectly being set to either bit
0 or 1 based on the value of the VF being tested. As a result the wrong
VFs were having their VLANs cleared if they were enabled.
I have updated the code so that word represents the offset in the array.
This way we can use the modulus and xor operations and they will make sense
instead of being performed on a 4 byte aligned value.
I replaced the statement "(word % 2) ^ 1" with "~word % 2" in order to
reduce the line length as the line exceeded 80 characters with the register
name inserted. The two should be equivalent so the change should be safe.
Reported-by: Emil Tantilov <emil.s.tantilov@intel.com> Signed-off-by: Alexander Duyck <aduyck@mirantis.com> Tested-by: Phil Schmitt <phillip.j.schmitt@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Mark Rustad [Fri, 20 Nov 2015 21:12:17 +0000 (13:12 -0800)]
ixgbe: Correct X550EM_x revision check
The X550EM_x revision check needs to check a value, not just a bit.
Use a mask and check the value. Also remove the redundant check
inside the ixgbe_enter_lplu_t_x550em, because it can only be called
when both the mac type and revision check pass.
Signed-off-by: Mark Rustad <mark.d.rustad@intel.com> Tested-by: Phil Schmitt <phillip.j.schmitt@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Emil Tantilov [Fri, 20 Nov 2015 21:02:16 +0000 (13:02 -0800)]
ixgbe: fix RSS limit for X550
X550 allows for up to 64 RSS queues, but the driver can have max
of 63 (-1 MSIX vector for link).
On systems with >= 64 CPUs the driver will set the redirection table
for all 64 queues which will result in packets being dropped.
Signed-off-by: Emil Tantilov <emil.s.tantilov@intel.com> Tested-by: Phil Schmitt <phillip.j.schmitt@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Mark Rustad [Wed, 18 Nov 2015 23:37:04 +0000 (15:37 -0800)]
ixgbe: Clean up redundancy in hw_enc_features
Clean up minor redundancy in the setting of hw_enc_features that
makes it appears that X550 uniquely has more encapsulation features
than other devices. The driver only supports one more feature, so
make it look that way. No longer set NETIF_F_SG since that is set
by the register_netdev call. Thanks to Alex Duyck for noticing this
slight confusion.
Reported-by: Alexander Duyck <aduyck@mirantis.com> Signed-off-by: Mark Rustad <mark.d.rustad@intel.com> Tested-by: Phil Schmitt <phillip.j.schmitt@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Veola Nazareth [Wed, 11 Nov 2015 23:22:59 +0000 (16:22 -0700)]
ixgbe: report correct media type for KR, KX and KX4 interfaces
Ethtool reports backplane type interfaces as 1000/10000baseT link modes.
This has been corrected to report the media as KR, KX or KX4 based on the
backplane interface present.
Signed-off-by: Veola Nazareth <veola.nazareth@intel.com> Tested-by: Phil Schmitt <phillip.j.schmitt@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Emil Tantilov [Mon, 9 Nov 2015 23:07:12 +0000 (15:07 -0800)]
ixgbe: add support for QSFP PHY types in ixgbe_get_settings()
Add missing QSFP PHY types to allow for more accurate reporting of
port settings.
Signed-off-by: Emil Tantilov <emil.s.tantilov@intel.com> Tested-by: Phil Schmitt <phillip.j.schmitt@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Emil Tantilov [Thu, 5 Nov 2015 00:02:21 +0000 (16:02 -0800)]
ixgbevf: minor cleanups for ixgbevf_set_itr()
adapter->rx_itr_setting is not a mask so check it with == instead of &
do not default to 12K interrupts in ixgbevf_set_itr()
There should be no functional effect from these changes.
Signed-off-by: Emil Tantilov <emil.s.tantilov@intel.com> Tested-by: Phil Schmitt <phillip.j.schmitt@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
William Dauchy [Fri, 30 Oct 2015 17:16:30 +0000 (18:16 +0100)]
ixgbevf: Fix handling of NAPI budget when multiple queues are enabled per vector
This is the same patch as for ixgbe but applied differently according to
busy polling. See commit 5d6002b7b822c74 ("ixgbe: Fix handling of NAPI
budget when multiple queues are enabled per vector")
Signed-off-by: William Dauchy <william@gandi.net> Tested-by: Phil Schmitt <phillip.j.schmitt@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Linus Torvalds [Wed, 30 Dec 2015 02:04:41 +0000 (18:04 -0800)]
Merge branch 'akpm' (patches from Andrew)
Merge misc fixes from Andrew Morton:
"9 fixes"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
mm/vmstat: fix overflow in mod_zone_page_state()
ocfs2/dlm: clear migration_pending when migration target goes down
mm/memory_hotplug.c: check for missing sections in test_pages_in_a_zone()
ocfs2: fix flock panic issue
m32r: add io*_rep helpers
m32r: fix build failure
arch/x86/xen/suspend.c: include xen/xen.h
mm: memcontrol: fix possible memcg leak due to interrupted reclaim
ocfs2: fix BUG when calculate new backup super
Heiko Carstens [Tue, 29 Dec 2015 22:54:32 +0000 (14:54 -0800)]
mm/vmstat: fix overflow in mod_zone_page_state()
mod_zone_page_state() takes a "delta" integer argument. delta contains
the number of pages that should be added or subtracted from a struct
zone's vm_stat field.
If a zone is larger than 8TB this will cause overflows. E.g. for a
zone with a size slightly larger than 8TB the line
in mm/page_alloc.c:free_area_init_core() will result in a negative
result for the NR_ALLOC_BATCH entry within the zone's vm_stat, since 8TB
contain 0x8xxxxxxx pages which will be sign extended to a negative
value.
Fix this by changing the delta argument to long type.
This could fix an early boot problem seen on s390, where we have a 9TB
system with only one node. ZONE_DMA contains 2GB and ZONE_NORMAL the
rest. The system is trying to allocate a GFP_DMA page but ZONE_DMA is
completely empty, so it tries to reclaim pages in an endless loop.
This was seen on a heavily patched 3.10 kernel. One possible
explaination seem to be the overflows caused by mod_zone_page_state().
Unfortunately I did not have the chance to verify that this patch
actually fixes the problem, since I don't have access to the system
right now. However the overflow problem does exist anyway.
Given the description that a system with slightly less than 8TB does
work, this seems to be a candidate for the observed problem.
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Christoph Lameter <cl@linux.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
xuejiufei [Tue, 29 Dec 2015 22:54:29 +0000 (14:54 -0800)]
ocfs2/dlm: clear migration_pending when migration target goes down
We have found a BUG on res->migration_pending when migrating lock
resources. The situation is as follows.
dlm_mark_lockres_migration
res->migration_pending = 1;
__dlm_lockres_reserve_ast
dlm_lockres_release_ast returns with res->migration_pending remains
because other threads reserve asts
wait dlm_migration_can_proceed returns 1
>>>>>>> o2hb found that target goes down and remove target
from domain_map
dlm_migration_can_proceed returns 1
dlm_mark_lockres_migrating returns -ESHOTDOWN with
res->migration_pending still remains.
When reentering dlm_mark_lockres_migrating(), it will trigger the BUG_ON
with res->migration_pending. So clear migration_pending when target is
down.
Signed-off-by: Jiufei Xue <xuejiufei@huawei.com> Reviewed-by: Joseph Qi <joseph.qi@huawei.com> Cc: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Andrew Banman [Tue, 29 Dec 2015 22:54:25 +0000 (14:54 -0800)]
mm/memory_hotplug.c: check for missing sections in test_pages_in_a_zone()
test_pages_in_a_zone() does not account for the possibility of missing
sections in the given pfn range. pfn_valid_within always returns 1 when
CONFIG_HOLES_IN_ZONE is not set, allowing invalid pfns from missing
sections to pass the test, leading to a kernel oops.
Wrap an additional pfn loop with PAGES_PER_SECTION granularity to check
for missing sections before proceeding into the zone-check code.
This also prevents a crash from offlining memory devices with missing
sections. Despite this, it may be a good idea to keep the related patch
'[PATCH 3/3] drivers: memory: prohibit offlining of memory blocks with
missing sections' because missing sections in a memory block may lead to
other problems not covered by the scope of this fix.
Signed-off-by: Andrew Banman <abanman@sgi.com> Acked-by: Alex Thorlton <athorlton@sgi.com> Cc: Russ Anderson <rja@sgi.com> Cc: Alex Thorlton <athorlton@sgi.com> Cc: Yinghai Lu <yinghai@kernel.org> Cc: Greg KH <greg@kroah.com> Cc: Seth Jennings <sjennings@variantweb.net> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Junxiao Bi [Tue, 29 Dec 2015 22:54:22 +0000 (14:54 -0800)]
ocfs2: fix flock panic issue
Commit 4f6563677ae8 ("Move locks API users to locks_lock_inode_wait()")
move flock/posix lock indentify code to locks_lock_inode_wait(), but
missed to set fl_flags to FL_FLOCK which caused the following kernel
panic on 4.4.0_rc5.
Fixes: 4f6563677ae8 ("Move locks API users to locks_lock_inode_wait()") Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com> Cc: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Cc: Joseph Qi <joseph.qi@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Sudip Mukherjee [Tue, 29 Dec 2015 22:54:16 +0000 (14:54 -0800)]
m32r: fix build failure
m32r allmodconfig is failing with:
In file included from ../include/linux/kvm_para.h:4:0,
from ../kernel/watchdog.c:26:
../include/uapi/linux/kvm_para.h:30:26: fatal error: asm/kvm_para.h: No such file or directory
Andrew Morton [Tue, 29 Dec 2015 22:54:13 +0000 (14:54 -0800)]
arch/x86/xen/suspend.c: include xen/xen.h
Fix the build warning:
arch/x86/xen/suspend.c: In function 'xen_arch_pre_suspend':
arch/x86/xen/suspend.c:70:9: error: implicit declaration of function 'xen_pv_domain' [-Werror=implicit-function-declaration]
if (xen_pv_domain())
^
Reported-by: kbuild test robot <fengguang.wu@intel.com> Cc: Sasha Levin <sasha.levin@oracle.com> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: David Vrabel <david.vrabel@citrix.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Vladimir Davydov [Tue, 29 Dec 2015 22:54:10 +0000 (14:54 -0800)]
mm: memcontrol: fix possible memcg leak due to interrupted reclaim
Memory cgroup reclaim can be interrupted with mem_cgroup_iter_break()
once enough pages have been reclaimed, in which case, in contrast to a
full round-trip over a cgroup sub-tree, the current position stored in
mem_cgroup_reclaim_iter of the target cgroup does not get invalidated
and so is left holding the reference to the last scanned cgroup. If the
target cgroup does not get scanned again (we might have just reclaimed
the last page or all processes might exit and free their memory
voluntary), we will leak it, because there is nobody to put the
reference held by the iterator.
The problem is easy to reproduce by running the following command
sequence in a loop:
mkdir /sys/fs/cgroup/memory/test
echo 100M > /sys/fs/cgroup/memory/test/memory.limit_in_bytes
echo $$ > /sys/fs/cgroup/memory/test/cgroup.procs
memhog 150M
echo $$ > /sys/fs/cgroup/memory/cgroup.procs
rmdir test
The cgroups generated by it will never get freed.
This patch fixes this issue by making mem_cgroup_iter avoid taking
reference to the current position. In order not to hit use-after-free
bug while running reclaim in parallel with cgroup deletion, we make use
of ->css_released cgroup callback to clear references to the dying
cgroup in all reclaim iterators that might refer to it. This callback
is called right before scheduling rcu work which will free css, so if we
access iter->position from rcu read section, we might be sure it won't
go away under us.
[hannes@cmpxchg.org: clean up css ref handling] Fixes: 5ac8fb31ad2e ("mm: memcontrol: convert reclaim iterator to simple css refcounting") Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@kernel.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: <stable@vger.kernel.org> [3.19+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Joseph Qi [Tue, 29 Dec 2015 22:54:06 +0000 (14:54 -0800)]
ocfs2: fix BUG when calculate new backup super
When resizing, it firstly extends the last gd. Once it should backup
super in the gd, it calculates new backup super and update the
corresponding value.
But it currently doesn't consider the situation that the backup super is
already done. And in this case, it still sets the bit in gd bitmap and
then decrease from bg_free_bits_count, which leads to a corrupted gd and
trigger the BUG in ocfs2_block_group_set_bits:
Joe Stringer [Wed, 23 Dec 2015 22:39:27 +0000 (14:39 -0800)]
openvswitch: Fix template leak in error cases.
Commit 5b48bb8506c5 ("openvswitch: Fix helper reference leak") fixed a
reference leak on helper objects, but inadvertently introduced a leak on
the ct template.
Previously, ct_info.ct->general.use was initialized to 0 by
nf_ct_tmpl_alloc() and only incremented when ovs_ct_copy_action()
returned successful. If an error occurred while adding the helper or
adding the action to the actions buffer, the __ovs_ct_free_action()
cleanup would use nf_ct_put() to free the entry; However, this relies on
atomic_dec_and_test(ct_info.ct->general.use). This reference must be
incremented first, or nf_ct_put() will never free it.
Fix the issue by acquiring a reference to the template immediately after
allocation.
Fixes: cae3a2627520 ("openvswitch: Allow attaching helpers to ct action") Fixes: 5b48bb8506c5 ("openvswitch: Fix helper reference leak") Signed-off-by: Joe Stringer <joe@ovn.org> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Tue, 29 Dec 2015 20:13:45 +0000 (15:13 -0500)]
Merge branch 'bpf_hash-locking'
Ming Lei says:
====================
bpf: hash: use per-bucket spinlock
This patchset tries to optimize ebpf hash map, and follows
the idea:
Both htab_map_update_elem() and htab_map_delete_elem()
can be called from eBPF program, and they may be in kernel
hot path, it isn't efficient to use a per-hashtable lock
in this two helpers, so this patch converts the lock into
per-bucket spinlock.
With this patchset, looks the performance penalty from eBPF
decreased a lot, see the following test:
1) run 'tools/biolatency' of bcc before running block test;
2) run fio to test block throught over /dev/nullb0,
(randread, 16jobs, libaio, 4k bs) and the test box
is one 24cores(dual sockets) VM server:
- without patchset: 607K IOPS
- with this patchset: 1184K IOPS
- without running eBPF prog: 1492K IOPS
TODO:
- remove the per-hashtable atomic counter
V2:
- fix checking on buckets size
V1:
- fix the wrong 3/3 patch
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Both htab_map_update_elem() and htab_map_delete_elem() can be
called from eBPF program, and they may be in kernel hot path,
so it isn't efficient to use a per-hashtable lock in this two
helpers.
The per-hashtable spinlock is used for protecting bucket's
hlist, and per-bucket lock is just enough. This patch converts
the per-hashtable lock into per-bucket spinlock, so that
contention can be decreased a lot.
Signed-off-by: Ming Lei <tom.leiming@gmail.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net>