John Fastabend [Wed, 18 Mar 2015 12:57:33 +0000 (14:57 +0200)]
net: Add max rate tx queue attribute
This adds a tx_maxrate attribute to the tx queue sysfs entry allowing
for max-rate limiting. Along with DCB-ETS and BQL this provides another
knob to tune queue performance. The limit units are Mbps.
By default it is disabled. To disable the rate limitation after it
has been set for a queue, it should be set to zero.
Signed-off-by: John Fastabend <john.r.fastabend@intel.com> Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
I was trying to squeeze bucket_table->rehash in by downsizing
bucket_table->size, only to find that my spot had been taken
over by bucket_table->shift. These patches kill shift and makes
me feel better :)
v2 corrects the typo in the test_rhashtable changelog and also
notes the min_shift parameter in the tipc patch changelog.
====================
Acked-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Wed, 18 Mar 2015 16:44:13 +0000 (12:44 -0400)]
Merge branch 'xgene-next'
Keyur Chudgar says:
====================
drivers: net: xgene: Add second SGMII based 1G interface
This patch adds support for second SGMII based 1G interface.
====================
Signed-off-by: Keyur Chudgar <kchudgar@apm.com> Signed-off-by: Iyappan Subramanian <isubramanian@apm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Keyur Chudgar [Tue, 17 Mar 2015 18:27:13 +0000 (11:27 -0700)]
drivers: net: xgene: Add second SGMII based 1G interface
- Added resource initialization based on port-id field
- Enabled second SGMII 1G interface
Signed-off-by: Keyur Chudgar <kchudgar@apm.com> Signed-off-by: Iyappan Subramanian <isubramanian@apm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Keyur Chudgar [Tue, 17 Mar 2015 18:27:12 +0000 (11:27 -0700)]
dtb: xgene: Add second SGMII based 1G interface node
- Added new SGMII node for port 1
- Added port-id field
Signed-off-by: Keyur Chudgar <kchudgar@apm.com> Signed-off-by: Iyappan Subramanian <isubramanian@apm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Keyur Chudgar [Tue, 17 Mar 2015 18:27:11 +0000 (11:27 -0700)]
Documentation: dtb: Add port-id field for APM X-Gene ethernet
Signed-off-by: Keyur Chudgar <kchudgar@apm.com> Signed-off-by: Iyappan Subramanian <isubramanian@apm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Wed, 18 Mar 2015 02:12:10 +0000 (22:12 -0400)]
Merge branch 'tipc_netns_leak'
Ying Xue says:
====================
tipc: fix netns refcnt leak
The series aims to eliminate the issue of netns refcount leak. But
during fixing it, another two additional problems are found. So all
of known issues associated with the netns refcnt leak are resolved
at the same time in the patchset.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Ying Xue [Wed, 18 Mar 2015 01:32:59 +0000 (09:32 +0800)]
tipc: withdraw tipc topology server name when namespace is deleted
The TIPC topology server is a per namespace service associated with the
tipc name {1, 1}. When a namespace is deleted, that name must be withdrawn
before we call sk_release_kernel because the kernel socket release is
done in init_net and trying to withdraw a TIPC name published in another
namespace will fail with an error as:
[ 170.093264] Unable to remove local publication
[ 170.093264] (type=1, lower=1, ref=2184244004, key=2184244005)
We fix this by breaking the association between the topology server name
and socket before calling sk_release_kernel.
Signed-off-by: Ying Xue <ying.xue@windriver.com> Reviewed-by: Erik Hugne <erik.hugne@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Before tipc_purge_publications() calls tipc_nametbl_remove_publ() to
remove a publication with a name sequence, the name sequence's lock
is held. However, when tipc_nametbl_remove_publ() calling
tipc_nameseq_remove_publ() to remove the publication, it first tries
to query name sequence instance with the publication, and then holds
the lock of the found name sequence. But as the lock may be already
taken in tipc_purge_publications(), deadlock happens like above
scenario demonstrated. As tipc_nameseq_remove_publ() doesn't grab name
sequence's lock, the deadlock can be avoided if it's directly invoked
by tipc_purge_publications().
Fixes: 97ede29e80ee ("tipc: convert name table read-write lock to RCU") Signed-off-by: Ying Xue <ying.xue@windriver.com> Reviewed-by: Erik Hugne <erik.hugne@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Ying Xue [Wed, 18 Mar 2015 01:32:57 +0000 (09:32 +0800)]
tipc: fix netns refcnt leak
When the TIPC module is loaded, we launch a topology server in kernel
space, which in its turn is creating TIPC sockets for communication
with topology server users. Because both the socket's creator and
provider reside in the same module, it is necessary that the TIPC
module's reference count remains zero after the server is started and
the socket created; otherwise it becomes impossible to perform "rmmod"
even on an idle module.
Currently, we achieve this by defining a separate "tipc_proto_kern"
protocol struct, that is used only for kernel space socket allocations.
This structure has the "owner" field set to NULL, which restricts the
module reference count from being be bumped when sk_alloc() for local
sockets is called. Furthermore, we have defined three kernel-specific
functions, tipc_sock_create_local(), tipc_sock_release_local() and
tipc_sock_accept_local(), to avoid the module counter being modified
when module local sockets are created or deleted. This has worked well
until we introduced name space support.
However, after name space support was introduced, we have observed that
a reference count leak occurs, because the netns counter is not
decremented in tipc_sock_delete_local().
This commit remedies this problem. But instead of just modifying
tipc_sock_delete_local(), we eliminate the whole parallel socket
handling infrastructure, and start using the regular sk_create_kern(),
kernel_accept() and sk_release_kernel() calls. Since those functions
manipulate the module counter, we must now compensate for that by
explicitly decrementing the counter after module local sockets are
created, and increment it just before calling sk_release_kernel().
Fixes: a62fbccecd62 ("tipc: make subscriber server support net namespace") Signed-off-by: Ying Xue <ying.xue@windriver.com> Reviewed-by: Jon Maloy <jon.maloy@ericson.com> Reviewed-by: Erik Hugne <erik.hugne@ericsson.com> Reported-by: Cong Wang <cwang@twopensource.com> Tested-by: Erik Hugne <erik.hugne@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Wed, 18 Mar 2015 02:02:53 +0000 (22:02 -0400)]
Merge branch 'listener_refactor_part_12'
Eric Dumazet says:
====================
inet: tcp listener refactoring, part 12
By adding a pointer back to listener, we are preparing synack rtx
handling to no longer be governed by listener keepalive timer,
as this is the most problematic source of contention on listener
spinlock. Note that TCP FastOpen had such pointer anyway, so we
make it generic.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Wed, 18 Mar 2015 01:32:31 +0000 (18:32 -0700)]
inet: fix request sock refcounting
While testing last patch series, I found req sock refcounting was wrong.
We must set skc_refcnt to 1 for all request socks added in hashes,
but also on request sockets created by FastOpen or syncookies.
It is tricky because we need to defer this initialization so that
future RCU lookups do not try to take a refcount on a not yet
fully initialized request socket.
Also get rid of ireq_refcnt alias.
Signed-off-by: Eric Dumazet <edumazet@google.com> Fixes: 13854e5a6046 ("inet: add proper refcounting to request sock") Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Wed, 18 Mar 2015 01:32:29 +0000 (18:32 -0700)]
tcp: rename struct tcp_request_sock listener
The listener field in struct tcp_request_sock is a pointer
back to the listener. We now have req->rsk_listener, so TCP
only needs one boolean and not a full pointer.
Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Wed, 18 Mar 2015 01:32:28 +0000 (18:32 -0700)]
inet: add rsk_listener field to struct request_sock
Once we'll be able to lookup request sockets in ehash table,
we'll need to get access to listener which created this request.
This avoid doing a lookup to find the listener, which benefits
for a more solid SO_REUSEPORT, and is needed once we no
longer queue request sock into a listener private queue.
Note that 'struct tcp_request_sock'->listener could be reduced
to a single bit, as TFO listener should match req->rsk_listener.
TFO will no longer need to hold a reference on the listener.
Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
bpf: allow BPF programs access 'protocol' and 'vlan_tci' fields
as a follow on to patch 70006af95515 ("bpf: allow eBPF access skb fields")
this patch allows 'protocol' and 'vlan_tci' fields to be accessible
from extended BPF programs.
The usage of 'protocol', 'vlan_present' and 'vlan_tci' fields is the same as
corresponding SKF_AD_PROTOCOL, SKF_AD_VLAN_TAG_PRESENT and SKF_AD_VLAN_TAG
accesses in classic BPF.
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Mon, 16 Mar 2015 14:14:34 +0000 (07:14 -0700)]
tcp_metrics: fix wrong lockdep annotations
Changes in tcp_metric hash table are protected by tcp_metrics_lock
only, not by genl_mutex
While we are at it use deref_locked() instead of rcu_dereference()
in tcp_new() to avoid unnecessary barrier, as we hold tcp_metrics_lock
as well.
Reported-by: Andrew Vagin <avagin@parallels.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Fixes: 098a697b497e ("tcp_metrics: Use a single hash table for all network namespaces.") Reviewed-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Jiri Pirko [Mon, 16 Mar 2015 11:33:32 +0000 (12:33 +0100)]
dsa: change "select" to "depends on" for NET_SWITCHDEV and for NET_DSA
This would fix randconfig compile error:
net/built-in.o: In function `netdev_switch_fib_ipv4_abort':
(.text+0xf7811): undefined reference to `fib_flush_external'
Also it fixes following warnings:
warning: (NET_DSA) selects NET_SWITCHDEV which has unmet direct dependencies (NET && INET)
warning: (NET_DSA_MV88E6060 && NET_DSA_MV88E6131 && NET_DSA_MV88E6123_61_65 && NET_DSA_MV88E6171 && NET_DSA_MV88E6352 && NET_DSA_BCM_SF2) selects NET_DSA which has unmet direct dependencies (NET && HAVE_NET_DSA && NET_SWITCHDEV)
Reported-by: Randy Dunlap <rdunlap@infradead.org> Suggested-by: Alexei Starovoitov <alexei.starovoitov@gmail.com> Signed-off-by: Jiri Pirko <jiri@resnulli.us> Signed-off-by: David S. Miller <davem@davemloft.net>
Shaohui Xie [Mon, 16 Mar 2015 10:56:29 +0000 (18:56 +0800)]
net/fsl: modify xgmac_mdio for little endian SoCs
MDIO controller on little endian Socs, e.g. ls2085a is similar to the
controller on big endian Socs, but the MDIO access is little endian,
we use I/O accessor function to handle endianness, so the driver can
run on little endian Socs. A property "little-endian" is used
in DTS to indicate the MDIO is little endian, if driver probes the
property, driver will access MDIO in little endian, otherwise, driver
works in big endian by default.
Signed-off-by: Shaohui Xie <Shaohui.Xie@freescale.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Ying Xue [Mon, 16 Mar 2015 10:19:12 +0000 (18:19 +0800)]
net: kernel socket should be released in init_net namespace
Creating a kernel socket with sock_create_kern() happens in "init_net"
namespace, however, releasing it with sk_release_kernel() occurs in
the current namespace which may be different with "init_net" namespace.
Therefore, we should guarantee that the namespace in which a kernel
socket is created is same as the socket is created.
Signed-off-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Thomas Graf [Mon, 16 Mar 2015 09:42:27 +0000 (10:42 +0100)]
rhashtable: Annotate RCU locking of walkers
Fixes the following sparse warnings:
lib/rhashtable.c:767:5: warning: context imbalance in 'rhashtable_walk_start' - wrong count at exit
lib/rhashtable.c:849:6: warning: context imbalance in 'rhashtable_walk_stop' - unexpected unlock
Fixes: f2dba9c6ff0d ("rhashtable: Introduce rhashtable_walk_*") Signed-off-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
Scott Feldman [Mon, 16 Mar 2015 06:04:46 +0000 (23:04 -0700)]
rocker: replace fixed stack allocation with dynamic allocation
In hast to fix some sparse warning, I hard-coded a fix-sized array on the stack
which is probably too big for kernel standards. Fix this by converting array
to dynamic allocation.
Signed-off-by: Scott Feldman <sfeldma@gmail.com> Acked-by: Jiri Pirko <jiri@resnulli.us> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Mon, 16 Mar 2015 19:55:47 +0000 (15:55 -0400)]
Merge branch 'listener_refactor'
Eric Dumazet says:
====================
inet: tcp listener refactoring, part 10
We are getting close to the point where request sockets will be hashed
into generic hash table. Some followups are needed for netfilter and
will be handled in next patch series.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Scott Feldman [Mon, 16 Mar 2015 04:07:14 +0000 (21:07 -0700)]
switchdev: add swdev ops
As discussed at netconf, introduce swdev_ops as first step to move switchdev
ops from ndo to swdev. This will keep switchdev from cluttering up ndo ops
space.
Signed-off-by: Scott Feldman <sfeldma@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Herbert Xu [Sun, 15 Mar 2015 10:12:05 +0000 (21:12 +1100)]
rhashtable: Fix rhashtable_remove failures
The commit 9d901bc05153bbf33b5da2cd6266865e531f0545 ("rhashtable:
Free bucket tables asynchronously after rehash") causes gratuitous
failures in rhashtable_remove.
The reason is that it inadvertently introduced multiple rehashing
from the perspective of readers. IOW it is now possible to see
more than two tables during a single RCU critical section.
Fortunately the other reader rhashtable_lookup already deals with
this correctly thanks to c4db8848af6af92f90462258603be844baeab44d
("rhashtable: rhashtable: Move future_tbl into struct bucket_table")
so only rhashtable_remove is broken by this change.
This patch fixes this by looping over every table from the first
one to the last or until we find the element that we were trying
to delete.
Incidentally the simple test for detecting rehashing to prevent
starting another shrinking no longer works. Since it isn't needed
anyway (the work queue and the mutex serves as a natural barrier
to unnecessary rehashes) I've simply killed the test.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
Herbert Xu [Sun, 15 Mar 2015 10:12:04 +0000 (21:12 +1100)]
rhashtable: Fix use-after-free in rhashtable_walk_stop
The commit c4db8848af6af92f90462258603be844baeab44d ("rhashtable:
Move future_tbl into struct bucket_table") introduced a use-after-
free bug in rhashtable_walk_stop because it dereferences tbl after
droping the RCU read lock.
This patch fixes it by moving the RCU read unlock down to the bottom
of rhashtable_walk_stop. In fact this was how I had it originally
but it got dropped while rearranging patches because this one
depended on the async freeing of bucket_table.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
V1->V2:
- refactored field access converter into common helper convert_skb_access()
used in both classic and extended BPF
- added missing build_bug_on for field 'len'
- added comment to uapi/linux/bpf.h as suggested by Daniel
- dropped exposing 'ifindex' field for now
classic BPF has a way to access skb fields, whereas extended BPF didn't.
This patch introduces this ability.
Classic BPF can access fields via negative SKF_AD_OFF offset.
Positive bpf_ld_abs N is treated as load from packet, whereas
bpf_ld_abs -0x1000 + N is treated as skb fields access.
Many offsets were hard coded over years: SKF_AD_PROTOCOL, SKF_AD_PKTTYPE, etc.
The problem with this approach was that for every new field classic bpf
assembler had to be tweaked.
I've considered doing the same for extended, but for every new field LLVM
compiler would have to be modifed. Since it would need to add a new intrinsic.
It could be done with single intrinsic and magic offset or use of inline
assembler, but neither are clean from compiler backend point of view, since
they look like calls but shouldn't scratch caller-saved registers.
Another approach was to introduce a new helper functions like bpf_get_pkt_type()
for every field that we want to access, but that is equally ugly for kernel
and slow, since helpers are calls and they are slower then just loads.
In theory helper calls can be 'inlined' inside kernel into direct loads, but
since they were calls for user space, compiler would have to spill registers
around such calls anyway. Teaching compiler to treat such helpers differently
is even uglier.
They were few other ideas considered. At the end the best seems to be to
introduce a user accessible mirror of in-kernel sk_buff structure:
No new instructions added. LLVM doesn't need to be modified.
JITs don't change and verifier already knows when it accesses 'ctx' pointer.
The only thing needed was to convert user visible offset within __sk_buff
to kernel internal offset within sk_buff.
For 'len' and other fields conversion is trivial.
Converting 'pkt_type' takes 2 or 3 instructions depending on endianness.
More fields can be exposed by adding to the end of the 'struct __sk_buff'.
Like vlan_tci and others can be added later.
When pkt_type field is moved around, goes into different structure, removed or
its size changes, the function convert_skb_access() would need to updated and
it will cover both classic and extended.
Patch 2 updates examples to demonstrates how fields are accessed and
adds new tests for verifier, since it needs to detect a corner case when
attacker is using single bpf instruction in two branches with different
register types.
The 4 fields of __sk_buff are already exposed to user space via classic bpf and
I believe they're useful in extended as well.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
- modify sockex1 example to count number of bytes in outgoing packets
- modify sockex2 example to count number of bytes and packets per flow
- add 4 stress tests that exercise 'skb->field' code path of verifier
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Daniel Borkmann [Sat, 14 Mar 2015 01:27:17 +0000 (02:27 +0100)]
ebpf: add helper for obtaining current processor id
This patch adds the possibility to obtain raw_smp_processor_id() in
eBPF. Currently, this is only possible in classic BPF where commit da2033c28226 ("filter: add SKF_AD_RXHASH and SKF_AD_CPU") has added
facilities for this.
Perhaps most importantly, this would also allow us to track per CPU
statistics with eBPF maps, or to implement a poor-man's per CPU data
structure through eBPF maps.
David S. Miller [Sun, 15 Mar 2015 23:56:52 +0000 (19:56 -0400)]
Merge branch 'gianfar-next'
Claudiu Manoil says:
====================
gianfar: ARM port driver updates (2/2)
The 2nd round of driver updates to make gianfar portable on ARM,
for the ARM based SoC that integrates eTSEC - "ls1021a".
The patches address the bulk of remaining endianess issues -
handling DMA fields (BD and FCB), and device tree properties.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Jingchang Lu [Fri, 13 Mar 2015 08:52:32 +0000 (10:52 +0200)]
gianfar: Consider dts property endianess on handling
Use of_property_read*() to get arch endian consistent
property values. Do some refactoring in the process.
Signed-off-by: Jingchang Lu <jingchang.lu@freescale.com> Signed-off-by: Claudiu Manoil <claudiu.manoil@freescale.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Herbert Xu [Sat, 14 Mar 2015 02:57:24 +0000 (13:57 +1100)]
rhashtable: Add rehash counter to bucket_table
This patch adds a rehash counter to bucket_table to indicate
the last bucket that has been rehashed. This serves two purposes:
1. Any bucket that has been rehashed can never gain a new object.
2. If the rehash counter reaches the size of the table, the table
will forever remain empty.
This patch also downsizes bucket_table->size to an unsigned int
since we do not support sizes greater than 32 bits yet.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
Herbert Xu [Sat, 14 Mar 2015 02:57:23 +0000 (13:57 +1100)]
rhashtable: Free bucket tables asynchronously after rehash
There is in fact no need to wait for an RCU grace period in the
rehash function, since all insertions are guaranteed to go into
the new table through spin locks.
This patch uses call_rcu to free the old/rehashed table at our
leisure.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
Herbert Xu [Sat, 14 Mar 2015 02:57:20 +0000 (13:57 +1100)]
rhashtable: Fix walker behaviour during rehash
Previously whenever the walker encountered a resize it simply
snaps back to the beginning and starts again. However, this only
works if the rehash started and completed while the walker was
idle.
If the walker attempts to restart while the rehash is still ongoing,
we may miss objects that we shouldn't have.
This patch fixes this by making the walker walk the old table
followed by the new table just like all other readers. If a
rehash is detected we will still signal our caller of the fact
so they can prepare for duplicates but we will simply continue
the walk onto the new table after the old one is finished either
by us or by the rehasher.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
Florian Fainelli [Sat, 14 Mar 2015 20:21:59 +0000 (13:21 -0700)]
net: dsa: do not use slave MII bus for fixed PHYs
Commit cd28a1a9baee7 ("net: dsa: fully divert PHY reads/writes if
requested") introduced a check for particular PHYs that need to be
accessed using the slave MII bus created by DSA, but this check was too
inclusive. This would prevent fixed PHYs from being successfully
registered because those should not go through the slave MII bus created
by DSA.
Make sure we check that the PHY is not a fixed PHY to prevent that from
happening.
Fixes: cd28a1a9baee7 ("net: dsa: fully divert PHY reads/writes if requested") Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Sat, 14 Mar 2015 19:08:02 +0000 (15:08 -0400)]
Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/next-queue
Jeff Kirsher says:
====================
Intel Wired LAN Driver Updates 2015-03-13
This series contains updates to ixgbe and ixgbevf.
Don adds additional support for X550 MAC types, which require additional
steps around enabling and disabling Rx. Also cleans up variable type
inconsistency.
I provide a patch to allow relaxed ordering to be enabled on SPARC
architectures. Also cleans up ixgbevf whitespace and code comments to
align the driver with networking coding standard. Lastly cleaned up
uses of memcpy() where ether_addr_copy() could have been used.
Alex removes some dead code in the ixgbe cleanup patch.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Fri, 13 Mar 2015 22:51:10 +0000 (15:51 -0700)]
inet: fill request sock ir_iif for IPv4
Once request socks will be in ehash table, they will need to have
a valid ir_iff field.
This is currently true only for IPv6. This patch extends support
for IPv4 as well.
This means inet_diag_fill_req() can now properly use ir_iif,
which is better for IPv6 link locals anyway, as request sockets
and established sockets will propagate consistent netlink idiag_if.
Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Sat, 14 Mar 2015 18:38:41 +0000 (14:38 -0400)]
Merge branch 'tipc-next'
Jon Maloy says:
====================
tipc: some optimizations and impovements
The commits in this series contain some relatively simple changes that
lead to better throughput across TIPC connections. We also make changes
to the implementation of link transmission queueing and priority
handling, in order to make the code more comprehensible and maintainable.
v2: Commit #2: Redesigned tipc_msg_validate() to use pskb_may_pull(),
as per feedback from David Miller.
Commit #3: Some cosmetic changes to tipc_msg_extract(). I tried to
replace the unconditional skb_linearize() with calls to
pskb_may_pull() at selected locations, but I gave up.
First, skb_trim() requires a fully linearized buffer.
Second, it doesn't make much sense; the whole buffer
will end up linearized, one way or another.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Jon Paul Maloy [Fri, 13 Mar 2015 20:08:11 +0000 (16:08 -0400)]
tipc: clean up handling of message priorities
Messages transferred by TIPC are assigned an "importance priority", -an
integer value indicating how to treat the message when there is link or
destination socket congestion.
There is no separate header field for this value. Instead, the message
user values have been chosen in ascending order according to perceived
importance, so that the message user field can be used for this.
This is not a good solution. First, we have many more users than the
needed priority levels, so we end up with treating more priority
levels than necessary. Second, the user field cannot always
accurately reflect the priority of the message. E.g., a message
fragment packet should really have the priority of the enveloped
user data message, and not the priority of the MSG_FRAGMENTER user.
Until now, we have been working around this problem in different ways,
but it is now time to implement a consistent way of handling such
priorities, although still within the constraint that we cannot
allocate any more bits in the regular data message header for this.
In this commit, we define a new priority level, TIPC_SYSTEM_IMPORTANCE,
that will be the only one used apart from the four (lower) user data
levels. All non-data messages map down to this priority. Furthermore,
we take some free bits from the MSG_FRAGMENTER header and allocate
them to store the priority of the enveloped message. We then adjust
the functions msg_importance()/msg_set_importance() so that they
read/set the correct header fields depending on user type.
This small protocol change is fully compatible, because the code at
the receiving end of a link currently reads the importance level
only from user data messages, where there is no change.
Reviewed-by: Erik Hugne <erik.hugne@ericsson.com> Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Jon Paul Maloy [Fri, 13 Mar 2015 20:08:10 +0000 (16:08 -0400)]
tipc: split link outqueue
struct tipc_link contains one single queue for outgoing packets,
where both transmitted and waiting packets are queued.
This infrastructure is hard to maintain, because we need
to keep a number of fields to keep track of which packets are
sent or unsent, and the number of packets in each category.
A lot of code becomes simpler if we split this queue into a transmission
queue, where sent/unacknowledged packets are kept, and a backlog queue,
where we keep the not yet sent packets.
In this commit we do this separation.
Reviewed-by: Erik Hugne <erik.hugne@ericsson.com> Reviewed-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Jon Paul Maloy [Fri, 13 Mar 2015 20:08:09 +0000 (16:08 -0400)]
tipc: eliminate unnecessary call to broadcast ack function
The unicast packet header contains a broadcast acknowledge sequence
number, that may need to be conveyed to the broadcast link for proper
treatment. Currently, the function tipc_rcv(), which is on the most
critical data path, calls the function tipc_bclink_acknowledge() to
have this done. This call is made for each received packet, and results
in the unconditional grabbing of the broadcast link spinlock.
This is unnecessary, since we can see directly from tipc_rcv() if
the acknowledged number differs from what has been previously acked
from the node in question. In the vast majority of cases the numbers
won't differ, and there is nothing to update.
We now make the call to tipc_bclink_acknowledge() conditional
to that the two ack values differ.
Reviewed-by: Erik Hugne <erik.hugne@ericsson.com> Reviewed-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Jon Paul Maloy [Fri, 13 Mar 2015 20:08:08 +0000 (16:08 -0400)]
tipc: extract bundled buffers by cloning instead of copying
When we currently extract a bundled buffer from a message bundle in
the function tipc_msg_extract(), we allocate a new buffer and explicitly
copy the linear data area.
This is unnecessary, since we can just clone the buffer and do
skb_pull() on the clone to move the data pointer to the correct
position.
This is what we do in this commit.
Reviewed-by: Erik Hugne <erik.hugne@ericsson.com> Reviewed-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Jon Paul Maloy [Fri, 13 Mar 2015 20:08:07 +0000 (16:08 -0400)]
tipc: eliminate unnecessary linearization of incoming buffers
Currently, TIPC linearizes all incoming buffers directly at reception
before passing them upwards in the stack. This is clearly a waste of
CPU resources, and must be avoided.
In this commit, we eliminate this unnecessary linearization. We still
ensure that at least the message header is linear, and that the buffer
is linearized where this is still needed, i.e. when unbundling and when
reversing messages.
In addition, we ensure that fragmented messages are validated after
reassembly before delivering them upwards in the stack.
Reviewed-by: Erik Hugne <erik.hugne@ericsson.com> Reviewed-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Jon Paul Maloy [Fri, 13 Mar 2015 20:08:06 +0000 (16:08 -0400)]
tipc: move message validation function to msg.c
The function link_buf_validate() is in reality re-entrant and context
independent, and will in later commits be called from several locations.
Therefore, we move it to msg.c, make it outline and rename the it to
tipc_msg_validate().
We also redesign the function to make proper use of pskb_may_pull()
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>