Jozsef Kadlecsik [Fri, 18 Oct 2013 12:03:41 +0000 (14:03 +0200)]
netfilter:ipset: Fix memory allocation for bitmap:port
At the restructuring of the bitmap types creation in ipset, for the
bitmap:port type wrong (too large) memory allocation was copied
(netfilter bugzilla id #859).
Jozsef Kadlecsik [Fri, 18 Oct 2013 09:41:57 +0000 (11:41 +0200)]
netfilter: ipset: The unnamed union initialization may lead to compilation error
The unnamed union should be possible to be initialized directly, but
unfortunately it's not so:
/usr/src/ipset/kernel/net/netfilter/ipset/ip_set_hash_netnet.c: In
function ?hash_netnet4_kadt?:
/usr/src/ipset/kernel/net/netfilter/ipset/ip_set_hash_netnet.c:141:
error: unknown field ?cidr? specified in initializer
Reported-by: Husnu Demir <hdemir@metu.edu.tr> Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
netfilter: ipt_CLUSTERIP: make clusterip_list per net namespace
clusterip_configs should be per net namespace, so operate
cluster in one net namespace won't affect other net
namespace. right now, only allow to operate the clusterip_configs
of init net namespace.
Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Improve the SH fallback realserver selection strategy.
With sh and sh-fallback, if a realserver is down, this attempts to
distribute the traffic that would have gone to that server evenly
among the remaining servers.
Signed-off-by: Alexander Frolkin <avf@eldamar.org.uk> Acked-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Simon Horman <horms@verge.net.au>
commit 578bc3ef1e473a ("ipvs: reorganize dest trash") added
rcu_barrier() on cleanup to wait dest users and schedulers
like LBLC and LBLCR to put their last dest reference.
Using rcu_barrier with many namespaces is problematic.
Trying to fix it by freeing dest with kfree_rcu is not
a solution, RCU callbacks can run in parallel and execution
order is random.
Fix it by creating new function ip_vs_dest_put_and_free()
which is heavier than ip_vs_dest_put(). We will use it just
for schedulers like LBLC, LBLCR that can delay their dest
release.
By default, dests reference is above 0 if they are present in
service and it is 0 when deleted but still in trash list.
Change the dest trash code to use ip_vs_dest_put_and_free(),
so that refcnt -1 can be used for freeing. As result,
such checks remain in slow path and the rcu_barrier() from
netns cleanup can be removed.
Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Simon Horman <horms@verge.net.au>
David S. Miller [Thu, 10 Oct 2013 19:29:44 +0000 (15:29 -0400)]
Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/net-next
Jeff Kirsher says:
====================
This series contains updates to i40e only.
Alex provides the majority of the patches against i40e, where he does
cleanup of the Tx and RX queues and to align the code with the known
good Tx/Rx queue code in the ixgbe driver.
Anjali provides an i40e patch to update link events to not print to
the log until the device is administratively up.
Catherine provides a patch to update the driver version.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Thu, 10 Oct 2013 07:04:37 +0000 (00:04 -0700)]
inet: rename ir_loc_port to ir_num
In commit 634fb979e8f ("inet: includes a sock_common in request_sock")
I forgot that the two ports in sock_common do not have same byte order :
skc_dport is __be16 (network order), but skc_num is __u16 (host order)
So sparse complains because ir_loc_port (mapped into skc_num) is
considered as __u16 while it should be __be16
Let rename ir_loc_port to ireq->ir_num (analogy with inet->inet_num),
and perform appropriate htons/ntohs conversions.
Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Wu Fengguang <fengguang.wu@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Alexander Duyck [Sat, 28 Sep 2013 06:01:03 +0000 (06:01 +0000)]
i40e: Add support for 64 bit netstats
This change brings support for 64 bit netstats to the driver. Previously
the stats were 64 bit but highly racy due to the fact that 64 bit
transactions are not atomic on 32 bit systems. This change makes is so
that the 64 bit byte and packet stats are reliable on all architectures.
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com> Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com> Tested-by: Kavindya Deegala <kavindya.s.deegala@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Alexander Duyck [Sat, 28 Sep 2013 06:00:58 +0000 (06:00 +0000)]
i40e: Move rings from pointer to array to array of pointers
Allocate the queue pairs individually instead of as a group. This
allows for much easier queue management as it is possible to dynamically
resize the queues without having to free and allocate the entire block.
Ease statistic collection by treating Tx/Rx queue pairs as a single
unit. Each pair is allocated together and starts with a Tx queue and
ends with an Rx queue. By ordering them this way it is possible to know
the Rx offset based on a pointer to the Tx queue.
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com> Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com> Tested-by: Kavindya Deegala <kavindya.s.deegala@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Alexander Duyck [Sat, 28 Sep 2013 06:00:53 +0000 (06:00 +0000)]
i40e: Replace ring container array with linked list
This replaces the ring container array with a linked list. The idea is
to make the logic much easier to deal with since this will allow us to
call a simple helper function from the q_vectors to go through the
entire list.
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com> Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com> Tested-by: Kavindya Deegala <kavindya.s.deegala@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Alexander Duyck [Sat, 28 Sep 2013 07:01:44 +0000 (07:01 +0000)]
i40e: Move q_vectors from pointer to array to array of pointers
Allocate the q_vectors individually. The advantage to this is that it
allows for easier freeing and allocation. In addition it makes it so
that we could do node specific allocations at some point in the future.
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com> Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Alexander Duyck [Sat, 28 Sep 2013 06:00:43 +0000 (06:00 +0000)]
i40e: Split bytes and packets from Rx/Tx stats
This makes it so that the Tx and Rx byte and packet counts are
separated from the rest of the statistics. This allows for better
isolation of these stats when we move them into the 64 bit statistics.
Simplify things by re-ordering how the stats display in ethtool.
Instead of displaying all of the Tx queues as a block, followed by all
the Rx queues, the new order is Tx[0], Rx[0], Tx[1], Rx[1], ..., Tx[n],
Rx[n]. This reduces the loops and cleans up the display for testing
purposes since it is very easy to verify if flow director is doing the
right thing as the Tx and Rx queue pair are shown in pairs.
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com> Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com> Tested-by: Kavindya Deegala <kavindya.s.deegala@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Alexander Duyck [Sat, 28 Sep 2013 06:00:32 +0000 (06:00 +0000)]
i40e: Drop dead code and flags from Tx hotpath
Drop Tx flag and TXSW which is tested but never set.
As a result of this change we can drop a complicated check that always
resulted in the final result of i40e_tx_csum being equal to the
CHECKSUM_PARTIAL value. As such we can replace the entire function call
with just a check for skb->summed == CHECKSUM_PARTIAL.
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com> Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com> Tested-by: Kavindya Deegala <kavindya.s.deegala@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Alexander Duyck [Sat, 28 Sep 2013 06:00:27 +0000 (06:00 +0000)]
i40e: clean up Tx fast path
Sync the fast path for i40e_tx_map and i40e_clean_tx_irq so that they
are similar to igb and ixgbe.
- Only update the Tx descriptor ring in tx_map
- Make skb mapping always on the first buffer in the chain
- Drop the use of MAPPED_AS_PAGE Tx flag
- Only store flags on the first buffer_info structure
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com> Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com> Tested-by: Kavindya Deegala <kavindya.s.deegala@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Alexander Duyck [Sat, 28 Sep 2013 06:00:22 +0000 (06:00 +0000)]
i40e: Do not directly increment Tx next_to_use
Avoid directly incrementing next_to_use for multiple reasons. The main
reason being that if we directly increment it then it can attain a state
where it is equal to the ring count. Technically this is a state it
should not be able to reach but the way this is written it now can.
This patch pulls the value off into a register and then increments it
and writes back either the value or 0 depending on if the value is equal
to the ring count.
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com> Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com> Tested-by: Kavindya Deegala <kavindya.s.deegala@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Alexander Duyck [Sat, 28 Sep 2013 06:00:17 +0000 (06:00 +0000)]
i40e: Cleanup Tx buffer info layout
- drop the mapped_as_page u8 from the Tx buffer info as it was unused
- use the DMA unmap accessors for Tx DMA
- replace checks of DMA with checks of the unmap length to verify if an
unmap is needed
- update the Tx buffer layout to make it consistent with igb, ixgbe
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com> Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com> Tested-by: Kavindya Deegala <kavindya.s.deegala@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Eric Dumazet [Thu, 10 Oct 2013 00:14:52 +0000 (17:14 -0700)]
tcp: use ACCESS_ONCE() in tcp_update_pacing_rate()
sk_pacing_rate is read by sch_fq packet scheduler at any time,
with no synchronization, so make sure we update it in a
sensible way. ACCESS_ONCE() is how we instruct compiler
to not do stupid things, like using the memory location
as a temporary variable.
Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Wed, 9 Oct 2013 22:21:29 +0000 (15:21 -0700)]
inet: includes a sock_common in request_sock
TCP listener refactoring, part 5 :
We want to be able to insert request sockets (SYN_RECV) into main
ehash table instead of the per listener hash table to allow RCU
lookups and remove listener lock contention.
This patch includes the needed struct sock_common in front
of struct request_sock
This means there is no more inet6_request_sock IPv6 specific
structure.
Following inet_request_sock fields were renamed as they became
macros to reference fields from struct sock_common.
Prefix ir_ was chosen to avoid name collisions.
Eric Dumazet [Tue, 8 Oct 2013 16:02:23 +0000 (09:02 -0700)]
net: gro: allow to build full sized skb
skb_gro_receive() is currently limited to 16 or 17 MSS per GRO skb,
typically 24616 bytes, because it fills up to MAX_SKB_FRAGS frags.
It's relatively easy to extend the skb using frag_list to allow
more frags to be appended into the last sk_buff.
This still builds very efficient skbs, and allows reaching 45 MSS per
skb.
(45 MSS GRO packet uses one skb plus a frag_list containing 2 additional
sk_buff)
High speed TCP flows benefit from this extension by lowering TCP stack
cpu usage (less packets stored in receive queue, less ACK packets
processed)
Forwarding setups could be hurt, as such skbs will need to be
linearized, although its not a new problem, as GRO could already
provide skbs with a frag_list.
We could make the 65536 bytes threshold a tunable to mitigate this.
(First time we need to linearize skb in skb_needs_linearize(), we could
lower the tunable to ~16*1460 so that following skb_gro_receive() calls
build smaller skbs)
Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Alexander Duyck [Sat, 28 Sep 2013 06:00:12 +0000 (06:00 +0000)]
i40e: Drop unused completed stat
The Tx "completed" stat was part of the original rewrite for detecting
Tx hangs. However some time ago in ixgbe I determined that we could
just use the packets stat instead. Since then this stat was
removed from ixgbe and it serves no purpose in i40e so it can be
dropped.
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com> Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com> Tested-by: Kavindya Deegala <kavindya.s.deegala@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Eric Dumazet [Wed, 9 Oct 2013 10:05:48 +0000 (03:05 -0700)]
net: fix build errors if ipv6 is disabled
CONFIG_IPV6=n is still a valid choice ;)
It appears we can remove dead code.
Reported-by: Wu Fengguang <fengguang.wu@intel.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Thu, 3 Oct 2013 22:42:29 +0000 (15:42 -0700)]
ipv6: make lookups simpler and faster
TCP listener refactoring, part 4 :
To speed up inet lookups, we moved IPv4 addresses from inet to struct
sock_common
Now is time to do the same for IPv6, because it permits us to have fast
lookups for all kind of sockets, including upcoming SYN_RECV.
Getting IPv6 addresses in TCP lookups currently requires two extra cache
lines, plus a dereference (and memory stall).
inet6_sk(sk) does the dereference of inet_sk(__sk)->pinet6
This patch is way bigger than its IPv4 counter part, because for IPv4,
we could add aliases (inet_daddr, inet_rcv_saddr), while on IPv6,
it's not doable easily.
Eric Dumazet [Thu, 3 Oct 2013 07:22:02 +0000 (00:22 -0700)]
tcp/dccp: remove twchain
TCP listener refactoring, part 3 :
Our goal is to hash SYN_RECV sockets into main ehash for fast lookup,
and parallel SYN processing.
Current inet_ehash_bucket contains two chains, one for ESTABLISH (and
friend states) sockets, another for TIME_WAIT sockets only.
As the hash table is sized to get at most one socket per bucket, it
makes little sense to have separate twchain, as it makes the lookup
slightly more complicated, and doubles hash table memory usage.
If we make sure all socket types have the lookup keys at the same
offsets, we can use a generic and faster lookup. It turns out TIME_WAIT
and ESTABLISHED sockets already have common lookup fields for IPv4.
[ INET_TW_MATCH() is no longer needed ]
I'll provide a follow-up to factorize IPv6 lookup as well, to remove
INET6_TW_MATCH()
This way, SYN_RECV pseudo sockets will be supported the same.
A new sock_gen_put() helper is added, doing either a sock_put() or
inet_twsk_put() [ and will support SYN_RECV later ].
Note this helper should only be called in real slow path, when rcu
lookup found a socket that was moved to another identity (freed/reused
immediately), but could eventually be used in other contexts, like
sock_edemux()
Before patch :
dmesg | grep "TCP established"
TCP established hash table entries: 524288 (order: 11, 8388608 bytes)
After patch :
TCP established hash table entries: 524288 (order: 10, 4194304 bytes)
Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Wed, 9 Oct 2013 01:56:09 +0000 (21:56 -0400)]
Merge branch 'sfc-3.12' of git://git.kernel.org/pub/scm/linux/kernel/git/bwh/sfc
Ben Hutchings says:
====================
Some more fixes for EF10 support; hopefully the last lot:
1. Fixes for reading statistics, from Edward Cree and Jon Cooper.
2. Addition of ethtool statistics for packets dropped by the hardware
before they were associated with a specific function, from Edward Cree.
3. Only bind to functions that are in control of their associated port,
as the driver currently assumes this is the case.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Tue, 8 Oct 2013 22:16:00 +0000 (15:16 -0700)]
pkt_sched: fq: fix non TCP flows pacing
Steinar reported FQ pacing was not working for UDP flows.
It looks like the initial sk->sk_pacing_rate value of 0 was
a wrong choice. We should init it to ~0U (unlimited)
Then, TCA_FQ_FLOW_DEFAULT_RATE should be removed because it makes
no real sense. The default rate is really unlimited, and we
need to avoid a zero divide.
Reported-by: Steinar H. Gunderson <sesse@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Wei Yongjun [Tue, 8 Oct 2013 03:19:19 +0000 (11:19 +0800)]
moxa: fix the error handling in moxart_mac_probe()
This patch fix the error handling in moxart_mac_probe():
- return -ENOMEM in some memory alloc fail cases
- add missing free_netdev() in the error handling case
Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn> Signed-off-by: David S. Miller <davem@davemloft.net>
Oussama Ghorbel [Mon, 7 Oct 2013 17:50:05 +0000 (18:50 +0100)]
ipv6: Fix the upper MTU limit in GRE tunnel
Unlike ipv4, the struct member hlen holds the length of the GRE and ipv6
headers. This length is also counted in dev->hard_header_len.
Perhaps, it's more clean to modify the hlen to count only the GRE header
without ipv6 header as the variable name suggest, but the simple way to fix
this without regression risk is simply modify the calculation of the limit
in ip6gre_tunnel_change_mtu function.
Verified in kernel version v3.11.
Signed-off-by: Oussama Ghorbel <ou.ghorbel@gmail.com> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
Shawn Bohrer [Mon, 7 Oct 2013 16:01:40 +0000 (11:01 -0500)]
net: ipv4 only populate IP_PKTINFO when needed
The since the removal of the routing cache computing
fib_compute_spec_dst() does a fib_table lookup for each UDP multicast
packet received. This has introduced a performance regression for some
UDP workloads.
This change skips populating the packet info for sockets that do not have
IP_PKTINFO set.
Benchmark results from a netperf UDP_RR test:
Before 89789.68 transactions/s
After 90587.62 transactions/s
Benchmark results from a fio 1 byte UDP multicast pingpong test
(Multicast one way unicast response):
Before 12.63us RTT
After 12.48us RTT
Signed-off-by: Shawn Bohrer <sbohrer@rgmadvisors.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Shawn Bohrer [Mon, 7 Oct 2013 16:01:39 +0000 (11:01 -0500)]
udp: ipv4: Add udp early demux
The removal of the routing cache introduced a performance regression for
some UDP workloads since a dst lookup must be done for each packet.
This change caches the dst per socket in a similar manner to what we do
for TCP by implementing early_demux.
For UDP multicast we can only cache the dst if there is only one
receiving socket on the host. Since caching only works when there is
one receiving socket we do the multicast socket lookup using RCU.
For UDP unicast we only demux sockets with an exact match in order to
not break forwarding setups. Additionally since the hash chains may be
long we only check the first socket to see if it is a match and not
waste extra time searching the whole chain when we might not find an
exact match.
Benchmark results from a netperf UDP_RR test:
Before 87961.22 transactions/s
After 89789.68 transactions/s
Benchmark results from a fio 1 byte UDP multicast pingpong test
(Multicast one way unicast response):
Before 12.97us RTT
After 12.63us RTT
Signed-off-by: Shawn Bohrer <sbohrer@rgmadvisors.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Shawn Bohrer [Mon, 7 Oct 2013 16:01:38 +0000 (11:01 -0500)]
udp: Only allow busy read/poll on connected sockets
UDP sockets can receive packets from multiple endpoints and thus may be
received on multiple receive queues. Since packets packets can arrive
on multiple receive queues we should not mark the napi_id for all
packets. This makes busy read/poll only work for connected UDP sockets.
This additionally enables busy read/poll for UDP multicast packets as
long as the socket is connected by moving the check into
__udp_queue_rcv_skb().
Signed-off-by: Shawn Bohrer <sbohrer@rgmadvisors.com> Suggested-by: Eric Dumazet <edumazet@google.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: David Vrabel <david.vrabel@citrix.com> Cc: Ian Campbell <ian.campbell@citrix.com> Cc: Wei Liu <wei.liu2@citrix.com> Cc: Paul Durrant <Paul.Durrant@citrix.com> Acked-by: Wei Liu <wei.liu2@citrix.com> Reviewed-by: Paul Durrant <paul.durrant@citrix.com> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Tue, 8 Oct 2013 20:10:10 +0000 (16:10 -0400)]
Merge branch 'mlx4'
Amir Vadai says:
====================
net/mlx4_en: Fix pages never dma unmapped on rx
This patchset fixes a bug introduced by commit 51151a16 (mlx4: allow order-0
memory allocations in RX path). Where dma_unmap_page wasn't called.
Changes from V0:
- Added "Rename name of mlx4_en_rx_alloc members". Old names were confusing.
- Last frag in page calculation was wrong. Since all frags in page are of the
same size, need to add this frag_stride to end of frag offset, and not the
size of next frag in skb.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Amir Vadai [Mon, 7 Oct 2013 11:38:13 +0000 (13:38 +0200)]
net/mlx4_en: Fix pages never dma unmapped on rx
This patch fixes a bug introduced by commit 51151a16 (mlx4: allow
order-0 memory allocations in RX path).
dma_unmap_page never reached because condition to detect last fragment
in page is wrong. offset+frag_stride can't be greater than size, need to
make sure no additional frag will fit in page => compare offset +
frag_stride + next_frag_size instead.
next_frag_size is the same as the current one, since page is shared only
with frags of the same size.
CC: Eric Dumazet <edumazet@google.com> Signed-off-by: Amir Vadai <amirv@mellanox.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Amir Vadai [Mon, 7 Oct 2013 11:38:12 +0000 (13:38 +0200)]
net/mlx4_en: Rename name of mlx4_en_rx_alloc members
Add page prefix to page related members: @size and @offset into
@page_size and @page_offset
CC: Eric Dumazet <edumazet@google.com> Signed-off-by: Amir Vadai <amirv@mellanox.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
bonding: ensure that TLB mode's active slave has correct mac filter
Currently, in TLB mode we change mac addresses only by memcpy-ing the to
net_device->dev_addr, without actually setting them via
dev_set_mac_address(). This permits us to receive all the traffic always on
one mac address.
However, in case the interface flips, some drivers might enforce the
mac filtering for its FW/HW based on current ->dev_addr, and thus we won't
be able to receive traffic on that interface, in case it will be selected
as active in TLB mode.
Fix it by setting the mac address forcefully on every new active slave that
we select in TLB mode.
CC: Jay Vosburgh <fubar@us.ibm.com> CC: Andy Gospodarek <andy@greyhouse.net> CC: Yuval Mintz <yuvalmin@broadcom.com> Reported-by: Yuval Mintz <yuvalmin@broadcom.com> Tested-by: Yuval Mintz <yuvalmin@broadcom.com> Signed-off-by: Veaceslav Falico <vfalico@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Nguyen Hong Ky [Mon, 7 Oct 2013 06:29:25 +0000 (15:29 +0900)]
net: sh_eth: Fix RX packets errors on R8A7740
This patch will fix RX packets errors when receiving big size
of data by set bit RNC = 1.
RNC - Receive Enable Control
0: Upon completion of reception of one frame, the E-DMAC writes
the receive status to the descriptor and clears the RR bit in
EDRRR to 0.
1: Upon completion of reception of one frame, the E-DMAC writes
(writes back) the receive status to the descriptor. In addition,
the E-DMAC reads the next descriptor and prepares for reception
of the next frame.
In addition, for get more stable when receiving packets, I set
maximum size for the transmit/receive FIFO and inserts padding
in receive data.
Signed-off-by: Nguyen Hong Ky <nh-ky@jinso.co.jp> Signed-off-by: David S. Miller <davem@davemloft.net>
We play with a wait queue even if socket is
non blocking. This is an obvious waste.
Besides, it will prevent calling the non blocking
variant when current is not valid.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Acked-by: Jason Wang <jasowang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Tue, 8 Oct 2013 19:32:19 +0000 (15:32 -0400)]
Merge branch 'mrf24j40'
Alan Ott says:
====================
Fix race conditions in mrf24j40 interrupts
After testing with the betas of this patchset, it's been rebased and is
ready for inclusion.
David Hauweele noticed that the mrf24j40 would hang arbitrarily after some
period of heavy traffic. Two race conditions were discovered, and the
driver was changed to use threaded interrupts, since the enable/disable of
interrupts in the driver has recently been a lighning rod whenever issues
arise related to interrupts (costing engineering time), and since threaded
interrupts are the right way to do it.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Alan Ott [Sun, 6 Oct 2013 03:52:24 +0000 (23:52 -0400)]
mrf24j40: Use level-triggered interrupts
The mrf24j40 generates level interrupts. There are rare cases where it
appears that the interrupt line never gets de-asserted between interrupts,
causing interrupts to be lost, and causing a hung device from the driver's
perspective. Switching the driver to interpret these interrupts as
level-triggered fixes this issue.
Signed-off-by: Alan Ott <alan@signal11.us> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Tue, 8 Oct 2013 19:28:53 +0000 (15:28 -0400)]
Merge branch '6lowpan'
Alan Ott says:
====================
Alexander Aring suggested that devices desired to be linked to 6lowpan
be checked for actually being of type IEEE802154, since IEEE802154 devices
are all that are supported by 6lowpan at present.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Masatake YAMATO [Fri, 4 Oct 2013 02:34:21 +0000 (11:34 +0900)]
veth: Showing peer of veth type dev in ip link (kernel side)
ip link has ability to show extra information of net work device if
kernel provides sunh information. With this patch veth driver can
provide its peer ifindex information to ip command via netlink
interface.
Signed-off-by: Masatake YAMATO <yamato@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
The named changeset is causing problem. Let's aim to make this part less
fragile before trying to improve things.
Signed-off-by: Wei Liu <wei.liu2@citrix.com> Cc: Ian Campbell <ian.campbell@citrix.com> Cc: Annie Li <annie.li@oracle.com> Cc: Matt Wilson <msw@amazon.com> Cc: Xi Xiong <xixiong@amazon.com> Cc: David Vrabel <david.vrabel@citrix.com> Cc: Paul Durrant <paul.durrant@citrix.com> Acked-by: Ian Campbell <ian.campbell@citrix.com> Signed-off-by: David S. Miller <davem@davemloft.net>
net: Update the sysctl permissions handler to test effective uid/gid
On Tue, 20 Aug 2013 11:40:04 -0500 Eric Sandeen <sandeen@redhat.com> wrote:
> This was brought up in a Red Hat bug (which may be marked private, I'm sorry):
>
> Bug 987055 - open O_WRONLY succeeds on some root owned files in /proc for process running with unprivileged EUID
>
> "On RHEL7 some of the files in /proc can be opened for writing by an unprivileged EUID."
>
> The flaw existed upstream as well last I checked.
>
> This commit in kernel v3.8 caused the regression:
>
> commit cff109768b2d9c03095848f4cd4b0754117262aa
> Author: Eric W. Biederman <ebiederm@xmission.com>
> Date: Fri Nov 16 03:03:01 2012 +0000
>
> net: Update the per network namespace sysctls to be available to the network namespace owner
>
> - Allow anyone with CAP_NET_ADMIN rights in the user namespace of the
> the netowrk namespace to change sysctls.
> - Allow anyone the uid of the user namespace root the same
> permissions over the network namespace sysctls as the global root.
> - Allow anyone with gid of the user namespace root group the same
> permissions over the network namespace sysctl as the global root group.
>
> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
> Signed-off-by: David S. Miller <davem@davemloft.net>
>
> because it changed /sys/net's special permission handler to test current_uid, not
> current_euid; same for current_gid/current_egid.
>
> So in this case, root cannot drop privs via set[ug]id, and retains all privs
> in this codepath.
Modify the code to use current_euid(), and in_egroup_p, as in done
in fs/proc/proc_sysctl.c:test_perm()
Cc: stable@vger.kernel.org Reviewed-by: Eric Sandeen <sandeen@redhat.com> Reported-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Just some minor conflicts between the wireless-next changes
and Joe Perches's "extern" removal from function prototypes
in header files.
John W. Linville says:
====================
Regarding the Bluetooth bits, Gustavo says:
"The big work here is from Marcel and Johan. They did a lot of work
in the L2CAP, HCI and MGMT layers. The most important ones are the
addition of a new MGMT command to enable/disable LE advertisement
and the introduction of the HCI user channel to allow applications
to get directly and exclusive access to Bluetooth devices."
As to the ath10k bits, Kalle says:
"Bartosz dropped support for qca98xx hw1.0 hardware from ath10k, it's
just too much to support it. Michal added support for the new firmware
interface. Marek fixed WEP in AP and IBSS mode. Rest of the changes are
minor fixes or cleanups."
And also:
"Major changes are:
* throughput improvements including aligning the RX frames correctly and
optimising HTT layer (Michal)
* remove qca98xx hw1.0 support (Bartosz)
* add support for firmware version 999.999.0.636 (Michal)
* firmware htt statistics support (Kalle)
* fix WEP in AP and IBSS mode (Marek)
* fix a mutex unlock balance in debugfs file (Shafi)
And of course there's a lot of smaller fixes and cleanup."
For the wl12xx bits, Luca says:
"Here are some patches intended for 3.13. Eliad is upstreaming a bunch
of patches that have been pending in the internal tree. Mostly bugfixes
and other small improvements."
Along with that...
Arend and friends bring us a batch of brcmfmac updates, Larry Finger
offers some rtlwifi refactoring, and Sujith sends the usual batch of
ath9k updates. As usual, there are a number of other small updates
from a variety of players as well.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Jiri Benc [Fri, 4 Oct 2013 15:04:48 +0000 (17:04 +0200)]
ipv4: fix ineffective source address selection
When sending out multicast messages, the source address in inet->mc_addr is
ignored and rewritten by an autoselected one. This is caused by a typo in
commit 813b3b5db831 ("ipv4: Use caller's on-stack flowi as-is in output
route lookups").
Signed-off-by: Jiri Benc <jbenc@redhat.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
net: Separate the close_list and the unreg_list v2
Separate the unreg_list and the close_list in dev_close_many preventing
dev_close_many from permuting the unreg_list. The permutations of the
unreg_list have resulted in cases where the loopback device is accessed
it has been freed in code such as dst_ifdown. Resulting in subtle memory
corruption.
This is the second bug from sharing the storage between the close_list
and the unreg_list. The issues that crop up with sharing are
apparently too subtle to show up in normal testing or usage, so let's
forget about being clever and use two separate lists.
v2: Make all callers pass in a close_list to dev_close_many
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Markus Pargmann [Fri, 4 Oct 2013 12:44:40 +0000 (14:44 +0200)]
net/ethernet: cpsw: DT read bool dual_emac
Signed-off-by: Markus Pargmann <mpa@pengutronix.de> Acked-by: Mugunthan V N <mugunthanvnm@ti.com> Acked-by: Peter Korsgaard <jacmet@sunsite.dk> Signed-off-by: David S. Miller <davem@davemloft.net>
Markus Pargmann [Fri, 4 Oct 2013 12:44:39 +0000 (14:44 +0200)]
net: ethernet: cpsw: Search childs for slave nodes
The current implementation searches the whole DT for nodes named
"slave".
This patch changes it to search only child nodes for slaves.
Signed-off-by: Markus Pargmann <mpa@pengutronix.de> Acked-by: Mugunthan V N <mugunthanvnm@ti.com> Acked-by: Peter Korsgaard <jacmet@sunsite.dk> Signed-off-by: David S. Miller <davem@davemloft.net>
Ben Hutchings [Mon, 7 Oct 2013 19:10:11 +0000 (20:10 +0100)]
sfc: Only bind to EF10 functions with the LinkCtrl and Trusted flags
Although we do not yet enable multiple PFs per port, it is possible
that a board will be reconfigured to enable them while the driver has
not yet been updated to fully support this.
The most obvious problem is that multiple functions may try to set
conflicting link settings. But we will also run into trouble if the
firmware doesn't consider us fully trusted. So, abort probing unless
both the LinkCtrl and Trusted flags are set for this function.
Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
Oussama Ghorbel [Thu, 3 Oct 2013 13:49:26 +0000 (14:49 +0100)]
ipv6: Allow the MTU of ipip6 tunnel to be set below 1280
The (inner) MTU of a ipip6 (IPv4-in-IPv6) tunnel cannot be set below 1280, which is the minimum MTU in IPv6.
However, there should be no IPv6 on the tunnel interface at all, so the IPv6 rules should not apply.
More info at https://bugzilla.kernel.org/show_bug.cgi?id=15530
This patch allows to check the minimum MTU for ipv6 tunnel according to these rules:
-In case the tunnel is configured with ipip6 mode the minimum MTU is 68.
-In case the tunnel is configured with ip6ip6 or any mode the minimum MTU is 1280.
Signed-off-by: Oussama Ghorbel <ou.ghorbel@gmail.com> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
Edward Cree [Wed, 25 Sep 2013 16:32:09 +0000 (17:32 +0100)]
sfc: Add PM and RXDP drop counters to ethtool stats
Recognise the new Packet Memory and RX Data Path counters.
The following counters are added:
rx_pm_{trunc,discard}_bb_overflow - burst buffer overflowed. This should not
occur if BB correctly configured.
rx_pm_{trunc,discard}_vfifo_full - not enough space in packet memory. May
indicate RX performance problems.
rx_pm_{trunc,discard}_qbb - dropped by 802.1Qbb early discard mechanism.
Since Qbb is not supported at present, this should not occur.
rx_pm_discard_mapping - 802.1p priority configured to be dropped. This should
not occur in normal operation.
rx_dp_q_disabled_packets - packet was to be delivered to a queue but queue is
disabled. May indicate misconfiguration by the driver.
rx_dp_di_dropped_packets - parser-dispatcher indicated that a packet should be
dropped.
rx_dp_streaming_packets - packet was sent to the RXDP streaming bus, ie. a
filter directed the packet to the MCPU.
rx_dp_emerg_{fetch,wait} - RX datapath had to wait for descriptors to be
loaded. Indicates performance problems but not drops.
These are only provided if the MC firmware has the
PM_AND_RXDP_COUNTERS capability. Otherwise, mask them out.
Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
Edward Cree [Fri, 27 Sep 2013 17:52:49 +0000 (18:52 +0100)]
sfc: Refactor EF10 stat mask code to allow for more conditional stats
Previously, efx_ef10_stat_mask returned a static const unsigned long[], which
meant that each possible mask had to be declared statically with
STAT_MASK_BITMAP. Since adding a condition would double the size of the
decision tree, we now create the bitmask dynamically.
To do this, we have two functions efx_ef10_raw_stat_mask, which returns a u64,
and efx_ef10_get_stat_mask, which fills in an unsigned long * argument.
Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
Edward Cree [Wed, 25 Sep 2013 16:34:12 +0000 (17:34 +0100)]
sfc: Fix internal indices of ethtool stats for EF10
The indices in nic_data->stats need to match the EF10_STAT_whatever
enum values. In efx_nic_update_stats, only mask; gaps are removed in
efx_ef10_update_stats.
Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
Eric Dumazet [Fri, 4 Oct 2013 17:31:41 +0000 (10:31 -0700)]
tcp: do not forget FIN in tcp_shifted_skb()
Yuchung found following problem :
There are bugs in the SACK processing code, merging part in
tcp_shift_skb_data(), that incorrectly resets or ignores the sacked
skbs FIN flag. When a receiver first SACK the FIN sequence, and later
throw away ofo queue (e.g., sack-reneging), the sender will stop
retransmitting the FIN flag, and hangs forever.
Following packetdrill test can be used to reproduce the bug.
First, a typo inverted left/right of one OR operation, then
code forgot to advance end_seq if the merged skb carried FIN.
Bug was added in 2.6.29 by commit 832d11c5cd076ab
("tcp: Try to restore large SKBs while SACK processing")
Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Cc: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Acked-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Fri, 4 Oct 2013 17:26:38 +0000 (13:26 -0400)]
Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next
Pablo Neira Ayuso says:
====================
The following patchset contains Netfilter updates for your net-next tree,
mostly ipset improvements and enhancements features, they are:
* Don't call ip_nest_end needlessly in the error path from me, suggested
by Pablo Neira Ayuso, from Jozsef Kadlecsik.
* Fixed sparse warnings about shadowed variable and missing rcu annotation
and fix of "may be used uninitialized" warnings, also from Jozsef.
* Renamed simple macro names to avoid namespace issues, reported by David
Laight, again from Jozsef.
* Use fix sized type for timeout in the extension part, and cosmetic
ordering of matches and targets separatedly in xt_set.c, from Jozsef.
* Support package fragments for IPv4 protos without ports from Anders K.
Pedersen. For example this allows a hash:ip,port ipset containing the
entry 192.168.0.1,gre:0 to match all package fragments for PPTP VPN
tunnels to/from the host. Without this patch only the first package
fragment (with fragment offset 0) was matched.
* Introduced a new operation to get both setname and family, from Jozsef.
ip[6]tables set match and SET target need to know the family of the set
in order to reject adding rules which refer to a set with a non-mathcing
family. Currently such rules are silently accepted and then ignored
instead of generating an error message to the user.
* Reworked extensions support in ipset types from Jozsef. The approach of
defining structures with all variations is not manageable as the
number of extensions grows. Therefore a blob for the extensions is
introduced, somewhat similar to conntrack. The support of extensions
which need a per data destroy function is added as well.
* When an element timed out in a list:set type of set, the garbage
collector skipped the checking of the next element. So the purging
was delayed to the next run of the gc, fixed by Jozsef.
* A small Kconfig fix: NETFILTER_NETLINK cannot be selected and
ipset requires it.
* hash:net,net type from Oliver Smith. The type provides the ability to
store pairs of subnets in a set.
* Comment for ipset entries from Oliver Smith. This makes possible to
annotate entries in a set with comments, for example:
ipset n foo hash:net,net comment
ipset a foo 10.0.0.0/21,192.168.1.0/24 comment "office nets A and B"
* Fix of hash types resizing with comment extension from Jozsef.
* Fix of new extensions for list:set type when an element is added
into a slot from where another element was pushed away from Jozsef.
* Introduction of a common function for the listing of the element
extensions from Jozsef.
* Net namespace support for ipset from Vitaly Lavrov.
* hash:net,port,net type from Oliver Smith, which makes possible
to store the triples of two subnets and a protocol, port pair in
a set.
* Get xt_TCPMSS working with net namespace, by Gao feng.
* Use the proper net netnamespace to allocate skbs, also by Gao feng.
* A couple of cleanups for the conntrack SIP helper, by Holger
Eitzenberger.
* Extend cttimeout to allow setting default conntrack timeouts via
nfnetlink, so we can get rid of all our sysctl/proc interfaces in
the future for timeout tuning, from me.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Thu, 3 Oct 2013 21:27:25 +0000 (14:27 -0700)]
tcp: shrink tcp6_timewait_sock by one cache line
While working on tcp listener refactoring, I found that it
would really make things easier if sock_common could include
the IPv6 addresses needed in the lookups, instead of doing
very complex games to get their values (depending on sock
being SYN_RECV, ESTABLISHED, TIME_WAIT)
For this to happen, I need to be sure that tcp6_timewait_sock
and tcp_timewait_sock consume same number of cache lines.
This is possible if we only use 32bits for tw_ttd, as we remove
one 32bit hole in inet_timewait_sock
inet_tw_time_stamp() is defined and used, even if its current
implementation looks like tcp_time_stamp : We might need finer
resolution for tcp_time_stamp in the future.
Before patch : sizeof(struct tcp6_timewait_sock) = 0xc8
After patch : sizeof(struct tcp6_timewait_sock) = 0xc0
Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Andrea Merello [Thu, 3 Oct 2013 19:18:37 +0000 (21:18 +0200)]
atl1e: enable support for NETIF_F_RXALL and NETIF_F_RXCRC features
This patch allows (optionally, via ethtool) the atl1e NIC to:
- Receive bad frames (runt, bad-fcs, etc..)
- Receive full frames without stripping the FCS.
This has been tested on my board by injecting runt and bad-fcs
frames with a FPGA-based device.
The particular scenario of receiving very short frames (<4 bytes)
without passing FCS to the upper layer has been also tested:
This could be potentially dangerous because the driver performs a
4 byte subtraction on the frame length, but I finally have NOT
added anything to avoid this because it seems the NIC always
discards frames so much short..
If someone still have some reason to worry about this, please
tell me.. I will add an explicit SW check..
Signed-off-by: Andrea Merello <andrea.merello@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>