David S. Miller [Tue, 1 Sep 2015 22:06:24 +0000 (15:06 -0700)]
Merge branch 'flow-dissector-features'
Tom Herbert says:
====================
flow_dissector: Paramterize dissection and other features
This patch set adds some new capabilities to flow_dissector:
- Add flags to flow dissector functions to control dissection
- Flag to stop dissection when L3 header is seen (don't
dissect L4)
- Flag to stop dissection when encapsulation is detected
- Flag to parse first fragment of fragmented packet. This
may provide L4 ports
- Added new reporting in key_control
- Packet is a fragment
- Packet is a first fragment
- Packet has encapsulation
Also:
- Make __skb_set_sw_hash a general function
- Create functions to get a flow hash based on flowi4 or flowi6
structures without an reference to an skbuff
- Ignore flow dissector return value from ___skb_get_hash. Just
use whatever key fields are found to make a hash
Tested:
Ran 200 netperf TCP_RR instances for IPv6 and IPv4. Did not see any
regression. Ran UDP_RR with 10000 byte request and response size
for IPv4 and IPv6, no regression observed however I did see better
performance with IPv6 flow labels due to use of flow labels for L4
hash.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Tom Herbert [Tue, 1 Sep 2015 16:24:33 +0000 (09:24 -0700)]
flow_dissector: Ignore flow dissector return value from ___skb_get_hash
In ___skb_get_hash ignore return value from skb_flow_dissect_flow_keys.
A failure in that function likely means that there was a parse error,
so we may as well use whatever fields were found before the error was
hit. This is also good because it means we won't keep trying to derive
the hash on subsequent calls to skb_get_hash for the same packet.
Signed-off-by: Tom Herbert <tom@herbertland.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Tom Herbert [Tue, 1 Sep 2015 16:24:32 +0000 (09:24 -0700)]
flow_dissector: Add control/reporting of encapsulation
Add an input flag to flow dissector on rather dissection should stop
when encapsulation is detected (IP/IP or GRE). Also, add a key_control
flag that indicates encapsulation was encountered during the
dissection.
Signed-off-by: Tom Herbert <tom@herbertland.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Tom Herbert [Tue, 1 Sep 2015 16:24:31 +0000 (09:24 -0700)]
flow_dissector: Add flag to stop parsing when an IPv6 flow label is seen
Add an input flag to flow dissector on rather dissection should be
stopped when a flow label is encountered. Presumably, the flow label
is derived from a sufficient hash of an inner transport packet so
further dissection is not needed (that is ports are not included in
the flow hash). Using the flow label instead of ports has the additional
benefit that packet fragments should hash to same value as non-fragments
for a flow (assuming that the same flow label is used).
We set this flag by default in for skb_get_hash.
Signed-off-by: Tom Herbert <tom@herbertland.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Tom Herbert [Tue, 1 Sep 2015 16:24:30 +0000 (09:24 -0700)]
flow_dissector: Add flag to stop parsing at L3
Add an input flag to flow dissector on rather dissection should be
stopped when an L3 packet is encountered. This would be useful if a
caller just wanted to get IP addresses of the outermost header (e.g.
to do an L3 hash).
Signed-off-by: Tom Herbert <tom@herbertland.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Tom Herbert [Tue, 1 Sep 2015 16:24:28 +0000 (09:24 -0700)]
flow_dissector: Add control/reporting of fragmentation
Add an input flag to flow dissector on rather dissection should be
attempted on a first fragment. Also add key_control flags to indicate
that a packet is a fragment or first fragment.
Signed-off-by: Tom Herbert <tom@herbertland.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Tom Herbert [Tue, 1 Sep 2015 16:24:26 +0000 (09:24 -0700)]
flow_dissector: Jump to exit code in __skb_flow_dissect
Instead of returning immediately (on a parsing failure for instance) we
jump to cleanup code. This always sets protocol values in key_control
(even on a failure there is still valid information in the key_tags that
was set before the problem was hit).
Signed-off-by: Tom Herbert <tom@herbertland.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Tom Herbert [Tue, 1 Sep 2015 16:24:25 +0000 (09:24 -0700)]
flowi: Abstract out functions to get flow hash based on flowi
Create __get_hash_from_flowi6 and __get_hash_from_flowi4 to get the
flow keys and hash based on flowi structures. These are called by
__skb_get_hash_flowi6 and __skb_get_hash_flowi4. Also, created
get_hash_from_flowi6 and get_hash_from_flowi4 which can be called
when just the hash value for a flowi is needed.
Signed-off-by: Tom Herbert <tom@herbertland.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Tom Herbert [Tue, 1 Sep 2015 16:24:24 +0000 (09:24 -0700)]
skbuff: Make __skb_set_sw_hash a general function
Move __skb_set_sw_hash to skbuff.h and add __skb_set_hash which is
a common method (between __skb_set_sw_hash and skb_set_hash) to set
the hash in an skbuff.
Also, move skb_clear_hash to be closer to __skb_set_hash.
Signed-off-by: Tom Herbert <tom@herbertland.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Tom Herbert [Tue, 1 Sep 2015 16:24:23 +0000 (09:24 -0700)]
flow_dissector: Move skb related functions to skbuff.h
Move the flow dissector functions that are specific to skbuffs into
skbuff.h out of flow_dissector.h. This makes flow_dissector.h have
no dependencies on skbuff.h.
Signed-off-by: Tom Herbert <tom@herbertland.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Jean Delvare [Tue, 1 Sep 2015 16:07:41 +0000 (18:07 +0200)]
tg3: Fix temperature reporting
The temperature registers appear to report values in degrees Celsius
while the hwmon API mandates values to be exposed in millidegrees
Celsius. Do the conversion so that the values reported by "sensors"
are correct.
Fixes: aed93e0bf493 ("tg3: Add hwmon support for temperature") Signed-off-by: Jean Delvare <jdelvare@suse.de> Cc: Prashant Sreedharan <prashant@broadcom.com> Cc: Michael Chan <mchan@broadcom.com> Cc: stable@vger.kernel.org [v3.6+] Signed-off-by: David S. Miller <davem@davemloft.net>
Mark Salter [Tue, 1 Sep 2015 13:36:05 +0000 (09:36 -0400)]
phylib: fix device deletion order in mdiobus_unregister()
commit 8b63ec1837fa ("phylib: Make PHYs children of their MDIO bus, not
the bus' parent.") uncovered a problem in mdiobus_unregister() which
leads to this warning when I reboot an APM Mustang (arm64) platform:
The problem is that mdiobus_unregister() deletes the bus device before
unregistering the phy devices on the bus. This wasn't a problem before
because the phys were not children of the bus:
when mdiobus_unregister deletes the bus device, the phy subdirs are
removed from sysfs also. So when the phys are unregistered afterward,
we get the warning. This patch changes the order so that phys are
unregistered before the bus device is deleted.
Fixes: 8b63ec1837fa ("phylib: Make PHYs children of their MDIO bus, not the bus' parent.") Signed-off-by: Mark Salter <msalter@redhat.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Tested-by: Mark Langsdorf <mlangsdo@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
David Ahern [Tue, 1 Sep 2015 20:26:35 +0000 (14:26 -0600)]
net: Make table id type u32
A number of VRF patches used 'int' for table id. It should be u32 to be
consistent with the rest of the stack.
Fixes: 4e3c89920cd3a ("net: Introduce VRF related flags and helpers") 15be405eb2ea9 ("net: Add inet_addr lookup by table") 30bbaa1950055 ("net: Fix up inet_addr_type checks") 021dd3b8a142d ("net: Add routes to the table associated with the device") dc028da54ed35 ("inet: Move VRF table lookup to inlined function") f6d3c19274c74 ("net: FIB tracepoints")
Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Reviewed-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
ipv6: send NEWLINK on RA managed/otherconf changes
The kernel is applying the RA managed/otherconf flags silently and
forgets to send ifinfo notify to inform about their change when the
router provides a zero reachable_time and retrans_timer as dnsmasq
and many routers send it, which just means unspecified by this router
and the host should continue using whatever value it is already using.
Userspace may monitor the ifinfo notifications to activate dhcpv6.
Signed-off-by: Marius Tomaschewski <mt@suse.de> Signed-off-by: David S. Miller <davem@davemloft.net>
Andrew Lunn [Mon, 31 Aug 2015 13:56:54 +0000 (15:56 +0200)]
net: phy: fixed_phy: Set phy capabilities even when link down.
What features a phy supports is masked in genphy_config_init() by
looking at the PHYs BMSR register.
If the link is down, fixed_phy_update_regs() will only set the auto-
negotiation capable bit in BMSR. Thus genphy_config_init() comes to
the conclusion the PHY can only perform 10/Half, and masks out the
higher speed features. If however the link it up, BMSR is set to
indicate the speed the PHY is capable of auto-negotiating, and
genphy_config_init() does not mask out the high speed features.
To fix this, when the link is down, have fixed_phy_update_regs() leave
the link status, auto-negotiation complete, and link partner
capabilities unset, but set all the local capabilities depending on
the fixed phy speed.
Signed-off-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
Andrew Lunn [Mon, 31 Aug 2015 13:56:53 +0000 (15:56 +0200)]
phy: fixed_phy: Add gpio to determine link up/down.
An SFP module may have a link up/down status pin which can be
connection to a GPIO line of the host. Add support for reading such an
GPIO in the fixed_phy driver.
Signed-off-by: Andrew Lunn <andrew@lunn.ch> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Andrew Lunn [Mon, 31 Aug 2015 13:56:51 +0000 (15:56 +0200)]
dsa: mv88e6xxx: Set the RGMII delay based on phy interface
Some Marvell switches allow the RGMII Rx and Tx clock to be delayed
when the port is using RGMII. Have the adjust_link function look at
the phy interface type and enable this delay as requested.
Signed-off-by: Andrew Lunn <andrew@lunn.ch> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Andrew Lunn [Mon, 31 Aug 2015 13:56:50 +0000 (15:56 +0200)]
net: dsa: Allow DSA and CPU ports to have a phy-mode property
It can be useful for DSA and CPU ports to have a phy-mode property, in
particular to specify RGMII delays. Parse the property and set it in
the fixed-link phydev.
Signed-off-by: Andrew Lunn <andrew@lunn.ch> Acked-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Andrew Lunn [Mon, 31 Aug 2015 13:56:49 +0000 (15:56 +0200)]
net: dsa: Allow configuration of CPU & DSA port speeds/duplex
By default, DSA and CPU ports are configured to the maximum speed the
switch supports. However there can be use cases where the peer devices
port is slower. Allow a fixed-link property to be used with the DSA
and CPU port in the device tree, and use this information to configure
the port.
Signed-off-by: Andrew Lunn <andrew@lunn.ch> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Andrew Lunn [Mon, 31 Aug 2015 13:56:48 +0000 (15:56 +0200)]
phy: fixed_phy: Set supported speed in phydev
Set the supported field of the phydev to indicate the speed features
of the phy. If the phy is never attached to a netdev, but used in an
adjust_link() function, the speed will be incorrectly evaluated to
10/half rather than the correct speed/duplex.
Signed-off-by: Andrew Lunn <andrew@lunn.ch> Acked-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Andrew Lunn [Mon, 31 Aug 2015 13:56:47 +0000 (15:56 +0200)]
dsa: mv88e6xxx: Allow speed/duplex of port to be configured
The current code sets user ports to perform auto negotiation using the
phy. CPU and DSA ports are configured to full duplex and maximum speed
the switch supports.
There are however use cases where the CPU has a slower port, and when
user ports have SFP modules with fixed speed. In these cases, port
settings to be read from a fixed_phy devices. The switch driver then
needs to implement the adjust_link op, so the port settings can be
set.
Signed-off-by: Andrew Lunn <andrew@lunn.ch> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Florian Fainelli [Mon, 31 Aug 2015 13:56:46 +0000 (15:56 +0200)]
net: phy: Allow PHY devices to identify themselves as Ethernet switches, etc.
Some Ethernet MAC drivers using the PHY library require the hardcoding
of link parameters when interfaced to a switch device, SFP module,
switch to switch port, etc. This has typically lead to various ad-hoc
implementations looking like this:
- using a "fixed PHY" emulated device, which will provide link
indication towards the Ethernet MAC driver and hardware
- pretend there is no PHY and hardcode link parameters, ala mv643x_eth
Based on that, it is desireable to have the PHY drivers advertise the
correct link parameters, just like regular Ethernet PHYs towards their
CPU Ethernet MAC drivers, however, Ethernet MAC drivers should be able
to tell whether this link should be monitored or not. In the context
of an Ethernet switch, SFP module, switch to switch link, we do not
need to monitor this link since it should be always up.
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
Fix a memory leak in the mpls netns init function in case of failure. If
register_net_sysctl fails then we need to free the ctl_table.
Fixes: 7720c01f3f59 ("mpls: Add a sysctl to control the size of the mpls label table") Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
To fix build errors:
kernel/built-in.o: In function `bpf_trace_printk':
bpf_trace.c:(.text+0x11a254): undefined reference to `strncpy_from_unsafe'
kernel/built-in.o: In function `fetch_memory_string':
trace_kprobe.c:(.text+0x11acf8): undefined reference to `strncpy_from_unsafe'
move strncpy_from_unsafe() next to probe_kernel_read/write()
which use the same memory access style.
Reported-by: Fengguang Wu <fengguang.wu@intel.com> Reported-by: Guenter Roeck <linux@roeck-us.net> Fixes: 1a6877b9c0c2 ("lib: introduce strncpy_from_unsafe()") Signed-off-by: Alexei Starovoitov <ast@plumgrid.com> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Mon, 31 Aug 2015 19:34:00 +0000 (12:34 -0700)]
Merge branch 'per-route-dctcp-receive-side'
Daniel Borkmann says:
====================
tcp: receive-side per route dctcp handling
Original cover letter:
Currently, the following case doesn't use DCTCP, even if it should:
- responder has f.e. cubic as system wide default
- 'ip route congctl dctcp $src' was set
Then, DCTCP is NOT used if a DCTCP sender attempts to connect from a
host in the $src range: ECT(0) is set, but listen_sk is not dctcp, so
we fail the INET_ECN_is_not_ect sanity check.
We also have to examine the dst used for the SYN/ACK reply to make
this case work.
In order to minimize additional cost, store the 'ecn is must have'
information is the dst_features field.
The set targets -next instead of -net since this doesn't seem to be a
serious bug and to give the change more soak time until it hits linus
tree.
v1 -> v2:
- Addressed Dave's feedback, not exposing any bits to user space
- Added patch 3 to reject incorrect configurations
- Rest as is, rebased and retested
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Daniel Borkmann [Mon, 31 Aug 2015 13:58:47 +0000 (15:58 +0200)]
tcp: use dctcp if enabled on the route to the initiator
Currently, the following case doesn't use DCTCP, even if it should:
A responder has f.e. Cubic as system wide default, but for a specific
route to the initiating host, DCTCP is being set in RTAX_CC_ALGO. The
initiating host then uses DCTCP as congestion control, but since the
initiator sets ECT(0), tcp_ecn_create_request() doesn't set ecn_ok,
and we have to fall back to Reno after 3WHS completes.
We were thinking on how to solve this in a minimal, non-intrusive
way without bloating tcp_ecn_create_request() needlessly: lets cache
the CA ecn option flag in RTAX_FEATURES. In other words, when ECT(0)
is set on the SYN packet, set ecn_ok=1 iff route RTAX_FEATURES
contains the unexposed (internal-only) DST_FEATURE_ECN_CA. This allows
to only do a single metric feature lookup inside tcp_ecn_create_request().
Joint work with Florian Westphal.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>
Daniel Borkmann [Mon, 31 Aug 2015 13:58:46 +0000 (15:58 +0200)]
fib, fib6: reject invalid feature bits
Feature bits that are invalid should not be accepted by the kernel,
only the lower 4 bits may be configured, but not the remaining ones.
Even from these 4, 2 of them are unused.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net>
Florian Westphal [Mon, 31 Aug 2015 13:58:44 +0000 (15:58 +0200)]
net: fib: move metrics parsing to a helper
fib_create_info() is already quite large, so before adding more
code to the metrics section move that to a helper, similar to
ip6_convert_metrics.
Suggested-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>
Philip Downey [Mon, 31 Aug 2015 10:30:38 +0000 (11:30 +0100)]
IGMP: Document igmp_link_local_mcast_reports
Document the addition of a new sysctl variable which controls the
generation of IGMP reports for link local multicast groups in the
224.0.0.X range.
IGMP reports for local multicast groups can now be optionally
inhibited by setting the value to zero e.g.:
echo 0 > /proc/sys/net/ipv4/igmp_link_local_mcast_reports
To retain backwards compatibility the previous behaviour is retained
by default on system boot or reverted by setting the value back to
non-zero.
Signed-off-by: Philip Downey <pdowney@brocade.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Pravin B Shelar [Mon, 31 Aug 2015 01:09:38 +0000 (18:09 -0700)]
ip-tunnel: Use API to access tunnel metadata options.
Currently tun-info options pointer is used in few cases to
pass options around. But tunnel options can be accessed using
ip_tunnel_info_opts() API without using the pointer. Following
patch removes the redundant pointer and consistently make use
of API.
Signed-off-by: Pravin B Shelar <pshelar@nicira.com> Acked-by: Thomas Graf <tgraf@suug.ch> Reviewed-by: Jesse Gross <jesse@nicira.com> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Mon, 31 Aug 2015 05:40:44 +0000 (22:40 -0700)]
ipv4: Fix 32-bit build.
net/ipv4/af_inet.c: In function 'snmp_get_cpu_field64':
>> net/ipv4/af_inet.c:1486:26: error: 'offt' undeclared (first use in this function)
v = *(((u64 *)bhptr) + offt);
^
net/ipv4/af_inet.c:1486:26: note: each undeclared identifier is reported only once for each function it appears in
net/ipv4/af_inet.c: In function 'snmp_fold_field64':
>> net/ipv4/af_inet.c:1499:39: error: 'offct' undeclared (first use in this function)
res += snmp_get_cpu_field(mib, cpu, offct, syncp_offset);
^
>> net/ipv4/af_inet.c:1499:10: error: too many arguments to function 'snmp_get_cpu_field'
res += snmp_get_cpu_field(mib, cpu, offct, syncp_offset);
^
net/ipv4/af_inet.c:1455:5: note: declared here
u64 snmp_get_cpu_field(void __percpu *mib, int cpu, int offt)
^
Reported-by: kbuild test robot <fengguang.wu@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Poll() returns immediately after setting the kernel current frame
(ring->head) to SKIP from user space even though there is no new
frame. And in a case of all frames is VALID, user space program
unintensionally sets (only) kernel current frame to UNUSED, then
calls poll(), it will not return immediately even though there are
VALID frames.
To avoid situations like above, I think we need to scan all frames
to find VALID frames at poll() like netlink_alloc_skb(),
netlink_forward_ring() finding an UNUSED frame at skb allocation.
Signed-off-by: Ken-ichirou MATSUZAWA <chamas@h4.dion.ne.jp> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Mon, 31 Aug 2015 04:54:13 +0000 (21:54 -0700)]
Merge branch 'thunderx-features-fixes'
Aleksey Makarov says:
====================
net: thunderx: New features and fixes
v2:
- The unused affinity_mask field of the structure cmp_queue
has been deleted. (thanks to David Miller)
- The unneeded initializers have been dropped. (thanks to Alexey Klimov)
- The commit message "net: thunderx: Rework interrupt handling"
has been fixed. (thanks to Alexey Klimov)
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Sunil Goutham [Sun, 30 Aug 2015 09:29:16 +0000 (12:29 +0300)]
net: thunderx: Support for internal loopback mode
Support for setting VF's corresponding BGX LMAC in internal
loopback mode. This mode can be used for verifying basic HW
functionality such as packet I/O, RX checksum validation,
CQ/RBDR interrupts, stats e.t.c. Useful when DUT has no external
network connectivity.
'loopback' mode can be enabled or disabled via ethtool.
Note: This feature is not supported when no of VFs enabled are
morethan no of physical interfaces i.e active BGX LMACs
Signed-off-by: Sunil Goutham <sgoutham@cavium.com> Signed-off-by: Aleksey Makarov <aleksey.makarov@caviumnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Sunil Goutham [Sun, 30 Aug 2015 09:29:15 +0000 (12:29 +0300)]
net: thunderx: Support for upto 96 queues for a VF
This patch adds support for handling multiple qsets assigned to a
single VF. There by increasing no of queues from earlier 8 to max
no of CPUs in the system i.e 48 queues on a single node and 96 on
dual node system. User doesn't have option to assign which Qsets/VFs
to be merged. Upon request from VF, PF assigns next free Qsets as
secondary qsets. To maintain current behavior no of queues is kept
to 8 by default which can be increased via ethtool.
If user wants to unbind NICVF driver from a secondary Qset then it
should be done after tearing down primary VF's interface.
Signed-off-by: Sunil Goutham <sgoutham@cavium.com> Signed-off-by: Aleksey Makarov <aleksey.makarov@caviumnetworks.com> Signed-off-by: Robert Richter <rrichter@cavium.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Sunil Goutham [Sun, 30 Aug 2015 09:29:14 +0000 (12:29 +0300)]
net: thunderx: Rework interrupt handling
Rework interrupt handler to avoid checking IRQ affinity of
CQ interrupts. Now separate handlers are registered for each IRQ
including RBDR. Register interrupt handlers for only those
which are being used. Add nicvf_dump_intr_status() and use it
in irq handlers.
Signed-off-by: Sunil Goutham <sgoutham@cavium.com> Signed-off-by: Aleksey Makarov <aleksey.makarov@caviumnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Sunil Goutham [Sun, 30 Aug 2015 09:29:13 +0000 (12:29 +0300)]
net: thunderx: Support for HW VLAN stripping
This patch configures HW to strip 802.1Q header if found in a
receiving packet. The stripped VLAN ID and TCI information is
passed on to software via CQE_RX. Also sets netdev's 'vlan_features'
so that other HW offload features can be used for tagged packets.
This offload feature can be enabled or disabled via ethtool.
Network stack normally ignores RPS for 802.1Q packets and hence low
throughput. With this offload enabled throughput for tagged packets
will be almost same as normal packets.
Note: This patch doesn't enable HW VLAN insertion for transmit packets.
Signed-off-by: Sunil Goutham <sgoutham@cavium.com> Signed-off-by: Aleksey Makarov <aleksey.makarov@caviumnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Sunil Goutham [Sun, 30 Aug 2015 09:29:12 +0000 (12:29 +0300)]
net: thunderx: Receive hashing HW offload support
Adding support for receive hashing HW offload by using RSS_ALG
and RSS_TAG fields of CQE_RX descriptor. Also removed dependency
on minimum receive queue count to configure RSS so that hash is
always generated.
This hash is used by RPS logic to distribute flows across multiple
CPUs. Offload can be disabled via ethtool.
Signed-off-by: Robert Richter <rrichter@cavium.com> Signed-off-by: Sunil Goutham <sgoutham@cavium.com> Signed-off-by: Aleksey Makarov <aleksey.makarov@caviumnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Sunil Goutham [Sun, 30 Aug 2015 09:29:11 +0000 (12:29 +0300)]
net: thunderx: mailboxes: remove code duplication
Use the nicvf_send_msg_to_pf() function in the mailbox code.
Signed-off-by: Sunil Goutham <sgoutham@cavium.com> Signed-off-by: Robert Richter <rrichter@cavium.com> Signed-off-by: Aleksey Makarov <aleksey.makarov@caviumnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Sunil Goutham [Sun, 30 Aug 2015 09:29:10 +0000 (12:29 +0300)]
net: thunderx: Add receive error stats reporting via ethtool
Added ethtool support to dump receive packet error statistics reported
in CQE. Also made some small fixes
Signed-off-by: Sunil Goutham <sgoutham@cavium.com> Signed-off-by: Aleksey Makarov <aleksey.makarov@caviumnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Mon, 31 Aug 2015 04:48:59 +0000 (21:48 -0700)]
Merge branch 'snmp-stat-aggregation'
Raghavendra K T says:
====================
Optimize the snmp stat aggregation for large cpus
While creating 1000 containers, perf is showing lot of time spent in
snmp_fold_field on a large cpu system.
The current patch tries to improve by reordering the statistics gathering.
Please note that similar overhead was also reported while creating
veth pairs https://lkml.org/lkml/2013/3/19/556
Changes in V4:
- remove 'item' variable and use IPSTATS_MIB_MAX to avoid sparse
warning (Eric) also remove 'item' parameter (Joe)
- add missing memset of padding.
Changes in V3:
- use memset to initialize temp buffer in leaf function. (David)
- use memcpy to copy the buffer data to stat instead of unalign_pu (Joe)
- Move buffer definition to leaf function __snmp6_fill_stats64() (Eric)
-
Changes in V2:
- Allocate the stat calculation buffer in stack. (Eric)
Setup:
160 cpu (20 core) baremetal powerpc system with 1TB memory
1000 docker containers was created with command
docker run -itd ubuntu:15.04 /bin/bash in loop
observation:
Docker container creation linearly increased from around 1.6 sec to 7.5 sec
(at 1000 containers) perf data showed, creating veth interfaces resulting in
the below code path was taking more time.
proposed idea:
currently __snmp6_fill_stats64 calls snmp_fold_field that walks
through per cpu data to of an item (iteratively for around 36 items).
The patch tries to aggregate the statistics by going through
all the items of each cpu sequentially which is reducing cache
misses.
Performance of docker creation improved by around more than 2x
after the patch.
changes/ideas suggested:
Using buffer in stack (Eric), Usage of memset (David), Using memcpy in
place of unaligned_put (Joe).
Signed-off-by: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Pravin B Shelar [Sun, 30 Aug 2015 00:44:06 +0000 (17:44 -0700)]
openvswitch: Remove egress_tun_info.
tun info is passed using skb-dst pointer. Now we have
converted all vports to netdev based implementation so
Now we can remove redundant pointer to tun-info from OVS_CB.
Signed-off-by: Pravin B Shelar <pshelar@nicira.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Jesse Gross [Fri, 28 Aug 2015 23:54:40 +0000 (16:54 -0700)]
geneve: Use GRO cells infrastructure.
Geneve can benefit from GRO at the device level in a manner similar
to other tunnels, especially as hardware offloads are still emerging.
After this patch, aggregated frames are seen on the tunnel interface.
Single stream throughput nearly doubles in ideal circumstances (on
old hardware).
Signed-off-by: Jesse Gross <jesse@nicira.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Simon Horman [Sat, 29 Aug 2015 00:02:21 +0000 (09:02 +0900)]
openvswitch: retain parsed IPv6 header fields in flow on error skipping extension headers
When an error occurs skipping IPv6 extension headers retain the already
parsed IP protocol and IPv6 addresses in the flow. Also assume that the
packet is not a fragment in the absence of information to the contrary;
that is always use the frag_off value set by ipv6_skip_exthdr().
This allows matching on the IP protocol and IPv6 addresses of packets
with malformed extension headers.
Signed-off-by: Simon Horman <simon.horman@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
- Crash fix for hci_bcm driver
- Enhancements to hci_intel driver (e.g. baudrate configuration)
- Fix for SCO link type after multiple connect attempts
- Cleanups & minor fixes in a few other places
Please let me know if there are any issues pulling. Thanks.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Tony Lindgren [Fri, 28 Aug 2015 18:50:15 +0000 (11:50 -0700)]
net/smsc911x: Fix deferred probe for interrupt
The interrupt handler may not be available when smsc911x probes if the
interrupt handler is a GPIO controller for example. Let's fix that
by adding handling for -EPROBE_DEFER.
Cc: Steve Glendinning <steve.glendinning@shawell.net> Signed-off-by: Tony Lindgren <tony@atomide.com> Signed-off-by: David S. Miller <davem@davemloft.net>
With tunneling, it is currently possible to get an IPv6 header and interpret
it as an IPv4 header, or to interpret an IPv6 address as an IPv4 address
(and vice versa). This leads to things like sending packets to incorrect
address, IPv6 flow label being interpreted as IP packet length, etc.
Fix several places where this can happen.
Most of this is net-next only. The third patch affects net, too, but it
doesn't seem there's anything in user space that sets the attribute at all
currently, thus net-next is fine.
Changelog:
v2: fixed geneve after incorrect rebase on top of Pravin's patches
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Jiri Benc [Fri, 28 Aug 2015 18:48:22 +0000 (20:48 +0200)]
vxlan: do not receive IPv4 packets on IPv6 socket
By default (subject to the sysctl settings), IPv6 sockets listen also for
IPv4 traffic. Vxlan is not prepared for that and expects IPv6 header in
packets received through an IPv6 socket.
In addition, it's currently not possible to have both IPv4 and IPv6 vxlan
tunnel on the same port (unless bindv6only sysctl is enabled), as it's not
possible to create and bind both IPv4 and IPv6 vxlan interfaces and there's
no way to specify both IPv4 and IPv6 remote/group IP addresses.
Set IPV6_V6ONLY on vxlan sockets to fix both of these issues. This is not
done globally in udp_tunnel, as l2tp and tipc seems to work okay when
receiving IPv4 packets on IPv6 socket and people may rely on this behavior.
The other tunnels (geneve and fou) do not support IPv6.
Signed-off-by: Jiri Benc <jbenc@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Jiri Benc [Fri, 28 Aug 2015 18:48:21 +0000 (20:48 +0200)]
fou: reject IPv6 config
fou does not really support IPv6 encapsulation. After an UDP socket is
created in fou_create, the encap_rcv callback is set either to fou_udp_recv
or to gue_udp_recv. Both of those unconditionally assume that the received
packet has an IPv4 header and access the data at network_header as it was an
IPv4 header. This leads to IPv6 flow label being interpreted as IP packet
length, etc.
Disallow fou tunnel to be configured as IPv6 until real IPv6 support is
added to fou.
CC: Tom Herbert <tom@herbertland.com> Signed-off-by: Jiri Benc <jbenc@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Jiri Benc [Fri, 28 Aug 2015 18:48:20 +0000 (20:48 +0200)]
ip_tunnels: record IP version in tunnel info
There's currently nothing preventing directing packets with IPv6
encapsulation data to IPv4 tunnels (and vice versa). If this happens,
IPv6 addresses are incorrectly interpreted as IPv4 ones.
Track whether the given ip_tunnel_key contains IPv4 or IPv6 data. Store this
in ip_tunnel_info. Reject packets at appropriate places if they are supposed
to be encapsulated into an incompatible protocol.
Signed-off-by: Jiri Benc <jbenc@redhat.com> Acked-by: Alexei Starovoitov <ast@plumgrid.com> Acked-by: Thomas Graf <tgraf@suug.ch> Acked-by: Pravin B Shelar <pshelar@nicira.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Jiri Benc [Fri, 28 Aug 2015 18:48:19 +0000 (20:48 +0200)]
ip_tunnels: convert the mode field of ip_tunnel_info to flags
The mode field holds a single bit of information only (whether the
ip_tunnel_info struct is for rx or tx). Change the mode field to bit flags.
This allows more mode flags to be added.
Signed-off-by: Jiri Benc <jbenc@redhat.com> Acked-by: Alexei Starovoitov <ast@plumgrid.com> Acked-by: Thomas Graf <tgraf@suug.ch> Acked-by: Pravin B Shelar <pshelar@nicira.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Vlad Yasevich [Sat, 29 Aug 2015 01:23:39 +0000 (21:23 -0400)]
sctp: Do not try to search for the transport twice
When removing an non-primary transport during ASCONF
processing, we end up traversing the transport list
twice: once in sctp_cmd_del_non_primary, and once in
sctp_assoc_del_peer. We can avoid the second
search and call sctp_assoc_rm_peer() instead.
Found by code inspection during code reviews.
Signed-off-by: Vladislav Yasevich <vyasevich@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
The problem is rcu_read_unlock_bh() which triggers a warning when irqs are
disabled. ndo_poll_controller should run with irqs disabled always so we
can drop the rcu_read_lock_bh.
Fixes: 616f45416ca0 ("bonding: implement bond_poll_controller()") Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Acked-by: Mahesh Bandewar <maheshb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
The driver overrides the error returned by platform_get_irq() with -ENODEV
which e.g. precludes the deferred probing from working. Propagate the real
error code to the driver core instead.
Signed-off-by: Sergei Shtylyov <sergei.shtylyov@cogentembedded.com> Acked-by: Geert Uytterhoeven <geert+renesas@glider.be> Signed-off-by: David S. Miller <davem@davemloft.net>
Sergei Shtylyov [Fri, 28 Aug 2015 13:55:10 +0000 (16:55 +0300)]
ravb: propagate platform_get_irq() error upstream
The driver overrides the error returned by platform_get_irq() with -ENODEV
which e.g. precludes the deferred probing from working. Propagate the real
error code to the driver core instead.
Signed-off-by: Sergei Shtylyov <sergei.shtylyov@cogentembedded.com> Acked-by: Geert Uytterhoeven <geert+renesas@glider.be> Signed-off-by: David S. Miller <davem@davemloft.net>
lucien [Fri, 28 Aug 2015 09:45:58 +0000 (17:45 +0800)]
sctp: ASCONF-ACK with Unresolvable Address should be sent
RFC 5061:
This is an opaque integer assigned by the sender to identify each
request parameter. The receiver of the ASCONF Chunk will copy this
32-bit value into the ASCONF Response Correlation ID field of the
ASCONF-ACK response parameter. The sender of the ASCONF can use this
same value in the ASCONF-ACK to find which request the response is
for. Note that the receiver MUST NOT change this 32-bit value.
Address Parameter: TLV
This field contains an IPv4 or IPv6 address parameter, as described
in Section 3.3.2.1 of [RFC4960].
ASCONF chunk with Error Cause Indication Parameter (Unresolvable Address)
should be sent if the Delete IP Address is not part of the association.
Endpoint A Endpoint B
(ESTABLISHED) (ESTABLISHED)
ASCONF ----------------->
(Delete IP Address)
<----------------- ASCONF-ACK
(Unresolvable Address)
Signed-off-by: Xin Long <lucien.xin@gmail.com> Acked-by: Vlad Yasevich <vyasevich@gmail.com> Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
__netlink_lookup_frame() was always called with the same "pos"
value in netlink_forward_ring(). It will look at the same ring entry
header over and over again, every time through this loop. Then cycle
through the whole ring, advancing ring->head, not "pos" until it
equals the "ring->head != head" loop test fails.
Signed-off-by: Ken-ichirou MATSUZAWA <chamas@h4.dion.ne.jp> Signed-off-by: David S. Miller <davem@davemloft.net>
Since commit c05cdb1b864f ("netlink: allow large data transfers from
user-space"), the kernel may fail to allocate the necessary room for the
acknowledgment message back to userspace. This patch introduces a new
socket option that trims off the payload of the original netlink message.
The netlink message header is still included, so the user can guess from
the sequence number what is the message that has triggered the
acknowledgment.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Christophe Ricard <christophe-h.ricard@st.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Joe Stringer [Sat, 29 Aug 2015 02:22:11 +0000 (19:22 -0700)]
openvswitch: Fix conntrack compilation without mark.
Fix build with !CONFIG_NF_CONNTRACK_MARK && CONFIG_OPENVSWITCH_CONNTRACK
Fixes: 182e304 ("openvswitch: Allow matching on conntrack mark") Reported-by: Simon Horman <simon.horman@netronome.com> Signed-off-by: Joe Stringer <joestringer@nicira.com> Tested-by: Simon Horman <simon.horman@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
====================
Netfilter updates for net-next
The following patchset contains Netfilter/IPVS updates for your net-next tree.
In sum, patches to address fallout from the previous round plus updates from
the IPVS folks via Simon Horman, they are:
1) Add a new scheduler to IPVS: The weighted overflow scheduling algorithm
directs network connections to the server with the highest weight that is
currently available and overflows to the next when active connections exceed
the node's weight. From Raducu Deaconu.
2) Fix locking ordering in IPVS, always take rtnl_lock in first place. Patch
from Julian Anastasov.
3) Allow to indicate the MTU to the IPVS in-kernel state sync daemon. From
Julian Anastasov.
4) Enhance multicast configuration for the IPVS state sync daemon. Also from
Julian.
5) Resolve sparse warnings in the nf_dup modules.
6) Fix a linking problem when CONFIG_NF_DUP_IPV6 is not set.
7) Add ICMP codes 5 and 6 to IPv6 REJECT target, they are more informative
subsets of code 1. From Andreas Herz.
8) Revert the jumpstack size calculation from mark_source_chains due to chain
depth miscalculations, from Florian Westphal.
9) Calm down more sparse warning around the Netfilter tree, again from Florian
Westphal.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
bpf: add support for %s specifier to bpf_trace_printk()
%s specifier makes bpf program and kernel debugging easier.
To make sure that trace_printk won't crash the unsafe string
is copied into stack and unsafe pointer is substituted.
The following C program:
#include <linux/fs.h>
int foo(struct pt_regs *ctx, struct filename *filename)
{
void *name = 0;
Suggested-by: Brendan Gregg <brendan.d.gregg@gmail.com> Signed-off-by: Alexei Starovoitov <ast@plumgrid.com> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Fri, 28 Aug 2015 21:15:25 +0000 (14:15 -0700)]
Merge branch 'phylib-simplifications'
Sergei Shtylyov says:
====================
Some phylib simplifications
Here's 2 patches against DaveM's 'net-next.git' repo. We simplify a bogus
string of type casts in the 1st patch and make the code respect some coding
standards of the networking code in the 2nd one. I may follow with fixing of
checkpatch.pl's complaints. if I have time..
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Daniel Borkmann [Fri, 28 Aug 2015 16:46:39 +0000 (18:46 +0200)]
net: sched: don't break line in tc_classify loop notification
Just some minor noise follow-up to address some stylistic issues of
commit 3b3ae880266d ("net: sched: consolidate tc_classify{,_compat}").
Accidentally v1 instead of v2 of that commit got applied, so this
patch adds the relative diff.
Suggested-by: Alexei Starovoitov <ast@plumgrid.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@plumgrid.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Shradha Shah [Fri, 28 Aug 2015 09:55:42 +0000 (10:55 +0100)]
sfc: Allow driver to cope with a lower number of VIs than it needs for RSS
Previously, the driver would refuse to load if it couldn't secure
enough VIs from the MC to fulfill its RSS requirements.
This was causing probe to fail on later functions in
configurations where we'd run out of VIs, such as having many
VFs.
This change allows the driver to load with fewer VIs, down to a
minimum of 2. A warning will be printed saying that RSS
requirements were not met, possibly affecting performance.
efx->max_tx_channels needs to be set to avoid going down the
failure path in efx_probe_nic() immediately in the loop after the
probe() NIC-type function.
Also, Set rc=ENOSPC when bombing out of efx_probe_nic due to lack
of VIs.
Signed-off-by: Shradha Shah <sshah@solarflare.com> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Fri, 28 Aug 2015 20:43:33 +0000 (13:43 -0700)]
Merge tag 'batman-adv-for-davem' of git://git.open-mesh.org/linux-merge
Antonio Quartulli says:
====================
Included changes:
- code beautification
- remove obsolete 'deleted' attribute for bat-gw node
- increase internal version number
- prevent potential access to netdev object after deregistration
- set needed_head/tail_room for batman virtual interface
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Fri, 28 Aug 2015 20:32:37 +0000 (13:32 -0700)]
Merge branch 'vrf-inetpeer'
David Ahern says:
====================
net: Refactor inetpeer cache and add support for VRFs
Per Dave's comment on the version 1 patch adding VRF support to inetpeer
cache by explicitly making the address + index a key. Refactored the
inetpeer code in the process; mostly impacts the use by tcp_metrics.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
David Ahern [Thu, 27 Aug 2015 23:07:03 +0000 (16:07 -0700)]
net: Add support for VRFs to inetpeer cache
inetpeer caches based on address only, so duplicate IP addresses within
a namespace return the same cached entry. Enhance the ipv4 address key
to contain both the IPv4 address and VRF device index.
Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
David Ahern [Thu, 27 Aug 2015 23:07:02 +0000 (16:07 -0700)]
net: Refactor inetpeer address struct
Move the inetpeer_addr_base union to inetpeer_addr and drop
inetpeer_addr_base.
Both the a6 and in6_addr overlays are not needed; drop the __be32 version
and rename in6 to a6 for consistency with ipv4. Add a new u32 array to
the union which removes the need for the typecast in the compare function
and the use of a consistent arg for both ipv4 and ipv6 addresses which
makes the compare function more readable.
Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Local packets going through the VRF device are missing an ethernet header.
Fix by adding one and then stripping it off before pushing back to the IP
stack. With this patch you get the expected dumps:
...
05:36:15.713944 IP 10.2.1.254 > 10.2.1.2: ICMP echo request, id 23795, seq 1, length 64
05:36:15.714160 IP 10.2.1.2 > 10.2.1.254: ICMP echo reply, id 23795, seq 1, length 64
...
Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>