Paul Durrant [Wed, 15 Jan 2014 17:30:33 +0000 (17:30 +0000)]
xen-netfront: add support for IPv6 offloads
This patch adds support for IPv6 checksum offload and GSO when those
features are available in the backend.
Signed-off-by: Paul Durrant <paul.durrant@citrix.com> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: David Vrabel <david.vrabel@citrix.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Tom Herbert [Wed, 15 Jan 2014 16:58:06 +0000 (08:58 -0800)]
net: Check skb->rxhash in gro_receive
When initializing a gro_list for a packet, first check the rxhash of
the incoming skb against that of the skb's in the list. This should be
a very strong inidicator of whether the flow is going to be matched,
and potentially allows a lot of other checks to be short circuited.
Use skb_hash_raw so that we don't force the hash to be calculated.
Tested by running netperf 200 TCP_STREAMs between two machines with
GRO, HW rxhash, and 1G. Saw no performance degration, slight reduction
of time in dev_gro_receive.
Signed-off-by: Tom Herbert <therbert@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Daniel Borkmann [Wed, 15 Jan 2014 15:25:36 +0000 (16:25 +0100)]
packet: use percpu mmap tx frame pending refcount
In PF_PACKET's packet mmap(), we can avoid using one atomic_inc()
and one atomic_dec() call in skb destructor and use a percpu
reference count instead in order to determine if packets are
still pending to be sent out. Micro-benchmark with [1] that has
been slightly modified (that is, protcol = 0 in socket(2) and
bind(2)), example on a rather crappy testing machine; I expect
it to scale and have even better results on bigger machines:
./packet_mm_tx -s7000 -m7200 -z700000 em1, avg over 2500 runs:
With patch: 4,022,015 cyc
Without patch: 4,812,994 cyc
time ./packet_mm_tx -s64 -c10000000 em1 > /dev/null, stable:
With patch:
real 1m32.241s
user 0m0.287s
sys 1m29.316s
Without patch:
real 1m38.386s
user 0m0.265s
sys 1m35.572s
In function tpacket_snd(), it is okay to use packet_read_pending()
since in fast-path we short-circuit the condition already with
ph != NULL, since we have next frames to process. In case we have
MSG_DONTWAIT, we also do not execute this path as need_wait is
false here anyway, and in case of _no_ MSG_DONTWAIT flag, it is
okay to call a packet_read_pending(), because when we ever reach
that path, we're done processing outgoing frames anyway and only
look if there are skbs still outstanding to be orphaned. We can
stay lockless in this percpu counter since it's acceptable when we
reach this path for the sum to be imprecise first, but we'll level
out at 0 after all pending frames have reached the skb destructor
eventually through tx reclaim. When people pin a tx process to
particular CPUs, we expect overflows to happen in the reference
counter as on one CPU we expect heavy increase; and distributed
through ksoftirqd on all CPUs a decrease, for example. As
David Laight points out, since the C language doesn't define the
result of signed int overflow (i.e. rather than wrap, it is
allowed to saturate as a possible outcome), we have to use
unsigned int as reference count. The sum over all CPUs when tx
is complete will result in 0 again.
The BUG_ON() in tpacket_destruct_skb() we can remove as well. It
can _only_ be set from inside tpacket_snd() path and we made sure
to increase tx_ring.pending in any case before we called po->xmit(skb).
So testing for tx_ring.pending == 0 is not too useful. Instead, it
would rather have been useful to test if lower layers didn't orphan
the skb so that we're missing ring slots being put back to
TP_STATUS_AVAILABLE. But such a bug will be caught in user space
already as we end up realizing that we do not have any
TP_STATUS_AVAILABLE slots left anymore. Therefore, we're all set.
Btw, in case of RX_RING path, we do not make use of the pending
member, therefore we also don't need to use up any percpu memory
here. Also note that __alloc_percpu() already returns a zero-filled
percpu area, so initialization is done already.
Daniel Borkmann [Wed, 15 Jan 2014 15:25:35 +0000 (16:25 +0100)]
packet: don't unconditionally schedule() in case of MSG_DONTWAIT
In tpacket_snd(), when we've discovered a first frame that is
not in status TP_STATUS_SEND_REQUEST, and return a NULL buffer,
we exit the send routine in case of MSG_DONTWAIT, since we've
finished traversing the mmaped send ring buffer and don't care
about pending frames.
While doing so, we still unconditionally call an expensive
schedule() in the packet_current_frame() "error" path, which
is unnecessary in this case since it's enough to just quit
the function.
Also, in case MSG_DONTWAIT is not set, we should rather test
for need_resched() first and do schedule() only if necessary
since meanwhile pending frames could already have finished
processing and called skb destructor.
Signed-off-by: Daniel Borkmann <dborkman@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Daniel Borkmann [Wed, 15 Jan 2014 15:25:34 +0000 (16:25 +0100)]
packet: improve socket create/bind latency in some cases
Most people acquire PF_PACKET sockets with a protocol argument in
the socket call, e.g. libpcap does so with htons(ETH_P_ALL) for
all its sockets. Most likely, at some point in time a subsequent
bind() call will follow, e.g. in libpcap with ...
... as arguments. What happens in the kernel is that already
in socket() syscall, we install a proto hook via register_prot_hook()
if our protocol argument is != 0. Yet, in bind() we're almost
doing the same work by doing a unregister_prot_hook() with an
expensive synchronize_net() call in case during socket() the proto
was != 0, plus follow-up register_prot_hook() with a bound device
to it this time, in order to limit traffic we get.
In the case when the protocol and user supplied device index (== 0)
does not change from socket() to bind(), we can spare us doing
the same work twice. Similarly for re-binding to the same device
and protocol. For these scenarios, we can decrease create/bind
latency from ~7447us (sock-bind-2 case) to ~89us (sock-bind-1 case)
with this patch.
Alternatively, for the first case, if people care, they should
simply create their sockets with proto == 0 argument and define
the protocol during bind() as this saves a call to synchronize_net()
as well (sock-bind-3 case).
In all other cases, we're tied to user space behaviour we must not
change, also since a bind() is not strictly required. Thus, we need
the synchronize_net() to make sure no asynchronous packet processing
paths still refer to the previous elements of po->prot_hook.
In case of mmap()ed sockets, the workflow that includes bind() is
socket() -> setsockopt(<ring>) -> bind(). In that case, a pair of
{__unregister, register}_prot_hook is being called from setsockopt()
in order to install the new protocol receive handler. Thus, when
we call bind and can skip a re-hook, we have already previously
installed the new handler. For fanout, this is handled different
entirely, so we should be good.
Timings on an i7-3520M machine:
* sock-bind-1: 89 us
* sock-bind-2: 7447 us
* sock-bind-3: 75 us
David S. Miller [Fri, 17 Jan 2014 00:12:45 +0000 (16:12 -0800)]
i40e: Remove autogenerated Module.symvers file.
Fixes: 9d8bf54 ("i40e: associate VMDq queue with VM type") Reported-by: Stephen Hemminger <stephen@networkplumber.org> Signed-off-by: David S. Miller <davem@davemloft.net>
Paul Gortmaker [Wed, 15 Jan 2014 16:19:55 +0000 (11:19 -0500)]
net/ipv4: don't use module_init in non-modular gre_offload
Recent commit 438e38fadca2f6e57eeecc08326c8a95758594d4
("gre_offload: statically build GRE offloading support") added
new module_init/module_exit calls to the gre_offload.c file.
The file is obj-y and can't be anything other than built-in.
Currently it can never be built modular, so using module_init
as an alias for __initcall can be somewhat misleading.
Fix this up now, so that we can relocate module_init from
init.h into module.h in the future. If we don't do this, we'd
have to add module.h to obviously non-modular code, and that
would be a worse thing. We also make the inclusion explicit.
Note that direct use of __initcall is discouraged, vs. one
of the priority categorized subgroups. As __initcall gets
mapped onto device_initcall, our use of device_initcall
directly in this change means that the runtime impact is
zero -- it will remain at level 6 in initcall ordering.
As for the module_exit, rather than replace it with __exitcall,
we simply remove it, since it appears only UML does anything
with those, and even for UML, there is no relevant cleanup
to be done here.
Cc: Eric Dumazet <edumazet@google.com> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Paul Bolle [Tue, 14 Jan 2014 19:46:52 +0000 (20:46 +0100)]
net/mlx4_core: clean up srq_res_start_move_to()
Building resource_tracker.o triggers a GCC warning:
drivers/net/ethernet/mellanox/mlx4/resource_tracker.c: In function 'mlx4_HW2SW_SRQ_wrapper':
drivers/net/ethernet/mellanox/mlx4/resource_tracker.c:3202:17: warning: 'srq' may be used uninitialized in this function [-Wmaybe-uninitialized]
atomic_dec(&srq->mtt->ref_count);
^
This is a false positive. But a cleanup of srq_res_start_move_to() can
help GCC here. The code currently uses a switch statement where a plain
if/else would do, since only two of the switch's four cases can ever
occur. Dropping that switch makes the warning go away.
While we're at it, add some missing braces, and convert state to the
correct type.
Signed-off-by: Paul Bolle <pebolle@tiscali.nl> Signed-off-by: David S. Miller <davem@davemloft.net>
Paul Bolle [Tue, 14 Jan 2014 19:45:36 +0000 (20:45 +0100)]
net/mlx4_core: clean up cq_res_start_move_to()
Building resource_tracker.o triggers a GCC warning:
drivers/net/ethernet/mellanox/mlx4/resource_tracker.c: In function 'mlx4_HW2SW_CQ_wrapper':
drivers/net/ethernet/mellanox/mlx4/resource_tracker.c:3019:16: warning: 'cq' may be used uninitialized in this function [-Wmaybe-uninitialized]
atomic_dec(&cq->mtt->ref_count);
^
This is a false positive. But a cleanup of cq_res_start_move_to() can
help GCC here. The code currently uses a switch statement where an
if/else construct would do too, since only two of the switch's four
cases can ever occur. Dropping that switch makes the warning go away.
While we're at it, add some missing braces.
Signed-off-by: Paul Bolle <pebolle@tiscali.nl> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Thu, 16 Jan 2014 23:35:08 +0000 (15:35 -0800)]
Merge branch 'ixgbe-next'
Aaron Brown says:
====================
Intel Wired LAN Driver Updates
This series contains updates to ixgbe and ixgbevf.
John adds rtnl lock / unlock semantics for ixgbe_reinit_locked()
which was being called without the rtnl lock being held.
Jacob corrects an issue where ixgbevf_qv_disable function does not
set the disabled bit correctly.
From the community, Wei uses a type of struct for pci driver-specific
data in ixgbevf_suspend()
Don changes the way we store ring arrays in a manner that allows
support of multiple queues on multiple nodes and creates new ring
initialization functions for work previously done across multiple
functions - making the code closer to ixgbe and hopefully more readable.
He also fixes incorrect fiber eeprom write logic.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Don Skidmore [Thu, 16 Jan 2014 10:30:10 +0000 (02:30 -0800)]
ixgbe: Fix incorrect logic for fixed fiber eeprom write
In this code we wanted to set the bit in IXGBE_SFF_SOFT_RS_SELECT_MASK to
the value in rs. So we really needed a logical or rather than an and, this
patch makes that change.
Signed-off-by: Don Skidmore <donald.c.skidmore@intel.com> Tested-by: Phil Schmitt <phillip.j.schmitt@intel.com> Signed-off-by: Aaron Brown <aaron.f.brown@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Don Skidmore [Thu, 16 Jan 2014 10:30:09 +0000 (02:30 -0800)]
ixgbevf: create function for all of ring init
This patch creates new functions for ring initialization,
ixgbevf_configure_tx_ring() and ixgbevf_configure_rx_ring(). The work done
in these function previously was spread between several other functions and
this change should hopefully lead to greater readability and make the code
more like ixgbe. This patch also moves the placement of some older functions
to avoid having to write prototypes. It also promotes a couple of debug
messages to errors.
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com> Signed-off-by: Don Skidmore <donald.c.skidmore@intel.com> Tested-by: Phil Schmitt <phillip.j.schmitt@intel.com> Signed-off-by: Aaron Brown <aaron.f.brown@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Don Skidmore [Thu, 16 Jan 2014 10:30:08 +0000 (02:30 -0800)]
ixgbevf: Convert ring storage form pointer to an array to array of pointers
This will change how we store rings arrays in the adapter sturct.
We use to have a pointer to an array now we will be using an array
of pointers. This will allow us to support multiple queues on
muliple nodes at some point we would be able to reallocate the rings
so that each is on a local node if needed.
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com> Signed-off-by: Don Skidmore <donald.c.skidmore@intel.com> Tested-by: Phil Schmitt <phillip.j.schmitt@intel.com> Signed-off-by: Aaron Brown <aaron.f.brown@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Wei Yongjun [Thu, 16 Jan 2014 10:30:07 +0000 (02:30 -0800)]
ixgbevf: use pci drvdata correctly in ixgbevf_suspend()
We had set the pci driver-specific data in ixgbevf_probe() as a type of
struct net_device, so we should use it as netdev in ixgbevf_suspend().
Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn> Tested-by: Phil Schmitt <phillip.j.schmitt@intel.com> Signed-off-by: Aaron Brown <aaron.f.brown@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Jacob Keller [Thu, 16 Jan 2014 10:30:06 +0000 (02:30 -0800)]
ixgbevf: set the disable state when ixgbevf_qv_disable is called
The ixgbevf_qv_disable function used by CONFIG_NET_RX_BUSY_POLL is broken,
because it does not properly set the IXGBEVF_QV_STATE_DISABLED bit, indicating
that the q_vector should be disabled (and preventing future locks from
obtaining the vector). This patch corrects the issue by setting the disable
state.
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Tested-by: Phil Schmitt <phillip.j.schmitt@intel.com> Signed-off-by: Aaron Brown <aaron.f.brown@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
John Fastabend [Thu, 16 Jan 2014 10:30:05 +0000 (02:30 -0800)]
ixgbe: reinit_locked() should be called with rtnl_lock
ixgbe_service_task() is calling ixgbe_reinit_locked() without
the rtnl_lock being held. This is because it is being called
from a worker thread and not a rtnl netlink or dcbnl path.
Add rtnl_{un}lock() semantics. I found this during code review.
Signed-off-by: John Fastabend <john.r.fastabend@intel.com> Tested-by: Phil Schmitt <phillip.j.schmitt@intel.com> Signed-off-by: Aaron Brown <aaron.f.brown@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Thu, 16 Jan 2014 23:03:31 +0000 (15:03 -0800)]
net: eth_type_trans() should use skb_header_pointer()
eth_type_trans() can read uninitialized memory as drivers
do not necessarily pull more than 14 bytes in skb->head before
calling it.
As David suggested, we can use skb_header_pointer() to
fix this without breaking some drivers that might not expect
eth_type_trans() pulling 2 additional bytes.
Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Ben Hutchings <bhutchings@solarflare.com> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Thu, 16 Jan 2014 23:24:00 +0000 (15:24 -0800)]
Merge branch 'stmmac_pm'
Srinivas Kandagatla says:
====================
net: stmmac PM related fixes.
During PM_SUSPEND_FREEZE testing, I have noticed that PM support in STMMAC is
partly broken. I had to re-arrange the code to do PM correctly. There were lot
of things I did not like personally and some bits did not work in the first
place. I thought this is the nice opportunity to clean the mess up.
Here is what I did:
any
1> Test PM suspend freeze via pm_test
It did not work for following reasons.
- If the power to gmac is removed when it enters in low power state.
stmmac_resume could not cope up with such behaviour, it was expecting the ip
register contents to be still same as before entering low power, This
assumption is wrong. So I started to add some code to do Hardware
initialization, thats when I started to re-arrange the code. stmmac_open
contains both resource and memory allocations and hardware initialization. I
had to separate these two things in two different functions.
These two patches do that
net: stmmac: move dma allocation to new function
net: stmmac: move hardware setup for stmmac_open to new function
And rest of the other patches are fixing the loose ends, things like mdio
reset, which might be necessary in cases likes hibernation(I did not test).
In hibernation cases the driver was just unregistering with subsystems and
releasing resources which I did not like and its not necessary to do this as
part of PM. So using the same stmmac_suspend/resume made more sense for
hibernation cases than using stmmac_open/release.
Also fixed a NULL pointer dereference bug too.
2> Test WOL via PM_SUSPEND_FREEZE
Did get an wakeup interrupt, but could not wakeup a freeze system.
So I had to add pm_wakeup_event to the driver.
net: stmmac: notify the PM core of a wakeup event. patch.
Also few patches like
net: stmmac: make stmmac_mdio_reset non-static
net: stmmac: restore pinstate in pm resume.
helps the resume function to reset the phy and put back the pins in default
state.
Changes since RFC:
- Rebased to net-next on Dave's suggestion.
All these patches are Acked by Peppe.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
net: stmmac: notify the PM core of a wakeup event.
In PM_SUSPEND_FREEZE and WOL(Wakeup On Lan) case, when the driver gets a
wakeup event, either the driver or platform specific PM code should notify
the pm core about it, so that the system can wakeup from low power.
In cases where there is no involvement of platform specific PM, it
becomes driver responsibility to notify the PM core to wakeup the
system.
Without this WOL with PM_SUSPEND_FREEZE does not work on STi based SOCs.
Signed-off-by: Srinivas Kandagatla <srinivas.kandagatla@st.com> Acked-by: Giuseppe Cavallaro <peppe.cavallaro@st.com> Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds code to restore default pinstate of the pins when it
comes back from low power state. Without this patch the state of the
pins would be unknown and the driver would not work.
This patch also adds code to put the pins in to sleep state when the
driver enters low power state.
Signed-off-by: Srinivas Kandagatla <srinivas.kandagatla@st.com> Acked-by: Giuseppe Cavallaro <peppe.cavallaro@st.com> Signed-off-by: David S. Miller <davem@davemloft.net>
net: stmmac: use suspend functions for hibernation
In hibernation freeze case the driver just releases the resources like
dma buffers, irqs, unregisters the drivers and during restore it does
register, request the resources. This is not really necessary, as part
of power management all the data structures are intact, all the
previously allocated resources can be used after coming out of low
power.
This patch uses the suspend and resume callbacks for freeze and
restore which initializes the hardware correctly without unregistering
or releasing the resources, this should also help in reducing the time
to restore.
Also this patch fixes a bug in stmmac_pltfr_restore and
stmmac_pltfr_freeze where it tries to get hold of platform data via
dev_get_platdata call, which would return NULL in device tree cases and
the next if statement would crash as there is no NULL check.
Signed-off-by: Srinivas Kandagatla <srinivas.kandagatla@st.com> Acked-by: Giuseppe Cavallaro <peppe.cavallaro@st.com> Signed-off-by: David S. Miller <davem@davemloft.net>
net: stmmac: fix power management suspend-resume case
The driver PM resume assumes that the IP is still powered up and the
all the register contents are not disturbed when it comes out of low
power suspend case. This assumption is wrong, basically the driver
should not consider any state of registers after it comes out of low
power. However driver can keep the part of the IP powered up if its a
wake up source. But it can not assume the register state of the IP. Also
its possible that SOC glue layer can take the power off the IP if its
not wake-up source to reduce the power consumption.
This patch re initializes hardware by calling stmmac_hw_setup function in
resume case.
Signed-off-by: Srinivas Kandagatla <srinivas.kandagatla@st.com> Acked-by: Giuseppe Cavallaro <peppe.cavallaro@st.com> Signed-off-by: David S. Miller <davem@davemloft.net>
This patch promotes stmmac_mdio_reset function from static to
non-static, so that power management functions can decide to reset if
the IP comes out from lowe power state specially hibernation cases.
Signed-off-by: Srinivas Kandagatla <srinivas.kandagatla@st.com> Acked-by: Giuseppe Cavallaro <peppe.cavallaro@st.com> Signed-off-by: David S. Miller <davem@davemloft.net>
net: stmmac: move hardware setup for stmmac_open to new function
This patch moves hardware setup part of the code in stmmac_open to a new
function stmmac_hw_setup, the reason for doing this is to make hw
initialization independent function so that PM functions can re-use it to
re-initialize the IP after returning from low power state.
This will also avoid code duplication across stmmac_resume/restore and
stmmac_open.
Signed-off-by: Srinivas Kandagatla <srinivas.kandagatla@st.com> Acked-by: Giuseppe Cavallaro <peppe.cavallaro@st.com> Signed-off-by: David S. Miller <davem@davemloft.net>
This patch moves dma resource allocation to a new function
alloc_dma_desc_resources, the reason for moving this to a new function
is to keep the memory allocations in a separate function. One more reason
it to get suspend and hibernation cases working without releasing and
allocating these resources during suspend-resume and freeze-restore
cases.
Signed-off-by: Srinivas Kandagatla <srinivas.kandagatla@st.com> Acked-by: Giuseppe Cavallaro <peppe.cavallaro@st.com> Signed-off-by: David S. Miller <davem@davemloft.net>
This patch removes gpio_free for reset line of the phy, driver stores
the gpio number in its private data-structure to use in future. As the
driver uses this pin in future this pin should not be freed.
Signed-off-by: Srinivas Kandagatla <srinivas.kandagatla@st.com> Acked-by: Giuseppe Cavallaro <peppe.cavallaro@st.com> Signed-off-by: David S. Miller <davem@davemloft.net>
net: stmmac: support max-speed device tree property
This patch adds support to "max-speed" property which is a standard
Ethernet device tree property. max-speed specifies maximum speed
(specified in megabits per second) supported the device.
Depending on the clocking schemes some of the boards can only support
few link speeds, so having a way to limit the link speed in the mac
driver would allow such setups to work reliably.
Without this patch there is no way to tell the driver to limit the
link speed.
Signed-off-by: Srinivas Kandagatla <srinivas.kandagatla@st.com> Acked-by: Giuseppe Cavallaro <peppe.cavallaro@st.com> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Thu, 16 Jan 2014 23:15:50 +0000 (15:15 -0800)]
Merge branch 'mvneta'
Willy Tarreau says:
====================
Assorted mvneta fixes and improvements
this series provides some fixes for a number of issues met with the
mvneta driver, then adds some improvements. Patches 1-5 are fixes
and would be needed in 3.13 and likely -stable. The next ones are
performance improvements and cleanups :
- driver lockup when reading stats while sending traffic from multiple
CPUs : this obviously only happens on SMP and is the result of missing
locking on the driver. The problem was present since the introduction
of the driver in 3.8. The first patch performs some changes that are
needed for the second one which actually fixes the issue by using
per-cpu counters. It could make sense to backport this to the relevant
stable versions.
- mvneta_tx_timeout calls various functions to reset the NIC, and these
functions sleep, which is not allowed here, resulting in a panic.
Better completely disable this Tx timeout handler for now since it is
never called. The problem was encountered while developing some new
features, it's uncertain whether it's possible to reproduce it with
regular usage, so maybe a backport to stable is not needed.
- replace the Tx timer with a real Tx IRQ. As first reported by Arnaud
Ebalard and explained by Eric Dumazet, there is no way this driver
can work correctly if it uses a driver to recycle the Tx descriptors.
If too many packets are sent at once, the driver quickly ends up with
no descriptors (which happens twice as easily in GSO) and has to wait
10ms for recycling its descriptors and being able to send again. Eric
has worked around this in the core GSO code. But still when routing
traffic or sending UDP packets, the limitation is very visible. Using
Tx IRQs allows Tx descriptors to be recycled when sent. The coalesce
value is still configurable using ethtool. This fix turns the UDP
send bitrate from 134 Mbps to 987 Mbps (ie: line rate). It's made of
two patches, one to add the relevant bits from the original Marvell's
driver, and another one to implement the change. I don't know if it
should be backported to stable, as the bug only causes poor performance.
- Patches 6..8 are essentially cleanups, code deduplication and minor
optimizations for not re-fetching a value we already have (status).
- patch 9 changes the prefetch of Rx descriptor from current one to
next one. In benchmarks, it results in about 1% general performance
increase on HTTP traffic, probably because prefetching the current
descriptor does not leave enough time between the start of prefetch
and its usage.
- patch 10 implements support for build_skb() on Rx path. The driver
now preallocates frags instead of skbs and builds an skb just before
delivering it. This results in a 2% performance increase on HTTP
traffic, and up to 5% on small packet Rx rate.
- patch 11 implements rx_copybreak for small packets (256 bytes). It
avoids a dma_map_single()/dma_unmap_single() and increases the Rx
rate by 16.4%, from 486kpps to 573kpps. Further improvements up to
711kpps are possible depending how the DMA is used.
- patches 12 and 13 are extra cleanups made possible by some of the
simplifications above.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Arnaud Ebalard [Thu, 16 Jan 2014 07:20:18 +0000 (08:20 +0100)]
net: mvneta: mvneta_tx_done_gbe() cleanups
mvneta_tx_done_gbe() return value and third parameter are no more
used. This patch changes the function prototype and removes a useless
variable where the function is called.
Reviewed-by: Willy Tarreau <w@1wt.eu> Signed-off-by: Arnaud Ebalard <arno@natisbad.org> Signed-off-by: David S. Miller <davem@davemloft.net>
willy tarreau [Thu, 16 Jan 2014 07:20:17 +0000 (08:20 +0100)]
net: mvneta: implement rx_copybreak
calling dma_map_single()/dma_unmap_single() is quite expensive compared
to copying a small packet. So let's copy short frames and keep the buffers
mapped. We set the limit to 256 bytes which seems to give good results both
on the XP-GP board and on the AX3/4.
The Rx small packet rate increased by 16.4% doing this, from 486kpps to
573kpps. It is worth noting that even the call to the function
dma_sync_single_range_for_cpu() is expensive (300 ns) although less
than dma_unmap_single(). Without it, the packet rate raises to 711kpps
(+24% more). Thus on systems where coherency from device to CPU is
guaranteed by a snoop control unit, this patch should provide even more
gains, and probably rx_copybreak could be increased.
Cc: Thomas Petazzoni <thomas.petazzoni@free-electrons.com> Cc: Gregory CLEMENT <gregory.clement@free-electrons.com> Tested-by: Arnaud Ebalard <arno@natisbad.org> Signed-off-by: Willy Tarreau <w@1wt.eu> Signed-off-by: David S. Miller <davem@davemloft.net>
willy tarreau [Thu, 16 Jan 2014 07:20:16 +0000 (08:20 +0100)]
net: mvneta: convert to build_skb()
Make use of build_skb() to allocate frags on the RX path. When frag size
is lower than a page size, we can use netdev_alloc_frag(), and we fall back
to kmalloc() for larger sizes. The frag size is stored into the mvneta_port
struct. The alloc/free functions check the frag size to decide what alloc/
free method to use. MTU changes are safe because the MTU change function
stops the device and clears the queues before applying the change.
With this patch, I observed a reproducible 2% performance improvement on
HTTP-based benchmarks, and 5% on small packet RX rate.
Cc: Thomas Petazzoni <thomas.petazzoni@free-electrons.com> Cc: Gregory CLEMENT <gregory.clement@free-electrons.com> Tested-by: Arnaud Ebalard <arno@natisbad.org> Signed-off-by: Willy Tarreau <w@1wt.eu> Signed-off-by: David S. Miller <davem@davemloft.net>
willy tarreau [Thu, 16 Jan 2014 07:20:15 +0000 (08:20 +0100)]
net: mvneta: prefetch next rx descriptor instead of current one
Currently, the mvneta driver tries to prefetch the current Rx
descriptor during read. Tests have shown that prefetching the
next one instead increases general performance by about 1% on
HTTP traffic.
Cc: Thomas Petazzoni <thomas.petazzoni@free-electrons.com> Cc: Gregory CLEMENT <gregory.clement@free-electrons.com> Tested-by: Arnaud Ebalard <arno@natisbad.org> Signed-off-by: Willy Tarreau <w@1wt.eu> Signed-off-by: David S. Miller <davem@davemloft.net>
willy tarreau [Thu, 16 Jan 2014 07:20:14 +0000 (08:20 +0100)]
net: mvneta: simplify access to the rx descriptor status
At several places, we already know the value of the rx status but
we call functions which dereference the pointer again to get it
and don't need the descriptor for anything else. Simplify this
task by replacing the rx desc pointer by the status word itself.
Cc: Thomas Petazzoni <thomas.petazzoni@free-electrons.com> Cc: Gregory CLEMENT <gregory.clement@free-electrons.com> Tested-by: Arnaud Ebalard <arno@natisbad.org> Signed-off-by: Willy Tarreau <w@1wt.eu> Signed-off-by: David S. Miller <davem@davemloft.net>
willy tarreau [Thu, 16 Jan 2014 07:20:12 +0000 (08:20 +0100)]
net: mvneta: remove tests for impossible cases in the tx_done path
Currently, mvneta_txq_bufs_free() calls mvneta_tx_done_policy() with
a non-null cause to retrieve the pointer to the next queue to process.
There are useless tests on the return queue number and on the pointer,
all of which are well defined within a known limited set. This code
path is fast, although not critical. Removing 3 tests here that the
compiler could not optimize (verified) is always desirable.
Cc: Thomas Petazzoni <thomas.petazzoni@free-electrons.com> Cc: Gregory CLEMENT <gregory.clement@free-electrons.com> Tested-by: Arnaud Ebalard <arno@natisbad.org> Signed-off-by: Willy Tarreau <w@1wt.eu> Signed-off-by: David S. Miller <davem@davemloft.net>
willy tarreau [Thu, 16 Jan 2014 07:20:11 +0000 (08:20 +0100)]
net: mvneta: replace Tx timer with a real interrupt
Right now the mvneta driver doesn't handle Tx IRQ, and relies on two
mechanisms to flush Tx descriptors : a flush at the end of mvneta_tx()
and a timer. If a burst of packets is emitted faster than the device
can send them, then the queue is stopped until next wake-up of the
timer 10ms later. This causes jerky output traffic with bursts and
pauses, making it difficult to reach line rate with very few streams.
A test on UDP traffic shows that it's not possible to go beyond 134
Mbps / 12 kpps of outgoing traffic with 1500-bytes IP packets. Routed
traffic tends to observe pauses as well if the traffic is bursty,
making it even burstier after the wake-up.
It seems that this feature was inherited from the original driver but
nothing there mentions any reason for not using the interrupt instead,
which the chip supports.
Thus, this patch enables Tx interrupts and removes the timer. It does
the two at once because it's not really possible to make the two
mechanisms coexist, so a split patch doesn't make sense.
First tests performed on a Mirabox (Armada 370) show that less CPU
seems to be used when sending traffic. One reason might be that we now
call the mvneta_tx_done_gbe() with a mask indicating which queues have
been done instead of looping over all of them.
The same UDP test above now happily reaches 987 Mbps / 87.7 kpps.
Single-stream TCP traffic can now more easily reach line rate. HTTP
transfers of 1 MB objects over a single connection went from 730 to
840 Mbps. It is even possible to go significantly higher (>900 Mbps)
by tweaking tcp_tso_win_divisor.
Cc: Thomas Petazzoni <thomas.petazzoni@free-electrons.com> Cc: Gregory CLEMENT <gregory.clement@free-electrons.com> Cc: Arnaud Ebalard <arno@natisbad.org> Cc: Eric Dumazet <eric.dumazet@gmail.com> Tested-by: Arnaud Ebalard <arno@natisbad.org> Signed-off-by: Willy Tarreau <w@1wt.eu> Signed-off-by: David S. Miller <davem@davemloft.net>
willy tarreau [Thu, 16 Jan 2014 07:20:10 +0000 (08:20 +0100)]
net: mvneta: add missing bit descriptions for interrupt masks and causes
Marvell has not published the chip's datasheet yet, so it's very hard
to find the relevant bits to manipulate to change the IRQ behaviour.
Fortunately, these bits are described in the proprietary LSP patch set
which is publicly available here :
http://www.plugcomputer.org/downloads/mirabox/
So let's put them back in the driver in order to reduce the burden of
current and future maintenance.
Cc: Thomas Petazzoni <thomas.petazzoni@free-electrons.com> Cc: Gregory CLEMENT <gregory.clement@free-electrons.com> Tested-by: Arnaud Ebalard <arno@natisbad.org> Signed-off-by: Willy Tarreau <w@1wt.eu> Signed-off-by: David S. Miller <davem@davemloft.net>
willy tarreau [Thu, 16 Jan 2014 07:20:09 +0000 (08:20 +0100)]
net: mvneta: do not schedule in mvneta_tx_timeout
If a queue timeout is reported, we can oops because of some
schedules while the caller is atomic, as shown below :
mvneta d0070000.ethernet eth0: tx timeout
BUG: scheduling while atomic: bash/1528/0x00000100
Modules linked in: slhttp_ethdiv(C) [last unloaded: slhttp_ethdiv]
CPU: 2 PID: 1528 Comm: bash Tainted: G WC 3.13.0-rc4-mvebu-nf #180
[<c0011bd9>] (unwind_backtrace+0x1/0x98) from [<c000f1ab>] (show_stack+0xb/0xc)
[<c000f1ab>] (show_stack+0xb/0xc) from [<c02ad323>] (dump_stack+0x4f/0x64)
[<c02ad323>] (dump_stack+0x4f/0x64) from [<c02abe67>] (__schedule_bug+0x37/0x4c)
[<c02abe67>] (__schedule_bug+0x37/0x4c) from [<c02ae261>] (__schedule+0x325/0x3ec)
[<c02ae261>] (__schedule+0x325/0x3ec) from [<c02adb97>] (schedule_timeout+0xb7/0x118)
[<c02adb97>] (schedule_timeout+0xb7/0x118) from [<c0020a67>] (msleep+0xf/0x14)
[<c0020a67>] (msleep+0xf/0x14) from [<c01dcbe5>] (mvneta_stop_dev+0x21/0x194)
[<c01dcbe5>] (mvneta_stop_dev+0x21/0x194) from [<c01dcfe9>] (mvneta_tx_timeout+0x19/0x24)
[<c01dcfe9>] (mvneta_tx_timeout+0x19/0x24) from [<c024afc7>] (dev_watchdog+0x18b/0x1c4)
[<c024afc7>] (dev_watchdog+0x18b/0x1c4) from [<c0020b53>] (call_timer_fn.isra.27+0x17/0x5c)
[<c0020b53>] (call_timer_fn.isra.27+0x17/0x5c) from [<c0020cad>] (run_timer_softirq+0x115/0x170)
[<c0020cad>] (run_timer_softirq+0x115/0x170) from [<c001ccb9>] (__do_softirq+0xbd/0x1a8)
[<c001ccb9>] (__do_softirq+0xbd/0x1a8) from [<c001cfad>] (irq_exit+0x61/0x98)
[<c001cfad>] (irq_exit+0x61/0x98) from [<c000d4bf>] (handle_IRQ+0x27/0x60)
[<c000d4bf>] (handle_IRQ+0x27/0x60) from [<c000843b>] (armada_370_xp_handle_irq+0x33/0xc8)
[<c000843b>] (armada_370_xp_handle_irq+0x33/0xc8) from [<c000fba9>] (__irq_usr+0x49/0x60)
Ben Hutchings attempted to propose a better fix consisting in using a
scheduled work for this, but while it fixed this panic, it caused other
random freezes and panics proving that the reset sequence in the driver
is unreliable and that additional fixes should be investigated.
When sending multiple streams over a link limited to 100 Mbps, Tx timeouts
happen from time to time, and the driver correctly recovers only when the
function is disabled.
Cc: Thomas Petazzoni <thomas.petazzoni@free-electrons.com> Cc: Gregory CLEMENT <gregory.clement@free-electrons.com> Cc: Ben Hutchings <ben@decadent.org.uk> Tested-by: Arnaud Ebalard <arno@natisbad.org> Signed-off-by: Willy Tarreau <w@1wt.eu> Signed-off-by: David S. Miller <davem@davemloft.net>
willy tarreau [Thu, 16 Jan 2014 07:20:08 +0000 (08:20 +0100)]
net: mvneta: use per_cpu stats to fix an SMP lock up
Stats writers are mvneta_rx() and mvneta_tx(). They don't lock anything
when they update the stats, and as a result, it randomly happens that
the stats freeze on SMP if two updates happen during stats retrieval.
This is very easily reproducible by starting two HTTP servers and binding
each of them to a different CPU, then consulting /proc/net/dev in loops
during transfers, the interface should immediately lock up. This issue
also randomly happens upon link state changes during transfers, because
the stats are collected in this situation, but it takes more attempts to
reproduce it.
The comments in netdevice.h suggest using per_cpu stats instead to get
rid of this issue.
This patch implements this. It merges both rx_stats and tx_stats into
a single "stats" member with a single syncp. Both mvneta_rx() and
mvneta_rx() now only update the a single CPU's counters.
In turn, mvneta_get_stats64() does the summing by iterating over all CPUs
to get their respective stats.
With this change, stats are still correct and no more lockup is encountered.
Note that this bug was present since the first import of the mvneta
driver. It might make sense to backport it to some stable trees. If
so, it depends on "d33dc73 net: mvneta: increase the 64-bit rx/tx stats
out of the hot path".
Cc: Thomas Petazzoni <thomas.petazzoni@free-electrons.com> Cc: Gregory CLEMENT <gregory.clement@free-electrons.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Tested-by: Arnaud Ebalard <arno@natisbad.org> Signed-off-by: Willy Tarreau <w@1wt.eu> Signed-off-by: David S. Miller <davem@davemloft.net>
willy tarreau [Thu, 16 Jan 2014 07:20:07 +0000 (08:20 +0100)]
net: mvneta: increase the 64-bit rx/tx stats out of the hot path
Better count packets and bytes in the stack and on 32 bit then
accumulate them at the end for once. This saves two memory writes
and two memory barriers per packet. The incoming packet rate was
increased by 4.7% on the Openblocks AX3 thanks to this.
Cc: Thomas Petazzoni <thomas.petazzoni@free-electrons.com> Cc: Gregory CLEMENT <gregory.clement@free-electrons.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Tested-by: Arnaud Ebalard <arno@natisbad.org> Signed-off-by: Willy Tarreau <w@1wt.eu> Signed-off-by: David S. Miller <davem@davemloft.net>
Paul Gortmaker [Wed, 8 Jan 2014 20:32:47 +0000 (15:32 -0500)]
drivers/net: delete non-required instances of include <linux/init.h>
None of these files are actually using any __init type directives
and hence don't need to include <linux/init.h>. Most are just a
left over from __devinit and __cpuinit removal, or simply due to
code getting copied from one driver to the next.
This covers everything under drivers/net except for wireless, which
has been submitted separately.
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Thu, 16 Jan 2014 19:44:54 +0000 (11:44 -0800)]
Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nftables
Pablo Neira Ayuso says:
====================
This small batch contains several Netfilter fixes for your net-next
tree, more specifically:
* Fix compilation warning in nft_ct in NF_CONNTRACK_MARK is not set,
from Kristian Evensen.
* Add dependency to IPV6 for NF_TABLES_INET. This one has been reported
by the several robots that are testing .config combinations, from Paul
Gortmaker.
* Fix default base chain policy setting in nf_tables, from myself.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Jiri Pirko [Thu, 9 Jan 2014 13:13:47 +0000 (14:13 +0100)]
neigh: use NEIGH_VAR_INIT in ndo_neigh_setup functions.
When ndo_neigh_setup is called, the bitfield used by NEIGH_VAR_SET is
not initialized yet. This might cause confusion for the people who use
NEIGH_VAR_SET in ndo_neigh_setup. So rather introduce NEIGH_VAR_INIT for
usage in ndo_neigh_setup.
Signed-off-by: Jiri Pirko <jiri@resnulli.us> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Thu, 16 Jan 2014 05:48:42 +0000 (21:48 -0800)]
Merge branch 'ixgbe'
Aaron Brown says:
====================
Intel Wired LAN Driver Updates
This series contains several updates from Alex to ixgbe.
To avoid head of line blocking in the event a VF stops cleaning Rx descriptors
he makes sure QDE bits are set for a VF before the Rx queues are enabled.
To avoid a situation where the head write-back registers can remain set ofter
the driver is unloaded he clears them on a VF reset.
Alexander Duyck (2):
ixgbe: Force QDE via PFQDE for VFs during reset
ixgbe: Clear head write-back registers on VF reset
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Alexander Duyck [Thu, 16 Jan 2014 01:38:41 +0000 (17:38 -0800)]
ixgbe: Clear head write-back registers on VF reset
The Tx head write-back registers are not cleared during an FLR or VF reset.
As a result a configuration that had head write-back enabled can leave the
registers set after the driver is unloaded. If the next driver loaded doesn't
use the write-back registers this can lead to a bad configuration where
head write-back is enabled, but the driver didn't request it.
To avoid this situation the PF should be resetting the Tx head write-back
registers when the VF requests a reset.
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com> Tested-by: Phil Schmitt <phillip.j.schmitt@intel.com> Signed-off-by: Aaron Brown <aaron.f.brown@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Alexander Duyck [Thu, 16 Jan 2014 01:38:40 +0000 (17:38 -0800)]
ixgbe: Force QDE via PFQDE for VFs during reset
This change makes it so that the QDE bits are set for a VF before the Rx
queues are enabled. As such we avoid head of line blocking in the event
that the VF stops cleaning Rx descriptors for whatever reason.
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com> Tested-by: Phil Schmitt <phillip.j.schmitt@intel.com>
drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c | 14 ++++++++++++++
drivers/net/ethernet/intel/ixgbe/ixgbe_type.h | 7 ++++---
2 files changed, 18 insertions(+), 3 deletions(-) Signed-off-by: Aaron Brown <aaron.f.brown@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Thu, 16 Jan 2014 01:00:47 +0000 (17:00 -0800)]
Merge branch 'noprefixroute'
Thomas Haller says:
====================
ipv6 addrconf: add IFA_F_NOPREFIXROUTE flag to suppress creation of IP6 routes
v1 -> v2: add a second commit, handling NOPREFIXROUTE in ip6_del_addr.
v2 -> v3: reword commit messages, code comments and some refactoring.
v3 -> v4: refactor, rename variables, add enum
v4 -> v5: rebase, so that patch applies cleanly to current net-next/master
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Thomas Haller [Wed, 15 Jan 2014 14:36:59 +0000 (15:36 +0100)]
ipv6 addrconf: don't cleanup prefix route for IFA_F_NOPREFIXROUTE
Refactor the deletion/update of prefix routes when removing an
address. Now also consider IFA_F_NOPREFIXROUTE and if there is an address
present with this flag, to not cleanup the route. Instead, assume
that userspace is taking care of this route.
Also perform the same cleanup, when userspace changes an existing address
to add NOPREFIXROUTE (to an address that didn't have this flag). This is
done because when the address was added, a prefix route was created for it.
Since the user now wants to handle this route by himself, we cleanup this
route.
This cleanup of the route is not totally robust. There is no guarantee,
that the route we are about to delete was really the one added by the
kernel. This behavior does not change by the patch, and in practice it
should work just fine.
Signed-off-by: Thomas Haller <thaller@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Thomas Haller [Wed, 15 Jan 2014 14:36:58 +0000 (15:36 +0100)]
ipv6 addrconf: add IFA_F_NOPREFIXROUTE flag to suppress creation of IP6 routes
When adding/modifying an IPv6 address, the userspace application needs
a way to suppress adding a prefix route. This is for example relevant
together with IFA_F_MANAGERTEMPADDR, where userspace creates autoconf
generated addresses, but depending on on-link, no route for the
prefix should be added.
Signed-off-by: Thomas Haller <thaller@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
wangweidong [Wed, 15 Jan 2014 09:24:01 +0000 (17:24 +0800)]
sctp: create helper function to enable|disable sackdelay
add sctp_spp_sackdelay_{enable|disable} helper function for
avoiding code duplication.
Signed-off-by: Wang Weidong <wangweidong1@huawei.com> Acked-by: Neil Horman <nhorman@tuxdriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Wed, 15 Jan 2014 23:52:07 +0000 (15:52 -0800)]
Merge branch 'be2net'
Sathya Perla says:
====================
be2net: patch set
The following patch set is best suited for net-next as it
contains code-cleanup, support for newer versions of FW cmds and
a few minor fixes. Please apply. Thanks!
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Suresh Reddy [Wed, 15 Jan 2014 07:53:40 +0000 (13:23 +0530)]
be2net: cleanup wake-on-lan code
This patch cleans-up wake-on-lan code in the following ways:
1) Removes some driver hacks in be_cmd_get_acpi_wol_cap() that were based
on incorrect assumptions.
2) Uses the adapter->wol_en and wol_cap variables for checking if WoL
is supported and enabled on an interface instead of referring to the
exclusion list via the macro be_is_wol_supported()
Signed-off-by: Suresh Reddy <suresh.reddy@emulex.com> Signed-off-by: Sathya Perla <sathya.perla@emulex.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Suresh Reddy [Wed, 15 Jan 2014 07:53:39 +0000 (13:23 +0530)]
be2net: use GET_MAC_LIST cmd to query mac-address from a pmac-id
The use of NTKW_MAC_QUERY cmd has been deprecated for Skyhawk-R.
Replace the last remaining usage in be_vfs_mac_query() routine.
Signed-off-by: Suresh Reddy <suresh.reddy@emulex.com> Signed-off-by: Sathya Perla <sathya.perla@emulex.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Suresh Reddy [Wed, 15 Jan 2014 07:53:38 +0000 (13:23 +0530)]
be2net: do not use frag index in the RX-compl entry
Instead, use the tail of the RXQ to pick the associated RXQ entry
This fix is required in preparation for supporting RXQ lengths greater than 1K.
For such queues, the frag index in the RX-compl entry is not valid as it is only a 10 bit entry not capable of addressing RXQs longer than 1K.
Signed-off-by: Suresh Reddy <suresh.reddy@emulex.com> Signed-off-by: Sathya Perla <sathya.perla@emulex.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Suresh Reddy [Wed, 15 Jan 2014 07:53:37 +0000 (13:23 +0530)]
be2net: Remove "10Gbps" from driver description string
As be2net is used even by the 40Gbps Skyhawk-R chip
Signed-off-by: Suresh Reddy <suresh.reddy@emulex.com> Signed-off-by: Sathya Perla <sathya.perla@emulex.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Suresh Reddy [Wed, 15 Jan 2014 07:53:36 +0000 (13:23 +0530)]
be2net: fix incorrect setting of cmd_privileges for VFs
An earlier commit (f25b119c "Fix error messages while driver load for VFs")
incorrectly set the adapter->cmd_privileges value for VFs (in a
multi-channel config) to MAX_PRIVILEGES. This causes FW cmd failures
and avoidable error logs when certian cmds are issued by a VF.
Also, move the multi-channel hack to be_cmds.c inside
be_cmd_get_fn_privileges() routine.
Fixes: f25b119c "Fix error messages while driver load for VFs" Signed-off-by: Suresh Reddy <suresh.reddy@emulex.com> Signed-off-by: Sathya Perla <sathya.perla@emulex.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Vasundhara Volam [Wed, 15 Jan 2014 07:53:35 +0000 (13:23 +0530)]
be2net: ignore mac-addr set call for an already programmed mac-addr
An ndo_set_mac_addr() call may be issued for a mac-addr that is already
active on an interface. If so, silently ignore the request. Sending such
a request to the FW, causes a "mac collision" error. The error is harmless
but is avoidable noise in the kernel log.
Signed-off-by: Vasundhara Volam <vasundhara.volam@emulex.com> Signed-off-by: Sathya Perla <sathya.perla@emulex.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Vasundhara Volam [Wed, 15 Jan 2014 07:53:34 +0000 (13:23 +0530)]
be2net: do not call be_set/get_fw_log_level() on Skyhawk-R
Skyhawk-R FW does not support SET/GET_EXT_FAT_CAPABILITIES cmds via which
FW logging level can be controlled. Also, the hack used in BE3 to control
FW logging level via the ethtool interface is not needed in Skyhawk-R.
This patch also cleans up this code by moving be_set/get_fw_log_level()
routines to be_cmds.c where they belong.
Signed-off-by: Vasundhara Volam <vasundhara.volam@emulex.com>
remove new line Signed-off-by: Sathya Perla <sathya.perla@emulex.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Vasundhara Volam [Wed, 15 Jan 2014 07:53:33 +0000 (13:23 +0530)]
be2net: Log the profile-id used by FW during driver initialization
Signed-off-by: Vasundhara Volam <vasundhara.volam@emulex.com> Signed-off-by: Sathya Perla <sathya.perla@emulex.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Vasundhara Volam [Wed, 15 Jan 2014 07:53:32 +0000 (13:23 +0530)]
be2net: don't set "pport" field when querying "pvid"
In the GET_HSW_CONFIG cmd, the "pport" field must be set only while
querying the switch mode. When the "pport" field is set, the
"interface_id" field must be set to the port number, otherwise, it
must be set to adapter->if_handle.
Signed-off-by: Vasundhara Volam <vasundhara.volam@emulex.com> Signed-off-by: Sathya Perla <sathya.perla@emulex.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Vasundhara Volam [Wed, 15 Jan 2014 07:53:31 +0000 (13:23 +0530)]
be2net: Use MCC_CREATE_EXT_V1 cmd for Skyhawk-R
Currently this cmd is used only for Lancer.
MCC_CREATE_EXT_V1 supports larger CQ-ids and additional event codes for the
async_event_bitmap field.
Signed-off-by: Vasundhara Volam <vasundhara.volam@emulex.com> Signed-off-by: Sathya Perla <sathya.perla@emulex.com> Signed-off-by: David S. Miller <davem@davemloft.net>
net: move 6lowpan compression code to separate module
IEEE 802.15.4 and Bluetooth networking stacks share 6lowpan compression
code. Instead of introducing Makefile/Kconfig hacks, build this code as
a separate module referenced from both ieee802154 and bluetooth modules.
This fixes the following build error observed in some kernel
configurations:
net/built-in.o: In function `header_create': 6lowpan.c:(.text+0x166149): undefined reference to `lowpan_header_compress'
net/built-in.o: In function `bt_6lowpan_recv': (.text+0x166b3c): undefined reference to `lowpan_process_data'
Reported-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Dmitry Eremin-Solenikov <dmitry_eremin@mentor.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Veaceslav Falico [Tue, 14 Jan 2014 20:58:51 +0000 (21:58 +0100)]
net: rename sysfs symlinks on device name change
Currently, we don't rename the upper/lower_ifc symlinks in
/sys/class/net/*/ , which might result stale/duplicate links/names.
Fix this by adding netdev_adjacent_rename_links(dev, oldname) which renames
all the upper/lower interface's links to dev from the upper/lower_oldname
to the new name.
We don't need a rollback because only we control these symlinks and if we
fail to rename them - sysfs will anyway complain.
Reported-by: Ding Tianhong <dingtianhong@huawei.com> CC: Ding Tianhong <dingtianhong@huawei.com> CC: "David S. Miller" <davem@davemloft.net> CC: Eric Dumazet <edumazet@google.com> CC: Nicolas Dichtel <nicolas.dichtel@6wind.com> CC: Cong Wang <amwang@redhat.com> Signed-off-by: Veaceslav Falico <vfalico@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Veaceslav Falico [Tue, 14 Jan 2014 20:58:50 +0000 (21:58 +0100)]
net: add sysfs helpers for netdev_adjacent logic
They clean up the code a bit and can be used further.
CC: Ding Tianhong <dingtianhong@huawei.com> CC: "David S. Miller" <davem@davemloft.net> CC: Eric Dumazet <edumazet@google.com> CC: Nicolas Dichtel <nicolas.dichtel@6wind.com> CC: Cong Wang <amwang@redhat.com> Signed-off-by: Veaceslav Falico <vfalico@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
On archs like S390 or um this driver cannot build nor work.
Make it depend on HAS_IOMEM to bypass build failures.
drivers/staging/iio/adc/spear_adc.c: In function ‘spear_adc_probe’:
drivers/staging/iio/adc/spear_adc.c:393:2: error: implicit declaration of function ‘iounmap’ [-Werror=implicit-function-declaration
Signed-off-by: Richard Weinberger <richard@nod.at> Signed-off-by: David S. Miller <davem@davemloft.net>
On archs like S390 or um this driver cannot build nor work.
Make it depend on HAS_IOMEM to bypass build failures.
drivers/staging/dgap/dgap_driver.c: In function ‘dgap_cleanup_board’:
drivers/staging/dgap/dgap_driver.c:457:3: error: implicit declaration of function ‘iounmap’ [-Werror=implicit-function-declaration]
drivers/staging/dgap/dgap_driver.c: In function ‘dgap_do_remap’:
drivers/staging/dgap/dgap_driver.c:694:2: error: implicit declaration of function ‘ioremap’ [-Werror=implicit-function-declaration]
Signed-off-by: Richard Weinberger <richard@nod.at> Signed-off-by: David S. Miller <davem@davemloft.net>
On archs like S390 or um this driver cannot build nor work.
Make it depend on HAS_IOMEM to bypass build failures.
drivers/ptp/ptp_pch.c: In function ‘pch_remove’:
drivers/ptp/ptp_pch.c:571:3: error: implicit declaration of function ‘iounmap’ [-Werror=implicit-function-declaration]
drivers/ptp/ptp_pch.c: In function ‘pch_probe’:
drivers/ptp/ptp_pch.c:621:2: error: implicit declaration of function ‘ioremap’ [-Werror=implicit-function-declaration]
Signed-off-by: Richard Weinberger <richard@nod.at> Signed-off-by: David S. Miller <davem@davemloft.net>
Eugene Crosser [Tue, 14 Jan 2014 14:54:13 +0000 (15:54 +0100)]
qeth: bridgeport support - address notifications
Introduce functions to enable and disable bridgeport address
notification feature, sysfs attributes for access to these
functions from userspace, and udev events emitted when a host
joins or exits a bridgeport-enabled HiperSocket channel.
Signed-off-by: Eugene Crosser <eugene.crosser@ru.ibm.com> Signed-off-by: Frank Blaschka <frank.blaschka@de.ibm.com> Reviewed-by: Ursula Braun <ursula.braun@de.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Eugene Crosser [Tue, 14 Jan 2014 14:54:12 +0000 (15:54 +0100)]
s390/qdio: bridgeport support - CHSC part
Introduce function for the "Perform network-subchannel operation"
CHSC command with operation code "bridgeport information",
and bit definitions for "characteristics" pertaning to this command.
Signed-off-by: Eugene Crosser <eugene.crosser@ru.ibm.com> Signed-off-by: Frank Blaschka <frank.blaschka@de.ibm.com> Reviewed-by: Sebastian Ott <sebott@linux.vnet.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Eugene Crosser [Tue, 14 Jan 2014 14:54:11 +0000 (15:54 +0100)]
qeth: bridgeport support - basic control
Introduce functions to assign roles and check state of bridgeport-capable
HiperSocket devices, and sysfs attributes providing access to these
functions from userspace. Introduce udev events emitted when the state
of a bridgeport device changes.
Signed-off-by: Eugene Crosser <eugene.crosser@ru.ibm.com> Signed-off-by: Frank Blaschka <frank.blaschka@de.ibm.com> Reviewed-by: Ursula Braun <ursula.braun@de.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Claudiu Manoil [Tue, 14 Jan 2014 13:35:00 +0000 (15:35 +0200)]
gianfar: Fix portabilty issues for ethtool and ptp
Fixes unhandled register write in gianfar_ethtool.c.
Fixes following endianess related functional issues,
reported by sparse as well, i.e.:
gianfar_ethtool.c:1058:33: warning:
incorrect type in argument 1 (different base types)
expected unsigned int [unsigned] [usertype] value
got restricted __be32 [usertype] ip4src
gianfar_ethtool.c:1164:33: warning:
restricted __be16 degrades to integer
gianfar_ethtool.c:1669:32: warning:
invalid assignment: ^=
left side has type restricted __be16
right side has type int
Solves all the sparse warnings for mixig normal pointers
with __iomem pointers for gianfar_ptp.c, i.e.:
gianfar_ptp.c:163:32: warning:
incorrect type in argument 1 (different address spaces)
expected unsigned int [noderef] <asn:2>*addr
got unsigned int *<noident>
Signed-off-by: Claudiu Manoil <claudiu.manoil@freescale.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Julia Lawall [Mon, 13 Jan 2014 16:17:24 +0000 (17:17 +0100)]
ksz884x: delete useless variable
Delete a variable that is at most only assigned to a constant, but never
used otherwise. In this code, it is the variable result that is used for
the return code, not rc.
A simplified version of the semantic patch that fixes this problem is as
follows: (http://coccinelle.lip6.fr/)
// <smpl>
@@
type T;
identifier i;
constant c;
@@
-T i;
<... when != i
-i = c;
...>
// </smpl>
Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr> Signed-off-by: David S. Miller <davem@davemloft.net>
Kristian Evensen [Fri, 10 Jan 2014 20:43:20 +0000 (21:43 +0100)]
netfilter: nft_ct: fix compilation warning if NF_CONNTRACK_MARK is not set
net/netfilter/nft_ct.c: In function 'nft_ct_set_eval':
net/netfilter/nft_ct.c:136:6: warning: unused variable 'value' [-Wunused-variable]
Reported-by: kbuild test robot <fengguang.wu@intel.com> Signed-off-by: Kristian Evensen <kristian.evensen@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
David S. Miller [Wed, 15 Jan 2014 08:01:03 +0000 (00:01 -0800)]
Merge branch 'i40e-next'
Aaron Brown says:
====================
Intel Wired LAN Driver Updates
This series contains updates to i40e from Greg Rose for VLAN filtering.
Greg Rose (2):
i40e: Warn admin to reload VF driver on port VLAN configuration.
When an administrator sets a port VLAN filters for the virtual
function (VF) after the VF has already set its own VLAN filters a
conflict requiring the VF be reloaded can occur. This patch logs a
message indicating to the system administrator that the VF driver
must be reloaded for the new port VLAN settings to take effect
i40e: Retain MAC filters on port VLAN deletion
On port VLAN deletion the list of MAC filters for the virtual function
(VF) VSI were all deleted. Let's keep them around, they come in
handy for keeping the VF functional.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Greg Rose [Tue, 14 Jan 2014 00:13:04 +0000 (16:13 -0800)]
i40e: Retain MAC filters on port VLAN deletion
On port VLAN deletion the list of MAC filters for the virtual function (VF)
VSI were all deleted. Let's keep them around, they come in handy for keeping
the VF functional.
Change-Id: I335e760392f274dc8b8b40efcb708f65b49d7973 Signed-off-by: Greg Rose <gregory.v.rose@intel.com> Tested-by: Sibai Li <sibai.li@intel.com> Signed-off-by: Aaron Brown <aaron.f.brown@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Greg Rose [Tue, 14 Jan 2014 00:13:03 +0000 (16:13 -0800)]
i40e: Warn admin to reload VF driver on port VLAN configuration
The i40e Physical Function (PF) driver will allow the
Virtual Function (VF) driver to configure its own VLAN filters if no port
VLAN filter has been configured. This leads to the possibility of the
administrator setting a port VLAN filter for the VF after the VF has already
configured its own VLAN filters. This leads to a conflict that can only be
resolved by reloading the VF driver. When the conflicting administrative
command is detected in setting the port VLAN then log a message indicating to
the system administrator that he must now reload the VF driver for the new
port VLAN settings to take effect.
Change-Id: I8de73b885d944a043aff32226297e4249862bcad Signed-off-by: Greg Rose <gregory.v.rose@intel.com> Tested-by: Sibai Li <sibai.li@intel.com> Signed-off-by: Aaron Brown <aaron.f.brown@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Wed, 15 Jan 2014 07:39:05 +0000 (23:39 -0800)]
Merge branch 'vxlan_lower_dev_unregister'
Daniel Borkmann says:
====================
vxlan updates
Did the split into two patches upon request from Cong Wang.
Changelog:
v1->v2:
- Removed BUG_ON as it's not needed.
v2->v3:
- Removed dev->reg_state check for netns.
v3->v4:
- Removed list_del(), we seem to do it in some places and
in some others not; we agreed it's not really necessary.
- Split patch into 2 patches, notifier part and module
unload cleanup part.
====================
Reviewed-by: Cong Wang <cwang@twopensource.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Daniel Borkmann [Mon, 13 Jan 2014 17:41:20 +0000 (18:41 +0100)]
net: vxlan: properly cleanup devs on module unload
We should use vxlan_dellink() handler in vxlan_exit_net(), since
i) we're not in fast-path and we should be consistent in dismantle
just as we would remove a device through rtnl ops, and more
importantly, ii) in case future code will kfree() memory in
vxlan_dellink(), we would leak it right here unnoticed. Therefore,
do not only half of the cleanup work, but make it properly.
Signed-off-by: Daniel Borkmann <dborkman@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Daniel Borkmann [Mon, 13 Jan 2014 17:41:19 +0000 (18:41 +0100)]
net: vxlan: when lower dev unregisters remove vxlan dev as well
We can create a vxlan device with an explicit underlying carrier.
In that case, when the carrier link is being deleted from the
system (e.g. due to module unload) we should also clean up all
created vxlan devices on top of it since otherwise we're in an
inconsistent state in vxlan device. In that case, the user needs
to remove all such devices, while in case of other virtual devs
that sit on top of physical ones, it is usually the case that
these devices do unregister automatically as well and do not
leave the burden on the user.
This work is not necessary when vxlan device was not created with
a real underlying device, as connections can resume in that case
when driver is plugged again. But at least for the other cases,
we should go ahead and do the cleanup on removal.
We don't register the notifier during vxlan_newlink() here since
I consider this event rather rare, and therefore we should not
bloat vxlan's core structure unecessary. Also, we can simply make
use of unregister_netdevice_many() to batch that. fdb is flushed
upon ndo_stop().
E.g. `ip -d link show vxlan13` after carrier removal before
this patch:
5: vxlan13: <BROADCAST,MULTICAST> mtu 1450 qdisc noop state DOWN mode DEFAULT group default
link/ether 1e:47:da:6d:4d:99 brd ff:ff:ff:ff:ff:ff promiscuity 0
vxlan id 13 group 239.0.0.10 dev 2 port 32768 61000 ageing 300
^^^^^ Signed-off-by: Daniel Borkmann <dborkman@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Wed, 15 Jan 2014 02:59:33 +0000 (18:59 -0800)]
Merge branch 'intel-next'
Aaron Brown says:
====================
Intel Wired LAN Driver Updates, ixgbe: Add LER support
The following patches add Live Error Recovery (LER) support to the
ixgbe driver. This support also improves behavior in Thunderbolt
environments. This involves checking all register reads for a
value of all ones and when that is seen, to read the status
register, which should never properly return all ones, to
confirm whether the received value was correct. When this detects
a removal, the hw_addr field is cleared to indicate the removal.
This then blocks subsequent access to the device registers.
All register access macros have been changed to static inline
functions and all register accesses now use them.· Macro versions
are temporarily provided.
The __IXGBE_DOWN bit is no longer overloaded to also mean that
device removal has been initiated. Now the bit can be used to
protect ixgbe_down from multiple entry via test_and_set_bit. A
needed smp_mb__before_clear_bit was also added.
V2 Changes:
- Use ACCESS_ONCE where needed, thanks to Ben Hutchings
- Fix crash on module removal
- Use boolean values for boolean returns instead of 0 and 1
- Reword Kconfig help text
V3 Changes:
- Drop config option, per David Miller
- Drop tail register write checks, per Alexander Duyck
- Change writeq implementation to a static inline, thanks to Joe Perches
V4 Changes:
- Change __IXGBE_REMOVE to __IXGBE_REMOVING, per Scott Feldman's comment
- Add #define writeq writeq, per Alexander Duyck
- Change static inline functions to lower case, per David Miller
- Use new lower case names in added and modified register accesses
- Provide temporary upper case macros for register access functions
- Change IXGBE_REMOVED from macro to static inline and change references
- Correct IXGBE_WRITE_FLUSH to properly enclose parameter expansion
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Mark Rustad [Wed, 15 Jan 2014 02:53:17 +0000 (18:53 -0800)]
ixgbe: Additional adapter removal checks
Additional checks are needed for a detected removal not to cause
problems. Some involve simply avoiding a lot of stuff that can't
do anything good, and also cases where the phony return value can
cause problems. In addition, down the adapter when the removal is
sensed.
Signed-off-by: Mark Rustad <mark.d.rustad@intel.com> Tested-by: Phil Schmitt <phillip.j.schmitt@intel.com> Signed-off-by: Aaron Brown <aaron.f.brown@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Mark Rustad [Wed, 15 Jan 2014 02:53:16 +0000 (18:53 -0800)]
ixgbe: Check for adapter removal on register writes
Prevent writes to an adapter that has been detected as removed
by a previous failing read. This also fixes some include file
ordering confusion that this patch revealed.
Signed-off-by: Mark Rustad <mark.d.rustad@intel.com> Tested-by: Phil Schmitt <phillip.j.schmitt@intel.com> Signed-off-by: Aaron Brown <aaron.f.brown@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Mark Rustad [Wed, 15 Jan 2014 02:53:15 +0000 (18:53 -0800)]
ixgbe: Check register reads for adapter removal
Check all register reads for adapter removal by checking the status
register after any register read that returns 0xFFFFFFFF. Since the
status register will never return 0xFFFFFFFF unless the adapter is
removed, such a value from a status register read confirms the
removal.
Signed-off-by: Mark Rustad <mark.d.rustad@intel.com> Tested-by: Phil Schmitt <phillip.j.schmitt@intel.com> Signed-off-by: Aaron Brown <aaron.f.brown@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>