Hariprasad S [Wed, 4 May 2016 19:57:36 +0000 (01:27 +0530)]
RDMA/iw_cxgb4: move QP -> ERROR on fatal disconnect errors
In c4iw_ep_disconnect(), if we fail to initiate a close operation, then
move the qp to ERROR to disassociate the ep from the qp. Failure to do
this will leak the ep resources.
Hariprasad S [Wed, 4 May 2016 19:57:35 +0000 (01:27 +0530)]
RDMA/iw_cxgb4: don't use abort_connection in process_mpa_request()
Instead return whether the caller needs to disconnect. This is part of
getting rid of abort_connection() altogether so we properly clean up on
send_abort() failures.
Hariprasad S [Wed, 4 May 2016 19:57:32 +0000 (01:27 +0530)]
RDMA/iw_cxgb4: remove connection abort from process_mpa_reply
Instead, have the caller, rx_data() handle the close/abort like
it does for process_mpa_request(). This is part of getting rid of
abort_connection() altogether so we properly clean up on send_abort()
failures.
Hariprasad S [Wed, 4 May 2016 19:57:31 +0000 (01:27 +0530)]
RDMA/iw_cxgb4: ensure eps don't get freed while the mutex is held
In rx_data(), with the ep in FPDU_MODE, refcnt=2, if we get unexpected
streaming data, we call c4iw_modify_rc_qp() and move the qp from
RTS -> TERMINATE. In c4iw_modify_rc_qp(), if rdma_fini() returns
an error, the ep will be dereferenced (refcnt=1). Then rx_data()
calls c4iw_ep_disconnect() which starts the close operation.
But if send_halfclose() fails in c4iw_ep_disconnect(), we will call
release_ep_resources() derefing the ep which reduces the refcnt to 0 and
and frees the ep. However we still has the ep mutex at that point, so we
have a touch-after-free bug. There is a similar issue where
peer_close() calls c4iw_ep_disconnect().
The solution is to add a reference to the ep in c4iw_ep_disconnect()
after acquiring the mutex, and release it after releasing the mutex.
Hariprasad S [Wed, 4 May 2016 19:57:30 +0000 (01:27 +0530)]
RDMA/iw_cxgb4: stop ep timer on close failure
In c4iw_ep_disconnect(), if we start the ep timer to begin a close,
but send_halfclose() fails, we need to stop the timer and send a CLOSE
event up to the IWCM before releasing the resources. Otherwise, we can
crash when the ep timer fires if the ep is referencing a previous instance
of the device. This can happen as part of adapter reset/recovery, for
instance.
Hariprasad S [Wed, 4 May 2016 19:57:29 +0000 (01:27 +0530)]
RDMA/iw_cxgb4: release ep resources on accept arp failure
If ARP fails before the CPL_PASS_ACCEPT_RPL is seen by hardware, the tid
will be stuck in SYN_PEND and never released. So create an arp failure
handler specifically for this message to release the endpoint resources.
In pass_accept_rpl_arp_failure(), put the parent endpoint so it will
be freed when destroyed. Also we don't need to call release_tid() here
because _c4iw_free_ep() calls cxgb4_remove_tid() which releases the
hwtid.
If we get an ABORT_REQ_RSS instead of a PASS_ESTABLISH (because the
peer's ACK to our SYN is never received), then put the parent as well
in peer_abort().
Treat accept_cr() failures just like arp failures: put the parent ep
and release the ep resources destroying the tid
The ARP failure handlers are called in an atomic context, so we need to
schedule some of the processing which might block. Namely _c4iw_free_ep()
which needs a mutex. So create a "special" CPL opcode and handler and
schedule it via sched() to be run by process_work() in a blockable context.
Also rework the active open arp failure handler to make use of
release_ep_resources(). This allows both the active and passive arp
failure handlers to use the same deferred cleanup function.
iSER currently has a couple places that set max_sectors in either the host
template or SCSI host, and all of them get it wrong.
This patch instead uses a single assignment that (hopefully) gets it right:
the max_sectors value must be derived from the number of segments in the
FR or FMR structure, but actually be one lower than the page size multiplied
by the number of sectors, as it has to handle the case of non-aligned I/O.
Without this I get trivial to reproduce hangs when running xfstests
(on XFS) over iSER to Linux targets.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Max Gurtovoy <maxg@mellanox.com> Acked-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Doug Ledford <dledford@redhat.com>
RDMA/i40iw: Fix for checking if the QP is destroyed
Fix for checking if the QP associated with a completion
has been destroyed while processing CQ elements.
If that is the case, move the CQ head to the next element
and continue completion processing.
STag index mask is calculated incorrectly, missing
the 14 bits minimum requirement. Add max macro to use
either # of MRs or 14 bits in the mask size calculation.
Ismail, Mustafa [Mon, 18 Apr 2016 15:33:09 +0000 (10:33 -0500)]
RDMA/i40iw: Adding queue drain functions
Adding sq and rq drain functions, which block until all
previously posted wr-s in the specified queue have completed.
A completion object is signaled to unblock the thread,
when the last cqe for the corresponding queue is processed.
Ismail, Mustafa [Mon, 18 Apr 2016 15:33:08 +0000 (10:33 -0500)]
RDMA/i40iw: Fix SD calculation for initial HMC creation
Correct SD calculation by using base address returned from commit FPM.
This alleviates any assumptions on resource ordering and alignment
requirement. Also consolidate SD estimation code into i40iw_est_sd().
Jubin John [Thu, 14 Apr 2016 15:31:53 +0000 (08:31 -0700)]
IB/hfi1: Serialize hrtimer function calls
hrtimer functions do not guarantee serialization, so we extend the
cca_timer_lock to cover the hrtimer_forward_now() in the hrtimer
callback handler and the hrtimer_start() in process_becn(). This
prevents races between these 2 functions to update the hrtimer state
leading to problems such as:
kernel BUG at kernel/hrtimer.c:1282!
encountered during validation of the CCA feature.
Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com> Signed-off-by: Jubin John <jubin.john@intel.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
Dean Luick [Thu, 14 Apr 2016 15:31:42 +0000 (08:31 -0700)]
IB/hfi1: Correctly report neighbor link down reason
The code to save the link down reason for reporting to the SMA
was in a location before the actual reason was read. Move the
SMA link down reason assignment to a better location.
Dean Luick [Thu, 14 Apr 2016 15:31:36 +0000 (08:31 -0700)]
IB/hfi1: Use the neighbor link down reason only when valid
The 8051 uses a link down reason to inform the driver why the
link went down. The neighbor planned link down reason code is
only valid when a link down idle message is received by the 8051.
Enhance the explanation on why the link went down.
Dean Luick [Thu, 14 Apr 2016 15:31:30 +0000 (08:31 -0700)]
IB/hfi1: Ignore link downgrade with 0 lanes
Versions of the 8051 firmware < 0.38 may report a link failure
as a link downgrade with a width of 0 followed by a link down
notification. Ignore the zero width downgrade notification -
the driver should follow the link down path.
Dean Luick [Tue, 12 Apr 2016 18:32:06 +0000 (11:32 -0700)]
IB/hfi1: Add RSM rule for user FECN handling
Add a receive side mapping rule to extract expected user packets with
the FECN bit set and place them in an eager buffer. This will allow
user libraries to recognize that a FECN was sent when using header
suppression and respond appropriately.
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com> Signed-off-by: Dean Luick <dean.luick@intel.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
Dean Luick [Tue, 12 Apr 2016 18:31:11 +0000 (11:31 -0700)]
IB/hfi1: Move QOS decision logic into its own function
The decision to use QOS affects other resource allocation.
Move the QOS decision logic into its own function so it can
be called by other interested parties.
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com> Signed-off-by: Dean Luick <dean.luick@intel.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
Dean Luick [Tue, 12 Apr 2016 18:30:51 +0000 (11:30 -0700)]
IB/hfi1: Extract RSM map table init from QOS
Refactor the allocation, tracking, and writing of the RSM map table
into its own set of routines. This will allow the map table to be
passed to multiple users to fill in as needed. Start with the original
user, QOS.
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com> Signed-off-by: Dean Luick <dean.luick@intel.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
IB/hfi1: Reduce kernel context pio buffer allocation
The pio buffers were pooled evenly among all kernel contexts and
user contexts. However, the demand from kernel contexts is much
lower than user contexts. This patch reduces the allocation for
kernel contexts and thus makes more credits available for PSM,
helping performance. This is especially useful on high core-count
systems where large numbers of contexts are used.
A new context type SC_VL15 is added to distinguish the context used
for VL15 from other kernel contexts. The reason is that VL15 needs
to support 2KB sized packet while other kernel contexts need only
support packets up to the size determined by "piothreshold", which
has a default value of 256.
The new allocation method allows triple buffering of largest pio
packets configured for these contexts. This is sufficient to maintain
verbs performance. The largest pio packet size is 2048B for VL15
and "piothreshold" for other kernel contexts. A cap is applied to
"piothreshold" to avoid excessive buffer allocation.
The special case that SDMA is disable is handled differently. In
that case, the original pooling allocation is used to better
support the much higher pio traffic.
Notice that if adaptive pio is disabled (piothreshold==0), the pio
buffer size doesn't matter for non-VL15 kernel send contexts when
SDMA is enabled because pio is not used at all on these contexts
and thus the new allocation is still valid. If SDMA is disabled then
pooling allocation is used as mentioned in previous paragraph.
Adjustment is also made to the calculation of the credit return
threshold for the kernel contexts. Instead of purely based on
the MTU size, a percentage based threshold is also considered and
the smaller one of the two is chosen. This is necessary to ensure
that with the reduced buffer allocation credits are returned in
time to avoid unnecessary stall in the send path.
Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com> Reviewed-by: Dean Luick <dean.luick@intel.com> Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com> Reviewed-by: Mark Debbage <mark.debbage@intel.com> Reviewed-by: Jubin John <jubin.john@intel.com> Signed-off-by: Jianxin Xiong <jianxin.xiong@intel.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
Jubin John [Tue, 12 Apr 2016 18:30:08 +0000 (11:30 -0700)]
IB/hfi1: Change default number of user contexts
Change the default number of user contexts to the number of real
(non-HT) cpu cores in order to reduce the division of hfi1 hardware
contexts in the case of high core counts with hyper-threading enabled.
Reviewed-by: Dean Luick <dean.luick@intel.com> Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com> Reviewed-by: Mitko Haralanov <mitko.haralanov@intel.com> Signed-off-by: Jubin John <jubin.john@intel.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
Mike Marciniszyn [Tue, 12 Apr 2016 18:28:56 +0000 (11:28 -0700)]
IB/hfi1: Remove unreachable code
Remove unreachable code from RC ack handling to fix an
smatch error.
Fixes: 633d27399514 ("staging/rdma/hfi1: use mod_timer when appropriate") Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com> Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
Dean Luick [Tue, 12 Apr 2016 18:28:36 +0000 (11:28 -0700)]
IB/hfi1: Fix double QSFP resource acquire on cache refresh
The function refresh_qsfp_cache() acquires the i2c chain resource,
but one caller already holds the resource. Change the acquire so
all calls to refresh_qsfp_cache() are covered by the acquire and
remove the acquire within refresh_qsfp_cache().
Dean Luick [Tue, 12 Apr 2016 18:26:21 +0000 (11:26 -0700)]
IB/hfi1: Guard against concurrent I2C access across all chains
The discrete ASIC board design makes the two I2C chains not
independent of each other. That is, only one chain can safely
be accessed at a time. For discrete ASIC devices, adjust the
resource locking so that access to one I2C chain will lock both
of the chains.
The pre-LNI SerDes and channel tuning algorithm already checks for
module presence assertion for the relevant port types. The extraneous
check removed in this patch blocks link up for port types for which
the module presence assertion is not relevant.
IB/hfi1: Always turn on CDRs for low power QSFP modules
Clock and data recovery mechanisms (CDRs) in active QSFP modules
can be turned on or off to improve the bit error rate observed on
the channel. Signal integrity and bit error rate requirements require
us to always turn on any CDRs present in low power cables (power
dissipation 2.5W or lower). However, we adhere to the platform
designer's settings (provided in the platform configuration) for
higher power cables (dissipation 3.5W or higher) if the platform
designer has determined that the platform requires the CDRs to be
turned on (or off) and is capable of supplying and cooling the higher
power modules.
This patch also introduces the get_qsfp_power_class function to
centralize the bit twiddling required to determine the QSFP power class
across the code. Reusing this function improves the readability of code
that depends on knowing the power class of the cable, such as the
active and optical channel tuning algorithm.
Reviewed-by: Dean Luick <dean.luick@intel.com> Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com> Signed-off-by: Easwar Hariharan <easwar.hariharan@intel.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
IB/hfi1: Check P_KEY for all sent packets from user mode
Add the P_KEY check for user-context mechanism for
both PIO and SDMA. For PIO, the
SendCtxtCheckEnable.DisallowKDETHPackets is set by
default. When the P_KEY is set,
SendCtxtCheckEnable.DisallowKDETHPackets is cleared.
For SDMA, a software check was included. This change
requires user processes to set the P_KEY before sending
any packets, otherwise, the sent packet will fail. The
original submission didn't have this check but it's
required.
Reviewed-by: Dean Luick <dean.luick@intel.com> Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com> Reviewed-by: Mikto Haralanov <mitko.haralanov@intel.com> Signed-off-by: Sebastian Sanchez <sebastian.sanchez@intel.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
Increasing the default MTU size to 10KB improves performance
for PSM. Change the default MTU to 10KB but constrain
Verbs MTU to 8KB. Also update default MTU module parameter
description to be HFI1_DEFAULT_MAX_MTU.
Reviewed-by: Dean Luick <dean.luick@intel.com> Reviewed-by: Mitko Haralanov <mitko.haralanov@intel.com> Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com> Reviewed-by: Jubin John <jubin.john@intel.com> Signed-off-by: Sebastian Sanchez <sebastian.sanchez@intel.com> Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
Dean Luick [Tue, 12 Apr 2016 17:50:35 +0000 (10:50 -0700)]
IB/hfi1: Simplify init_qpmap_table()
Make init_qpmap_table() easier to understand by simplifying
the loop indexing and writing each register when it is "full",
removing the need for a follow-on register write.
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com> Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com> Signed-off-by: Dean Luick <dean.luick@intel.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
Dean Luick [Tue, 12 Apr 2016 17:50:16 +0000 (10:50 -0700)]
IB/hfi1: Remove invalid QOS check
Remove an invalid compare of the number of QOS RSM map table entries
against the number of physical receive contexts. The RSM map table
has its own size and has no relation to the number of physical receive
contexts.
Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com> Signed-off-by: Dean Luick <dean.luick@intel.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
Dean Luick [Tue, 12 Apr 2016 17:50:10 +0000 (10:50 -0700)]
IB/hfi1: Fix QOS num_vl bit width
The bit width for num_vls, n, needs to be calculated based on
the pow2 rounded up of the number of vls. Otherwise num_vls of 3,
5, 6, and 7 will have misplaced QOS RSM map entries.
Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com> Signed-off-by: Dean Luick <dean.luick@intel.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
Dean Luick [Tue, 12 Apr 2016 17:50:04 +0000 (10:50 -0700)]
IB/hfi1: Fix i2c resource reservation checks
The i2c and qsfp read/write routines should check for the resource
reservation of the incoming argument target rather than the implicit
target of the hardware HFI.
Reviewed-by: Easwar Hariharan <easwar.hariharan@intel.com> Signed-off-by: Dean Luick <dean.luick@intel.com> Signed-off-by: Jubin John <jubin.john@intel.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
Jubin John [Wed, 20 Apr 2016 13:05:24 +0000 (06:05 -0700)]
IB/rdmavt,hfi1,qib: Fix memory leak
rdi->ports has memory allocated in rvt_alloc_device(), but does not get
freed because the hfi1 and qib drivers drivers call ib_dealloc_device()
directly instead of going through rdmavt. Add a rvt_dealloc_device()
that frees rdi->ports and then calls ib_dealloc_device(). Switch hfi1
and qib drivers to calling rvt_dealloc_device() instead of
ib_dealloc_device() directly.
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com> Reviewed-by: Brian Welty <brian.welty@intel.com> Signed-off-by: Jubin John <jubin.john@intel.com> Reviewed-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
IB/hfi1: Fix buffer cache races which may cause corruption
There are two possible causes for node/memory corruption both
of which are related to the cache eviction algorithm. One way
to cause corruption is due to the asynchronous nature of the
MMU invalidation and the locking used when invalidating node.
The MMU invalidation routine would temporarily release the
RB tree lock to avoid a deadlock. However, this would allow
the eviction function to take the lock resulting in the removal
of cache nodes.
If the node being removed by the eviction code is the same as
the node being invalidated, the result is use after free.
The same is true in the other direction due to the temporary
release of the eviction list lock in the eviction loop.
Another corner case exists when dealing with the SDMA buffer
cache that could cause memory corruption of kernel memory.
The most common way, in which this corruption exhibits itself
is a linked list node corruption. In that case, the kernel will
complain that a node with poisoned pointers is being removed.
The fact that the pointers are already poisoned means that the
node has already been removed from the list.
To root cause of this corruption was a mishandling of the
eviction list maintained by the driver. In order for this
to happen four conditions need to be satisfied:
1. A node describing a user buffer already exists in the
interval RB tree,
2. The beginning of the current user buffer matches that
node but is bigger. This will cause the node to be
extended.
3. The amount of cached buffers is close or at the limit
of the buffer cache size.
4. The node has dropped close to the end of the eviction
list. This will cause the node to be considered for
eviction.
If all of the above conditions have been satisfied, it is
possible for the eviction algorithm to evict the current node,
which will free the node without the driver knowing.
To solve both issues described above:
- the locking around the MMU invalidation loop and cache
eviction loop has been improved so locks are not released in
the loop body,
- a new RB function is introduced which will "atomically" find
and remove the matching node from the RB tree, preventing the
MMU invalidation loop from touching it, and
- the node being extended by the pin_vector_pages() function is
removed from the eviction list prior to calling the eviction
function.
IB/hfi1: Extract and reinsert MMU RB node on lookup
The page pinning function, which also maintains the pin cache,
behaves one of two ways when an exact buffer match is not found:
1. If no node is not found (a buffer with the same starting address
is not found in the cache), a new node is created, the buffer
pages are pinned, and the node is inserted into the RB tree, or
2. If a node is found but the buffer in that node is a subset of
the new user buffer, the node is extended with the new buffer
pages.
Both modes of operation require (re-)insertion into the interval RB
tree.
When the node being inserted is a new node, the operations are pretty
simple. However, when the node is already existing and is being
extended, special care must be taken.
First, we want to guard against an asynchronous attempt to
delete the node by the MMU invalidation notifier. The simplest way to
do this is to remove the node from the RB tree, preventing the search
algorithm from finding it.
Second, the node needs to be re-inserted so it lands in the proper place
in the tree and the tree is correctly re-balanced. This also requires
the node to be removed from the RB tree.
This commit adds the hfi1_mmu_rb_extract() function, which will search
for a node in the interval RB tree matching an address and length and
remove it from the RB tree if found. This allows for both of the above
special cases be handled in a single step.
Reviewed-by: Dean Luick <dean.luick@intel.com> Signed-off-by: Mitko Haralanov <mitko.haralanov@intel.com> Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
The computation of the interval of an interval RB node
was incorrect leading to data corruption due to the RB
search algorithm not properly finding the all RB nodes
in an MMU invalidation interval.
The problem stemmed from the fact that the beginning
address of the node's range was being aligned to a page
boundary. For certain buffer sizes, this would lead to
a end address calculation that was off by 1 page.
An important aspect of keeping the RB same is also
updating the node's range in the case it's being extended.
IB/hfi1: Protect the interval RB tree when cleaning up
The current implementation of the clean up function for
the interval RB trees has two flaws which may cause
problems in cases of concurrent executing of the function
and MMU notifier.
The flaws were due to the fact that deregistration of the
MMU callbacks was done after the tree was emptied and,
furthermore, the tree was not being locked.
This commit fixes both of these flaws by, first, switch the
order of operations, and, second, locking the tree while
traversing it to prevent any other operations.
The driver had two memory leaks - one in the user
expected receive code and one in SDMA buffer cache.
The leak in the expected receive code only showed up
when the user/admin had set ulimit sufficiently low
and the driver did not have enough room in the cache
before hitting the limit of allowed cachable memory.
When this condition occurred, the driver returned
early signaling userland that it needed to free some
buffers to free up room in the cache.
The bug was that the driver was not cleaning up
allocated memory prior to returning early.
The leak in the SDMA buffer cache could occur (even
though it never did), when the insertion of a buffer
node in the interval RB tree failed. In this case, the
driver failed to unpin the pages of the node instead
erroneously returning success.
IB/hfi1: Don't remove list entries if they are not in a list
The SDMA cache logic maintains an eviction list which is ordered
by most recently used user buffers. Upon errors or buffer freeing,
the list nodes were unconditionally being deleted. This would lead
to list corruption warnings if the nodes were never inserted in the
eviction list to begin with.
This commit prevents this by checking that the nodes are already
part of the eviction list.
Reviewed-by: Dean Luick <dean.luick@intel.com> Signed-off-by: Mitko Haralanov <mitko.haralanov@intel.com> Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
Mike Marciniszyn [Tue, 12 Apr 2016 17:46:10 +0000 (10:46 -0700)]
IB/qib, IB/hfi1: Fix up UD loopback use of irq flags
The dual lock patch moved locking around and missed an issue
with handling irq flags when processing UD loopback
packets. This issue was revealed by smatch.
Fix for both qib and hfi1 to pass the saved flags to the UD request
builder and handle the changes correctly.
Fixes: 46a80d62e6e0 ("IB/qib, staging/rdma/hfi1: add s_hlock for use in post send") Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com> Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
Jason Gunthorpe [Mon, 11 Apr 2016 01:13:13 +0000 (19:13 -0600)]
IB/security: Restrict use of the write() interface
The drivers/infiniband stack uses write() as a replacement for
bi-directional ioctl(). This is not safe. There are ways to
trigger write calls that result in the return structure that
is normally written to user space being shunted off to user
specified kernel memory instead.
For the immediate repair, detect and deny suspicious accesses to
the write API.
For long term, update the user space libraries and the kernel API
to something that doesn't present the same security vulnerabilities
(likely a structured ioctl() interface).
The impacted uAPI interfaces are generally only available if
hardware from drivers/infiniband is installed in the system.
Reported-by: Jann Horn <jann@thejh.net> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
[ Expanded check to all known write() entry points ] Cc: stable@vger.kernel.org Signed-off-by: Doug Ledford <dledford@redhat.com>
Dean Luick [Fri, 22 Apr 2016 18:17:03 +0000 (11:17 -0700)]
IB/hfi1: Use kernel default llseek for ui device
The ui device llseek had a mistake with SEEK_END and did
not fully follow seek semantics. Correct all this by
using a kernel supplied function for fixed size devices.
Cc: Al Viro <viro@ZenIV.linux.org.uk> Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com> Signed-off-by: Dean Luick <dean.luick@intel.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
This commit re-arranges the context initialization code in a way that
would allow for context event flags to be used to determine whether
the context has been successfully initialized.
In turn, this can be used to skip the resource de-allocation if they
were never allocated in the first place.
Fixes: 3abb33ac6521 ("staging/hfi1: Add TID cache receive init and free funcs") Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com> Signed-off-by: Mitko Haralanov <mitko.haralanov@intel.com> Reviewed-by: Leon Romanovsky <leonro@mellanox.com. Signed-off-by: Doug Ledford <dledford@redhat.com>
Jubin John [Tue, 12 Apr 2016 17:47:00 +0000 (10:47 -0700)]
IB/rdmavt: Fix send scheduling
call_send is used to determine whether to send immediately or schedule
a send for later. The current logic in rdmavt is inverted and has a
negative impact on the latency of the hfi1 and qib drivers. Fix this
regression by correctly calling send immediately when call_send is set.
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com> Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com> Signed-off-by: Jubin John <jubin.john@intel.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
IB/hfi1: Fix deadlock caused by locking with wrong scope
The locking around the interval RB tree is designed to prevent
access to the tree while it's being modified. The locking in its
current form is too overzealous, which is causing a deadlock in
certain cases with the following backtrace:
As evident from the backtrace above, the process was being put to sleep
while holding the lock.
Limiting the scope of the lock only to the RB tree operation fixes the
above error allowing for proper locking and the process being put to
sleep when needed.
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com> Reviewed-by: Dean Luick <dean.luick@intel.com> Signed-off-by: Mitko Haralanov <mitko.haralanov@intel.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
IB/hfi1: Prevent NULL pointer deferences in caching code
There is a potential kernel crash when the MMU notifier calls the
invalidation routines in the hfi1 pinned page caching code for sdma.
The invalidation routine could call the remove callback
for the node, which in turn ends up dereferencing the
current task_struct to get a pointer to the mm_struct.
However, the mm_struct pointer could be NULL resulting in
the following backtrace:
To correct this, the mm_struct passed to us by the MMU notifier is
used (which is what should have been done to begin with). This avoids
the broken derefences and ensures that the correct mm_struct is used.
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com> Reviewed-by: Dean Luick <dean.luick@intel.com> Signed-off-by: Mitko Haralanov <mitko.haralanov@intel.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
Sagi Grimberg [Thu, 31 Mar 2016 16:03:25 +0000 (19:03 +0300)]
IB/mlx5: Expose correct max_sge_rd limit
mlx5 devices (Connect-IB, ConnectX-4, ConnectX-4-LX) has a limitation
where rdma read work queue entries cannot exceed 512 bytes.
A rdma_read wqe needs to fit in 512 bytes:
- wqe control segment (16 bytes)
- rdma segment (16 bytes)
- scatter elements (16 bytes each)
So max_sge_rd should be: (512 - 16 - 16) / 16 = 30.
Cc: linux-stable@vger.kernel.org Reported-by: Christoph Hellwig <hch@lst.de> Tested-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Sagi Grimberg <sagig@grimberg.me> Signed-off-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
Merge branch 'for-4.6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq
Pull workqueue fix from Tejun Heo:
"So, it turns out we had a silly bug in the most fundamental part of
workqueue for a very long time. AFAICS, this dates back to pre-git
era and has quite likely been there from the time workqueue was first
introduced.
A work item uses its PENDING bit to synchronize multiple queuers.
Anyone who wins the PENDING bit owns the pending state of the work
item. Whether a queuer wins or loses the race, one thing should be
guaranteed - there will soon be at least one execution of the work
item - where "after" means that the execution instance would be able
to see all the changes that the queuer has made prior to the queueing
attempt.
Unfortunately, we were missing a smp_mb() after clearing PENDING for
execution, so nothing guaranteed visibility of the changes that a
queueing loser has made, which manifested as a reproducible blk-mq
stall.
Lots of kudos to Roman for debugging the problem. The patch for
-stable is the minimal one. For v3.7, Peter is working on a patch to
make the code path slightly more efficient and less fragile"
* 'for-4.6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
workqueue: fix ghost PENDING flag while doing MQ IO
Merge branch 'for-4.6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup fixes from Tejun Heo:
"Two patches to fix a deadlock which can be easily triggered if memcg
charge moving is used.
This bug was introduced while converting threadgroup locking to a
global percpu_rwsem and is caused by cgroup controller task migration
path depending on the ability to create new kthreads. cpuset had a
similar issue which was fixed by performing heavy-lifting operations
asynchronous to task migration. The two patches fix the same issue in
memcg in a similar way. The first patch makes the mechanism generic
and the second relocates memcg charge moving outside the migration
path.
Given that we don't want to perform heavy operations while
writelocking threadgroup lock anyway, moving them out of the way is a
desirable solution. One thing to note is that the problem was
difficult to debug because lockdep couldn't figure out the deadlock
condition. Looking into how to improve that"
* 'for-4.6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
memcg: relocate charge moving from ->attach to ->post_attach
cgroup, cpuset: replace cpuset_post_attach_flush() with cgroup_subsys->post_attach callback
Merge branch 'i2c/for-current' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux
Pull i2c fixes from Wolfram Sang:
"I2C has one buildfix, one ABBA deadlock fix, and three simple 'add ID'
patches"
* 'i2c/for-current' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux:
i2c: exynos5: Fix possible ABBA deadlock by keeping I2C clock prepared
i2c: cpm: Fix build break due to incompatible pointer types
i2c: ismt: Add Intel DNV PCI ID
i2c: xlp9xx: add support for Broadcom Vulcan
i2c: rk3x: add support for rk3228
* tag 'arc-4.6-rc6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vgupta/arc:
ARC: add support for reserved memory defined by device tree
ARC: support generic per-device coherent dma mem
Documentation: dt: arc: fix spelling mistakes
ARCv2: Enable LOCKDEP
Merge tag 'nios2-v4.6-fix' of git://git.kernel.org/pub/scm/linux/kernel/git/lftan/nios2
Pull arch/nios2 fix from Ley Foon Tan:
"memset: use the right constraint modifier for the %4 output operand"
* tag 'nios2-v4.6-fix' of git://git.kernel.org/pub/scm/linux/kernel/git/lftan/nios2:
nios2: memset: use the right constraint modifier for the %4 output operand
Merge tag 'platform-drivers-x86-v4.6-3' of git://git.infradead.org/users/dvhart/linux-platform-drivers-x86
Pull x86 platform driver fix from Darren Hart:
"Fix regression caused by hotkey enabling value in toshiba_acpi"
* tag 'platform-drivers-x86-v4.6-3' of git://git.infradead.org/users/dvhart/linux-platform-drivers-x86:
toshiba_acpi: Fix regression caused by hotkey enabling value
nios2: memset: use the right constraint modifier for the %4 output operand
Depending on the size of the area to be memset'ed, the nios2 memset implementation
either uses a naive loop (for buffers smaller or equal than 8 bytes) or a more optimized
implementation (for buffers larger than 8 bytes). This implementation does 4-byte stores
rather than 1-byte stores to speed up memset.
However, we discovered that on our nios2 platform, memset() was not properly setting the
buffer to the expected value. A memset of 0xff would not set the entire buffer to 0xff, but to:
0xff 0x00 0xff 0x00 0xff 0x00 0xff 0x00 ...
Which is obviously incorrect. Our investigation has revealed that the problem lies in the
incorrect constraints used in the inline assembly.
The following piece of assembly, from the nios2 memset implementation, is supposed to
create a 4-byte value that repeats 4 times the 1-byte pattern passed as memset argument:
This is wrong because r5 gets used both for %5 and %4, which leads to the final pattern
stored in r4 to be 0xff00ff00 rather than the expected 0xffffffff.
%4 is defined with the "=r" constraint, i.e as an output operand. However, as explained in
http://www.ethernut.de/en/documents/arm-inline-asm.html, this does not prevent gcc from
using the same register for an output operand (%4) and input operand (%5). By using the
constraint modifier '&', we indicate that the register should be used for output only. With this
change, we get the following assembly output:
Which correctly produces the 0xffffffff pattern when 0xff is passed as the memset() pattern.
It is worth mentioning the observed consequence of this bug: we were hitting the kernel
BUG() in mm/bootmem.c:__free() that verifies when marking a page as free that it was
previously marked as occupied (i.e that the bit was set to 1). The entire bootmem bitmap is
set to 0xff bit via a memset() during the bootmem initialization. The bootmem_free() call right
after the initialization was finding some bits to be set to 0, which didn't make sense since the
bitmap has just been memset'ed to 0xff. Except that due to the bug explained above, the
bitmap was in fact initialized to 0xff00ff00.
Thanks to Marek Vasut for his help and feedback.
Signed-off-by: Romain Perier <romain.perier@free-electrons.com> Acked-by: Marek Vasut <marex@denx.de> Acked-by: Ley Foon Tan <lftan@altera.com>
1) Handle v4/v6 mixed sockets properly in soreuseport, from Craig
Gallak.
2) Bug fixes for the new macsec facility (missing kmalloc NULL checks,
missing locking around netdev list traversal, etc.) from Sabrina
Dubroca.
3) Fix handling of host routes on ifdown in ipv6, from David Ahern.
4) Fix double-fdput in bpf verifier. From Jann Horn.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (31 commits)
bpf: fix double-fdput in replace_map_fd_with_map_ptr()
net: ipv6: Delete host routes on an ifdown
Revert "ipv6: Revert optional address flusing on ifdown."
net/mlx4_en: fix spurious timestamping callbacks
net: dummy: remove note about being Y by default
cxgbi: fix uninitialized flowi6
ipv6: Revert optional address flusing on ifdown.
ipv4/fib: don't warn when primary address is missing if in_dev is dead
net/mlx5: Add pci shutdown callback
net/mlx5_core: Remove static from local variable
net/mlx5e: Use vport MTU rather than physical port MTU
net/mlx5e: Fix minimum MTU
net/mlx5e: Device's mtu field is u16 and not int
net/mlx5_core: Add ConnectX-5 to list of supported devices
net/mlx5e: Fix MLX5E_100BASE_T define
net/mlx5_core: Fix soft lockup in steering error flow
qlcnic: Update version to 5.3.64
net: stmmac: socfpga: Remove re-registration of reset controller
macsec: fix netlink attribute validation
macsec: add missing macsec prefix in uapi
...
Merge tag 'fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc
Pull ARM SoC fixes from Arnd Bergmann:
"Here are the latest bug fixes for ARM SoCs, mostly addressing recent
regressions. Changes are across several platforms, so I'm listing
every change separately here.
Regressions since 4.5:
- A correction of the psci firmware DT binding, to prevent users from
relying on unintended semantics
- Actually getting the newly merged clock driver for some OMAP
platforms to work
- A revert of patches for the Qualcomm BAM, these need to be reworked
for 4.7 to avoid breaking boards other than the one they were
intended for
- A correction for the I2C device nodes on the Socionext Uniphier
platform
- i.MX SDHCI was broken for non-DT platforms due to a change with the
setting of the DMA mask
- A revert of a patch that accidentally added a nonexisting clock on
the Rensas "Porter" board
- A couple of OMAP fixes that are all related to suspend after the
power domain changes for dra7
- On Mediatek, revert part of the power domain initialization changes
that broke mt8173-evb
Fixes for older bugs:
- Workaround for an "external abort" in the omap34xx suspend/resume
code.
- The USB1/eSATA should not be listed as an excon device on
am57xx-beagle-x15 (broken since v4.0)
- A v4.5 regression in the TI AM33xx and AM43XX DT specifying
incorrect DMA request lines for the GPMC
- The jiffies calibration on Renesas platforms was incorrect for some
modern CPU cores.
- A hardware errata woraround for clockdomains on TI DRA7"
* tag 'fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc:
drivers: firmware: psci: unify enable-method binding on ARM {64,32}-bit systems
arm64: dts: uniphier: fix I2C nodes of PH1-LD20
ARM: shmobile: timer: Fix preset_lpj leading to too short delays
Revert "ARM: dts: porter: Enable SCIF_CLK frequency and pins"
ARM: dts: r8a7791: Don't disable referenced optional clocks
Revert "ARM: OMAP: Catch callers of revision information prior to it being populated"
ARM: OMAP3: Fix external abort on 36xx waking from off mode idle
ARM: dts: am57xx-beagle-x15: remove extcon_usb1
ARM: dts: am437x: Fix GPMC dma properties
ARM: dts: am33xx: Fix GPMC dma properties
Revert "soc: mediatek: SCPSYS: Fix double enabling of regulators"
ARM: mach-imx: sdhci-esdhc-imx: initialize DMA mask
ARM: DRA7: clockdomain: Implement timer workaround for errata i874
ARM: OMAP: Catch callers of revision information prior to it being populated
ARM: dts: dra7: Correct clock tree for sys_32k_ck
ARM: OMAP: DRA7: Provide proper class to omap2_set_globals_tap
ARM: OMAP: DRA7: wakeupgen: Skip SAR save for wakeupgen
Revert "dts: msm8974: Add dma channels for blsp2_i2c1 node"
Revert "dts: msm8974: Add blsp2_bam dma node"
ARM: dts: Add clocks for dm814x ADPLL
This is more prep-work for the upcoming pty changes. Still just code
cleanup with no actual semantic changes.
This removes a bunch pointless complexity by just having the slave pty
side remember the dentry associated with the devpts slave rather than
the inode. That allows us to remove all the "look up the dentry" code
for when we want to remove it again.
Together with moving the tty pointer from "inode->i_private" to
"dentry->d_fsdata" and getting rid of pointless inode locking, this
removes about 30 lines of code. Not only is the end result smaller,
it's simpler and easier to understand.
The old code, for example, depended on the d_find_alias() to not just
find the dentry, but also to check that it is still hashed, which in
turn validated the tty pointer in the inode.
That is a _very_ roundabout way to say "invalidate the cached tty
pointer when the dentry is removed".
The new code just does
dentry->d_fsdata = NULL;
in devpts_pty_kill() instead, invalidating the tty pointer rather more
directly and obviously. Don't do something complex and subtle when the
obvious straightforward approach will do.
The rest of the patch (ie apart from code deletion and the above tty
pointer clearing) is just switching the calling convention to pass the
dentry or file pointer around instead of the inode.
Cc: Eric Biederman <ebiederm@xmission.com> Cc: Peter Anvin <hpa@zytor.com> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Peter Hurley <peter@hurleysoftware.com> Cc: Serge Hallyn <serge.hallyn@ubuntu.com> Cc: Willy Tarreau <w@1wt.eu> Cc: Aurelien Jarno <aurelien@aurel32.net> Cc: Alan Cox <gnomes@lxorguk.ukuu.org.uk> Cc: Jann Horn <jann@thejh.net> Cc: Greg KH <greg@kroah.com> Cc: Jiri Slaby <jslaby@suse.com> Cc: Florian Weimer <fw@deneb.enyo.de> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
bpf: fix double-fdput in replace_map_fd_with_map_ptr()
When bpf(BPF_PROG_LOAD, ...) was invoked with a BPF program whose bytecode
references a non-map file descriptor as a map file descriptor, the error
handling code called fdput() twice instead of once (in __bpf_map_get() and
in replace_map_fd_with_map_ptr()). If the file descriptor table of the
current task is shared, this causes f_count to be decremented too much,
allowing the struct file to be freed while it is still in use
(use-after-free). This can be exploited to gain root privileges by an
unprivileged user.
This bug was introduced in
commit 0246e64d9a5f ("bpf: handle pseudo BPF_LD_IMM64 insn"), but is only
exploitable since
commit 1be7f75d1668 ("bpf: enable non-root eBPF programs") because
previously, CAP_SYS_ADMIN was required to reach the vulnerable code.
(posted publicly according to request by maintainer)
Signed-off-by: Jann Horn <jannh@google.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Acked-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net>
Hariprasad S [Tue, 5 Apr 2016 04:53:48 +0000 (10:23 +0530)]
RDMA/iw_cxgb4: Fix bar2 virt addr calculation for T4 chips
For T4, kernel mode qps don't use the user doorbell. User mode qps during
flow control db ringing are forced into kernel, where user doorbell is
treated as kernel doorbell and proper bar2 offset in bar2 virtual space is
calculated, which incase of T4 is a bogus address, causing a kernel panic
due to illegal write during doorbell ringing.
In case of T4, kernel mode qp bar2 virtual address should be 0. Added T4
check during bar2 virtual address calculation to return 0. Fixed Bar2
range checks based on bar2 physical address.
Steve Wise [Tue, 12 Apr 2016 13:55:01 +0000 (06:55 -0700)]
iw_cxgb3: initialize ibdev.iwcm->ifname for port mapping
The IWCM uses ibdev.iwcm->ifname for registration with the iwarp
port map daemon. But iw_cxgb3 did not initialize this field which
causes intermittent registration failures based on the contents of the
uninitialized memory.
Fixes: c1340e8aa628 ("iw_cxgb3: support for iWARP port mapping") Signed-off-by: Steve Wise <swise@opengridcomputing.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
Steve Wise [Tue, 12 Apr 2016 13:54:54 +0000 (06:54 -0700)]
iw_cxgb4: initialize ibdev.iwcm->ifname for port mapping
The IWCM uses ibdev.iwcm->ifname for registration with the iwarp
port map daemon. But iw_cxgb4 did not initialize this field which
causes intermittent registration failures based on the contents of the
uninitialized memory.
Fixes: 170003c894d9 ("iw_cxgb4: remove port mapper related code") Signed-off-by: Steve Wise <swise@opengridcomputing.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
The drain_rq function expects a normal receive qp to drain. A qp can
only have either a normal rq or an srq. If there is an srq, there
is no rq to drain. Until the API supports draining SRQs, simply
skip draining the rq when the qp has an srq attached.
Fixes: 765d67748bcf ("IB: new common API for draining queues") Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Doug Ledford <dledford@redhat.com>