There are several computatations of the sc in the
ud receive routine.
Besides the code duplication, all are wrong when the
sc is greater than 15. In that case the code incorrectly
or's a 1 into the computed sc instead of 1 shifted left
by 4.
Fix precomputed sc5 by using an already implemented routine
hdr2sc() and deleting flawed duplicated code.
Cc: Stable <stable@vger.kernel.org> # 4.6+ Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com> Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com> Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
Bart Van Assche [Thu, 23 Jun 2016 07:35:48 +0000 (09:35 +0200)]
IB/srpt: Reduce QP buffer size
The memory needed for the send and receive queues associated with
a QP is proportional to the max_sge parameter. The current value
of that parameter is such that with an mlx4 HCA the QP buffer size
is 8 MB. Since DMA is used for communication between HCA and CPU
that buffer either has to be allocated coherently or map_single()
must succeed for that buffer. Since large contiguous allocations
are fragile and since the maximum segment size for e.g. swiotlb
is 256 KB, reduce the max_sge parameter. This patch avoids that
the following text appears on the console after SRP logout and
relogin on a system equipped with multiple IB HCAs:
Shiraz Saleem [Tue, 14 Jun 2016 21:54:16 +0000 (16:54 -0500)]
i40iw: Correct CQ arming
CQ is armed for solicited events only, ignoring other notification
flags. Correct this by arming for next and arming for solicited
event if IB_CQ_SOLICITED is set. Also protect CQ shadow area update
with spinlock.
Ashutosh Dixit [Sat, 18 Jun 2016 02:17:54 +0000 (19:17 -0700)]
IB/hfi1: Don't zero out qp->s_ack_queue in rvt_reset_qp
Since rvt_reset_qp already zero's out qp->s_ack_queue head and tail
pointers, there is no need to zero out qp->s_ack_queue itself.
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com> Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com> Signed-off-by: Ashutosh Dixit <ashutosh.dixit@intel.com> Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
Dotan Barak [Wed, 22 Jun 2016 14:27:31 +0000 (17:27 +0300)]
IB/mlx4: Fix memory leak if QP creation failed
When RC, UC, or RAW QPs are created, a qp object is allocated (kzalloc).
If at a later point (in procedure create_qp_common) the qp creation fails,
this qp object must be freed.
Fixes: 1ffeb2eb8be99 ("IB/mlx4: SR-IOV IB context objects and proxy/tunnel SQP support") Signed-off-by: Dotan Barak <dotanb@dev.mellanox.co.il> Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il> Signed-off-by: Leon Romanovsky <leon@kernel.org> Signed-off-by: Doug Ledford <dledford@redhat.com>
Yishai Hadas [Wed, 22 Jun 2016 14:27:30 +0000 (17:27 +0300)]
IB/mlx4: Verify port number in flow steering create flow
In procedure mlx4_ib_create_flow, passing an invalid port number
will cause an out-of-bounds array access. Data passed to this procedure
can come from user-space. Therefore, need to validate port number
before proceeding onwards.
Note that we check against the number of physical ports declared at
the verbs (ib core) level; When bonding is active, the verbs level
sees one physical port, even though the low-level driver sees two ports.
Fixes: f77c0162a339 ("IB/mlx4: Add receive flow steering support") Signed-off-by: Yishai Hadas <yishaih@mellanox.com> Reviewed-by: Jack Morgenstein <jackm@dev.mellanox.co.il> Reviewed-by: Moni Shoua <monis@mellanox.com> Signed-off-by: Leon Romanovsky <leon@kernel.org> Signed-off-by: Doug Ledford <dledford@redhat.com>
Yishai Hadas [Wed, 22 Jun 2016 14:27:29 +0000 (17:27 +0300)]
IB/mlx4: Fix error flow when sending mads under SRIOV
Fix mad send error flow to prevent double freeing address handles,
and leaking tx_ring entries when SRIOV is active.
If ib_mad_post_send fails, the address handle pointer in the tx_ring entry
must be set to NULL (or there will be a double-free) and tx_tail must be
incremented (or there will be a leak of tx_ring entries).
The tx_ring is handled the same way in the send-completion handler.
Fixes: 37bfc7c1e83f ("IB/mlx4: SR-IOV multiplex and demultiplex MADs") Signed-off-by: Yishai Hadas <yishaih@mellanox.com> Reviewed-by: Jack Morgenstein <jackm@dev.mellanox.co.il> Signed-off-by: Leon Romanovsky <leon@kernel.org> Signed-off-by: Doug Ledford <dledford@redhat.com>
Yishai Hadas [Wed, 22 Jun 2016 14:27:28 +0000 (17:27 +0300)]
IB/mlx4: Fix the SQ size of an RC QP
When calculating the required size of an RC QP send queue, leave
enough space for masked atomic operations, which require more space than
"regular" atomic operation.
Fixes: 6fa8f719844b ("IB/mlx4: Add support for masked atomic operations") Signed-off-by: Yishai Hadas <yishaih@mellanox.com> Reviewed-by: Jack Morgenstein <jackm@mellanox.co.il> Reviewed-by: Eran Ben Elisha <eranbe@mellanox.com> Signed-off-by: Leon Romanovsky <leon@kernel.org> Signed-off-by: Doug Ledford <dledford@redhat.com>
Eli Cohen [Wed, 22 Jun 2016 14:27:26 +0000 (17:27 +0300)]
IB/mlx5: Fix post send fence logic
If the caller specified IB_SEND_FENCE in the send flags of the work
request and no previous work request stated that the successive one
should be fenced, the work request would be executed without a fence.
This could result in RDMA read or atomic operations failure due to a MR
being invalidated. Fix this by adding the mlx5 enumeration for fencing
RDMA/atomic operations and fix the logic to apply this.
Fixes: e126ba97dba9 ('mlx5: Add driver for Mellanox Connect-IB adapters') Signed-off-by: Eli Cohen <eli@mellanox.com> Signed-off-by: Leon Romanovsky <leon@kernel.org> Signed-off-by: Doug Ledford <dledford@redhat.com>
Eli Cohen [Wed, 22 Jun 2016 14:27:24 +0000 (17:27 +0300)]
IB/core: Fix false search of the IB_SA_WELL_KNOWN_GUID
When virtualziation is supported, VFs may send SA MADs to a GID formed
by the concatenation of the subnet prefix with the
IB_SA_WELL_KNOWN_GUID. When a response is required, the current code
will search the local HCA's port for the received GID to figure out the
GID index of the entry containing this GID. However, since this is not a
real GID it will not be found and error will be printed.
We change the logic to check if the destination GID is this special GID
and avoid lookup in this case and use GID index 0.
Fixes: a0c1b2a35087 ('IB/core: Support accessing SA in virtualized environment') Signed-off-by: Eli Cohen <eli@mellanox.com> Signed-off-by: Leon Romanovsky <leon@kernel.org> Signed-off-by: Doug Ledford <dledford@redhat.com>
Alex Vesker [Wed, 22 Jun 2016 14:27:23 +0000 (17:27 +0300)]
IB/core: Fix RoCE v1 multicast join logic issue
During multicast join of RoCEv1, IGMP join state and max hop limit
were updated incorrectly. IGMP join should be sent and marked as
joined only on RoCEv2 after a successful join. Max hops should be
updated to the hop limit on RoCEv2 regardless of the join state.
Fixes: bee3c3c91865 ('IB/cma: Join and leave multicast groups...') Signed-off-by: Alex Vesker <valex@mellanox.com> Signed-off-by: Leon Romanovsky <leon@kernel.org> Signed-off-by: Doug Ledford <dledford@redhat.com>
Talat Batheesh [Wed, 22 Jun 2016 14:27:22 +0000 (17:27 +0300)]
IB/core: Fix no default GIDs when netdevice reregisters
Currently, when the netdevice returned by get_netdev is unregistered,
we delete all GIDs (including the default GIDs) and reset their
attributes. Therefore, when we re-register it, no default GIDs
will be assigned (as their "default GID") attribute will be reset.
Fixing this by keeping "default GID" attribute.
Fixes: 03db3a2d81e6 ('IB/core: Add RoCE GID table management') Signed-off-by: Talat Batheesh <talatb@mellanox.com> Signed-off-by: Leon Romanovsky <leon@kernel.org> Signed-off-by: Doug Ledford <dledford@redhat.com>
IB/hfi1: Send a pkey change event on driver pkey update
Swapping a cable from a "Mgmt Allowed=No" switch port to a
"Mgmt Allowed=Yes" switch port doesn't send a pkey change
notification. Therefore, the link doesn't become active as
the oib_utils layer uses an old pkey table cache.
Fix by ensuring the pkey change notification is sent when
the table is changed both explicitly by the FM and implicitly
by the driver via a cable swap.
Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com> Signed-off-by: Sebastian Sanchez <sebastian.sanchez@intel.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
IB/hfi1: Remove FULL_MGMT_P_KEY from pkey table at link up
FULL_MGMT_P_KEY doesn't get cleared from the pkey table at link bounce
because the link down and link bounce code paths are different when
moving a QSFP cable on a switch. This causes an HFI unit connected to a
switch to try to be initialized to an FM node when the QSFP cable is
moved from a MgmtAllowed=NO port to a MgmtAllowed=YES port and back to a
MgmtAllowed=NO port. Remove FULL_MGMT_P_KEY from pkey table at link up.
Reviewed-by: Dean Luick <dean.luick@intel.com> Signed-off-by: Sebastian Sanchez <sebastian.sanchez@intel.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
Tadeusz Struk [Thu, 9 Jun 2016 14:51:45 +0000 (07:51 -0700)]
IB/hfi1: Fix potential NULL ptr dereference
This fixes potential NULL ptr dereference because IS_ERR(dd) doesn't
handle NULL. Fix the issue by initializing the pointer with a not NULL
error code.
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com> Signed-off-by: Tadeusz Struk <tadeusz.struk@intel.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
Jubin John [Thu, 9 Jun 2016 14:51:27 +0000 (07:51 -0700)]
IB/hfi1: Increase packet egress timeout
The current value of 500us for the packet egress timeout is too small
which causes the host to declare failure on draining packets too early
and unnecessarily bounces the link. Increase this to 50ms taking into
account the switch packet discard timer default and the worst case
per-VL package drainage rate.
Reviewed-by: Dean Luick <dean.luick@intel.com> Signed-off-by: Jubin John <jubin.john@intel.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
Jubin John [Thu, 9 Jun 2016 14:51:08 +0000 (07:51 -0700)]
IB/hfi1: Fix credit return threshold adjustment
The credit return threshold adjustment on mtu change algorithm does not
take into account all the kernel send contexts that are assigned per VL.
Use the pio send context map to adjust the credit return thresholds for
all the allocated and assigned kernel send contexts based on the MTU
adjustment per VL.
The pio send context map can be changed dynamically based on the actual
number of operational vls which is set by the fabric manager. When this
happens update the credit return threshold values for all the remapped
kernel send contexts.
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com> Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com> Reviewed-by: Jianxin Xiong <jianxin.xiong@intel.com> Signed-off-by: Jubin John <jubin.john@intel.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
Bart Van Assche [Fri, 10 Jun 2016 18:08:25 +0000 (11:08 -0700)]
IB/cma: Make the code easier to verify
Static source code analysis tools like smatch cannot handle functions
that lock or not lock a mutex depending on the value of the arguments.
Hence inline the function cma_disable_callback(). Additionally, this
patch realizes a small performance optimization by reducing the number of
mutex_lock() and mutex_unlock() calls in the modified functions. With
this patch applied smatch no longer complains about source file cma.c.
Without this patch smatch reports the following for this source file:
drivers/infiniband/core/cma.c:1959: cma_req_handler() warn: inconsistent returns 'mutex:&listen_id->handler_mutex'.
Locked on: line 1880
line 1959
Unlocked on: line 1941
drivers/infiniband/core/cma.c:2112: iw_conn_req_handler() warn: inconsistent returns 'mutex:&listen_id->handler_mutex'.
Locked on: line 2048
Unlocked on: line 2112
Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com> Cc: Sean Hefty <sean.hefty@intel.com> Cc: Steve Wise <swise@opengridcomputing.com> Cc: Leon Romanovsky <leonro@mellanox.com> Acked-by: Sean Hefty <sean.hefty@intel.com> Reviewed-by: Steve Wise <swise@opengridcomputing.com> Reviewed-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
Jason Gunthorpe [Wed, 8 Jun 2016 23:28:29 +0000 (17:28 -0600)]
IB/mlx4: Properly initialize GRH TClass and FlowLabel in AHs
When this code was reworked for IBoE support the order of assignments
for the sl_tclass_flowlabel got flipped around resulting in
TClass & FlowLabel being permanently set to 0 in the packet headers.
This breaks IB routers that rely on these headers, but only affects
kernel users - libmlx4 does this properly for user space.
Cc: stable@vger.kernel.org Fixes: fa417f7b520e ("IB/mlx4: Add support for IBoE") Signed-off-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
Erez Shitrit [Sat, 4 Jun 2016 12:15:19 +0000 (15:15 +0300)]
IB/IPoIB: Don't update neigh validity for unresolved entries
ipoib_neigh_get unconditionally updates the "alive" variable member on
any packet send. This prevents the neighbor garbage collection from
cleaning out a dead neighbor entry if we are still queueing packets
for it. If the queue for this neighbor is full, then don't update the
alive timestamp. That way the neighbor can time out even if packets
are still being queued as long as none of them are being sent.
Fixes: b63b70d87741 ("IPoIB: Use a private hash table for path lookup in xmit path") Signed-off-by: Erez Shitrit <erezsh@mellanox.com> Signed-off-by: Leon Romanovsky <leon@kernel.org> Signed-off-by: Doug Ledford <dledford@redhat.com>
Achiad Shochat [Sat, 4 Jun 2016 12:15:37 +0000 (15:15 +0300)]
IB/mlx5: Fix alternate path code
Userspace flag IBV_QP_ALT_PATH is supposed to set the alternate path
including fields alt_pkey_index and alt_timeout.
Added IB_QP_PKEY_INDEX and IB_QP_TIMEOUT to the attribute mask when
calling mlx5_set_path for the alternate path to force setting the
alt_pkey_index and alt_timeout values.
Fixes: bf24481a3a7c4 ('IB/mlx5: Consider alternate path in pkey ...') Signed-off-by: Achiad Shochat <achiad@mellanox.com> Signed-off-by: Noa Osherovich <noaos@mellanox.com> Reviewed-by: Jack Morgenstein <jackm@dev.mellanox.co.il> Signed-off-by: Leon Romanovsky <leon@kernel.org> Signed-off-by: Doug Ledford <dledford@redhat.com>
Noa Osherovich [Sat, 4 Jun 2016 12:15:36 +0000 (15:15 +0300)]
IB/mlx5: Fix pkey_index length in the QP path record
Pkey index fields in the QP context path record are extended to 16
bits, as required by IB spec (version 1.3).
This change affects all QP commands which include path records.
To enable this change, moved the free adaptive routing flag bit
(free_ar) to the most significant byte of the QP path record.
Fixes: e126ba97dba9e ('mlx5: Add driver for Mellanox Connect-IB ...') Signed-off-by: Noa Osherovich <noaos@mellanox.com> Reviewed-by: Jack Morgenstein <jackm@dev.mellanox.co.il> Signed-off-by: Leon Romanovsky <leon@kernel.org> Signed-off-by: Doug Ledford <dledford@redhat.com>
Noa Osherovich [Sat, 4 Jun 2016 12:15:31 +0000 (15:15 +0300)]
IB/mlx5: Limit query HCA clock
When PAGE_SIZE is larger than 4K, the user shouldn't be able to query
the HCA core clock. This counter is within 4KB boundary and the
user-space shall not read information that's after this boundary.
Noa Osherovich [Sat, 4 Jun 2016 12:15:29 +0000 (15:15 +0300)]
IB/mlx5: Return PORT_ERR in Active to Initializing tranisition
FW port-change events are fired on Active <-> non Active port state
transitions only.
When the port state changes from Active to Initializing (Active ->
Down -> Initializing), a single event is fired.
The HCA transitions from Down to Initializing unless prevented from
doing so, hence the driver should also propagate events when the port
state is Initializing to consumers so they'll be aware that the port
is no longer Active and act accordingly.
Fixes: e126ba97dba9e ('mlx5: Add driver for Mellanox Connect-IB...') Signed-off-by: Noa Osherovich <noaos@mellanox.com> Signed-off-by: Leon Romanovsky <leon@kernel.org> Signed-off-by: Doug Ledford <dledford@redhat.com>
Max Gurtovoy [Mon, 6 Jun 2016 16:34:39 +0000 (19:34 +0300)]
IB/core: Fix bit curruption in ib_device_cap_flags structure
ib_device_cap_flags 64-bit expansion caused caps overlapping
and made consumers read wrong device capabilities. For example
IB_DEVICE_SG_GAPS_REG was falsely read by the iser driver causing
it to use a non-existing capability. This happened because signed
int becomes sign extended when converted it to u64. Fix this by
casting IB_DEVICE_ON_DEMAND_PAGING enumeration to ULL.
Mark Bloch [Sat, 4 Jun 2016 12:15:24 +0000 (15:15 +0300)]
IB/core: Initialize sysfs attributes before sysfs create group
For dynamically allocated sysfs attributes there is a need to call
sysfs_attr_init in order to comply with lockdep, not calling it
will result in error complaining key is not in .data section.
Fixes: b40f4757daa1 ("IB/core: Make device counter infrastructure dynamic") Signed-off-by: Mark Bloch <markb@mellanox.com> Signed-off-by: Leon Romanovsky <leon@kernel.org> Signed-off-by: Doug Ledford <dledford@redhat.com>
Mark Bloch [Sat, 4 Jun 2016 12:15:22 +0000 (15:15 +0300)]
IB/IPoIB: Disable bottom half when dealing with device address
Align locking usage when touching device address with rest
of the kernel. Lock the bottom half when doing so using
netif_addr_lock_bh.
This also solves the following case as reported by lockdep:
CPU0 CPU1
---- ----
lock(_xmit_INFINIBAND);
local_irq_disable();
lock(&(&mc->mca_lock)->rlock);
lock(_xmit_INFINIBAND);
<Interrupt>
lock(&(&mc->mca_lock)->rlock);
*** DEADLOCK ***
Fixes: 492a7e67ff83 ("IB/IPoIB: Allow setting the device address") Signed-off-by: Mark Bloch <markb@mellanox.com> Signed-off-by: Leon Romanovsky <leon@kernel.org> Signed-off-by: Doug Ledford <dledford@redhat.com>
Aviv Heller [Sat, 4 Jun 2016 12:15:21 +0000 (15:15 +0300)]
IB/core: Fix removal of default GID cache entry
When deleting a default GID from the cache, its gid_type field is set
to 0.
This could set the gid_type to RoCE v1 for a RoCE v2 default GID,
essentially making it inaccessible to future modifications, since it
is no longer found by find_gid().
This fix preserves the gid_type value for default gids during cache
operations.
Fixes: b39ffa1df505 ('IB/core: Add gid_type to gid attribute') Signed-off-by: Aviv Heller <avivh@mellanox.com> Signed-off-by: Leon Romanovsky <leon@kernel.org> Signed-off-by: Doug Ledford <dledford@redhat.com>
Erez Shitrit [Sat, 4 Jun 2016 12:15:20 +0000 (15:15 +0300)]
IB/IPoIB: Fix race between ipoib_remove_one to sysfs functions
In ipoib_remove_one the driver holds the rtnl_lock and tries to do some
operation like dev_change_flags or unregister_netdev, while sysfs
callback like ipoib_vlan_delete holds sysfs mutex and tries to hold the
rtnl_lock via rtnl_trylock() and restart_syscall() if the lock is not
free, meanwhile ipoib_remove_one tries to get the sysfs lock in order to
free its sysfs directory, and we will get a->b, b->a deadlock.
Eli Cohen [Sat, 4 Jun 2016 12:15:18 +0000 (15:15 +0300)]
IB/core: Fix query port failure in RoCE
Currently ib_query_port always attempts to to read the subnet prefix by
calling ib_query_gid(). For RoCE/iWARP there is no subnet manager and no
subnet prefix. Fix this by querying GID[0] only for IB networks.
Fixes: fad61ad4e755 ('IB/core: Add subnet prefix to port info') Signed-off-by: Eli Cohen <eli@mellanox.com> Signed-off-by: Leon Romanovsky <leon@kernel.org> Reviewed-by: Steve Wise <swise@opengridcomputing.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
Doug Ledford [Tue, 7 Jun 2016 11:43:46 +0000 (07:43 -0400)]
IB/core: fix error unwind in sysfs hw counters code
Between the initial and final versions of the function setup_hw_stats,
the order of variable initialization was changed. However, the unwind
flow on error did not properly keep up with the flow changes. Make
the unwind flow match a proper unwind of the allocation flow, then
remove no longer needed variable initializations.
Bart Van Assche [Fri, 3 Jun 2016 19:10:37 +0000 (12:10 -0700)]
IB/hfi1: Use bit 0 instead of bit 1
The first argument of test_bit() and clear_bit() is a bit number and
not a bitmask. Hence change that first argument from (1 << 0) into 0.
This patch avoids that smatch reports the following warnings:
user_sdma.c:1059: sdma_cache_evict() warn: test_bit() takes a bit number
user_sdma.c:1590: sdma_rb_remove() warn: test_bit() takes a bit number
Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com> Cc: Mike Marciniszyn <mike.marciniszyn@intel.com> Cc: Dennis Dalessandro <dennis.dalessandro@intel.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
Bart Van Assche [Fri, 3 Jun 2016 18:40:24 +0000 (11:40 -0700)]
IB/srp: Fix srp_map_sg_dma()
Because patch "IB/srp: Move common code into the caller" was applied
partially srp_map_sg_dma() doesn't work properly. Fix this by
applying the remainder of that patch. See also
http://thread.gmane.org/gmane.linux.drivers.rdma/35803/focus=35811.
Fixes: 3849e44d1c4b ("IB/srp: Move common code into the caller") Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com> Cc: Mike Marciniszyn <mike.marciniszyn@intel.com> Cc: Sagi Grimberg <sai@grimberg.me> Cc: Christoph Hellwig <hch@lst.de> Cc: Laurence Oberman <loberman@redhat.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Doug Ledford <dledford@redhat.com>
Bart Van Assche [Fri, 3 Jun 2016 18:39:35 +0000 (11:39 -0700)]
IB/srp: Always initialize use_fast_reg and use_fmr
Avoid that mapping fails due to use_fast_reg != 0 or use_fmr != 0
if both member variables should be zero (if never_register == 1 or
if neither FMR nor FR is supported). Remove an initialization that
became superfluous due to changing a kmalloc() into a kzalloc()
call.
Fixes: 509c5f33f4f6 ("IB/srp: Prevent mapping failures") Cc: Sagi Grimberg <sai@grimberg.m> Cc: Christoph Hellwig <hch@lst.de> Cc: Laurence Oberman <loberman@redhat.com> Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com> Reviewed-by: Leon Romanovsky <leonro@mellanox.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Doug Ledford <dledford@redhat.com>
Bart Van Assche [Fri, 3 Jun 2016 14:58:32 +0000 (07:58 -0700)]
IB/mlx4: Fix device managed flow steering support test
Perform the test for device managed flow steering support even if
memory windows are not supported. I noticed this because smatch
reported inconsistent indentation for the device managed flow
steering support test.
Colin Ian King [Wed, 1 Jun 2016 18:06:36 +0000 (19:06 +0100)]
IB/core: fix null pointer deref and mem leak in error handling
The current error handling in setup_hw_stats has a couple of issues.
It is possible to generate a null pointer deference on the
kfree of hsag->attrs[i] because two of the early error exit paths
jump to the kfree when hsags NULL and not allocated. Fix this by
moving the kfree on stats and jumping to that, avoiding the hsag
freeing.
Secondly, there is a memory leak of stats if the hsag allocation
fails; instead of returning, jump to the kfree on stats.
Signed-off-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
Leon Romanovsky [Tue, 31 May 2016 07:54:36 +0000 (10:54 +0300)]
IB/hfi1: Avoid large frame size warning
When CONFIG_FRAME_WARN is set to 1024 bytes, which is useful to find
stack consumers, we get a warning in hfi1 driver.
drivers/infiniband/hw/hfi1/affinity.c: In function
‘hfi1_get_proc_affinity’:
drivers/infiniband/hw/hfi1/affinity.c:415:1: warning: the frame size of
1056 bytes is larger than 1024 bytes [-Wframe-larger-than=]
This change removes unneeded buf[1024] declaration and usage.
Fixes: f48ad614c100 ("IB/hfi1: Move driver out of staging") Signed-off-by: Leon Romanovsky <leonro@mellanox.com> Acked-by: Dennis Dalessandro <dennis.dalessandro@intel.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
Linus Torvalds [Sun, 5 Jun 2016 18:15:33 +0000 (11:15 -0700)]
Merge branch 'parisc-4.7-2' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux
Pull parisc fixes from Helge Deller:
- Fix printk time stamps on SMP systems which got wrong due to a patch
which was added during the merge window
- Fix two bugs in the stack backtrace code: Races in module unloading
and possible invalid accesses to memory due to wrong instruction
decoding (Mikulas Patocka)
- Fix userspace crash when syscalls access invalid unaligned userspace
addresses. Those syscalls will now return EFAULT as expected.
(tagged for stable kernel series)
* 'parisc-4.7-2' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux:
parisc: Move die_if_kernel() prototype into traps.h header
parisc: Fix pagefault crash in unaligned __get_user() call
parisc: Fix printk time during boot
parisc: Fix backtrace on PA-RISC
Linus Torvalds [Sun, 5 Jun 2016 18:02:00 +0000 (11:02 -0700)]
Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security
Pull key handling update from James Morris:
"This alters a new keyctl function added in the current merge window to
allow for a future extension planned for the next merge window"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security:
KEYS: Add placeholder for KDF usage with DH
devpts: Make each mount of devpts an independent filesystem.
The /dev/ptmx device node is changed to lookup the directory entry "pts"
in the same directory as the /dev/ptmx device node was opened in. If
there is a "pts" entry and that entry is a devpts filesystem /dev/ptmx
uses that filesystem. Otherwise the open of /dev/ptmx fails.
The DEVPTS_MULTIPLE_INSTANCES configuration option is removed, so that
userspace can now safely depend on each mount of devpts creating a new
instance of the filesystem.
Each mount of devpts is now a separate and equal filesystem.
Reserved ttys are now available to all instances of devpts where the
mounter is in the initial mount namespace.
A new vfs helper path_pts is introduced that finds a directory entry
named "pts" in the directory of the passed in path, and changes the
passed in path to point to it. The helper path_pts uses a function
path_parent_directory that was factored out of follow_dotdot.
In the implementation of devpts:
- devpts_mnt is killed as it is no longer meaningful if all mounts of
devpts are equal.
- pts_sb_from_inode is replaced by just inode->i_sb as all cached
inodes in the tty layer are now from the devpts filesystem.
- devpts_add_ref is rolled into the new function devpts_ptmx. And the
unnecessary inode hold is removed.
- devpts_del_ref is renamed devpts_release and reduced to just a
deacrivate_super.
- The newinstance mount option continues to be accepted but is now
ignored.
In devpts_fs.h definitions for when !CONFIG_UNIX98_PTYS are removed as
they are never used.
Documentation/filesystems/devices.txt is updated to describe the current
situation.
This has been verified to work properly on openwrt-15.05, centos5,
centos6, centos7, debian-6.0.2, debian-7.9, debian-8.2, ubuntu-14.04.3,
ubuntu-15.10, fedora23, magia-5, mint-17.3, opensuse-42.1,
slackware-14.1, gentoo-20151225 (13.0?), archlinux-2015-12-01. With the
caveat that on centos6 and on slackware-14.1 that there wind up being
two instances of the devpts filesystem mounted on /dev/pts, the lower
copy does not end up getting used.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Greg KH <greg@kroah.com> Cc: Peter Hurley <peter@hurleysoftware.com> Cc: Peter Anvin <hpa@zytor.com> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Serge Hallyn <serge.hallyn@ubuntu.com> Cc: Willy Tarreau <w@1wt.eu> Cc: Aurelien Jarno <aurelien@aurel32.net> Cc: One Thousand Gnomes <gnomes@lxorguk.ukuu.org.uk> Cc: Jann Horn <jann@thejh.net> Cc: Jiri Slaby <jslaby@suse.com> Cc: Florian Weimer <fw@deneb.enyo.de> Cc: Konstantin Khlebnikov <koct9i@gmail.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This means the userspace program clock_adjtime called the clock_adjtime()
syscall and then crashed inside the compat_get_timex() function.
Syscalls should never crash programs, but instead return EFAULT.
The IIR register contains the executed instruction, which disassebles
into "ldw 0(sr3,r5),r9".
This load-word instruction is part of __get_user() which tried to read the word
at %r5/IOR (0xfa6f7fff). This means the unaligned handler jumped in. The
unaligned handler is able to emulate all ldw instructions, but it fails if it
fails to read the source e.g. because of page fault.
int main(void) {
/* allocate 8k */
char *ptr = mmap(NULL, 2*4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
/* free second half (upper 4k) and make it invalid. */
munmap(ptr+4096, 4096);
/* syscall where first int is unaligned and clobbers into invalid memory region */
/* syscall should return EFAULT */
return syscall(__NR_clock_adjtime, 0, ptr+4095);
}
To fix this issue we simply need to check if the faulting instruction address
is in the exception fixup table when the unaligned handler failed. If it
is, call the fixup routine instead of crashing.
While looking at the unaligned handler I found another issue as well: The
target register should not be modified if the handler was unsuccessful.
Mikulas Patocka [Tue, 28 Jun 2011 22:48:19 +0000 (00:48 +0200)]
parisc: Fix backtrace on PA-RISC
This patch fixes backtrace on PA-RISC
There were several problems:
1) The code that decodes instructions handles instructions that subtract
from the stack pointer incorrectly. If the instruction subtracts the
number X from the stack pointer the code increases the frame size by
(0x100000000-X). This results in invalid accesses to memory and
recursive page faults.
2) Because gcc reorders blocks, handling instructions that subtract from
the frame pointer is incorrect. For example, this function
int f(int a)
{
if (__builtin_expect(a, 1))
return a;
g();
return a;
}
is compiled in such a way, that the code that decreases the stack
pointer for the first "return a" is placed before the code for "g" call.
If we recognize this decrement, we mistakenly believe that the frame
size for the "g" call is zero.
To fix problems 1) and 2), the patch doesn't recognize instructions that
decrease the stack pointer at all. To further safeguard the unwind code
against nonsense values, we don't allow frame size larger than
Total_frame_size.
3) The backtrace is not locked. If stack dump races with module unload,
invalid table can be accessed.
This patch adds a spinlock when processing module tables.
Note, that for correct backtrace, you need recent binutils.
Binutils 2.18 from Debian 5 produce garbage unwind tables.
Binutils 2.21 work better (it sometimes forgets function frames, but at
least it doesn't generate garbage).
Linus Torvalds [Sat, 4 Jun 2016 19:30:36 +0000 (12:30 -0700)]
Merge tag 'drm-fixes-for-v4.7-rc2' of git://people.freedesktop.org/~airlied/linux
Pull drm fixes from Dave Airlie:
"A bunch of ARM drivers got into the fixes vibe this time around, so
this contains a bunch of fixes for imx, atmel hlcdc, arm hdlcd (only
so many combos of hlcd), mediatek and omap drm.
Other than that there is one mgag200 fix and a few core drm regression
fixes"
* tag 'drm-fixes-for-v4.7-rc2' of git://people.freedesktop.org/~airlied/linux: (34 commits)
drm/omap: fix unused variable warning.
drm: hdlcd: Add information about the underlying framebuffers in debugfs
drm: hdlcd: Cleanup the atomic plane operations
drm/hdlcd: Fix up crtc_state->event handling
drm: hdlcd: Revamp runtime power management
drm/mediatek: mtk_dsi: Remove spurious drm_connector_unregister
drm/mediatek: mtk_dpi: remove invalid error message
drm: atmel-hlcdc: fix a NULL check
drm: atmel-hlcdc: fix atmel_hlcdc_crtc_reset() implementation
drm/mgag200: Black screen fix for G200e rev 4
drm: Wrap direct calls to driver->gem_free_object from CMA
drm: fix fb refcount issue with atomic modesetting
drm: make drm_atomic_set_mode_prop_for_crtc() more reliable
drm/sti: remove extra mode fixup
drm: add missing drm_mode_set_crtcinfo call
drm/omap: include gpio/consumer.h where needed
drm/omap: include linux/seq_file.h where needed
Revert "drm/omap: no need to select OMAP2_DSS"
drm/omap: Remove regulator API abuse
OMAPDSS: HDMI5: Change DDC timings
...
Linus Torvalds [Sat, 4 Jun 2016 19:25:36 +0000 (12:25 -0700)]
Merge tag 'vfio-v4.7-rc2' of git://github.com/awilliam/linux-vfio
Pull VFIO fixes from Alex Williamson:
"Fix irqfd shutdown ordering, build warning, and VPD short read"
* tag 'vfio-v4.7-rc2' of git://github.com/awilliam/linux-vfio:
vfio/pci: Allow VPD short read
vfio/type1: Fix build warning
vfio/pci: Fix ordering of eventfd vs virqfd shutdown
Linus Torvalds [Sat, 4 Jun 2016 18:56:28 +0000 (11:56 -0700)]
Merge branch 'for-linus-4.7' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs fixes from Chris Mason:
"The important part of this pull is Filipe's set of fixes for btrfs
device replacement. Filipe fixed a few issues seen on the list and a
number he found on his own"
* 'for-linus-4.7' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
Btrfs: deal with duplciates during extent_map insertion in btrfs_get_extent
Btrfs: fix race between device replace and read repair
Btrfs: fix race between device replace and discard
Btrfs: fix race between device replace and chunk allocation
Btrfs: fix race setting block group back to RW mode during device replace
Btrfs: fix unprotected assignment of the left cursor for device replace
Btrfs: fix race setting block group readonly during device replace
Btrfs: fix race between device replace and block group removal
Btrfs: fix race between readahead and device replace/removal
Linus Torvalds [Sat, 4 Jun 2016 18:37:53 +0000 (11:37 -0700)]
Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client
Pull Ceph fixes from Sage Weil:
"We have a few follow-up fixes for the libceph refactor from Ilya, and
then some cephfs + fscache fixes from Zheng.
The first two FS-Cache patches are acked by David Howells and deemed
trivial enough to go through our tree. The rest fix some issues with
the ceph fscache handling (disable cache for inodes opened for write,
and simplify the revalidation logic accordingly, dropping the
now-unnecessary work queue)"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client:
ceph: use i_version to check validity of fscache
ceph: improve fscache revalidation
ceph: disable fscache when inode is opened for write
ceph: avoid unnecessary fscache invalidation/revlidation
ceph: call __fscache_uncache_page() if readpages fails
FS-Cache: make check_consistency callback return int
FS-Cache: wake write waiter after invalidating writes
libceph: use %s instead of %pE in dout()s
libceph: put request only if it's done in handle_reply()
libceph: change ceph_osdmap_flag() to take osdc
Linus Torvalds [Sat, 4 Jun 2016 18:26:49 +0000 (11:26 -0700)]
Merge tag 'acpi-4.7-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull ACPI fixes from Rafael Wysocki:
"Two fixes for problems introduced recently (ACPICA and the ACPI
backlight driver) and one fix for an older issue that prevents at
least one system from booting.
Specifics:
- Fix an incorrect check introduced by recent ACPICA changes which
causes problems with booting KVM guests to happen, among other
things (Lv Zheng).
- Fix a backlight issue introduced by recent changes to the ACPI
video driver (Aaron Lu).
- Fix the ACPI processor initialization which attempts to register an
IO region without checking if that really is necessary and
sometimes prevents drivers loaded subsequently from registering
their resources which leads to boot issues (Rafael Wysocki)"
* tag 'acpi-4.7-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
ACPI / processor: Avoid reserving IO regions too early
ACPICA / Hardware: Fix old register check in acpi_hw_get_access_bit_width()
ACPI / Thermal / video: fix max_level incorrect value
Linus Torvalds [Sat, 4 Jun 2016 18:07:57 +0000 (11:07 -0700)]
Merge tag 'pm-4.7-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull power management fixes from Rafael Wysocki:
"Two fixes for problems introduced recently in the cpufreq core and the
intel_pstate driver.
Specifics:
- Fix a silly mistake related to the clamp_val() usage in a function
added by a recent commit (Rafael Wysocki).
- Reduce the log level of an annoying message added to intel_pstate
during the recent merge window (Srinivas Pandruvada)"
* tag 'pm-4.7-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
cpufreq: Fix clamp_val() usage in cpufreq_driver_fast_switch()
cpufreq: intel_pstate: Downgrade print level for _PPC
Linus Torvalds [Sat, 4 Jun 2016 17:51:29 +0000 (10:51 -0700)]
Merge branch 'akpm' (patches from Andrew)
Merge various fixes from Andrew Morton:
"10 fixes"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
mm, page_alloc: recalculate the preferred zoneref if the context can ignore memory policies
mm, page_alloc: reset zonelist iterator after resetting fair zone allocation policy
mm, oom_reaper: do not use siglock in try_oom_reaper()
mm, page_alloc: prevent infinite loop in buffered_rmqueue()
checkpatch: reduce git commit description style false positives
mm/z3fold.c: avoid modifying HEADLESS page and minor cleanup
memcg: add RCU locking around css_for_each_descendant_pre() in memcg_offline_kmem()
mm: check the return value of lookup_page_ext for all call sites
kdump: fix dmesg gdbmacro to work with record based printk
mm: fix overflow in vm_map_ram()
Linus Torvalds [Fri, 3 Jun 2016 23:12:35 +0000 (16:12 -0700)]
Merge branch 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull irq fixes from Thomas Gleixner:
- a few simple fixes for fallout from the recent gic-v3 changes
- a workaround for a Cavium thunderX erratum
- a bugfix for the pic32 irqchip to make external interrupts work proper
- a missing return value in the generic IPI management code
* 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
irqchip/irq-pic32-evic: Fix bug with external interrupts.
irqchip/gicv3-its: numa: Enable workaround for Cavium thunderx erratum 23144
irqchip/gic-v3: Fix quiescence check in gic_enable_redist
irqchip/gic-v3: Fix copy+paste mistakes in defines
irqchip/gic-v3: Fix ICC_SGI1R_EL1.INTID decoding mask
genirq: Fix missing return value in irq_destroy_ipi()
Mel Gorman [Fri, 3 Jun 2016 21:56:01 +0000 (14:56 -0700)]
mm, page_alloc: recalculate the preferred zoneref if the context can ignore memory policies
The optimistic fast path may use cpuset_current_mems_allowed instead of
of a NULL nodemask supplied by the caller for cpuset allocations. The
preferred zone is calculated on this basis for statistic purposes and as
a starting point in the zonelist iterator.
However, if the context can ignore memory policies due to being atomic
or being able to ignore watermarks then the starting point in the
zonelist iterator is no longer correct. This patch resets the zonelist
iterator in the allocator slowpath if the context can ignore memory
policies. This will alter the zone used for statistics but only after
it is known that it makes sense for that context. Resetting it before
entering the slowpath would potentially allow an ALLOC_CPUSET allocation
to be accounted for against the wrong zone. Note that while nodemask is
not explicitly set to the original nodemask, it would only have been
overwritten if cpuset_enabled() and it was reset before the slowpath was
entered.
Link: http://lkml.kernel.org/r/20160602103936.GU2527@techsingularity.net Fixes: c33d6c06f60f710 ("mm, page_alloc: avoid looking up the first zone in a zonelist twice") Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Reported-by: Geert Uytterhoeven <geert@linux-m68k.org> Tested-by: Geert Uytterhoeven <geert@linux-m68k.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Mel Gorman [Fri, 3 Jun 2016 21:55:58 +0000 (14:55 -0700)]
mm, page_alloc: reset zonelist iterator after resetting fair zone allocation policy
Geert Uytterhoeven reported the following problem that bisected to
commit c33d6c06f60f ("mm, page_alloc: avoid looking up the first zone
in a zonelist twice") on m68k/ARAnyM
The relationship is not obvious but it's due to a failure to rescan the
full zonelist after the fair zone allocation policy exhausts the batch
count. While this is a functional problem, it's also a performance
issue. A page allocator microbenchmark showed the following
4.7.0-rc1 4.7.0-rc1
vanilla reset-v1r2
Min alloc-odr0-1 327.00 ( 0.00%) 326.00 ( 0.31%)
Min alloc-odr0-2 235.00 ( 0.00%) 235.00 ( 0.00%)
Min alloc-odr0-4 198.00 ( 0.00%) 198.00 ( 0.00%)
Min alloc-odr0-8 170.00 ( 0.00%) 170.00 ( 0.00%)
Min alloc-odr0-16 156.00 ( 0.00%) 156.00 ( 0.00%)
Min alloc-odr0-32 150.00 ( 0.00%) 150.00 ( 0.00%)
Min alloc-odr0-64 146.00 ( 0.00%) 146.00 ( 0.00%)
Min alloc-odr0-128 145.00 ( 0.00%) 145.00 ( 0.00%)
Min alloc-odr0-256 155.00 ( 0.00%) 155.00 ( 0.00%)
Min alloc-odr0-512 168.00 ( 0.00%) 165.00 ( 1.79%)
Min alloc-odr0-1024 175.00 ( 0.00%) 174.00 ( 0.57%)
Min alloc-odr0-2048 180.00 ( 0.00%) 180.00 ( 0.00%)
Min alloc-odr0-4096 187.00 ( 0.00%) 186.00 ( 0.53%)
Min alloc-odr0-8192 190.00 ( 0.00%) 190.00 ( 0.00%)
Min alloc-odr0-16384 191.00 ( 0.00%) 191.00 ( 0.00%)
Min alloc-odr1-1 736.00 ( 0.00%) 445.00 ( 39.54%)
Min alloc-odr1-2 343.00 ( 0.00%) 335.00 ( 2.33%)
Min alloc-odr1-4 277.00 ( 0.00%) 270.00 ( 2.53%)
Min alloc-odr1-8 238.00 ( 0.00%) 233.00 ( 2.10%)
Min alloc-odr1-16 224.00 ( 0.00%) 218.00 ( 2.68%)
Min alloc-odr1-32 210.00 ( 0.00%) 208.00 ( 0.95%)
Min alloc-odr1-64 207.00 ( 0.00%) 203.00 ( 1.93%)
Min alloc-odr1-128 276.00 ( 0.00%) 202.00 ( 26.81%)
Min alloc-odr1-256 206.00 ( 0.00%) 202.00 ( 1.94%)
Min alloc-odr1-512 207.00 ( 0.00%) 202.00 ( 2.42%)
Min alloc-odr1-1024 208.00 ( 0.00%) 205.00 ( 1.44%)
Min alloc-odr1-2048 213.00 ( 0.00%) 212.00 ( 0.47%)
Min alloc-odr1-4096 218.00 ( 0.00%) 216.00 ( 0.92%)
Min alloc-odr1-8192 341.00 ( 0.00%) 219.00 ( 35.78%)
Note that order-0 allocations are unaffected but higher orders get a
small boost from this patch and a large reduction in system CPU usage
overall as can be seen here:
4.7.0-rc1 4.7.0-rc1
vanilla reset-v1r2
User 85.32 86.31
System 2221.39 2053.36
Elapsed 2368.89 2202.47
Fixes: c33d6c06f60f ("mm, page_alloc: avoid looking up the first zone in a zonelist twice") Link: http://lkml.kernel.org/r/20160531100848.GR2527@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Reported-by: Geert Uytterhoeven <geert@linux-m68k.org> Tested-by: Geert Uytterhoeven <geert@linux-m68k.org> Tested-by: Mikulas Patocka <mpatocka@redhat.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Michal Hocko [Fri, 3 Jun 2016 21:55:55 +0000 (14:55 -0700)]
mm, oom_reaper: do not use siglock in try_oom_reaper()
Oleg has noted that siglock usage in try_oom_reaper is both pointless
and dangerous. signal_group_exit can be checked lockless. The problem
is that sighand becomes NULL in __exit_signal so we can crash.
Fixes: 3ef22dfff239 ("oom, oom_reaper: try to reap tasks which skip regular OOM killer path") Link: http://lkml.kernel.org/r/1464679423-30218-1-git-send-email-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Suggested-by: Oleg Nesterov <oleg@redhat.com> Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Vlastimil Babka [Fri, 3 Jun 2016 21:55:52 +0000 (14:55 -0700)]
mm, page_alloc: prevent infinite loop in buffered_rmqueue()
In DEBUG_VM kernel, we can hit infinite loop for order == 0 in
buffered_rmqueue() when check_new_pcp() returns 1, because the bad page
is never removed from the pcp list. Fix this by removing the page
before retrying. Also we don't need to check if page is non-NULL,
because we simply grab it from the list which was just tested for being
non-empty.
Fixes: 479f854a207c ("mm, page_alloc: defer debugging checks of pages allocated from the PCP") Link: http://lkml.kernel.org/r/20160530090154.GM2527@techsingularity.net Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Reported-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Vitaly Wool [Fri, 3 Jun 2016 21:55:47 +0000 (14:55 -0700)]
mm/z3fold.c: avoid modifying HEADLESS page and minor cleanup
Fix erroneous z3fold header access in a HEADLESS page in reclaim
function, and change one remaining direct handle-to-buddy conversion to
use the appropriate helper.
Tejun Heo [Fri, 3 Jun 2016 21:55:44 +0000 (14:55 -0700)]
memcg: add RCU locking around css_for_each_descendant_pre() in memcg_offline_kmem()
memcg_offline_kmem() may be called from memcg_free_kmem() after a css
init failure. memcg_free_kmem() is a ->css_free callback which is
called without cgroup_mutex and memcg_offline_kmem() ends up using
css_for_each_descendant_pre() without any locking. Fix it by adding rcu
read locking around it.
mkdir: cannot create directory `65530': No space left on device
===============================
[ INFO: suspicious RCU usage. ]
4.6.0-work+ #321 Not tainted
-------------------------------
kernel/cgroup.c:4008 cgroup_mutex or RCU read lock required!
[ 527.243970] other info that might help us debug this:
[ 527.244715]
rcu_scheduler_active = 1, debug_locks = 0
2 locks held by kworker/0:5/1664:
#0: ("cgroup_destroy"){.+.+..}, at: [<ffffffff81060ab5>] process_one_work+0x165/0x4a0
#1: ((&css->destroy_work)#3){+.+...}, at: [<ffffffff81060ab5>] process_one_work+0x165/0x4a0
[ 527.248098] stack backtrace:
CPU: 0 PID: 1664 Comm: kworker/0:5 Not tainted 4.6.0-work+ #321
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.9.1-1.fc24 04/01/2014
Workqueue: cgroup_destroy css_free_work_fn
Call Trace:
dump_stack+0x68/0xa1
lockdep_rcu_suspicious+0xd7/0x110
css_next_descendant_pre+0x7d/0xb0
memcg_offline_kmem.part.44+0x4a/0xc0
mem_cgroup_css_free+0x1ec/0x200
css_free_work_fn+0x49/0x5e0
process_one_work+0x1c5/0x4a0
worker_thread+0x49/0x490
kthread+0xea/0x100
ret_from_fork+0x1f/0x40
Link: http://lkml.kernel.org/r/20160526203018.GG23194@mtj.duckdns.org Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Vladimir Davydov <vdavydov@virtuozzo.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: <stable@vger.kernel.org> [4.5+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Yang Shi [Fri, 3 Jun 2016 21:55:38 +0000 (14:55 -0700)]
mm: check the return value of lookup_page_ext for all call sites
Per the discussion with Joonsoo Kim [1], we need check the return value
of lookup_page_ext() for all call sites since it might return NULL in
some cases, although it is unlikely, i.e. memory hotplug.
Corey Minyard [Fri, 3 Jun 2016 21:55:36 +0000 (14:55 -0700)]
kdump: fix dmesg gdbmacro to work with record based printk
Commit 7ff9554bb578 ("printk: convert byte-buffer to variable-length
record buffer") introduced a record based printk buffer. Modify
gdbmacros.txt to parse this new structure so dmesg will work properly.
Link: http://lkml.kernel.org/r/1463515794-1599-1-git-send-email-minyard@acm.org Signed-off-by: Corey Minyard <cminyard@mvista.com> Cc: Dave Young <dyoung@redhat.com> Cc: Baoquan He <bhe@redhat.com> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When remapping pages accounting for 4G or more memory space, the
operation 'count << PAGE_SHIFT' overflows as it is performed on an
integer. Solution: cast before doing the bitshift.
[akpm@linux-foundation.org: fix vm_unmap_ram() also]
[akpm@linux-foundation.org: fix vmap() as well, per Guillermo] Link: http://lkml.kernel.org/r/etPan.57175fb3.7a271c6b.2bd@naudit.es Signed-off-by: Guillermo Julián Moreno <guillermo.julian@naudit.es> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Linus Torvalds [Fri, 3 Jun 2016 21:39:29 +0000 (14:39 -0700)]
Merge branch 'fixes' of git://git.armlinux.org.uk/~rmk/linux-arm
Pull ARM fix from Russell King:
"Just one fix to the ptrace code, spotted by Simon Marchi, where if a
thread migrates to a different CPU and the VFP registers are changed
through ptrace, the application doesn't see the updated VFP registers"
* 'fixes' of git://git.armlinux.org.uk/~rmk/linux-arm:
ARM: fix PTRACE_SETVFPREGS on SMP systems
Linus Torvalds [Fri, 3 Jun 2016 21:29:47 +0000 (14:29 -0700)]
Merge tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux
Pull arm64 fixes from Will Deacon:
"The main thing here is reviving hugetlb support using contiguous ptes,
which we ended up reverting at the last minute in 4.5 pending a fix
which went into the core mm/ code during the recent merge window.
- Revert a previous revert and get hugetlb going with contiguous hints
- Wire up missing compat syscalls
- Enable CONFIG_SET_MODULE_RONX by default
- Add missing line to our compat /proc/cpuinfo output
- Clarify levels in our page table dumps
- Fix booting with RANDOMIZE_TEXT_OFFSET enabled
- Misc fixes to the ARM CPU PMU driver (refcounting, probe failure)
- Remove some dead code and update a comment"
* tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux:
arm64: fix alignment when RANDOMIZE_TEXT_OFFSET is enabled
arm64: move {PAGE,CONT}_SHIFT into Kconfig
arm64: mm: dump: log span level
arm64: update stale PAGE_OFFSET comment
drivers/perf: arm_pmu: Avoid leaking pmu->irq_affinity on error
drivers/perf: arm_pmu: Defer the setting of __oprofile_cpu_pmu
drivers/perf: arm_pmu: Fix reference count of a device_node in of_pmu_irq_cfg
arm64: report CPU number in bad_mode
arm64: unistd32.h: wire up missing syscalls for compat tasks
arm64: Provide "model name" in /proc/cpuinfo for PER_LINUX32 tasks
arm64: enable CONFIG_SET_MODULE_RONX by default
arm64: Remove orphaned __addr_ok() definition
Revert "arm64: hugetlb: partial revert of 66b3923a1a0f"
Linus Torvalds [Fri, 3 Jun 2016 21:20:22 +0000 (14:20 -0700)]
Merge tag 'powerpc-4.7-2' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux
Pull powerpc fixes from Michael Ellerman:
- Handle RTAS delay requests in configure_bridge from Russell Currey
- Refactor the configure_bridge RTAS tokens from Russell Currey
- Fix definition of SIAR and SDAR registers from Thomas Huth
- Use privileged SPR number for MMCR2 from Thomas Huth
- Update LPCR only if it is powernv from Aneesh Kumar K.V
- Fix the reference bit update when handling hash fault from Aneesh
Kumar K.V
- Add missing tlb flush from Aneesh Kumar K.V
- Add POWER8NVL support to ibm,client-architecture-support call from
Thomas Huth
* tag 'powerpc-4.7-2' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
powerpc/pseries: Add POWER8NVL support to ibm,client-architecture-support call
powerpc/mm/radix: Add missing tlb flush
powerpc/mm/hash: Fix the reference bit update when handling hash fault
powerpc/mm/radix: Update LPCR only if it is powernv
powerpc: Use privileged SPR number for MMCR2
powerpc: Fix definition of SIAR and SDAR registers
powerpc/pseries/eeh: Refactor the configure_bridge RTAS tokens
powerpc/pseries/eeh: Handle RTAS delay requests in configure_bridge