Keith Busch [Tue, 12 Jan 2016 22:09:31 +0000 (15:09 -0700)]
NVMe: Export NVMe attributes to sysfs group
Adds all controller information to attribute list exposed to sysfs, and
appends the reset_controller attribute to it. The nvme device is created
with this attribute list, so driver no long manages its attributes.
Keith Busch [Tue, 12 Jan 2016 21:41:18 +0000 (14:41 -0700)]
NVMe: Shutdown controller only for power-off
We don't need to shutdown a controller for a reset. A controller in a
shutdown state may take longer to become ready than one that was simply
disabled. This patch has the driver shut down a controller only if the
device is about to be powered off or being removed. When taking the
controller down for a reset reason, the controller will be disabled
instead.
Function names have been updated in this patch to reflect their changed
semantics.
Keith Busch [Tue, 12 Jan 2016 21:41:17 +0000 (14:41 -0700)]
NVMe: IO queue deletion re-write
The nvme driver deletes IO queues asynchronously since this operation
may potentially take an undesirable amount of time with a large number
of queues if done serially.
The driver used to manage coordinating asynchronous deletions. This
patch simplifies that by leveraging the block layer rather than using
kthread workers and chaining more complicated callbacks.
Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>
Keith Busch [Mon, 4 Jan 2016 16:10:57 +0000 (09:10 -0700)]
NVMe: Remove queue freezing on resets
NVMe submits all commands through the block layer now. This means we
can let requests queue at the blk-mq hardware context since there is no
path that bypasses this anymore so we don't need to freeze the queues
anymore. The driver can simply stop the h/w queues from running during
a reset instead.
This also fixes a WARN in percpu_ref_reinit when the queue was unfrozen
with requeued requests.
Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>
Keith Busch [Mon, 4 Jan 2016 16:10:56 +0000 (09:10 -0700)]
NVMe: Use a retryable error code on reset
A negative status has the "do not retry" bit set, which makes it not
retryable. Use a fake status that can potentially be retried on reset.
An aborted command's status is overridden by the timeout handler so
that it won't be retried, which is necessary to keep initialization from
getting into a reset loop.
Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>
Keith Busch [Mon, 4 Jan 2016 16:10:55 +0000 (09:10 -0700)]
NVMe: Fix admin queue ring wrap
The tag set queue depth needs to be one less than the h/w queue depth
so we don't wrap the circular buffer. This conforms to the specification
defined "Full Queue" condition.
Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>
Translation SCSI commands to NVMe commands is rather pointless in general
as applications must not expext to be able to use SCSI commands on a
generic block device.
Make the huge translation layer optional and hope no one will ever enable
it in the future.
Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
nvme: fixes for NVME_IOCTL_IO_CMD on the char device
Make sure we synchronize access to the namespaces list and grab a reference
to the namespace before doing I/O. Make sure to reject the ioctl if multiple
namespaces are present as it's entirely unsafe, and warn when using it even
with a single namespace.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagig@mellanox.com> Acked-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>
Sudip Mukherjee [Wed, 23 Dec 2015 15:35:26 +0000 (21:05 +0530)]
PCI/AER: include header file
We are having build failure with sparc allmodconfig with the error:
drivers/nvme/host/pci.c:15:0:
include/linux/aer.h: In function 'pci_enable_pcie_error_reporting':
include/linux/aer.h:49:10: error: 'EINVAL' undeclared (first use in this function)
The file aer.h is using the error values but they are defined in
errno.h. Include errno.h so that we have the definitions of the error
codes.
Keith Busch [Mon, 7 Dec 2015 22:30:31 +0000 (15:30 -0700)]
NVMe: Add pci error handlers
Requests enabling pcie aer support. Shuts down the controller on error
detected with io frozen state prior to requesting slot reset; resumes
controller after reset completes.
Signed-off-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
This was added for the 'magic' AEN requests in the NVMe driver that never
return. We now handle them purely inside the driver and don't need this
core hack any more.
Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>
Now that all commands are executed as block layer requests we can remove the
internal completion in the NVMe driver. Note that we can simply call
blk_mq_complete_request to abort commands as the block layer will protect
against double copletions internally.
Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>
AEN requests are different from other requests in that they don't time out
or can easily be cancelled. Because of that we should not use the blk-mq
infrastructure but just special case them in the completion path.
Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>
Keith Busch [Sat, 28 Nov 2015 14:41:02 +0000 (15:41 +0100)]
NVMe: Remove device management handles on remove
We don't want to allow new references to open on a device that is
removed. This ties the lifetime of these handles to the physical device's
presence rather than to the open reference count.
Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>
Keith Busch [Fri, 23 Oct 2015 17:42:02 +0000 (11:42 -0600)]
NVMe: Use unbounded work queue for all work
Removes all usage of the global work queue so work can't be
scheduled on two different work queues, and removes nvme's work queue
singlethreadedness so controllers can be driven in parallel.
Signed-off-by: Keith Busch <keith.busch@intel.com>
[hch: keep the dead controller removal on the system workqueue to avoid
deadlocks] Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
Keith Busch [Thu, 22 Oct 2015 21:45:06 +0000 (15:45 -0600)]
NVMe: Implement namespace list scanning
The NVMe 1.1 specification provides an identify mode to return a
list of active namespaces. This is more efficient to discover which
namespace identifiers are active on a controller, providing potentially
significant improvement in scan time for controllers with sparesly
populated namespaces.
Signed-off-by: Keith Busch <keith.busch@intel.com>
[hch: add quirk for the broken Qemu Identify implementation. To be relaxed
later] Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
If we're using two work queues we're always going to run into races where
one item is tearing down what the other one is initializing. So insted
merge the two work queues, and let the old probe_work also tear the
controller down first if it was alive. Together with the better detection
of the probe path using a flag this gives us a properly serialized
reset/probe path that also doesn't accidentally trigger when two commands
time out and the second one tries to reset the controller while the first
reset is still in progress.
Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>
Keith Busch [Thu, 26 Nov 2015 11:11:07 +0000 (12:11 +0100)]
nvme: do not restart the request timeout if we're resetting the controller
Otherwise we're never going to complete a command when it is restarted just
after we completed all other outstanding commands in nvme_clear_queue.
The controller must be disabled prior to completing a presumed lost
command, do this by directly shutting down the controller before
queueing the reset work, and return EH_HANDLED from the timeout handler
after we shut the controller down.
Signed-off-by: Keith Busch <keith.busch@intel.com>
[hch: split and rebase] Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
Don't delete the controller from dev_list before queuing a reset, instead
just check for it being reset in the polling kthread. This allows to remove
the dev_list_lock in various places, and in addition we can simply rely on
checking the queue_work return value to see if we could reset a controller.
Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>
We want to be able to return bettern error values frmo nvme_timeout, which
is significantly easier if the two functions are merged. Also clean up and
reduce the printk spew so that we only get one message per abort.
Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>
Keith Busch [Thu, 26 Nov 2015 11:21:29 +0000 (12:21 +0100)]
nvme: protect against simultaneous shutdown invocations
Signed-off-by: Keith Busch <keith.busch@intel.com>
[hch: split from a larger patch] Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
nvme: only add a controller to dev_list after it's been fully initialized
Without this we can easily get bad derferences on nvmeq->d_db when the nvme
kthread tries to poll the CQs for controllers that are in half initialized
state.
Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>
Timer context is not very useful for drivers to perform any meaningful abort
action from. So instead of calling the driver from this useless context
defer it to a workqueue as soon as possible.
Note that while a delayed_work item would seem the right thing here I didn't
dare to use it due to the magic in blk_add_timer that pokes deep into timer
internals. But maybe this encourages Tejun to add a sensible API for that to
the workqueue API and we'll all be fine in the end :)
Contains a major update from Keith Bush:
"This patch removes synchronizing the timeout work so that the timer can
start a freeze on its own queue. The timer enters the queue, so timer
context can only start a freeze, but not wait for frozen."
Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>
Dan Carpenter [Wed, 9 Dec 2015 10:24:06 +0000 (13:24 +0300)]
nvme: precedence bug in nvme_pr_clear()
The "|" operator has higher precedence than "?:" so this didn't work as
intended. I had previously fixed this bug, but it we copied the older
unfixed version when we moved the function between files.
Fixes: 1673f1f08c88 ('nvme: move block_device_operations and ns/ctrl freeing to common code') Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Jens Axboe <axboe@fb.com>
Arnd Bergmann [Tue, 8 Dec 2015 15:22:17 +0000 (16:22 +0100)]
nvme: fix another 32-bit build warning
The nvme_user_cmd function was recently moved around from one file
to another, which made a warning reappear that I had fixed before
at some point:
drivers/nvme/host/core.c: In function 'nvme_user_cmd':
drivers/nvme/host/core.c:424:4: warning: cast to pointer from integer of different size [-Wint-to-pointer-cast]
This applies the same workaround that we have elsewhere in the
driver with an extra type cast to uintptr_t.
Signed-off-by: Arnd Bergmann <arnd@arndb.de> Fixes: 1673f1f08c88 ("nvme: move block_device_operations and ns/ctrl freeing to common code") Link: https://lkml.org/lkml/2015/10/9/611 Signed-off-by: Jens Axboe <axboe@fb.com>
Keith Busch [Thu, 3 Dec 2015 16:32:21 +0000 (09:32 -0700)]
blk-integrity: empty implementation when disabled
This patch moves the blk_integrity_payload definition outside the
CONFIG_BLK_DEV_INTERITY dependency and provides empty function
implementations when the kernel configuration disables integrity
extensions. This simplifies drivers that make use of these to map user
data so they don't need to repeat the same configuration checks.
Signed-off-by: Keith Busch <keith.busch@intel.com>
Updated by Jens to pass an error pointer return from
bio_integrity_alloc(), otherwise if CONFIG_BLK_DEV_INTEGRITY isn't
set, we return a weird ENOMEM from __nvme_submit_user_cmd()
if a meta buffer is set.
Split out a helper that just issues the Set Features and interprets the
result which can go to common code, and document why we are ignoring
non-timeout error returns in the PCIe driver.
Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
nvme: move chardev and sysfs interface to common code
For this we need to add a proper controller init routine and a list of
all controllers that is in addition to the list of PCIe controllers,
which stays in pci.c. Note that we remove the sysfs device when the
last reference to a controller is dropped now - the old code would have
kept it around longer, which doesn't make much sense.
This requires a new ->reset_ctrl operation to implement controleller
resets, and a new ->write_reg32 operation that is required to implement
subsystem resets. We also now store caches copied of the NVMe compliance
version and the flag if a controller is attached to a subsystem or not in
the generic controller structure now.
Signed-off-by: Christoph Hellwig <hch@lst.de>
[Fixes for pr merge] Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>
The namespace scanning code has been mostly generic already, we just
need to store a pointer to the tagset in the nvme_ctrl structure, and
add a method to check if a controller is I/O incapable. The latter
will hopefully be replaced by a proper controller state machine soon.
Signed-off-by: Christoph Hellwig <hch@lst.de>
[Fixed pr conflicts] Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>
nvme: move remaining CC setup into nvme_enable_ctrl
Remove the calculation of all the bits written into the CC register into
nvme_enable_ctrl, so that they can be moved into the core NVMe driver in
the future.
Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
nvme: move block_device_operations and ns/ctrl freeing to common code
This moves the block_device_operations over to common code mostly
as-is. The only change is that the ns and ctrl refcounting got some
small refcounting to have wrappers around the kref_put operations.
A new free_ctrl operation is added to allow the PCI driver to free
it's ressources on the final drop.
Signed-off-by: Christoph Hellwig <hch@lst.de>
[Moved the integrity and pr changes due to merge conflict] Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>
Keith Busch [Fri, 23 Oct 2015 15:47:28 +0000 (09:47 -0600)]
nvme: use the block layer for userspace passthrough metadata
Use the integrity API to pass through metadata from userspace. For PI
enabled devices this means that we now validate the reftag, which seems
like an unintentional ommission in the old code.
Thanks to Keith Busch for testing and fixes.
Signed-off-by: Christoph Hellwig <hch@lst.de>
[Skip metadata setup on admin commands] Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>
Add a separate nvme_submit_user_cmd for commands that directly DMA
to or from userspace. We'll add metadata support to that soon and
the common version would become too messy.
Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>
This "backports" the structure I've used for the fabrics driver. It
mostly started out as a cleanup so that I could actually understand
the code, but I think it also qualifies as a micro-optimization due
to the reduced time we hold q_lock and disable interrupts.
Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>
nvme: split a new struct nvme_ctrl out of struct nvme_dev
The new struct nvme_ctrl will be used by the common NVMe code that sits
on top of struct request_queue and the new nvme_ctrl_ops abstraction.
It only contains the bare minimum required, which consists of values
sampled during controller probe, the admin queue pointer and a second
struct device pointer at the moment, but more will follow later. Only
values that are not used in the I/O fast path should be moved to
struct nvme_ctrl so that drivers can optimize their cache line usage
easily. That's also the reason why we have two device pointers as
the struct device is used for DMA mapping purposes.
Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>
Use the vendor ID from the identify data instead of the PCI device to
make the SCSI translation layer independent from the PCI driver. The NVMe
spec defines them as having the same value for current PCIe devices.
Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
nvme: use offset instead of a struct for registers
This makes life easier for future non-PCI drivers where access to the
registers might be more complicated. Note that Linux drivers are
pretty evenly split between the two versions, and in fact the NVMe
driver already uses offsets for the doorbells.
Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Keith Busch <keith.busch@intel.com>
[Fixed CMBSZ offset] Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>
nvme: split command submission helpers out of pci.c
Create a new core.c and start by adding the command submission helpers
to it, which are already abstracted away from the actual hardware queues
by the block layer.
Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>
blk-mq: add a flags parameter to blk_mq_alloc_request
We already have the reserved flag, and a nowait flag awkwardly encoded as
a gfp_t. Add a real flags argument to make the scheme more extensible and
allow for a nicer calling convention.
Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
Thanks for report! After some investigation I found out we allocate
elevator specific data in __get_request() only for non-flush requests. And
this is actually required since the flush machinery uses the space in
struct request for something else. Doh. So my patch is just wrong and not
easy to fix since at the time __get_request() is called we are not sure
whether the flush machinery will be used in the end. Jens, please revert 1b2ff19e6a957b1ef0f365ad331b608af80e932e. Thanks!
I'm somewhat surprised that you can reliably hit the race where flushing
gets disabled for the device just while the request is in flight. But I
guess during boot it makes some sense.
--
So let's just revert it, we can fix the queue run manually after the
fact. This race is rare enough that it didn't trigger in testing, it
requires the specific disable-while-in-flight scenario to trigger.
We received a bug report recently when DDW (64-bit direct DMA on Power)
is not enabled for NVMe devices. In that case, we fall back to 32-bit
DMA via the IOMMU, which is always done via 4K TCEs (Translation Control
Entries).
The NVMe device driver, though, assumes that the DMA alignment for the
PRP entries will match the device's page size, and that the DMA aligment
matches the kernel's page aligment. On Power, the the IOMMU page size,
as mentioned above, can be 4K, while the device can have a page size of
8K, while the kernel has a page size of 64K. This eventually trips the
BUG_ON in nvme_setup_prps(), as we have a 'dma_len' that is a multiple
of 4K but not 8K (e.g., 0xF000).
In this particular case of page sizes, we clearly want to use the
IOMMU's page size in the driver. And generally, the NVMe driver in this
function should be using the IOMMU's page size for the default device
page size, rather than the kernel's page size. There is not currently an
API to obtain the IOMMU's page size across all architectures and in the
interest of a stop-gap fix to this functional issue, default the NVMe
device page size to 4K, with the intent of adding such an API and
implementation across all architectures in the next merge window.
With the functionally equivalent v3 of this patch, our hardware test
exerciser survives when using 32-bit DMA; without the patch, the kernel
will BUG within a few minutes.
Signed-off-by: Nishanth Aravamudan <nacc at linux.vnet.ibm.com> Signed-off-by: Jens Axboe <axboe@fb.com>
Linus Torvalds [Tue, 24 Nov 2015 20:53:11 +0000 (12:53 -0800)]
Merge tag 'dm-4.4-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm
Pull device mapper fixes from Mike Snitzer:
"Two fixes for 4.4-rc1's DM ioctl changes that introduced the potential
for infinite recursion on ioctl (with DM multipath).
And four stable fixes:
- A DM thin-provisioning fix to restore 'error_if_no_space' setting
when a thin-pool is made writable again (after having been out of
space).
- A DM thin-provisioning fix to properly advertise discard support
for thin volumes that are stacked on a thin-pool whose underlying
data device doesn't support discards.
- A DM ioctl fix to allow ctrl-c to break out of an ioctl retry loop
when DM multipath is configured to 'queue_if_no_path'.
- A DM crypt fix for a possible hang on dm-crypt device removal"
* tag 'dm-4.4-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
dm thin: fix regression in advertised discard limits
dm crypt: fix a possible hang due to race condition on exit
dm mpath: fix infinite recursion in ioctl when no paths and !queue_if_no_path
dm: do not reuse dm_blk_ioctl block_device input as local variable
dm: fix ioctl retry termination with signal
dm thin: restore requested 'error_if_no_space' setting on OODS to WRITE transition
Eric Dumazet [Tue, 24 Nov 2015 19:39:54 +0000 (11:39 -0800)]
pidns: fix NULL dereference in __task_pid_nr_ns()
I got a crash during a "perf top" session that was caused by a race in
__task_pid_nr_ns() :
pid_nr_ns() was inlined, but apparently compiler chose to read
task->pids[type].pid twice, and the pid->level dereference crashed
because we got a NULL pointer at the second read :
if (pid && ns->level <= pid->level) { // CRASH
Just use RCU API properly to solve this race, and not worry about "perf
top" crashing hosts :(
get_task_pid() can benefit from same fix.
Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Linus Torvalds [Tue, 24 Nov 2015 18:26:30 +0000 (10:26 -0800)]
Merge branch 'for-linus' of git://git.kernel.dk/linux-block
Pull block layer fixes from Jens Axboe:
"A round of fixes/updates for the current series.
This looks a little bigger than it is, but that's mainly because we
pushed the lightnvm enabled null_blk change out of the merge window so
it could be updated a bit. The rest of the volume is also mostly
lightnvm. In particular:
- Lightnvm. Various fixes, additions, updates from Matias and
Javier, as well as from Wenwei Tao.
- NVMe:
- Fix for potential arithmetic overflow from Keith.
- Also from Keith, ensure that we reap pending completions from
a completion queue before deleting it. Fixes kernel crashes
when resetting a device with IO pending.
- Various little lightnvm related tweaks from Matias.
- Fixup flushes to go through the IO scheduler, for the cases where a
flush is not required. Fixes a case in CFQ where we would be
idling and not see this request, hence not break the idling. From
Jan Kara.
- Use list_{first,prev,next} in elevator.c for cleaner code. From
Gelian Tang.
- Fix for a warning trigger on btrfs and raid on single queue blk-mq
devices, where we would flush plug callbacks with preemption
disabled. From me.
- A mac partition validation fix from Kees Cook.
- Two merge fixes from Ming, marked stable. A third part is adding a
new warning so we'll notice this quicker in the future, if we screw
up the accounting.
- Cleanup of thread name/creation in mtip32xx from Rasmus Villemoes"
* 'for-linus' of git://git.kernel.dk/linux-block: (32 commits)
blk-merge: warn if figured out segment number is bigger than nr_phys_segments
blk-merge: fix blk_bio_segment_split
block: fix segment split
blk-mq: fix calling unplug callbacks with preempt disabled
mac: validate mac_partition is within sector
mtip32xx: use formatting capability of kthread_create_on_node
NVMe: reap completion entries when deleting queue
lightnvm: add free and bad lun info to show luns
lightnvm: keep track of block counts
nvme: lightnvm: use admin queues for admin cmds
lightnvm: missing free on init error
lightnvm: wrong return value and redundant free
null_blk: do not del gendisk with lightnvm
null_blk: use device addressing mode
null_blk: use ppa_cache pool
NVMe: Fix possible arithmetic overflow for max segments
blk-flush: Queue through IO scheduler when flush not required
null_blk: register as a LightNVM device
elevator: use list_{first,prev,next}_entry
lightnvm: cleanup queue before target removal
...
Ming Lei [Tue, 24 Nov 2015 02:35:31 +0000 (10:35 +0800)]
blk-merge: warn if figured out segment number is bigger than nr_phys_segments
We had seen lots of reports of this kind issue, so add one
warnning in blk-merge, then it can be triggered easily and
avoid to depend on warning/bug from drivers.
Signed-off-by: Ming Lei <ming.lei@canonical.com> Signed-off-by: Jens Axboe <axboe@fb.com>
Ming Lei [Tue, 24 Nov 2015 02:35:30 +0000 (10:35 +0800)]
blk-merge: fix blk_bio_segment_split
Commit bdced438acd83a(block: setup bi_phys_segments after
splitting) introduces function of computing bio->bi_phys_segments
during bio splitting.
Unfortunately both bio->bi_seg_front_size and bio->bi_seg_back_size
arn't computed, so too many physical segments may be obtained
for one request since both the two are used to check if one segment
across two bios can be possible.
This patch fixes the issue by computing the two variables in
blk_bio_segment_split().
Fixes: bdced438acd83a(block: setup bi_phys_segments after splitting) Reported-by: Michael Ellerman <mpe@ellerman.id.au> Reported-by: Mark Salter <msalter@redhat.com> Tested-by: Laurent Dufour <ldufour@linux.vnet.ibm.com> Tested-by: Mark Salter <msalter@redhat.com> Signed-off-by: Ming Lei <ming.lei@canonical.com> Signed-off-by: Jens Axboe <axboe@fb.com>
Ming Lei [Tue, 24 Nov 2015 02:35:29 +0000 (10:35 +0800)]
block: fix segment split
Inside blk_bio_segment_split(), previous bvec pointer(bvprvp)
always points to the iterator local variable, which is obviously
wrong, so fix it by pointing to the local variable of 'bvprv'.
Fixes: 5014c311baa2b(block: fix bogus compiler warnings in blk-merge.c) Cc: stable@kernel.org #4.3 Reported-by: Michael Ellerman <mpe@ellerman.id.au> Reported-by: Mark Salter <msalter@redhat.com> Tested-by: Laurent Dufour <ldufour@linux.vnet.ibm.com> Tested-by: Mark Salter <msalter@redhat.com> Signed-off-by: Ming Lei <ming.lei@canonical.com> Signed-off-by: Jens Axboe <axboe@fb.com>
Linus Torvalds [Mon, 23 Nov 2015 21:19:27 +0000 (13:19 -0800)]
Merge tag 'linux-kselftest-4.4-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest
Pull kselftest fixes from Shuah Khan:
"This update consists of one minor documentation fix and a fix to an
existing test"
* tag 'linux-kselftest-4.4-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest:
selftests/seccomp: Get page size from sysconf
tools:testing/selftests: fix typo in futex/README
Mike Snitzer [Mon, 23 Nov 2015 18:44:38 +0000 (13:44 -0500)]
dm thin: fix regression in advertised discard limits
When establishing a thin device's discard limits we cannot rely on the
underlying thin-pool device's discard capabilities (which are inherited
from the thin-pool's underlying data device) given that DM thin devices
must provide discard support even when the thin-pool's underlying data
device doesn't support discards.
Users were exposed to this thin device discard limits regression if
their thin-pool's underlying data device does _not_ support discards.
This regression caused all upper-layers that called the
blkdev_issue_discard() interface to not be able to issue discards to
thin devices (because discard_granularity was 0). This regression
wasn't caught earlier because the device-mapper-test-suite's extensive
'thin-provisioning' discard tests are only ever performed against
thin-pool's with data devices that support discards.
Fix is to have thin_io_hints() test the pool's 'discard_enabled' feature
rather than inferring whether or not a thin device's discard support
should be enabled by looking at the thin-pool's discard_granularity.
Fixes: 216076705 ("dm thin: disable discard support for thin devices if pool's is disabled") Reported-by: Mike Gerber <mike@sprachgewalt.de> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org # 4.1+
Linus Torvalds [Sun, 22 Nov 2015 23:21:40 +0000 (15:21 -0800)]
Merge branch 'akpm' (patches from Andrew)
Merge slub bulk allocator updates from Andrew Morton:
"This missed the merge window because I was waiting for some repairs to
come in. Nothing actually uses the bulk allocator yet and the changes
to other code paths are pretty small. And the net guys are waiting
for this so they can start merging the client code"
More comments from Jesper Dangaard Brouer:
"The kmem_cache_alloc_bulk() call, in mm/slub.c, were included in
previous kernel. The present version contains a bug. Vladimir
Davydov noticed it contained a bug, when kernel is compiled with
CONFIG_MEMCG_KMEM (see commit 03ec0ed57ffc: "slub: fix kmem cgroup
bug in kmem_cache_alloc_bulk"). Plus the mem cgroup counterpart in
kmem_cache_free_bulk() were missing (see commit 033745189b1b "slub:
add missing kmem cgroup support to kmem_cache_free_bulk").
I don't consider the fix stable-material because there are no in-tree
users of the API.
But with known bugs (for memcg) I cannot start using the API in the
net-tree"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
slab/slub: adjust kmem_cache_alloc_bulk API
slub: add missing kmem cgroup support to kmem_cache_free_bulk
slub: fix kmem cgroup bug in kmem_cache_alloc_bulk
slub: optimize bulk slowpath free by detached freelist
slub: support for bulk free with SLUB freelists
Linus Torvalds [Sun, 22 Nov 2015 23:10:57 +0000 (15:10 -0800)]
Merge tag 'tty-4.4-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty
Pull tty/serial fixes from Greg KH:
"Here are a few small tty/serial driver fixes for 4.4-rc2 that resolve
some reported problems.
All have been in linux-next, full details are in the shortlog below"
* tag 'tty-4.4-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty:
serial: export fsl8250_handle_irq
serial: 8250_mid: Add missing dependency
tty: audit: Fix audit source
serial: etraxfs-uart: Fix crash
serial: fsl_lpuart: Fix earlycon support
bcm63xx_uart: Use the device name when registering an interrupt
tty: Fix direct use of tty buffer work
tty: Fix tty_send_xchar() lock order inversion
Linus Torvalds [Sun, 22 Nov 2015 21:26:24 +0000 (13:26 -0800)]
Merge tag 'staging-4.4-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging
Pull staging/IIO fixes from Greg KH:
"Here are some staging and iio driver fixes for 4.4-rc2. All of these
are in response to issues that have been reported and have been in
linux-next for a while"
* tag 'staging-4.4-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging:
Revert "Staging: wilc1000: coreconfigurator: Drop unneeded wrapper functions"
iio: adc: xilinx: Fix VREFN scale
iio: si7020: Swap data byte order
iio: adc: vf610_adc: Fix division by zero error
iio:ad7793: Fix ad7785 product ID
iio: ad5064: Fix ad5629/ad5669 shift
iio:ad5064: Make sure ad5064_i2c_write() returns 0 on success
iio: lpc32xx_adc: fix warnings caused by enabling unprepared clock
staging: iio: select IRQ_WORK for IIO_DUMMY_EVGEN
vf610_adc: Fix internal temperature calculation
Linus Torvalds [Sun, 22 Nov 2015 21:15:05 +0000 (13:15 -0800)]
Merge tag 'usb-4.4-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb
Pull USB fixes from Greg KH:
"Here are a number of USB fixes and new device ids for 4.4-rc2. All
have been in linux-next and the details are in the shortlog"
* tag 'usb-4.4-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb: (28 commits)
usblp: do not set TASK_INTERRUPTIBLE before lock
USB: MAINTAINERS: cxacru
usb: kconfig: fix warning of select USB_OTG
USB: option: add XS Stick W100-2 from 4G Systems
xhci: Fix a race in usb2 LPM resume, blocking U3 for usb2 devices
usb: xhci: fix checking ep busy for CFC
xhci: Workaround to get Intel xHCI reset working more reliably
usb: chipidea: imx: fix a possible NULL dereference
usb: chipidea: usbmisc_imx: fix a possible NULL dereference
usb: chipidea: otg: gadget module load and unload support
usb: chipidea: debug: disable usb irq while role switch
ARM: dts: imx27.dtsi: change the clock information for usb
usb: chipidea: imx: refine clock operations to adapt for all platforms
usb: gadget: atmel_usba_udc: Expose correct device speed
usb: musb: enable usb_dma parameter
usb: phy: phy-mxs-usb: fix a possible NULL dereference
usb: dwc3: gadget: let us set lower max_speed
usb: musb: fix tx fifo flush handling
usb: gadget: f_loopback: fix the warning during the enumeration
usb: dwc2: host: Fix remote wakeup when not in DWC2_L2
...
Linus Torvalds [Sun, 22 Nov 2015 20:59:46 +0000 (12:59 -0800)]
Merge branch 'upstream' of git://git.linux-mips.org/pub/scm/ralf/upstream-linus
Pull MIPS fixes from Ralf Baechle:
- Fix a flood of annoying build warnings
- A number of fixes for Atheros 79xx platforms
* 'upstream' of git://git.linux-mips.org/pub/scm/ralf/upstream-linus:
MIPS: ath79: Add a machine entry for booting OF machines
MIPS: ath79: Fix the size of the MISC INTC registers in ar9132.dtsi
MIPS: ath79: Fix the DDR control initialization on ar71xx and ar934x
MIPS: Fix flood of warnings about comparsion being always true.
Linus Torvalds [Sun, 22 Nov 2015 20:50:58 +0000 (12:50 -0800)]
Merge branch 'parisc-4.4-2' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux
Pull parisc update from Helge Deller:
"This patchset adds Huge Page and HUGETLBFS support for parisc"
Honestly, the hugepage support should have gone through in the merge
window, and is not really an rc-time fix. But it only touches
arch/parisc, and I cannot find it in myself to care. If one of the
three parisc users notices a breakage, I will point at Helge and make
rude farting noises.
* 'parisc-4.4-2' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux:
parisc: Map kernel text and data on huge pages
parisc: Add Huge Page and HUGETLBFS support
parisc: Use long branch to do_syscall_trace_exit
parisc: Increase initial kernel mapping to 32MB on 64bit kernel
parisc: Initialize the fault vector earlier in the boot process.
parisc: Add defines for Huge page support
parisc: Drop unused MADV_xxxK_PAGES flags from asm/mman.h
parisc: Drop definition of start_thread_som for HP-UX SOM binaries
parisc: Fix wrong comment regarding first pmd entry flags
Linus Torvalds [Sun, 22 Nov 2015 20:37:20 +0000 (12:37 -0800)]
Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf tool fixes from Thomas Gleixner:
"A couple of fixes for perf tools:
- Build system updates
- Plug a memory leak in an error path of perf probe
- Tear down probes correctly when adding fails
- Fixes to the perf symbol handling
- Fix ordering of event processing in buildid-list
- Fix per DSO filtering in the histogram browser"
* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf probe: Clear probe_trace_event when add_probe_trace_event() fails
perf probe: Fix memory leaking on failure by clearing all probe_trace_events
perf inject: Also re-pipe lost_samples event
perf buildid-list: Requires ordered events
perf symbols: Fix dso lookup by long name and missing buildids
perf symbols: Allow forcing reading of non-root owned files by root
perf hists browser: The dso can be obtained from popup_action->ms.map->dso
perf hists browser: Fix 'd' hotkey action to filter by DSO
perf symbols: Rebuild rbtree when adjusting symbols for kcore
tools: Add a "make all" rule
tools: Actually install tmon in the install rule
Linus Torvalds [Sun, 22 Nov 2015 20:00:12 +0000 (12:00 -0800)]
Merge branch 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fixes from Thomas Gleixner:
"This update contains:
- MPX updates for handling 32bit processes
- A fix for a long standing bug in 32bit signal frame handling
related to FPU/XSAVE state
- Handle get_xsave_addr() correctly in KVM
- Fix SMAP check under paravirtualization
- Add a comment to the static function trace entry to avoid further
confusion about the difference to dynamic tracing"
* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/cpu: Fix SMAP check in PVOPS environments
x86/ftrace: Add comment on static function tracing
x86/fpu: Fix get_xsave_addr() behavior under virtualization
x86/fpu: Fix 32-bit signal frame handling
x86/mpx: Fix 32-bit address space calculation
x86/mpx: Do proper get_user() when running 32-bit binaries on 64-bit kernels
Adjust kmem_cache_alloc_bulk API before we have any real users.
Adjust API to return type 'int' instead of previously type 'bool'. This
is done to allow future extension of the bulk alloc API.
A future extension could be to allow SLUB to stop at a page boundary, when
specified by a flag, and then return the number of objects.
The advantage of this approach, would make it easier to make bulk alloc
run without local IRQs disabled. With an approach of cmpxchg "stealing"
the entire c->freelist or page->freelist. To avoid overshooting we would
stop processing at a slab-page boundary. Else we always end up returning
some objects at the cost of another cmpxchg.
To keep compatible with future users of this API linking against an older
kernel when using the new flag, we need to return the number of allocated
objects with this API change.
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Cc: Vladimir Davydov <vdavydov@virtuozzo.com> Acked-by: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
slub: add missing kmem cgroup support to kmem_cache_free_bulk
Initial implementation missed support for kmem cgroup support in
kmem_cache_free_bulk() call, add this.
If CONFIG_MEMCG_KMEM is not enabled, the compiler should be smart enough
to not add any asm code.
Incoming bulk free objects can belong to different kmem cgroups, and
object free call can happen at a later point outside memcg context. Thus,
we need to keep the orig kmem_cache, to correctly verify if a memcg object
match against its "root_cache" (s->memcg_params.root_cache).
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Reviewed-by: Vladimir Davydov <vdavydov@virtuozzo.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
slub: fix kmem cgroup bug in kmem_cache_alloc_bulk
The call slab_pre_alloc_hook() interacts with kmemgc and is not allowed to
be called several times inside the bulk alloc for loop, due to the call to
memcg_kmem_get_cache().
This would result in hitting the VM_BUG_ON in __memcg_kmem_get_cache.
As suggested by Vladimir Davydov, change slab_post_alloc_hook() to be able
to handle an array of objects.
A subtle detail is, loop iterator "i" in slab_post_alloc_hook() must have
same type (size_t) as size argument. This helps the compiler to easier
realize that it can remove the loop, when all debug statements inside loop
evaluates to nothing. Note, this is only an issue because the kernel is
compiled with GCC option: -fno-strict-overflow
In slab_alloc_node() the compiler inlines and optimizes the invocation of
slab_post_alloc_hook(s, flags, 1, &object) by removing the loop and access
object directly.
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Reported-by: Vladimir Davydov <vdavydov@virtuozzo.com> Suggested-by: Vladimir Davydov <vdavydov@virtuozzo.com> Reviewed-by: Vladimir Davydov <vdavydov@virtuozzo.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
slub: optimize bulk slowpath free by detached freelist
This change focus on improving the speed of object freeing in the
"slowpath" of kmem_cache_free_bulk.
The calls slab_free (fastpath) and __slab_free (slowpath) have been
extended with support for bulk free, which amortize the overhead of
the (locked) cmpxchg_double.
To use the new bulking feature, we build what I call a detached
freelist. The detached freelist takes advantage of three properties:
1) the free function call owns the object that is about to be freed,
thus writing into this memory is synchronization-free.
2) many freelist's can co-exist side-by-side in the same slab-page
each with a separate head pointer.
3) it is the visibility of the head pointer that needs synchronization.
Given these properties, the brilliant part is that the detached
freelist can be constructed without any need for synchronization. The
freelist is constructed directly in the page objects, without any
synchronization needed. The detached freelist is allocated on the
stack of the function call kmem_cache_free_bulk. Thus, the freelist
head pointer is not visible to other CPUs.
All objects in a SLUB freelist must belong to the same slab-page.
Thus, constructing the detached freelist is about matching objects
that belong to the same slab-page. The bulk free array is scanned is
a progressive manor with a limited look-ahead facility.
Kmem debug support is handled in call of slab_free().
Notice kmem_cache_free_bulk no longer need to disable IRQs. This
only slowed down single free bulk with approx 3 cycles.
Performance data:
Benchmarked[1] obj size 256 bytes on CPU i7-4790K @ 4.00GHz
SLUB fastpath single object quick reuse: 47 cycles(tsc) 11.931 ns
To get stable and comparable numbers, the kernel have been booted with
"slab_merge" (this also improve performance for larger bulk sizes).
Performance data, compared against fallback bulking:
Performance with normal SLUB merging is significantly slower for
larger bulking. This is believed to (primarily) be an effect of not
having to share the per-CPU data-structures, as tuning per-CPU size
can achieve similar performance.
Make it possible to free a freelist with several objects by adjusting API
of slab_free() and __slab_free() to have head, tail and an objects counter
(cnt).
Tail being NULL indicate single object free of head object. This allow
compiler inline constant propagation in slab_free() and
slab_free_freelist_hook() to avoid adding any overhead in case of single
object free.
This allows a freelist with several objects (all within the same
slab-page) to be free'ed using a single locked cmpxchg_double in
__slab_free() and with an unlocked cmpxchg_double in slab_free().
Object debugging on the free path is also extended to handle these
freelists. When CONFIG_SLUB_DEBUG is enabled it will also detect if
objects don't belong to the same slab-page.
These changes are needed for the next patch to bulk free the detached
freelists it introduces and constructs.
Micro benchmarking showed no performance reduction due to this change,
when debugging is turned off (compiled with CONFIG_SLUB_DEBUG).
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com> Acked-by: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Helge Deller [Sat, 21 Nov 2015 23:07:06 +0000 (00:07 +0100)]
parisc: Add Huge Page and HUGETLBFS support
This patch adds huge page support to allow userspace to allocate huge
pages and to use hugetlbfs filesystem on 32- and 64-bit Linux kernels.
A later patch will add kernel support to map kernel text and data on
huge pages.
The only requirement is, that the kernel needs to be compiled for a
PA8X00 CPU (PA2.0 architecture). Older PA1.X CPUs do not support
variable page sizes. 64bit Kernels are compiled for PA2.0 by default.
Technically on parisc multiple physical huge pages may be needed to
emulate standard 2MB huge pages.