Damien Le Moal [Mon, 3 Jul 2017 06:44:58 +0000 (15:44 +0900)]
dm zoned: fix overflow when converting zone ID to sectors
A zone ID is a 32 bits unsigned int which can overflow when doing the
bit shifts in dmz_start_sect(). With a 256 MB zone size drive, the
overflow happens for a zone ID >= 8192.
Fix this by casting the zone ID to a sector_t before doing the bit
shift. While at it, similarly fix dmz_start_block().
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Damien Le Moal [Wed, 7 Jun 2017 06:55:39 +0000 (15:55 +0900)]
dm zoned: drive-managed zoned block device target
The dm-zoned device mapper target provides transparent write access
to zoned block devices (ZBC and ZAC compliant block devices).
dm-zoned hides to the device user (a file system or an application
doing raw block device accesses) any constraint imposed on write
requests by the device, equivalent to a drive-managed zoned block
device model.
Write requests are processed using a combination of on-disk buffering
using the device conventional zones and direct in-place processing for
requests aligned to a zone sequential write pointer position.
A background reclaim process implemented using dm_kcopyd_copy ensures
that conventional zones are always available for executing unaligned
write requests. The reclaim process overhead is minimized by managing
buffer zones in a least-recently-written order and first targeting the
oldest buffer zones. Doing so, blocks under regular write access (such
as metadata blocks of a file system) remain stored in conventional
zones, resulting in no apparent overhead.
dm-zoned implementation focus on simplicity and on minimizing overhead
(CPU, memory and storage overhead). For a 14TB host-managed disk with
256 MB zones, dm-zoned memory usage per disk instance is at most about
3 MB and as little as 5 zones will be used internally for storing metadata
and performing buffer zone reclaim operations. This is achieved using
zone level indirection rather than a full block indirection system for
managing block movement between zones.
dm-zoned primary target is host-managed zoned block devices but it can
also be used with host-aware device models to mitigate potential
device-side performance degradation due to excessive random writing.
Zoned block devices can be formatted and checked for use with the dm-zoned
target using the dmzadm utility available at:
https://github.com/hgst/dm-zoned-tools
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com>
[Mike Snitzer partly refactored Damien's original work to cleanup the code] Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Damien Le Moal [Mon, 8 May 2017 23:40:51 +0000 (16:40 -0700)]
dm kcopyd: add sequential write feature
When copyying blocks to host-managed zoned block devices, writes must be
sequential. However, dm_kcopyd_copy() does not guarantee this as writes
are issued in the completion order of reads, and reads may complete out
of order despite being issued sequentially.
Fix this by introducing the DM_KCOPYD_WRITE_SEQ feature flag. This can
be specified when calling dm_kcopyd_copy() and should be set
automatically if one of the destinations is a host-managed zoned block
device. For a split job, the master job maintains the write position at
which writes must be issued. This is checked with the pop() function
which is modified to not return any write I/O sub job that is not at the
correct write position.
When DM_KCOPYD_WRITE_SEQ is specified for a job, errors cannot be
ignored and the flag DM_KCOPYD_IGNORE_ERROR is ignored, even if
specified by the user.
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Damien Le Moal [Mon, 8 May 2017 23:40:50 +0000 (16:40 -0700)]
dm linear: add support for zoned block devices
Add support for zoned block devices by allowing host-managed zoned block
device mapped targets, the remapping of REQ_OP_ZONE_RESET and the post
processing (reply remapping) of REQ_OP_ZONE_REPORT.
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Damien Le Moal [Mon, 8 May 2017 23:40:49 +0000 (16:40 -0700)]
dm flakey: add support for zoned block devices
With the development of file system support for zoned block devices
(e.g. f2fs), having dm-flakey support these devices is interesting
to improve testing.
Add host-aware and host-managed zoned block devices support to in
dm-flakey. The target type feature is set to DM_TARGET_ZONED_HM to
indicate support for host-managed models. Also add hooks for remapping
of REQ_OP_ZONE_RESET and REQ_OP_ZONE_REPORT bios. Additionally, in the
bio completion path, (backward) remapping of a zone report reply is
added.
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Damien Le Moal [Mon, 8 May 2017 23:40:48 +0000 (16:40 -0700)]
dm: introduce dm_remap_zone_report()
A target driver support zoned block devices and exposing it as such may
receive REQ_OP_ZONE_REPORT request for the user to determine the mapped
device zone configuration. To process properly such request, the target
driver may need to remap the zone descriptors provided in the report
reply. The helper function dm_remap_zone_report() does this generically
using only the target start offset and length and the start offset
within the target device.
dm_remap_zone_report() will remap the start sector of all zones
reported. If the report includes sequential zones, the write pointer
position of these zones will also be remapped.
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Damien Le Moal [Mon, 8 May 2017 23:40:47 +0000 (16:40 -0700)]
dm: fix REQ_OP_ZONE_REPORT bio handling
A REQ_OP_ZONE_REPORT bio is not a medium access command. Its number of
sectors indicates the maximum size allowed for the report reply size and
not an amount of sectors accessed from the device. REQ_OP_ZONE_REPORT
bios should thus not be split depending on the target device maximum I/O
length but passed as-is. Note that it is the responsability of the
target to remap and format the report reply.
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Damien Le Moal [Mon, 8 May 2017 23:40:46 +0000 (16:40 -0700)]
dm: fix REQ_OP_ZONE_RESET bio handling
The REQ_OP_ZONE_RESET bio has no payload and zero sectors. Its position
is the only information used to indicate the zone to reset on the
device. Due to its zero length, this bio is not cloned and sent to the
target through the non-flush case in __split_and_process_bio(). Add an
additional case in that function to call __split_and_process_non_flush()
without checking the clone info size.
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Damien Le Moal [Mon, 8 May 2017 23:40:43 +0000 (16:40 -0700)]
dm table: add zoned block devices validation
1) Introduce DM_TARGET_ZONED_HM feature flag:
The target drivers currently available will not operate correctly if a
table target maps onto a host-managed zoned block device.
To avoid problems, introduce the new feature flag DM_TARGET_ZONED_HM to
allow a target to explicitly state that it supports host-managed zoned
block devices. This feature is checked for all targets in a table if
any of the table's block devices are host-managed.
Note that as host-aware zoned block devices are backward compatible with
regular block devices, they can be used by any of the current target
types. This new feature is thus restricted to host-managed zoned block
devices.
2) Check device area zone alignment:
If a target maps to a zoned block device, check that the device area is
aligned on zone boundaries to avoid problems with REQ_OP_ZONE_RESET
operations (resetting a partially mapped sequential zone would not be
possible). This also facilitates the processing of zone report with
REQ_OP_ZONE_REPORT bios.
3) Check block devices zone model compatibility
When setting the DM device's queue limits, several possibilities exists
for zoned block devices:
1) The DM target driver may want to expose a different zone model
(e.g. host-managed device emulation or regular block device on top of
host-managed zoned block devices)
2) Expose the underlying zone model of the devices as-is
To allow both cases, the underlying block device zone model must be set
in the target limits in dm_set_device_limits() and the compatibility of
all devices checked similarly to the logical block size alignment. For
this last check, introduce validate_hardware_zoned_model() to check that
all targets of a table have the same zone model and that the zone size
of the target devices are equal.
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com>
[Mike Snitzer refactored Damien's original work to simplify the code] Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Milan Broz [Tue, 6 Jun 2017 07:07:01 +0000 (09:07 +0200)]
dm crypt: add big-endian variant of plain64 IV
The big-endian IV (plain64be) is needed to map images from extracted
disks that are used in some external (on-chip FDE) disk encryption
drives, e.g.: data recovery from external USB/SATA drives that support
"internal" encryption.
Signed-off-by: Milan Broz <gmazyland@gmail.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Mikulas Patocka [Mon, 16 Jan 2017 21:07:01 +0000 (16:07 -0500)]
dm ioctl: report event number in DM_LIST_DEVICES
Report the event numbers for all the devices, so that the user doesn't
have to ask them one by one. The event number is reported after the
name field in the dm_name_list structure.
The location of the next record is specified in the dm_name_list->next
field, that means that we can put the new data after the end of name and
it is backward compatible with the old code. The old code just skips
the event number without interpreting it.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Andy Grover <agrover@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Mikulas Patocka [Fri, 5 May 2017 18:12:52 +0000 (11:12 -0700)]
dm ioctl: add a new DM_DEV_ARM_POLL ioctl
This ioctl will record the current global event number in the structure
dm_file, so that next select or poll call will wait until new events
arrived since this ioctl.
The DM_DEV_ARM_POLL ioctl has the same effect as closing and reopening
the handle.
Using the DM_DEV_ARM_POLL ioctl is optional - if the userspace is OK
with closing and reopening the /dev/mapper/control handle after select
or poll, there is no need to re-arm via ioctl.
Usage:
1. open the /dev/mapper/control device
2. send the DM_DEV_ARM_POLL ioctl
3. scan the event numbers of all devices we are interested in and process
them
4. call select, poll or epoll on the handle (it waits until some new event
happens since the DM_DEV_ARM_POLL ioctl)
5. go to step 2
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Andy Grover <agrover@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Mikulas Patocka [Mon, 16 Jan 2017 21:05:59 +0000 (16:05 -0500)]
dm: add basic support for using the select or poll function
Add the ability to poll on the /dev/mapper/control device. The select
or poll function waits until any event happens on any dm device since
opening the /dev/mapper/control device. When select or poll returns the
device as readable, we must close and reopen the device to wait for new
dm events.
Usage:
1. open the /dev/mapper/control device
2. scan the event numbers of all devices we are interested in and process
them
3. call select, poll or epoll on the handle (it waits until some new event
happens since opening the device)
4. close the /dev/mapper/control handle
5. go to step 1
The next commit allows to re-arm the polling without closing and
reopening the device.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Andy Grover <agrover@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Ming Lei [Mon, 19 Jun 2017 02:21:08 +0000 (10:21 +0800)]
nvme: host: unquiesce queue in nvme_kill_queues()
When nvme_kill_queues() is run, queues may be in
quiesced state, so we forcibly unquiesce queues to avoid
blocking dispatch, and I/O hang can be avoided in
remove path.
Peviously we use blk_mq_start_stopped_hw_queues() as
counterpart of blk_mq_quiesce_queue(), now we have
introduced blk_mq_unquiesce_queue(), so use it explicitly.
Cc: linux-nvme@lists.infradead.org Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Ming Lei [Tue, 6 Jun 2017 15:22:10 +0000 (23:22 +0800)]
Revert "blk-mq: don't use sync workqueue flushing from drivers"
This patch reverts commit 2719aa217e0d02(blk-mq: don't use
sync workqueue flushing from drivers) because only
blk_mq_quiesce_queue() need the sync flush, and now
we don't need to stop queue any more, so revert it.
Also changes to cancel_delayed_work() in blk_mq_stop_hw_queue().
Reviewed-by: Bart Van Assche <Bart.VanAssche@sandisk.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Ming Lei [Tue, 6 Jun 2017 15:22:09 +0000 (23:22 +0800)]
blk-mq: clarify dispatch may not be drained/blocked by stopping queue
BLK_MQ_S_STOPPED may not be observed in other concurrent I/O paths,
we can't guarantee that dispatching won't happen after returning
from the APIs of stopping queue.
So clarify the fact and avoid potential misuse.
Signed-off-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Bart Van Assche <Bart.VanAssche@sandisk.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Ming Lei [Tue, 6 Jun 2017 15:22:08 +0000 (23:22 +0800)]
blk-mq: don't stop queue for quiescing
Queue can be started by other blk-mq APIs and can be used in
different cases, this limits uses of blk_mq_quiesce_queue()
if it is based on stopping queue, and make its usage very
difficult, especially users have to use the stop queue APIs
carefully for avoiding to break blk_mq_quiesce_queue().
We have applied the QUIESCED flag for draining and blocking
dispatch, so it isn't necessary to stop queue any more.
After stopping queue is removed, blk_mq_quiesce_queue() can
be used safely and easily, then users won't worry about queue
restarting during quiescing at all.
Reviewed-by: Bart Van Assche <Bart.VanAssche@sandisk.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Ming Lei [Sun, 18 Jun 2017 20:24:27 +0000 (14:24 -0600)]
blk-mq: use QUEUE_FLAG_QUIESCED to quiesce queue
It is required that no dispatch can happen any more once
blk_mq_quiesce_queue() returns, and we don't have such requirement
on APIs of stopping queue.
But blk_mq_quiesce_queue() still may not block/drain dispatch in the
the case of BLK_MQ_S_START_ON_RUN, so use the new introduced flag of
QUEUE_FLAG_QUIESCED and evaluate it inside RCU read-side critical
sections for fixing this issue.
Also blk_mq_quiesce_queue() is implemented via stopping queue, which
limits its uses, and easy to cause race, because any queue restart in
other paths may break blk_mq_quiesce_queue(). With the introduced
flag of QUEUE_FLAG_QUIESCED, we don't need to depend on stopping queue
for quiescing any more.
Signed-off-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Bart Van Assche <Bart.VanAssche@sandisk.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Ming Lei [Tue, 6 Jun 2017 15:22:03 +0000 (23:22 +0800)]
blk-mq: introduce blk_mq_unquiesce_queue
blk_mq_start_stopped_hw_queues() is used implictly
as counterpart of blk_mq_quiesce_queue() for unquiescing queue,
so we introduce blk_mq_unquiesce_queue() and make it
as counterpart of blk_mq_quiesce_queue() explicitly.
This function is for improving the current quiescing mechanism
in the following patches.
Reviewed-by: Bart Van Assche <Bart.VanAssche@sandisk.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
NeilBrown [Sun, 18 Jun 2017 04:38:59 +0000 (14:38 +1000)]
block: don't check for BIO_MAX_PAGES in blk_bio_segment_split()
blk_bio_segment_split() makes sure bios have no more than
BIO_MAX_PAGES entries in the bi_io_vec.
This was done because bio_clone_bioset() (when given a
mempool bioset) could not handle larger io_vecs.
No driver uses bio_clone_bioset() any more, they all
use bio_clone_fast() if anything, and bio_clone_fast()
doesn't clone the bi_io_vec.
The main user of of bio_clone_bioset() at this level
is bounce.c, and bouncing now happens before blk_bio_segment_split(),
so that is not of concern.
NeilBrown [Sun, 18 Jun 2017 04:38:59 +0000 (14:38 +1000)]
block: remove bio_clone() and all references.
bio_clone() is no longer used.
Only bio_clone_bioset() or bio_clone_fast().
This is for the best, as bio_clone() used fs_bio_set,
and filesystems are unlikely to want to use bio_clone().
So remove bio_clone() and all references.
This includes a fix to some incorrect documentation.
Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
NeilBrown [Sun, 18 Jun 2017 04:38:59 +0000 (14:38 +1000)]
bcache: use kmalloc to allocate bio in bch_data_verify()
This function allocates a bio, then a collection
of pages. It copes with failure.
It currently uses a mempool() to allocate the bio,
but alloc_page() to allocate the pages. These fail
in different ways, so the usage is inconsistent.
Change the bio_clone() to bio_clone_kmalloc()
so that no pool is used either for the bio or the pages.
Reviewed-by: Christoph Hellwig <hch@lst.de> Acked-by: Kent Overstreet <kent.overstreet@gmail.com>
Reviewed-by : Ming Lei <ming.lei@redhat.com> Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
NeilBrown [Sun, 18 Jun 2017 04:38:59 +0000 (14:38 +1000)]
xen-blkfront: remove bio splitting.
bios that are re-submitted will pass through blk_queue_split() when
blk_queue_bio() is called, and this will split the bio if necessary.
There is no longer any need to do this splitting in xen-blkfront.
NeilBrown [Sun, 18 Jun 2017 04:38:58 +0000 (14:38 +1000)]
lightnvm/pblk-read: use bio_clone_fast()
pblk_submit_read() uses bio_clone_bioset() but doesn't change the
io_vec, so bio_clone_fast() is a better choice.
It also uses fs_bio_set which is intended for filesystems. Using it
in a device driver can deadlock.
So allocate a new bioset, and and use bio_clone_fast().
Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Javier González <javier@cnexlabs.com> Tested-by: Javier González <javier@cnexlabs.com> Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
NeilBrown [Sun, 18 Jun 2017 04:38:58 +0000 (14:38 +1000)]
pktcdvd: use bio_clone_fast() instead of bio_clone()
pktcdvd doesn't change the bi_io_vec of the clone bio,
so it is more efficient to use bio_clone_fast(), and not clone
the bi_io_vec.
This requires providing a bio_set, and it is safest to
provide a dedicated bio_set rather than sharing
fs_bio_set, which filesytems use.
This new bio_set, pkt_bio_set, can also be use for the bio_split()
call as the two allocations (bio_clone_fast, and bio_split) are
independent, neither can block a bio allocated by the other.
NeilBrown [Sun, 18 Jun 2017 04:38:58 +0000 (14:38 +1000)]
drbd: use bio_clone_fast() instead of bio_clone()
drbd does not modify the bi_io_vec of the cloned bio,
so there is no need to clone that part. So bio_clone_fast()
is the better choice.
For bio_clone_fast() we need to specify a bio_set.
We could use fs_bio_set, which bio_clone() uses, or
drbd_md_io_bio_set, which drbd uses for metadata, but it is
generally best to avoid sharing bio_sets unless you can
be certain that there are no interdependencies.
So create a new bio_set, drbd_io_bio_set, and use bio_clone_fast().
Also remove a "XXX cannot fail ???" comment because it definitely
cannot fail - bio_clone_fast() doesn't fail if the GFP flags allow for
sleeping.
NeilBrown [Sun, 18 Jun 2017 04:38:58 +0000 (14:38 +1000)]
rbd: use bio_clone_fast() instead of bio_clone()
bio_clone() makes a copy of the bi_io_vec, but rbd never changes that,
so there is no need for a copy.
bio_clone_fast() can be used instead, which avoids making the copy.
This requires that we provide a bio_set. bio_clone() uses fs_bio_set,
but it isn't, in general, safe to use the same bio_set at different
levels of the stack, as that can lead to deadlocks. As filesystems
use fs_bio_set, block devices shouldn't.
As rbd never stacks, it is safe to have a single global bio_set for
all rbd devices to use. So allocate that when the module is
initialised, and use it with bio_clone_fast().
NeilBrown [Sun, 18 Jun 2017 04:38:58 +0000 (14:38 +1000)]
block: Improvements to bounce-buffer handling
Since commit 23688bf4f830 ("block: ensure to split after potentially
bouncing a bio") blk_queue_bounce() is called *before*
blk_queue_split().
This means that:
1/ the comments blk_queue_split() about bounce buffers are
irrelevant, and
2/ a very large bio (more than BIO_MAX_PAGES) will no longer be
split before it arrives at blk_queue_bounce(), leading to the
possibility that bio_clone_bioset() will fail and a NULL
will be dereferenced.
Separately, blk_queue_bounce() shouldn't use fs_bio_set as the bio
being copied could be from the same set, and this could lead to a
deadlock.
So:
- allocate 2 private biosets for blk_queue_bounce, one for
splitting enormous bios and one for cloning bios.
- add code to split a bio that exceeds BIO_MAX_PAGES.
- Fix up the comments in blk_queue_split()
Credit-to: Ming Lei <tom.leiming@gmail.com> (suggested using single bio_for_each_segment loop) Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
NeilBrown [Sun, 18 Jun 2017 04:38:57 +0000 (14:38 +1000)]
blk: use non-rescuing bioset for q->bio_split.
A rescuing bioset is only useful if there might be bios from
that same bioset on the bio_list_on_stack queue at a time
when bio_alloc_bioset() is called. This never applies to
q->bio_split.
Allocations from q->bio_split are only ever made from
blk_queue_split() which is only ever called early in each of
various make_request_fn()s. The original bio (call this A)
is then passed to generic_make_request() and is placed on
the bio_list_on_stack queue, and the bio that was allocated
from q->bio_split (B) is processed.
The processing of this may cause other bios to be passed to
generic_make_request() or may even cause the bio B itself to
be passed, possible after some prefix has been split off
(using some other bioset).
generic_make_request() now guarantees that all of these bios
(B and dependants) will be fully processed before the tail
of the original bio A gets handled. None of these early bios
can possible trigger an allocation from the original
q->bio_split as they are either too small to require
splitting or (more likely) are destined for a different queue.
The next time that the original q->bio_split might be used
by this thread is when A is processed again, as it might
still be too big to handle directly. By this time there
cannot be any other bios allocated from q->bio_split in the
generic_make_request() queue. So no rescuing will ever be
needed.
Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
NeilBrown [Sun, 18 Jun 2017 04:38:57 +0000 (14:38 +1000)]
blk: make the bioset rescue_workqueue optional.
This patch converts bioset_create() to not create a workqueue by
default, so alloctions will never trigger punt_bios_to_rescuer(). It
also introduces a new flag BIOSET_NEED_RESCUER which tells
bioset_create() to preserve the old behavior.
All callers of bioset_create() that are inside block device drivers,
are given the BIOSET_NEED_RESCUER flag.
biosets used by filesystems or other top-level users do not
need rescuing as the bio can never be queued behind other
bios. This includes fs_bio_set, blkdev_dio_pool,
btrfs_bioset, xfs_ioend_bioset, and one allocated by
target_core_iblock.c.
biosets used by md/raid do not need rescuing as
their usage was recently audited and revised to never
risk deadlock.
It is hoped that most, if not all, of the remaining biosets
can end up being the non-rescued version.
Reviewed-by: Christoph Hellwig <hch@lst.de> Credit-to: Ming Lei <ming.lei@redhat.com> (minor fixes) Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
NeilBrown [Sun, 18 Jun 2017 04:38:57 +0000 (14:38 +1000)]
blk: replace bioset_create_nobvec() with a flags arg to bioset_create()
"flags" arguments are often seen as good API design as they allow
easy extensibility.
bioset_create_nobvec() is implemented internally as a variation in
flags passed to __bioset_create().
To support future extension, make the internal structure part of the
API.
i.e. add a 'flags' argument to bioset_create() and discard
bioset_create_nobvec().
Note that the bio_split allocations in drivers/md/raid* do not need
the bvec mempool - they should have used bioset_create_nobvec().
Suggested-by: Christoph Hellwig <hch@infradead.org> Reviewed-by: Christoph Hellwig <hch@infradead.org> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
NeilBrown [Sun, 18 Jun 2017 04:38:57 +0000 (14:38 +1000)]
blk: remove bio_set arg from blk_queue_split()
blk_queue_split() is always called with the last arg being q->bio_split,
where 'q' is the first arg.
Also blk_queue_split() sometimes uses the passed-in 'bs' and sometimes uses
q->bio_split.
This is inconsistent and unnecessary. Remove the last arg and always use
q->bio_split inside blk_queue_split()
Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Ming Lei <ming.lei@redhat.com> Credit-to: Javier González <jg@lightnvm.io> (Noticed that lightnvm was missed) Reviewed-by: Javier González <javier@cnexlabs.com> Tested-by: Javier González <javier@cnexlabs.com> Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
This patch makes sure we always allocate requests in the core blk-mq
code and use a common prepare_request method to initialize them for
both mq I/O schedulers. For Kyber and additional limit_depth method
is added that is called before allocating the request.
Also because none of the intializations can really fail the new method
does not return an error - instead the bfq finish method is hardened
to deal with the no-IOC case.
Last but not least this removes the abuse of RQF_QUEUE by the blk-mq
scheduling code as RQF_ELFPRIV is all that is needed now.
Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
blk_mq_sched_assign_ioc now only handles the assigned of the ioc if
the schedule needs it (bfq only at the moment). The caller to the
per-request initializer is moved out so that it can be merged with
a similar call for the kyber I/O scheduler.
Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
NeilBrown [Fri, 16 Jun 2017 05:02:09 +0000 (15:02 +1000)]
loop: Add PF_LESS_THROTTLE to block/loop device thread.
When a filesystem is mounted from a loop device, writes are
throttled by balance_dirty_pages() twice: once when writing
to the filesystem and once when the loop_handle_cmd() writes
to the backing file. This double-throttling can trigger
positive feedback loops that create significant delays. The
throttling at the lower level is seen by the upper level as
a slow device, so it throttles extra hard.
The PF_LESS_THROTTLE flag was created to handle exactly this
circumstance, though with an NFS filesystem mounted from a
local NFS server. It reduces the throttling on the lower
layer so that it can proceed largely unthrottled.
To demonstrate this, create a filesystem on a loop device
and write (e.g. with dd) several large files which combine
to consume significantly more than the limit set by
/proc/sys/vm/dirty_ratio or dirty_bytes. Measure the total
time taken.
When I do this directly on a device (no loop device) the
total time for several runs (mkfs, mount, write 200 files,
umount) is fairly stable: 28-35 seconds.
When I do this over a loop device the times are much worse
and less stable. 52-460 seconds. Half below 100seconds,
half above.
When I apply this patch, the times become stable again,
though not as fast as the no-loop-back case: 53-72 seconds.
There may be room for further improvement as the total overhead still
seems too high, but this is a big improvement.
Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Ming Lei <tom.leiming@gmail.com> Suggested-by: Michal Hocko <mhocko@suse.com> Acked-by: Michal Hocko <mhocko@suse.com> Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Jens Axboe <axboe@fb.com>
Jens Axboe [Fri, 16 Jun 2017 16:14:59 +0000 (10:14 -0600)]
Merge branch 'nvme-4.13' of git://git.infradead.org/nvme into for-4.13/block
Pull NVMe changes for 4.13 from Christoph:
Highlights:
- UUID identifier support from Johannes
- Lots of cleanups from Sagi
- Host Memory Buffer support from me
And lots of cleanups and smaller fixes of course.
Note that the UUID identifier changes are based on top of the uuid tree.
I am the maintainer of that tree and will send it to Linus as soon as
4.12 is released as various other trees depend on it as well (and the
diffstat includes those changes unfortunately)
Arvind Yadav [Fri, 16 Jun 2017 09:54:39 +0000 (15:24 +0530)]
block: swim3: make of_device_ids const.
of_device_ids are not supposed to change at runtime. All functions
working with of_device_ids provided by <linux/of.h> work with const
of_device_ids. So mark the non-const structs as const.
File size before:
text data bss dec hex filename
8908 1096 624 10628 2984 drivers/block/swim3.o
File size after constify swim3_match:
text data bss dec hex filename
9708 296 624 10628 2984 drivers/block/swim3.o
Bart Van Assche [Tue, 13 Jun 2017 15:07:33 +0000 (08:07 -0700)]
block: Dedicated error code fixups
This patch fixes two sparse warnings introduced by the "dedicated
error codes for the block layer V3" patch series. These changes
have not been tested.
Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
Scott Bauer [Thu, 15 Jun 2017 16:44:30 +0000 (10:44 -0600)]
nvme: implement NS Optimal IO Boundary from 1.3 Spec
The NVMe 1.3 spec introduces Namespace Optimal IO Boundaries (NOIOB),
which standardizes the stripe mechanism we currently have quirks for.
This patch implements the necessary logic to handle this new feature.
Signed-off-by: Scott Bauer <scott.bauer@intel.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
nvme: move reset workqueue handling to common code
This moves the nvme_reset function from the PCIe driver to common code,
renaming it to nvme_reset_ctrl in the process. Additionally a new
helper nvme_reset_ctrl_sync is added for the case where we want to
wait for the reset. To facilitate that the reset_work work structure is
move to the common nvme_ctrl structure and the ->reset_ctrl method is
removed. For now the drivers initialize the reset_work with their own
callback, but longer term we should move to callouts for specific
parts of the reset process and move even more code to the core.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
nvme: move protection information check into nvme_setup_rw
It only applies to read/write commands, and this way non-PCIe drivers
get the check as well instead of having to duplicate it when adding
metadata support.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>
Dan Carpenter [Wed, 14 Jun 2017 10:46:45 +0000 (13:46 +0300)]
nvme-rdma: fix error code in nvme_rdma_create_ctrl()
We accidentally return ERR_PTR(0) which is NULL. The caller isn't
explicitly checking for that but I couldn't immediately spot whether
this would lead to a NULL dereference. Anyway, we can fix add an
error code easily enough.
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
Guan Junxiong [Tue, 13 Jun 2017 02:51:24 +0000 (10:51 +0800)]
nvmf: keep track of nvmet connect error status
To let the host know what happends to the connection establishment,
adjust the behavior of nvmf_log_connect_error to make more connect
specifig error codes human-readble.
Bart Van Assche [Thu, 8 Jun 2017 16:43:29 +0000 (09:43 -0700)]
nvmet-fc: Remove a set-but-not-used variable
This was detected by building the nvmet-fc driver with W=1.
Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com> Cc: James Smart <james.smart@broadcom.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Christoph Hellwig <hch@lst.de>
Allow overriding the announced NVMe Version of a via configfs.
This is particularly helpful when debugging new features for the host
or target side without bumping the hard coded version (as the target
might not be fully compliant to the announced version yet).
Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Hannes Reinecke <hare@suse.com> Reviewed-by: Guan Junxiong <guanjunxiong@huawei.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
nvmet: add uuid field to nvme_ns and populate via configfs
Add the UUID field from the NVMe Namespace Identification Descriptor
to the nvmet_ns structure and allow it's population via configfs.
Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Max Gurtovoy <maxg@mellanox.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
nvmet: implement namespace identify descriptor list
A NVMe Identify NS command with a CNS value of '3' is expecting a list
of Namespace Identification Descriptor structures to be returned to
the host for the namespace requested in the namespace identify
command.
This Namespace Identification Descriptor structure consists of the
type of the namespace identifier, the length of the identifier and the
actual identifier.
Valid types are NGUID and UUID which we have saved in our nvme_ns
structure if they have been configured via configfs. If no value has
been assigened to one of these we return an "invalid opcode" back to
the host to maintain backward compatibiliy with older implementations
without Namespace Identify Descriptor list support.
Also as the Namespace Identify Descriptor list is the only mandatory
feature change between 1.2.1 and 1.3 we can bump the advertised
version as well.
Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Hannes Reinecke <hare@suse.com> Reviewed-by: Max Gurtovoy <maxg@mellanox.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>
Now that we have a way for getting the UUID from a target, provide it
to userspace as well.
Unfortunately there is already a sysfs attribute called UUID which is
a misnomer as it holds the NGUID value. So instead of creating yet
another wrong name, create a new 'nguid' sysfs attribute for the
NGUID. For the UUID attribute add a check wheter the namespace has a
UUID assigned to it and return this or return the NGUID to maintain
backwards compatibility. This should give userspace a chance to catch
up.
Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.com> Reviewed-by: Sagi Grimberg <sagi@rimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>
If a target identifies itself as NVMe 1.3 compliant, try to get the
list of Namespace Identification Descriptors and populate the UUID,
NGUID and EUI64 fileds in the NVMe namespace structure with these
values.
Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
The uuid field in the nvme_ns structure represents the nguid field
from the identify namespace command. And as NVMe 1.3 introduced an
UUID in the NVMe Namespace Identification Descriptor this will
collide.
So rename the uuid to nguid to prevent any further
confusion. Unfortunately we export the nguid to sysfs in the uuid
sysfs attribute, but this can't be changed anymore without possibly
breaking existing userspace.
Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Max Gurtovoy <maxg@mellanox.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
Use NVME_IDENTIFY_DATA_SIZE define instead of hard coding the magic
4096 value.
Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Max Gurtovoy <maxg@mellanox.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Hannes Reinecke <hare@suse.com>
[hch: converted three more users] Signed-off-by: Christoph Hellwig <hch@lst.de>
Keith Busch [Wed, 7 Jun 2017 18:32:50 +0000 (20:32 +0200)]
nvme-pci: Remove watchdog timer
The controller status polling was added to preemptively reset a failed
controller. This early detection would allow commands that would normally
timeout a chance for a retry, or find broken links when the platform
didn't support hotplug.
This once-per-second MMIO read, however, created more problems than
it solves. This often races with PCIe Hotplug events that required
complicated syncing between work queues, frequently triggered PCIe
Completion Timeout errors that also lead to fatal machine checks, and
unnecessarily disrupts low power modes by running on idle controllers.
This patch removes the watchdog timer, and instead checks controller
health only on an IO timeout when we have a reason to believe something
is wrong. If the controller is failed, the driver will disable immediately
and request scheduling a reset.
Suggested-by: Andy Lutomirski <luto@amacapital.net> Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
Xu Yu [Wed, 24 May 2017 08:39:55 +0000 (16:39 +0800)]
nvme-pci: remap BAR0 to cover admin CQ doorbell for large stride
The existing driver initially maps 8192 bytes of BAR0 which is
intended to cover doorbells of admin SQ and CQ. However, if a
large stride, e.g. 10, is used, the doorbell of admin CQ will
be out of 8192 bytes. Consequently, a page fault will be raised
when the admin CQ doorbell is accessed in nvme_configure_admin_queue().
This patch fixes this issue by remapping BAR0 before accessing
admin CQ doorbell if the initial mapping is not enough.
Sagi Grimberg [Thu, 4 May 2017 10:33:12 +0000 (13:33 +0300)]
nvme: Don't allow to reset a reconnecting controller
The reset operation is guaranteed to fail for all scenarios
but the esoteric case where in the last reconnect attempt
concurrent with the reset we happen to successfully reconnect.
We just deny initiating a reset if we are reconnecting.
Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>
Instead of introducing a flag for if the queue is allocated,
simply free the rdma resources when we get the error.
We allocate the queue rdma resources when we have an address
resolution, their we allocate (or take a reference on) our device
so we should free it when we have error after the address resolution
namely:
1. route resolution error
2. connect reject
3. connect error
4. peer unreachable error
Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>
If a controller supports the host memory buffer we try to provide
it with the requested size up to an upper cap set as a module
parameter. We try to give as few as possible descriptors, eventually
working our way down.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Max Gurtovoy <maxg@mellanox.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Arnav Dawn [Fri, 12 May 2017 15:12:03 +0000 (17:12 +0200)]
nvme.h: add dword 12 - 15 fields to struct nvme_features
Signed-off-by: Arnav Dawn <a.dawn@samsung.com>
[hch: split from a larger patch, new changelog] Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
The merge of 4.12-rc5 into the for-4.13/block tree didn't handle the queue
ready case correctly. Fix this by propagating blk_status_t into
nvme_rdma_queue_is_ready.
Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>