Jan Kara [Tue, 11 Apr 2017 09:29:01 +0000 (11:29 +0200)]
block: Fix list corruption of blk stats callback list
When CFQ calls wbt_disable_default(), it will call
blk_stat_remove_callback() to stop gathering IO statistics for the
purposes of writeback throttling. Later, when request_queue is
unregistered, wbt_exit() will call blk_stat_remove_callback() again
which will try to delete callback from the list again and possibly cause
list corruption.
Fix the problem by making wbt_disable_default() called wbt_exit() which
is properly guarded against being called multiple times.
Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <axboe@fb.com>
Separating discards and zeroout operations allows us to remove the LBPRZ
block zeroing constraints from discards and honor the device preferences
for UNMAP commands.
If supported by the device, we'll also choose UNMAP over one of the
WRITE SAME variants for discards.
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@fb.com>
scsi: sd: Separate zeroout and discard command choices
Now that zeroout and discards are distinct operations we need to
separate the policy of choosing the appropriate command. Create a
zeroing_mode which can be one of:
write: Zeroout assist not present, use regular WRITE
writesame: Allow WRITE SAME(10/16) with a zeroed payload
writesame_16_unmap: Allow WRITE SAME(16) with UNMAP
writesame_10_unmap: Allow WRITE SAME(10) with UNMAP
The last two are conditional on the device being thin provisioned with
LBPRZ=1 and LBPWS=1 or LBPWS10=1 respectively.
Whether to set the UNMAP bit or not depends on the REQ_NOUNMAP flag. And
if none of the _unmap variants are supported, regular WRITE SAME will be
used if the device supports it.
The zeroout_mode is exported in sysfs and the detected mode for a given
device can be overridden using the string constants above.
With this change in place we can now issue WRITE SAME(16) with UNMAP set
for block zeroing applications that require hard guarantees and
logical_block_size granularity. And at the same time use the UNMAP
command with the device's preferred granulary and alignment for discard
operations.
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@fb.com>
Now that we use the proper REQ_OP_WRITE_ZEROES operation everywhere we can
kill this hack.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@fb.com>
drbd: make intelligent use of blkdev_issue_zeroout
drbd always wants its discard wire operations to zero the blocks, so
use blkdev_issue_zeroout with the BLKDEV_ZERO_UNMAP flag instead of
reinventing it poorly.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@fb.com>
Now that we have REQ_OP_WRITE_ZEROES implemented for all devices that
support efficient zeroing, we can remove the call to blkdev_issue_discard.
This means we only have two ways of zeroing left and can simplify the
code.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@fb.com>
But now for the real NVMe Write Zeroes yet, just to get rid of the
discard abuse for zeroing. Also rename the quirk flag to be a bit
more self-explanatory.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@fb.com>
Try to use a write same with unmap bit variant if the device supports it
and the caller allows for it.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@fb.com>
block_dev: use blkdev_issue_zerout for hole punches
This gets us support for non-discard efficient write of zeroes (e.g. NVMe)
and prepares for removing the discard_zeroes_data flag.
Also remove a pointless discard support check, which is done in
blkdev_issue_discard already.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@fb.com>
This avoids fallbacks to explicit zeroing in (__)blkdev_issue_zeroout if
the caller doesn't want them.
Also clean up the convoluted check for the return condition that this
new flag is added to.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@fb.com>
block: add a REQ_NOUNMAP flag for REQ_OP_WRITE_ZEROES
If this flag is set logical provisioning capable device should
release space for the zeroed blocks if possible, if it is not set
devices should keep the blocks anchored.
Also remove an out of sync kerneldoc comment for a static function
that would have become even more out of data with this change.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@fb.com>
block: add a flags argument to (__)blkdev_issue_zeroout
Turn the existing discard flag into a new BLKDEV_ZERO_UNMAP flag with
similar semantics, but without referring to diѕcard.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@fb.com>
block: stop using blkdev_issue_write_same for zeroing
We'll always use the WRITE ZEROES code for zeroing now.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@fb.com>
It seems like the code currently passes whatever it was using for writes
to WRITE SAME. Just switch it to WRITE ZEROES, although that doesn't
need any payload.
Untested, and confused by the code, maybe someone who understands it
better than me can help..
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@fb.com>
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@fb.com>
block: implement splitting of REQ_OP_WRITE_ZEROES bios
Copy and past the REQ_OP_WRITE_SAME code to prepare to implementations
that limit the write zeroes size.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@fb.com>
Make life easy for implementations that needs to send a data buffer
to the device (e.g. SCSI) by numbering it as a data out command.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@fb.com>
Split sd_setup_discard_cmnd into one function per provisioning type. While
this creates some very slight duplication of boilerplate code it keeps the
code modular for additions of new provisioning types, and for reusing the
write same functions for the upcoming scsi implementation of the Write Zeroes
operation.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@fb.com>
blk-mq: Clarify comments in blk_mq_dispatch_rq_list()
The blk_mq_dispatch_rq_list() implementation got modified several
times but the comments in that function were not updated every
time. Since it is nontrivial what is going on, update the comments
in blk_mq_dispatch_rq_list().
Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com> Cc: Omar Sandoval <osandov@fb.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Hannes Reinecke <hare@suse.de> Signed-off-by: Jens Axboe <axboe@fb.com>
blk-mq: Make it safe to use RCU to iterate over blk_mq_tag_set.tag_list
Since the next patch in this series will use RCU to iterate over
tag_list, make this safe. Add lockdep_assert_held() statements
in functions that iterate over tag_list to make clear that using
list_for_each_entry() instead of list_for_each_entry_rcu() is
fine in these functions.
Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@fb.com>
blk-mq-sched: provide hooks for initializing hardware queue data
Schedulers need to be informed when a hardware queue is added or removed
at runtime so they can allocate/free per-hardware queue data. So,
replace the blk_mq_sched_init_hctx_data() helper, which only makes sense
at init time, with .init_hctx() and .exit_hctx() hooks.
We've added a considerable amount of fixes for stalls and issues
with the blk-mq scheduling in the 4.11 series since forking
off the for-4.12/block branch. We need to do improvements on
top of that for 4.12, so pull in the previous fixes to make
our lives easier going forward.
blk-mq: Restart a single queue if tag sets are shared
To improve scalability, if hardware queues are shared, restart
a single hardware queue in round-robin fashion. Rename
blk_mq_sched_restart_queues() to reflect the new semantics.
Remove blk_mq_sched_mark_restart_queue() because this function
has no callers. Remove flag QUEUE_FLAG_RESTART because this
patch removes the code that uses this flag.
Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@fb.com>
dm rq: Avoid that request processing stalls sporadically
While running the srp-test software I noticed that request
processing stalls sporadically at the beginning of a test, namely
when mkfs is run against a dm-mpath device. Every time when that
happened the following command was sufficient to resume request
processing:
echo run >/sys/kernel/debug/block/dm-0/state
This patch avoids that such request processing stalls occur. The
test I ran is as follows:
while srp-test/run_tests -d -r 30 -t 02-mq; do :; done
Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com> Cc: Mike Snitzer <snitzer@redhat.com> Cc: dm-devel@redhat.com Signed-off-by: Jens Axboe <axboe@fb.com>
If a .queue_rq() function returns BLK_MQ_RQ_QUEUE_BUSY then the block
driver that implements that function is responsible for rerunning the
hardware queue once requests can be queued again successfully.
commit 52d7f1b5c2f3 ("blk-mq: Avoid that requeueing starts stopped
queues") removed the blk_mq_stop_hw_queue() call from scsi_queue_rq()
for the BLK_MQ_RQ_QUEUE_BUSY case. Hence change all calls to functions
that are intended to rerun a busy queue such that these examine all
hardware queues instead of only stopped queues.
Since no other functions than scsi_internal_device_block() and
scsi_internal_device_unblock() should ever stop or restart a SCSI
queue, change the blk_mq_delay_queue() call into a
blk_mq_delay_run_hw_queue() call.
Fixes: commit 52d7f1b5c2f3 ("blk-mq: Avoid that requeueing starts stopped queues") Fixes: commit 7e79dadce222 ("blk-mq: stop hardware queue in blk_mq_delay_queue()") Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com> Cc: Martin K. Petersen <martin.petersen@oracle.com> Cc: James Bottomley <James.Bottomley@HansenPartnership.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Hannes Reinecke <hare@suse.de> Cc: Sagi Grimberg <sagi@grimberg.me> Cc: Long Li <longli@microsoft.com> Cc: K. Y. Srinivasan <kys@microsoft.com> Signed-off-by: Jens Axboe <axboe@fb.com>
Introduce a function that runs a hardware queue unconditionally
after a delay. Note: there is already a function that stops and
restarts a hardware queue after a delay, namely blk_mq_delay_queue().
This function will be used in the next patch in this series.
Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Hannes Reinecke <hare@suse.de> Cc: Long Li <longli@microsoft.com> Cc: K. Y. Srinivasan <kys@microsoft.com> Signed-off-by: Jens Axboe <axboe@fb.com>
Currently only dm and md/raid5 bios trigger
trace_block_bio_complete(). Now that we have bio_chain() and
bio_inc_remaining(), it is not possible, in general, for a driver to
know when the bio is really complete. Only bio_endio() knows that.
So move the trace_block_bio_complete() call to bio_endio().
Now trace_block_bio_complete() pairs with trace_block_bio_queue().
Any bio for which a 'queue' event is traced, will subsequently
generate a 'complete' event.
There are a few cases where completion tracing is not wanted.
1/ If blk_update_request() has already generated a completion
trace event at the 'request' level, there is no point generating
one at the bio level too. In this case the bi_sector and bi_size
will have changed, so the bio level event would be wrong
2/ If the bio hasn't actually been queued yet, but is being aborted
early, then a trace event could be confusing. Some filesystems
call bio_endio() but do not want tracing.
3/ The bio_integrity code interposes itself by replacing bi_end_io,
then restoring it and calling bio_endio() again. This would produce
two identical trace events if left like that.
To handle these, we introduce a flag BIO_TRACE_COMPLETION and only
produce the trace event when this is set.
We address point 1 above by clearing the flag in blk_update_request().
We address point 2 above by only setting the flag when
generic_make_request() is called.
We address point 3 above by clearing the flag after generating a
completion event.
When bio_split() is used on a bio, particularly in blk_queue_split(),
there is an extra complication. A new bio is split off the front, and
may be handle directly without going through generic_make_request().
The old bio, which has been advanced, is passed to
generic_make_request(), so it will trigger a trace event a second
time.
Probably the best result when a split happens is to see a single
'queue' event for the whole bio, then multiple 'complete' events - one
for each component. To achieve this was can:
- copy the BIO_TRACE_COMPLETION flag to the new bio in bio_split()
- avoid generating a 'queue' event if BIO_TRACE_COMPLETION is already set.
This way, the split-off bio won't create a queue event, the original
won't either even if it re-submitted to generic_make_request(),
but both will produce completion events, each for their own range.
So if generic_make_request() is called (which generates a QUEUED
event), then bi_endio() will create a single COMPLETE event for each
range that the bio is split into, unless the driver has explicitly
requested it not to.
blk-mq: remap queues when adding/removing hardware queues
blk_mq_update_nr_hw_queues() used to remap hardware queues, which is the
behavior that drivers expect. However, commit 4e68a011428a changed
blk_mq_queue_reinit() to not remap queues for the case of CPU
hotplugging, inadvertently making blk_mq_update_nr_hw_queues() not remap
queues as well. This breaks, for example, NBD's multi-connection mode,
leaving the added hardware queues unused. Fix it by making
blk_mq_update_nr_hw_queues() explicitly remap the queues.
Fixes: 4e68a011428a ("blk-mq: don't redistribute hardware queues on a CPU hotplug event") Reviewed-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Jens Axboe <axboe@fb.com>
In elevator_switch(), if blk_mq_init_sched() fails, we attempt to fall
back to the original scheduler. However, at this point, we've already
torn down the original scheduler's tags, so this causes a crash. Doing
the fallback like the legacy elevator path is much harder for mq, so fix
it by just falling back to none, instead.
blk-mq-sched: set up scheduler tags when bringing up new queues
If a new hardware queue is added at runtime, we don't allocate scheduler
tags for it, leading to a crash. This hooks up the scheduler framework
to blk_mq_{init,exit}_hctx() to make sure everything gets properly
initialized/freed.
blk-mq: use the right hctx when getting a driver tag fails
While dispatching requests, if we fail to get a driver tag, we mark the
hardware queue as waiting for a tag and put the requests on a
hctx->dispatch list to be run later when a driver tag is freed. However,
blk_mq_dispatch_rq_list() may dispatch requests from multiple hardware
queues if using a single-queue scheduler with a multiqueue device. If
blk_mq_get_driver_tag() fails, it doesn't update the hardware queue we
are processing. This means we end up using the hardware queue of the
previous request, which may or may not be the same as that of the
current request. If it isn't, the wrong hardware queue will end up
waiting for a tag, and the requests will be on the wrong dispatch list,
leading to a hang.
The fix is twofold:
1. Make sure we save which hardware queue we were trying to get a
request for in blk_mq_get_driver_tag() regardless of whether it
succeeds or not.
2. Make blk_mq_dispatch_rq_list() take a request_queue instead of a
blk_mq_hw_queue to make it clear that it must handle multiple
hardware queues, since I've already messed this up on a couple of
occasions.
This didn't appear in testing with nvme and mq-deadline because nvme has
more driver tags than the default number of scheduler tags. However,
with the blk_mq_update_nr_hw_queues() fix, it showed up with nbd.
block: move timeout field in struct request to pack better
After commit 64c7f1d1572c, we went from 1 to 2 holes in my
test setup. If we move the timeout field a bit, we remove
both of those holes and shrink struct request by 8 bytes.
Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
->retries is counting the number of times a command is resubmitted, and
be cleared on the first time we see the command. We currently don't do
that for non-PCIe command, which is easily fixed by moving the setup
to common code.
This driver is for pre-IDE hardisk that are only found in PC from the
stoneage of personal computing, and which we don't support elsewhere
in the kernel these days.
It's also been marked broken forever.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Jens Axboe <axboe@fb.com>
The block layer core sets blk_mq_queue_data.list but no block
drivers read that member. Hence remove it and also the code that
is used to set this member.
Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Jens Axboe <axboe@fb.com>
Jan Kara [Tue, 4 Apr 2017 12:31:30 +0000 (14:31 +0200)]
cfq: Disable writeback throttling by default
Writeback throttling does not play well with CFQ since that also tries
to throttle async writes. As a result async writeback can get starved in
presence of readers. As an example take a benchmark simulating
postgreSQL database running over a standard rotating SATA drive. There
are 16 processes doing random reads from a huge file (2*machine memory),
1 process doing random writes to the huge file and calling fsync once
per 50000 writes and 1 process doing sequential 8k writes to a
relatively small file wrapping around at the end of the file and calling
fsync every 5 writes. Under this load read latency easily exceeds the
target latency of 75 ms (just because there are so many reads happening
against a relatively slow disk) and thus writeback is throttled to a
point where only 1 write request is allowed at a time. Blktrace data
then looks like:
8,0 1 0 8.347751764 0 m N cfq workload slice:40000000
8,0 1 0 8.347755256 0 m N cfq293A / set_active wl_class: 0 wl_type:0
8,0 1 0 8.347784100 0 m N cfq293A / Not idling. st->count:1
8,0 1 3814 8.347763916 5839 UT N [kworker/u9:2] 1
8,0 0 0 8.347777605 0 m N cfq293A / Not idling. st->count:1
8,0 1 0 8.347784100 0 m N cfq293A / Not idling. st->count:1
8,0 3 1596 8.354364057 0 C R 156109528 + 8 (6906954) [0]
8,0 3 0 8.354383193 0 m N cfq6196SN / complete rqnoidle 0
8,0 3 0 8.354386476 0 m N cfq schedule dispatch
8,0 3 0 8.354399397 0 m N cfq293A / Not idling. st->count:1
8,0 3 0 8.354404705 0 m N cfq293A / dispatch_insert
8,0 3 0 8.354409454 0 m N cfq293A / dispatched a request
8,0 3 0 8.354412527 0 m N cfq293A / activate rq, drv=1
8,0 3 1597 8.354414692 0 D W 145961400 + 24 (6718452) [swapper/0]
8,0 3 0 8.354484184 0 m N cfq293A / Not idling. st->count:1
8,0 3 0 8.354487536 0 m N cfq293A / slice expired t=0
8,0 3 0 8.354498013 0 m N / served: vt=5888102466265088 min_vt=5888074869387264
8,0 3 0 8.354502692 0 m N cfq293A / sl_used=6737519 disp=1 charge=6737519 iops=0 sect=24
8,0 3 0 8.354505695 0 m N cfq293A / del_from_rr
...
8,0 0 1810 8.354728768 0 C W 145961400 + 24 (314076) [0]
8,0 0 0 8.354746927 0 m N cfq293A / complete rqnoidle 0
...
8,0 1 3829 8.389886102 5839 G W 145962968 + 24 [kworker/u9:2]
8,0 1 3830 8.389888127 5839 P N [kworker/u9:2]
8,0 1 3831 8.389908102 5839 A W 145978336 + 24 <- (8,4) 44000
8,0 1 3832 8.389910477 5839 Q W 145978336 + 24 [kworker/u9:2]
8,0 1 3833 8.389914248 5839 I W 145962968 + 24 (28146) [kworker/u9:2]
8,0 1 0 8.389919137 0 m N cfq293A / insert_request
8,0 1 0 8.389924305 0 m N cfq293A / add_to_rr
8,0 1 3834 8.389933175 5839 UT N [kworker/u9:2] 1
...
8,0 0 0 9.455290997 0 m N cfq workload slice:40000000
8,0 0 0 9.455294769 0 m N cfq293A / set_active wl_class:0 wl_type:0
8,0 0 0 9.455303499 0 m N cfq293A / fifo=ffff880003166090
8,0 0 0 9.455306851 0 m N cfq293A / dispatch_insert
8,0 0 0 9.455311251 0 m N cfq293A / dispatched a request
8,0 0 0 9.455314324 0 m N cfq293A / activate rq, drv=1
8,0 0 2043 9.455316210 6204 D W 145962968 + 24 (1065401962) [pgioperf]
8,0 0 0 9.455392407 0 m N cfq293A / Not idling. st->count:1
8,0 0 0 9.455395969 0 m N cfq293A / slice expired t=0
8,0 0 0 9.455404210 0 m N / served: vt=5888958194597888 min_vt=5888941810597888
8,0 0 0 9.455410077 0 m N cfq293A / sl_used=4000000 disp=1 charge=4000000 iops=0 sect=24
8,0 0 0 9.455416851 0 m N cfq293A / del_from_rr
...
8,0 0 2045 9.455648515 0 C W 145962968 + 24 (332305) [0]
8,0 0 0 9.455668350 0 m N cfq293A / complete rqnoidle 0
...
8,0 1 4371 9.455710115 5839 G W 145978336 + 24 [kworker/u9:2]
8,0 1 4372 9.455712350 5839 P N [kworker/u9:2]
8,0 1 4373 9.455730159 5839 A W 145986616 + 24 <- (8,4) 52280
8,0 1 4374 9.455732674 5839 Q W 145986616 + 24 [kworker/u9:2]
8,0 1 4375 9.455737563 5839 I W 145978336 + 24 (27448) [kworker/u9:2]
8,0 1 0 9.455742871 0 m N cfq293A / insert_request
8,0 1 0 9.455747550 0 m N cfq293A / add_to_rr
8,0 1 4376 9.455756629 5839 UT N [kworker/u9:2] 1
So we can see a Q event for a write request, then IO is blocked by
writeback throttling and G and I events for the request happen only once
other writeback IO is completed. Thus CFQ always sees only one write
request. When it sees it, it queues the async queue behind all the read
queues and the async queue gets scheduled after about one second. When
it is scheduled, that one request gets dispatched and async queue is
expired as it has no more requests to submit. Overall we submit about
one write request per second.
Although this scheduling is beneficial for read latency, writes are
heavily starved and this causes large delays all over the system (due to
processes blocking on page lock, transaction starts, etc.). When
writeback throttling is disabled, write throughput is about one fifth of
a read throughput which roughly matches readers/writers ratio and
overall the system stalls are much shorter.
Mixing writeback throttling logic with CFQ throttling logic is always a
recipe for surprises as CFQ assumes it sees the big part of the picture
which is not necessarily true when writeback throttling is blocking
requests. So disable writeback throttling logic by default when CFQ is
used as an IO scheduler.
Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <axboe@fb.com>
Adam Manzanares [Tue, 4 Apr 2017 15:25:14 +0000 (08:25 -0700)]
block: fix inheriting request priority from bio
In 4.10 I introduced a patch that associates the ioc priority with
each request in the block layer. This work was done in the single queue
block layer code. This patch unifies ioc priority to request mapping across
the single/multi queue block layers.
I have tested this patch with the null block device driver with the following
parameters.
A requeue means we go through nvme_fc_start_fcp_op again and get
another controller reference. To make sure the refcount doesn't
leak we also need to drop it for every completion that came from
the LLDD.
James Smart [Fri, 24 Mar 2017 03:41:27 +0000 (20:41 -0700)]
nvme_fc: Clean up host fcpio done status handling
As Dan Carpenter pointed out: mixing 16-bit nvme status with 32-bit
error status from driver. Corrected comment on fcp request struct
status field, and converted done routine to explicitly set nvme status
codes for nvme status.
Signed-off-by: James Smart <james.smart@broadcom.com> Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Jens Axboe <axboe@fb.com>
James Smart [Fri, 24 Mar 2017 03:41:22 +0000 (20:41 -0700)]
nvme_fc: Sync FC-NVME header with standard
Update FC-NVME definitions to match FC-NVME r1.14 (16-020vB) plus
change voted in by 2/22 FC-NVME Adhoc (see HOSTID below).
Includes the following:
- Addition of "status_code" field to ERSP IU
- Addition of FC-NVME LS RJT reason_codes and reason_explanations
- CreateAssociation payload, HostID field shortened to 16 bytes
Signed-off-by: James Smart <james.smart@broadcom.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Jens Axboe <axboe@fb.com>
Sagi Grimberg [Sat, 18 Mar 2017 18:58:29 +0000 (20:58 +0200)]
nvme-rdma: Support ctrl_loss_tmo
Before scheduling a reconnect attempt, check
nr_reconnects against max_reconnects, if not
exhausted (or max_reconnects is not -1), schedule
a reconnect attempts, otherwise schedule ctrl
removal.
Sagi Grimberg [Sat, 18 Mar 2017 18:52:36 +0000 (20:52 +0200)]
nvme-fabrics: Allow ctrl loss timeout configuration
When a host sense that its controller session is damaged,
it tries to re-establish it periodically (reconnect every
reconnect_delay). It may very well be that the controller
is gone and never coming back, in this case the host will
try to reconnect forever.
Add a ctrl_loss_tmo to bound the number of reconnect attempts
to a specific controller (default to a reasonable 10 minutes).
The timeout configuration is actually translated into number of
reconnect attempts and not a schedule on its own but rather
divided with reconnect_delay. This is useful to prevent
racing flows of remove and reconnect, and it doesn't really
matter if we remove slightly sooner than what the user requested.
Parav Pandit [Tue, 28 Feb 2017 05:21:33 +0000 (23:21 -0600)]
nvmet: Introduced helper routine for controller status check.
This patch introduces helper function for checking controller
status during admin and io command processing which returns u16
status. As to bring consistency on returning status, other
friend functions also now return u16 status instead of int
to match the spec.
As part of the theseerror log prints in also prints qid on
which command error occured.
Sagi Grimberg [Sun, 5 Feb 2017 19:47:22 +0000 (21:47 +0200)]
net/utils: generic inet_pton_with_scope helper
Several locations in the stack need to handle ipv4/ipv6
(with scope) and port strings conversion to sockaddr.
Add a helper that takes either AF_INET, AF_INET6 or
AF_UNSPEC (for wildcard) to centralize this handling.
Suggested-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Acked-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Jens Axboe <axboe@fb.com>
If we are attacked with establishments/teradowns we need to
make sure we do not consume too much system memory. Thus
let ongoing controller teardowns complete before accepting
new controller establishments.
Sagi Grimberg [Thu, 9 Mar 2017 11:26:07 +0000 (13:26 +0200)]
nvme-rdma: handle cpu unplug when re-establishing the controller
If a cpu unplug event has occured, we need to take the minimum
of the provided nr_io_queues and the number of online cpus,
otherwise we won't be able to connect them as blk-mq mapping
won't dispatch to those queues.
Sagi Grimberg [Thu, 9 Mar 2017 11:45:52 +0000 (13:45 +0200)]
nvmet-rdma: Fix a possible uninitialized variable dereference
When handling a new recv command, we grab a new rsp resource and
check for the queue state being live. In case the queue is not in
live state, we simply restore the rsp back to the free list. However
in this flow we didn't set rsp->queue yet, so we cannot dereference it.
Instead, make sure to initialize rsp->queue (and other rsp members)
as soon as possible so we won't reference uninitialized variables.
Sagi Grimberg [Mon, 13 Mar 2017 11:27:51 +0000 (13:27 +0200)]
nvme-loop: handle cpu unplug when re-establishing the controller
If a cpu unplug event has occured, we need to take the minimum
of the provided nr_io_queues and the number of online cpus,
otherwise we won't be able to connect them as blk-mq mapping
won't dispatch to those queues.
Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Sagi Grimberg <sagi@grimberg.me>