Jens Axboe [Tue, 8 Nov 2016 04:32:37 +0000 (21:32 -0700)]
block: add scalable completion tracking of requests
For legacy block, we simply track them in the request queue. For
blk-mq, we track them on a per-sw queue basis, which we can then
sum up through the hardware queues and finally to a per device
state.
The stats are tracked in, roughly, 0.1s interval windows.
Add sysfs files to display the stats.
The feature is off by default, to avoid any extra overhead. In-kernel
users of it can turn it on by setting QUEUE_FLAG_STATS in the queue
flags. We currently don't turn it on if someone just reads any of
the stats files, that is something we could add as well.
Tejun Heo [Thu, 10 Nov 2016 16:16:37 +0000 (11:16 -0500)]
block: cfq_cpd_alloc() should use @gfp
cfq_cpd_alloc() which is the cpd_alloc_fn implementation for cfq was
incorrectly hard coding GFP_KERNEL instead of using the mask specified
through the @gfp parameter. This currently doesn't cause any actual
issues because all current callers specify GFP_KERNEL. Fix it.
Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Fixes: e4a9bde9589f ("blkcg: replace blkcg_policy->cpd_size with ->cpd_alloc/free_fn() methods") Signed-off-by: Jens Axboe <axboe@fb.com>
nvme: don't pass the full CQE to nvme_complete_async_event
We only need the status and result fields, and passing them explicitly
makes life a lot easier for the Fibre Channel transport which doesn't
have a full CQE for the fast path case.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>
This adds a shared per-request structure for all NVMe I/O. This structure
is embedded as the first member in all NVMe transport drivers request
private data and allows to implement common functionality between the
drivers.
The first use is to replace the current abuse of the SCSI command
passthrough fields in struct request for the NVMe command passthrough,
but it will grow a field more fields to allow implementing things
like common abort handlers in the future.
The passthrough commands are handled by having a pointer to the SQE
(struct nvme_command) in struct nvme_request, and the union of the
possible result fields, which had to be turned from an anonymous
into a named union for that purpose. This avoids having to pass
a reference to a full CQE around and thus makes checking the result
a lot more lightweight.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>
Arnd Bergmann [Wed, 9 Nov 2016 12:55:34 +0000 (13:55 +0100)]
skd: fix msix error handling
As reported by gcc -Wmaybe-uninitialized, the cleanup path for
skd_acquire_msix tries to free the already allocated msi-x vectors
in reverse order, but the index variable may not have been
used yet:
drivers/block/skd_main.c: In function ‘skd_acquire_irq’:
drivers/block/skd_main.c:3890:8: error: ‘i’ may be used uninitialized in this function [-Werror=maybe-uninitialized]
This changes the failure path to skip releasing the interrupts
if we have not started requesting them yet.
Fixes: 180b0ae77d49 ("skd: use pci_alloc_irq_vectors") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
Jens Axboe [Wed, 9 Nov 2016 02:39:28 +0000 (19:39 -0700)]
block: set REQ_SYNC if we clear REQ_FUA|REQ_PREFLUSH
If we insert a flush request, we clear REQ_PREFLUSH and/or REQ_FUA,
depending on flush settings. Since op_is_sync() factors those flags
in for deciding whether this request is sync or not, we should
set REQ_SYNC to avoid screwing up this accounting.
This should be less fragile.
Reported-by: Logan Gunthorpe <logang@deltatee.com> Fixes: b685d3d65ac ("block: treat REQ_FUA and REQ_PREFLUSH as synchronous") Signed-off-by: Jens Axboe <axboe@fb.com>
writeback: track if we're sleeping on progress in balance_dirty_pages()
Note in the bdi_writeback structure whenever a task ends up sleeping
waiting for progress. We can use that information in the lower layers
to increase the priority of writes.
Signed-off-by: Jens Axboe <axboe@fb.com> Reviewed-by: Jan Kara <jack@suse.cz>
Switch the skd driver to use pci_alloc_irq_vectors. We need to two calls to
pci_alloc_irq_vectors as skd only supports multiple MSI-X vectors, but not
multiple MSI vectors.
Otherwise this cleans up a lot of cruft and allows to a lot more common code.
Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
I've been looking over the multi page bio vec work again recently, and
one of the stumbling blocks is raw biovec access in the pktcdvd.
The first issue is that it directly sets up the page and offset pointers
in the biovec just before calling bio_add_page. As bio_add_page already
does the setup it's trivial to just switch it to stack variables for the
arguments.
The second issue is the copy code in pkt_make_local_copy, which
effectively is an opencoded version of bio_copy_data except that it
skips pages that already are the same in the ѕource and destination.
But we look at the only calleer we just set up the bio using bio_add_page
to point exactly to the page array that pkt_make_local_copy compares,
so the pages will always be the same and we can just remove this function.
Note that all this is done based on code inspection, I don't have any
packet writing hardware myself.
Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
Commit 0e87e58bf60e ("blk-mq: improve warning for running a queue on the
wrong CPU") attempts to avoid triggering the WARN_ON in
__blk_mq_run_hw_queue when the expected CPU is dead. Problem is, in the
last batch execution before round robin, blk_mq_hctx_next_cpu can
schedule a dead CPU and also update next_cpu to the next alive CPU in
the mask, which will trigger the WARN_ON despite the previous
workaround.
The following patch fixes this scenario by always scheduling the value
in hctx->next_cpu. This changes the moment when we round-robin the CPU
running the hctx, but it really doesn't matter, since it still executes
BLK_MQ_CPU_WORK_BATCH times in a row before switching to another CPU.
Fixes: 0e87e58bf60e ("blk-mq: improve warning for running a queue on the wrong CPU") Signed-off-by: Gabriel Krisman Bertazi <krisman@linux.vnet.ibm.com> Signed-off-by: Jens Axboe <axboe@fb.com>
Jens Axboe [Wed, 30 Mar 2016 16:21:08 +0000 (10:21 -0600)]
block: add code to track actual device queue depth
For blk-mq, ->nr_requests does track queue depth, at least at init
time. But for the older queue paths, it's simply a soft setting.
On top of that, it's generally larger than the hardware setting
on purpose, to allow backup of requests for merging.
Fill a hole in struct request with a 'queue_depth' member, that
drivers can call to more closely inform the block layer of the
real queue depth.
Signed-off-by: Jens Axboe <axboe@fb.com> Reviewed-by: Jan Kara <jack@suse.cz>
Shaohua Li [Fri, 4 Nov 2016 00:03:53 +0000 (17:03 -0700)]
block: immediately dispatch big size request
Currently block plug holds up to 16 non-mergeable requests. This makes
sense if the request size is small, eg, reduce lock contention. But if
request size is big enough, we don't need to worry about lock
contention. Holding such request makes no sense and it lows the disk
utilization.
In practice, this improves 10% throughput for my raid5 sequential write
workload.
The size (128k) is arbitrary right now, but it makes sure lock
contention is small. This probably could be more intelligent, eg, check
average request size holded. Since this is mainly for sequential IO,
probably not worthy.
V2: check the last request instead of the first request, so as long as
there is one big size request we flush the plug.
Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: Jens Axboe <axboe@fb.com>
Bart Van Assche [Sat, 29 Oct 2016 00:23:40 +0000 (17:23 -0700)]
nvme: Use BLK_MQ_S_STOPPED instead of QUEUE_FLAG_STOPPED in blk-mq code
Make nvme_requeue_req() check BLK_MQ_S_STOPPED instead of
QUEUE_FLAG_STOPPED. Remove the QUEUE_FLAG_STOPPED manipulations
that became superfluous because of this change. Change
blk_queue_stopped() tests into blk_mq_queue_stopped().
This patch fixes a race condition: using queue_flag_clear_unlocked()
is not safe if any other function that manipulates the queue flags
can be called concurrently, e.g. blk_cleanup_queue().
Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com> Cc: Keith Busch <keith.busch@intel.com> Cc: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
Bart Van Assche [Sat, 29 Oct 2016 00:22:16 +0000 (17:22 -0700)]
dm: Fix a race condition related to stopping and starting queues
Ensure that all ongoing dm_mq_queue_rq() and dm_mq_requeue_request()
calls have stopped before setting the "queue stopped" flag. This
allows to remove the "queue stopped" test from dm_mq_queue_rq() and
dm_mq_requeue_request(). This patch fixes a race condition because
dm_mq_queue_rq() is called without holding the queue lock and hence
BLK_MQ_S_STOPPED can be set at any time while dm_mq_queue_rq() is
in progress. This patch prevents that the following hang occurs
sporadically when using dm-mq:
Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com> Acked-by: Mike Snitzer <snitzer@redhat.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
Bart Van Assche [Sat, 29 Oct 2016 00:22:00 +0000 (17:22 -0700)]
dm: Use BLK_MQ_S_STOPPED instead of QUEUE_FLAG_STOPPED in blk-mq code
Instead of manipulating both QUEUE_FLAG_STOPPED and BLK_MQ_S_STOPPED
in the dm start and stop queue functions, only manipulate the latter
flag. Change blk_queue_stopped() tests into blk_mq_queue_stopped().
Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com> Acked-by: Mike Snitzer <snitzer@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Jens Axboe <axboe@fb.com>
Bart Van Assche [Sat, 29 Oct 2016 00:21:41 +0000 (17:21 -0700)]
blk-mq: Add a kick_requeue_list argument to blk_mq_requeue_request()
Most blk_mq_requeue_request() and blk_mq_add_to_requeue_list() calls
are followed by kicking the requeue list. Hence add an argument to
these two functions that allows to kick the requeue list. This was
proposed by Christoph Hellwig.
Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: Hannes Reinecke <hare@suse.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Jens Axboe <axboe@fb.com>
Bart Van Assche [Wed, 2 Nov 2016 16:09:51 +0000 (10:09 -0600)]
blk-mq: Introduce blk_mq_quiesce_queue()
blk_mq_quiesce_queue() waits until ongoing .queue_rq() invocations
have finished. This function does *not* wait until all outstanding
requests have finished (this means invocation of request.end_io()).
The algorithm used by blk_mq_quiesce_queue() is as follows:
* Hold either an RCU read lock or an SRCU read lock around
.queue_rq() calls. The former is used if .queue_rq() does not
block and the latter if .queue_rq() may block.
* blk_mq_quiesce_queue() first calls blk_mq_stop_hw_queues()
followed by synchronize_srcu() or synchronize_rcu(). The latter
call waits for .queue_rq() invocations that started before
blk_mq_quiesce_queue() was called.
* The blk_mq_hctx_stopped() calls that control whether or not
.queue_rq() will be called are called with the (S)RCU read lock
held. This is necessary to avoid race conditions against
blk_mq_quiesce_queue().
Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com> Cc: Hannes Reinecke <hare@suse.com> Cc: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Ming Lei <tom.leiming@gmail.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
Bart Van Assche [Sat, 29 Oct 2016 00:20:49 +0000 (17:20 -0700)]
blk-mq: Remove blk_mq_cancel_requeue_work()
Since blk_mq_requeue_work() no longer restarts stopped queues
canceling requeue work is no longer needed to prevent that a
stopped queue would be restarted. Hence remove this function.
Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com> Cc: Mike Snitzer <snitzer@redhat.com> Cc: Keith Busch <keith.busch@intel.com> Cc: Hannes Reinecke <hare@suse.com> Cc: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
Bart Van Assche [Sat, 29 Oct 2016 00:20:32 +0000 (17:20 -0700)]
blk-mq: Avoid that requeueing starts stopped queues
Since blk_mq_requeue_work() starts stopped queues and since
execution of this function can be scheduled after a queue has
been stopped it is not possible to stop queues without using
an additional state variable to track whether or not the queue
has been stopped. Hence modify blk_mq_requeue_work() such that it
does not start stopped queues. My conclusion after a review of
the blk_mq_stop_hw_queues() and blk_mq_{delay_,}kick_requeue_list()
callers is as follows:
* In the dm driver starting and stopping queues should only happen
if __dm_suspend() or __dm_resume() is called and not if the
requeue list is processed.
* In the SCSI core queue stopping and starting should only be
performed by the scsi_internal_device_block() and
scsi_internal_device_unblock() functions but not by any other
function. Although the blk_mq_stop_hw_queue() call in
scsi_queue_rq() may help to reduce CPU load if a LLD queue is
full, figuring out whether or not a queue should be restarted
when requeueing a command would require to introduce additional
locking in scsi_mq_requeue_cmd() to avoid a race with
scsi_internal_device_block(). Avoid this complexity by removing
the blk_mq_stop_hw_queue() call from scsi_queue_rq().
* In the NVMe core only the functions that call
blk_mq_start_stopped_hw_queues() explicitly should start stopped
queues.
* A blk_mq_start_stopped_hwqueues() call must be added in the
xen-blkfront driver in its blkif_recover() function.
Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Roger Pau Monné <roger.pau@citrix.com> Cc: Mike Snitzer <snitzer@redhat.com> Cc: James Bottomley <jejb@linux.vnet.ibm.com> Cc: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
Bart Van Assche [Sat, 29 Oct 2016 00:20:02 +0000 (17:20 -0700)]
blk-mq: Move more code into blk_mq_direct_issue_request()
Move the "hctx stopped" test and the insert request calls into
blk_mq_direct_issue_request(). Rename that function into
blk_mq_try_issue_directly() to reflect its new semantics. Pass
the hctx pointer to that function instead of looking it up a
second time. These changes avoid that code has to be duplicated
in the next patch.
Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
Bart Van Assche [Sat, 29 Oct 2016 00:19:37 +0000 (17:19 -0700)]
blk-mq: Introduce blk_mq_queue_stopped()
The function blk_queue_stopped() allows to test whether or not a
traditional request queue has been stopped. Introduce a helper
function that allows block drivers to query easily whether or not
one or more hardware contexts of a blk-mq queue have been stopped.
Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
Bart Van Assche [Sat, 29 Oct 2016 00:19:15 +0000 (17:19 -0700)]
blk-mq: Introduce blk_mq_hctx_stopped()
Multiple functions test the BLK_MQ_S_STOPPED bit so introduce
a helper function that performs this test.
Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com> Reviewed-by: Ming Lei <tom.leiming@gmail.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
Bart Van Assche [Sat, 29 Oct 2016 00:18:48 +0000 (17:18 -0700)]
blk-mq: Do not invoke .queue_rq() for a stopped queue
The meaning of the BLK_MQ_S_STOPPED flag is "do not call
.queue_rq()". Hence modify blk_mq_make_request() such that requests
are queued instead of issued if a queue has been stopped.
Reported-by: Ming Lei <tom.leiming@gmail.com> Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Ming Lei <tom.leiming@gmail.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Cc: <stable@vger.kernel.org> Signed-off-by: Jens Axboe <axboe@fb.com>
Kent Overstreet [Mon, 31 Oct 2016 17:59:24 +0000 (11:59 -0600)]
block: add bio_iov_iter_get_pages()
This is a helper that pins down a range from an iov_iter and adds it to
a bio without requiring a separate memory allocation for the page array.
It will be used for upcoming direct I/O implementations for block devices
and iomap based file systems.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
[hch: ported to the iov_iter interface, renamed and added comments.
All blame should be directed to me and all fame should go to Kent
after this!] Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
Jens Axboe [Tue, 1 Nov 2016 16:00:38 +0000 (10:00 -0600)]
writeback: add wbc_to_write_flags()
Add wbc_to_write_flags(), which returns the write modifier flags to use,
based on a struct writeback_control. No functional changes in this
patch, but it prepares us for factoring other wbc fields for write type.
Signed-off-by: Jens Axboe <axboe@fb.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Christoph Hellwig <hch@lst.de>
fs: decouple READ and WRITE from the block layer ops
Move READ and WRITE to kernel.h and don't define them in terms of block
layer ops; they are our generic data direction indicators these days
and have no more resemblance with the block layer ops.
Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
Noidle should be the default for writes as seen by all the compounds
definitions in fs.h using it. In fact only direct I/O really should
be using NODILE, so turn the whole flag around to get the defaults
right, which will make our life much easier especially onces the
WRITE_* defines go away.
This assumes all the existing "raw" users of REQ_SYNC for writes
want noidle behavior, which seems to be spot on from a quick audit.
Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
block: add a proper block layer data direction encoding
Currently the block layer op_is_write, bio_data_dir and rq_data_dir
helper treat every operation that is not a READ as a data out operation.
This worked surprisingly long, but the new REQ_OP_ZONE_REPORT operation
actually adds a second operation that reads data from the device.
Surprisingly nothing critical relied on this direction, but this might
be a good opportunity to properly fix this issue up.
We take a little inspiration and use the least significant bit of the
operation number to encode the data direction, which just requires us
to renumber the operations to fix this scheme.
Now that we don't need the common flags to overflow outside the range
of a 32-bit type we can encode them the same way for both the bio and
request fields. This in addition allows us to place the operation
first (and make some room for more ops while we're at it) and to
stop having to shift around the operation values.
In addition this allows passing around only one value in the block layer
instead of two (and eventuall also in the file systems, but we can do
that later) and thus clean up a lot of code.
Last but not least this allows decreasing the size of the cmd_flags
field in struct request to 32-bits. Various functions passing this
value could also be updated, but I'd like to avoid the churn for now.
Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
block: split out request-only flags into a new namespace
A lot of the REQ_* flags are only used on struct requests, and only of
use to the block layer and a few drivers that dig into struct request
internals.
This patch adds a new req_flags_t rq_flags field to struct request for
them, and thus dramatically shrinks the number of common requests. It
also removes the unfortunate situation where we have to fit the fields
from the same enum into 32 bits for struct bio and 64 bits for
struct request.
The information that am I/O is a read-ahead can be useful for drivers.
In fact the NVMe driver already checks it, even if it won't ever be set
at the moment.
With the addition of the zoned operations the tests in this function
became incorrect. But I think it's much better to just open code the
allow operations in the only caller anyway.
Jens Axboe [Fri, 28 Oct 2016 01:03:55 +0000 (19:03 -0600)]
blk-mq: get rid of confusing blk_map_ctx structure
We can just use struct blk_mq_alloc_data - it has a few more
members, but we allocate it further down the stack anyway. So
this cleans up the code, and reduces the stack overhead a bit.
Jens Axboe [Thu, 27 Oct 2016 15:49:19 +0000 (09:49 -0600)]
blk-mq: update hardware and software queues for sleeping alloc
If we end up sleeping due to running out of requests, we should
update the hardware and software queues in the map ctx structure.
Otherwise we could end up having rq->mq_ctx point to the pre-sleep
context, and risk corrupting ctx->rq_list since we'll be
grabbing the wrong lock when inserting the request.
Reported-by: Dave Jones <davej@codemonkey.org.uk> Reported-by: Chris Mason <clm@fb.com> Tested-by: Chris Mason <clm@fb.com> Fixes: 63581af3f31e ("blk-mq: remove non-blocking pass in blk_mq_map_request") Signed-off-by: Jens Axboe <axboe@fb.com>
Mike Snitzer [Tue, 25 Oct 2016 14:46:23 +0000 (08:46 -0600)]
brd: remove support for BLKFLSBUF
Discontinue having the brd driver destructively free all pages in the
ramdisk in response to the BLKFLSBUF ioctl. Doing so allows a BLKFLSBUF
ioctl issued to a logical partition to destroy pages of the parent brd
device (and all other partitions of that brd device).
This change breaks compatibility - but in this case the compatibility
breaks more than it helps.
Reported-by: Mikulas Patocka <mpatocka@redhat.com> Suggested-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Jens Axboe <axboe@fb.com>
Jan Kara [Tue, 25 Oct 2016 11:53:27 +0000 (13:53 +0200)]
brd: Switch rd_size to unsigned long
Currently rd_size was int which lead to overflow and bogus device size
once the requested ramdisk size was 1 TB or more. Although these days
ramdisks with 1 TB size are mostly a mistake, the days when they are
useful are not far.
Reported-by: Bart Van Assche <bart.vanassche@sandisk.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <axboe@fb.com>
Arnd Bergmann [Fri, 21 Oct 2016 15:32:24 +0000 (17:32 +0200)]
sd: fix uninitialized variable access in error handling
If sd_zbc_report_zones fails, the check for 'zone_blocks == 0'
later in the function accesses uninitialized data:
drivers/scsi/sd_zbc.c: In function ‘sd_zbc_read_zones’:
drivers/scsi/sd_zbc.c:520:7: error: ‘zone_blocks’ may be used uninitialized in this function [-Werror=maybe-uninitialized]
This sets it to zero, which has the desired effect of leaving
the sd_zbc_read_zones successfully with sdkp->zone_blocks = 0.
Fixes: 89d947561077 ("sd: Implement support for ZBC devices") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: Shaun Tancheff <shaun.tancheff@seagate.com> Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@fb.com>
The blkdev_report_zones produces a harmless warning when
-Wmaybe-uninitialized is set, after gcc gets a little confused
about the multiple 'goto' here:
block/blk-zoned.c: In function 'blkdev_report_zones':
block/blk-zoned.c:188:13: error: 'nz' may be used uninitialized in this function [-Werror=maybe-uninitialized]
Moving the assignment to nr_zones makes this a little simpler
while also avoiding the warning reliably. I'm removing the
extraneous initialization of 'int ret' in the same patch, as
that is semi-related and could cause an uninitialized use of
that variable to not produce a warning.
Fixes: 6a0cb1bc106f ("block: Implement support for zoned block devices") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: Shaun Tancheff <shaun.tancheff@seagate.com> Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com> Signed-off-by: Jens Axboe <axboe@fb.com>
Hannes Reinecke [Tue, 18 Oct 2016 06:40:34 +0000 (15:40 +0900)]
sd: Implement support for ZBC devices
Implement ZBC support functions to setup zoned disks, both
host-managed and host-aware models. Only zoned disks that satisfy
the following conditions are supported:
1) All zones are the same size, with the exception of an eventual
last smaller runt zone.
2) For host-managed disks, reads are unrestricted (reads are not
failed due to zone or write pointer alignement constraints).
Zoned disks that do not satisfy these 2 conditions are setup with
a capacity of 0 to prevent their use.
The function sd_zbc_read_zones, called from sd_revalidate_disk,
checks that the device satisfies the above two constraints. This
function may also change the disk capacity previously set by
sd_read_capacity for devices reporting only the capacity of
conventional zones at the beginning of the LBA range (i.e. devices
reporting rc_basis set to 0).
The capacity message output was moved out of sd_read_capacity into
a new function sd_print_capacity to include this eventual capacity
change by sd_zbc_read_zones. This new function also includes a call
to sd_zbc_print_zones to display the number of zones and zone size
of the device.
Signed-off-by: Hannes Reinecke <hare@suse.de>
[Damien: * Removed zone cache support
* Removed mapping of discard to reset write pointer command
* Modified sd_zbc_read_zones to include checks that the
device satisfies the kernel constraints
* Implemeted REPORT ZONES setup and post-processing based
on code from Shaun Tancheff <shaun.tancheff@seagate.com>
* Removed confusing use of 512B sector units in functions
interface] Signed-off-by: Damien Le Moal <damien.lemoal@hgst.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Shaun Tancheff <shaun.tancheff@seagate.com> Tested-by: Shaun Tancheff <shaun.tancheff@seagate.com> Acked-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Jens Axboe <axboe@fb.com>
Shaun Tancheff [Tue, 18 Oct 2016 06:40:35 +0000 (15:40 +0900)]
blk-zoned: implement ioctls
Adds the new BLKREPORTZONE and BLKRESETZONE ioctls for respectively
obtaining the zone configuration of a zoned block device and resetting
the write pointer of sequential zones of a zoned block device.
The BLKREPORTZONE ioctl maps directly to a single call of the function
blkdev_report_zones. The zone information result is passed as an array
of struct blk_zone identical to the structure used internally for
processing the REQ_OP_ZONE_REPORT operation. The BLKRESETZONE ioctl
maps to a single call of the blkdev_reset_zones function.
Signed-off-by: Shaun Tancheff <shaun.tancheff@seagate.com> Signed-off-by: Damien Le Moal <damien.lemoal@hgst.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@fb.com>
Hannes Reinecke [Tue, 18 Oct 2016 06:40:33 +0000 (15:40 +0900)]
block: Implement support for zoned block devices
Implement zoned block device zone information reporting and reset.
Zone information are reported as struct blk_zone. This implementation
does not differentiate between host-aware and host-managed device
models and is valid for both. Two functions are provided:
blkdev_report_zones for discovering the zone configuration of a
zoned block device, and blkdev_reset_zones for resetting the write
pointer of sequential zones. The helper function blk_queue_zone_size
and bdev_zone_size are also provided for, as the name suggest,
obtaining the zone size (in 512B sectors) of the zones of the device.
Signed-off-by: Hannes Reinecke <hare@suse.de>
[Damien: * Removed the zone cache
* Implement report zones operation based on earlier proposal
by Shaun Tancheff <shaun.tancheff@seagate.com>] Signed-off-by: Damien Le Moal <damien.lemoal@hgst.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Shaun Tancheff <shaun.tancheff@seagate.com> Tested-by: Shaun Tancheff <shaun.tancheff@seagate.com> Signed-off-by: Jens Axboe <axboe@fb.com>
Shaun Tancheff [Tue, 18 Oct 2016 06:40:32 +0000 (15:40 +0900)]
block: Define zoned block device operations
Define REQ_OP_ZONE_REPORT and REQ_OP_ZONE_RESET for handling zones of
host-managed and host-aware zoned block devices. With with these two
new operations, the total number of operations defined reaches 8 and
still fits with the 3 bits definition of REQ_OP_BITS.
Signed-off-by: Shaun Tancheff <shaun.tancheff@seagate.com> Signed-off-by: Damien Le Moal <damien.lemoal@hgst.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@fb.com>
Hannes Reinecke [Tue, 18 Oct 2016 06:40:31 +0000 (15:40 +0900)]
block: update chunk_sectors in blk_stack_limits()
Signed-off-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Damien Le Moal <damien.lemoal@hgst.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Shaun Tancheff <shaun.tancheff@seagate.com> Tested-by: Shaun Tancheff <shaun.tancheff@seagate.com> Signed-off-by: Jens Axboe <axboe@fb.com>
Damien Le Moal [Tue, 18 Oct 2016 06:40:29 +0000 (15:40 +0900)]
block: Add 'zoned' queue limit
Add the zoned queue limit to indicate the zoning model of a block device.
Defined values are 0 (BLK_ZONED_NONE) for regular block devices,
1 (BLK_ZONED_HA) for host-aware zone block devices and 2 (BLK_ZONED_HM)
for host-managed zone block devices. The standards defined drive managed
model is not defined here since these block devices do not provide any
command for accessing zone information. Drive managed model devices will
be reported as BLK_ZONED_NONE.
The helper functions blk_queue_zoned_model and bdev_zoned_model return
the zoned limit and the functions blk_queue_is_zoned and bdev_is_zoned
return a boolean for callers to test if a block device is zoned.
The zoned attribute is also exported as a string to applications via
sysfs. BLK_ZONED_NONE shows as "none", BLK_ZONED_HA as "host-aware" and
BLK_ZONED_HM as "host-managed".
Signed-off-by: Damien Le Moal <damien.lemoal@hgst.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Shaun Tancheff <shaun.tancheff@seagate.com> Tested-by: Shaun Tancheff <shaun.tancheff@seagate.com> Signed-off-by: Jens Axboe <axboe@fb.com>
Dave Hansen [Mon, 17 Oct 2016 20:57:09 +0000 (13:57 -0700)]
x86, pkeys: remove cruft from never-merged syscalls
pkey_set() and pkey_get() were syscalls present in older versions
of the protection keys patches. The syscall number definitions
were inadvertently left in place. This patch removes them.
I did a git grep and verified that these are the last places in
the tree that these appear, save for the protection_keys.c tests
and Documentation. Those spots talk about functions called
pkey_get/set() which are wrappers for the direct PKRU
instructions, not the syscalls.
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-arch@vger.kernel.org Cc: mgorman@techsingularity.net Cc: arnd@arndb.de Cc: linux-api@vger.kernel.org Cc: linux-mm@kvack.org Cc: luto@kernel.org Cc: akpm@linux-foundation.org Fixes: f9afc6197e9bb ("x86: Wire up protection keys system calls") Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Dave Hansen [Mon, 17 Oct 2016 15:18:15 +0000 (08:18 -0700)]
generic syscalls: kill cruft from removed pkey syscalls
pkey_set() and pkey_get() were syscalls present in older versions
of the protection keys patches. They were fully excised from the
x86 code, but some cruft was left in the generic syscall code. The
C++ comments were intended to help to make it more glaring to me to
fix them before actually submitting them. That technique worked,
but later than I would have liked.
Linus Torvalds [Sat, 15 Oct 2016 19:09:13 +0000 (12:09 -0700)]
Merge tag 'befs-v4.9-rc1' of git://github.com/luisbg/linux-befs
Pull befs fixes from Luis de Bethencourt:
"I recently took maintainership of the befs file system [0]. This is
the first time I send you a git pull request, so please let me know if
all the below is OK.
Salah Triki and myself have been cleaning the code and fixing a few
small bugs.
Sorry I couldn't send this sooner in the merge window, I was waiting
to have my GPG key signed by kernel members at ELCE in Berlin a few
days ago."
[0] https://lkml.org/lkml/2016/7/27/502
* tag 'befs-v4.9-rc1' of git://github.com/luisbg/linux-befs: (39 commits)
befs: befs: fix style issues in datastream.c
befs: improve documentation in datastream.c
befs: fix typos in datastream.c
befs: fix typos in btree.c
befs: fix style issues in super.c
befs: fix comment style
befs: add check for ag_shift in superblock
befs: dump inode_size superblock information
befs: remove unnecessary initialization
befs: fix typo in befs_sb_info
befs: add flags field to validate superblock state
befs: fix typo in befs_find_key
befs: remove unused BEFS_BT_PARMATCH
fs: befs: remove ret variable
fs: befs: remove in vain variable assignment
fs: befs: remove unnecessary *befs_sb variable
fs: befs: remove useless initialization to zero
fs: befs: remove in vain variable assignment
fs: befs: Insert NULL inode to dentry
fs: befs: Remove useless calls to brelse in befs_find_brun_dblindirect
...
Linus Torvalds [Sat, 15 Oct 2016 17:03:15 +0000 (10:03 -0700)]
Merge tag 'gcc-plugins-v4.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux
Pull gcc plugins update from Kees Cook:
"This adds a new gcc plugin named "latent_entropy". It is designed to
extract as much possible uncertainty from a running system at boot
time as possible, hoping to capitalize on any possible variation in
CPU operation (due to runtime data differences, hardware differences,
SMP ordering, thermal timing variation, cache behavior, etc).
At the very least, this plugin is a much more comprehensive example
for how to manipulate kernel code using the gcc plugin internals"
* tag 'gcc-plugins-v4.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
latent_entropy: Mark functions with __latent_entropy
gcc-plugins: Add latent_entropy plugin
Linus Torvalds [Sat, 15 Oct 2016 16:26:12 +0000 (09:26 -0700)]
Merge branch 'upstream' of git://git.linux-mips.org/pub/scm/ralf/upstream-linus
Pull MIPS updates from Ralf Baechle:
"This is the main MIPS pull request for 4.9:
MIPS core arch code:
- traps: 64bit kernels should read CP0_EBase 64bit
- traps: Convert ebase to KSEG0
- c-r4k: Drop bc_wback_inv() from icache flush
- c-r4k: Split user/kernel flush_icache_range()
- cacheflush: Use __flush_icache_user_range()
- uprobes: Flush icache via kernel address
- KVM: Use __local_flush_icache_user_range()
- c-r4k: Fix flush_icache_range() for EVA
- Fix -mabi=64 build of vdso.lds
- VDSO: Drop duplicated -I*/-E* aflags
- tracing: move insn_has_delay_slot to a shared header
- tracing: disable uprobe/kprobe on compact branch instructions
- ptrace: Fix regs_return_value for kernel context
- Squash lines for simple wrapper functions
- Move identification of VP(E) into proc.c from smp-mt.c
- Add definitions of SYNC barrierstype values
- traps: Ensure full EBase is written
- tlb-r4k: If there are wired entries, don't use TLBINVF
- Sanitise coherentio semantics
- dma-default: Don't check hw_coherentio if device is non-coherent
- Support per-device DMA coherence
- Adjust MIPS64 CAC_BASE to reflect Config.K0
- Support generating Flattened Image Trees (.itb)
- generic: Introduce generic DT-based board support
- generic: Convert SEAD-3 to a generic board
- Enable hardened usercopy
- Don't specify STACKPROTECTOR in defconfigs
Octeon:
- Delete dead code and files across the platform.
- Change to use all memory into use by default.
- Rename upper case variables in setup code to lowercase.
- Delete legacy hack for broken bootloaders.
- Leave maintaining the link state to the actual ethernet/PHY drivers.
- Add DTS for D-Link DSR-500N.
- Fix PCI interrupt routing on D-Link DSR-500N.
Pistachio:
- Remove ANDROID_TIMED_OUTPUT from defconfig
TX39xx:
- Move GPIO setup from .mem_setup() to .arch_init()
- Convert to Common Clock Framework
TX49xx:
- Move GPIO setup from .mem_setup() to .arch_init()
- Convert to Common Clock Framework
txx9wdt:
- Add missing clock (un)prepare calls for CCF
BMIPS:
- Add PW, GPIO SDHCI and NAND device node names
- Support APPENDED_DTB
- Add missing bcm97435svmb to DT_NONE
- Rename bcm96358nb4ser to bcm6358-neufbox4-sercom
- Add DT examples for BCM63268, BCM3368 and BCM6362
- Add support for BCM3368 and BCM6362
PCI
- Reduce stack frame usage
- Use struct list_head lists
- Support for CONFIG_PCI_DOMAINS_GENERIC
- Make pcibios_set_cache_line_size an initcall
- Inline pcibios_assign_all_busses
- Split pci.c into pci.c & pci-legacy.c
- Introduce CONFIG_PCI_DRIVERS_LEGACY
- Support generic drivers
CPC
- Convert bare 'unsigned' to 'unsigned int'
- Avoid lock when MIPS CM >= 3 is present
GIC:
- Delete unused file smp-gic.c
mt7620:
- Delete unnecessary assignment for the field "owner" from PCI
BCM63xx:
- Let clk_disable() return immediately if clk is NULL
pm-cps:
- Change FSB workaround to CPU blacklist
- Update comments on barrier instructions
- Use MIPS standard lightweight ordering barrier
- Use MIPS standard completion barrier
- Remove selection of sync types
- Add MIPSr6 CPU support
- Support CM3 changes to Coherence Enable Register
SMP:
- Wrap call to mips_cpc_lock_other in mips_cm_lock_other
- Introduce mechanism for freeing and allocating IPIs
cpuidle:
- cpuidle-cps: Enable use with MIPSr6 CPUs.
SEAD3:
- Rewrite to use DT and generic kernel feature.
USB:
- host: ehci-sead3: Remove SEAD-3 EHCI code
FBDEV:
- cobalt_lcdfb: Drop SEAD3 support
dt-bindings:
- Document a binding for simple ASCII LCDs
auxdisplay:
- img-ascii-lcd: driver for simple ASCII LCD displays
irqchip i8259:
- i8259: Add domain before mapping parent irq
- i8259: Allow platforms to override poll function
- i8259: Remove unused i8259A_irq_pending
Malta:
- Rewrite to use DT
of/platform:
- Probe "isa" busses by default
CM:
- Print CM error reports upon bus errors
Module:
- Migrate exception table users off module.h and onto extable.h
- Make various drivers explicitly non-modular:
- Audit and remove any unnecessary uses of module.h
mailmap:
- Canonicalize to Qais' current email address.
Linus Torvalds [Sat, 15 Oct 2016 16:20:54 +0000 (09:20 -0700)]
Merge tag 'sound-fix-4.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound
Pull sound fixes from Takashi Iwai:
"Just a few trivial small fixes"
* tag 'sound-fix-4.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound:
ALSA: line6: fix a crash in line6_hwdep_write()
ALSA: seq: fix passing wrong pointer in function call of compatibility layer
ALSA: hda - Fix a failure of micmute led when having multi adcs
ALSA: line6: Fix POD X3 Live audio input
Linus Torvalds [Sat, 15 Oct 2016 01:19:05 +0000 (18:19 -0700)]
Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull more misc uaccess and vfs updates from Al Viro:
"The rest of the stuff from -next (more uaccess work) + assorted fixes"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
score: traps: Add missing include file to fix build error
fs/super.c: don't fool lockdep in freeze_super() and thaw_super() paths
fs/super.c: fix race between freeze_super() and thaw_super()
overlayfs: Fix setting IOP_XATTR flag
iov_iter: kernel-doc import_iovec() and rw_copy_check_uvector()
blackfin: no access_ok() for __copy_{to,from}_user()
arm64: don't zero in __copy_from_user{,_inatomic}
arm: don't zero in __copy_from_user_inatomic()/__copy_from_user()
arc: don't leak bits of kernel stack into coredump
alpha: get rid of tail-zeroing in __copy_user()
Linus Torvalds [Sat, 15 Oct 2016 00:47:31 +0000 (17:47 -0700)]
Merge branch 'for-next' of git://git.samba.org/sfrench/cifs-2.6
Pull cifs fixes from Steve French:
"Including:
- nine bug fixes for stable. Some of these we found at the recent two
weeks of SMB3 test events/plugfests.
- significant improvements in reconnection (e.g. if server or network
crashes) especially when mounted with "persistenthandles" or to
server which advertises Continuous Availability on the share.
- a new mount option "idsfromsid" which improves POSIX compatibility
in some cases (when winbind not configured e.g.) by better (and
faster) fetching uid/gid from acl (when "cifsacl" mount option is
enabled). NB: we are almost complete work on "cifsacl" (querying
mode/uid/gid from ACL) for SMB3, but SMB3 support for cifsacl is
not included in this set.
- improved handling for SMB3 "credits" (even if server is buggy)
Still working on two sets of changes:
- cifsacl enablement for SMB3
- cleanup of RFC1001 length calculation (so we can handle encryption
and multichannel and RDMA)
And a couple of new bugs were reported recently (unrelated to above)
so will probably have another merge request next week"
* 'for-next' of git://git.samba.org/sfrench/cifs-2.6: (21 commits)
CIFS: Retrieve uid and gid from special sid if enabled
CIFS: Add new mount option to set owner uid and gid from special sids in acl
CIFS: Reset read oplock to NONE if we have mandatory locks after reopen
CIFS: Fix persistent handles re-opening on reconnect
SMB2: Separate RawNTLMSSP authentication from SMB2_sess_setup
SMB2: Separate Kerberos authentication from SMB2_sess_setup
Expose cifs module parameters in sysfs
Cleanup missing frees on some ioctls
Enable previous version support
Do not send SMB3 SET_INFO request if nothing is changing
SMB3: Add mount parameter to allow user to override max credits
fs/cifs: reopen persistent handles on reconnect
Clarify locking of cifs file and tcon structures and make more granular
Fix regression which breaks DFS mounting
fs/cifs: keep guid when assigning fid to fileinfo
SMB3: GUIDs should be constructed as random but valid uuids
Set previous session id correctly on SMB3 reconnect
cifs: Limit the overall credit acquired
Display number of credits available
Add way to query creation time of file via cifs xattr
...
Linus Torvalds [Sat, 15 Oct 2016 00:44:56 +0000 (17:44 -0700)]
Merge branch 'for-linus-4.9' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs fixes from Chris Mason:
"Some fixes from Omar and Dave Sterba for our new free space tree.
This isn't heavily used yet, but as we move toward making it the new
default we wanted to nail down an endian bug"
* 'for-linus-4.9' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
btrfs: tests: uninline member definitions in free_space_extent
btrfs: tests: constify free space extent specs
Btrfs: expand free space tree sanity tests to catch endianness bug
Btrfs: fix extent buffer bitmap tests on big-endian systems
Btrfs: catch invalid free space trees
Btrfs: fix mount -o clear_cache,space_cache=v2
Btrfs: fix free space tree bitmaps on big-endian systems
fs/super.c: don't fool lockdep in freeze_super() and thaw_super() paths
sb_wait_write()->percpu_rwsem_release() fools lockdep to avoid the
false-positives. Now that xfs was fixed by Dave's commit dbad7c993053
("xfs: stop holding ILOCK over filldir callbacks") we can remove it and
change freeze_super() and thaw_super() to run with s_writers.rw_sem locks
held; we add two trivial helpers for that, lockdep_sb_freeze_release()
and lockdep_sb_freeze_acquire().
xfstests-dev/check `grep -il freeze tests/*/???` does not trigger any
warning from lockdep.
Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Linus Torvalds [Sat, 15 Oct 2016 00:23:33 +0000 (17:23 -0700)]
Merge branch 'overlayfs-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs
Pull overlayfs updates from Miklos Szeredi:
"This update contains fixes to the "use mounter's permission to access
underlying layers" area, and miscellaneous other fixes and cleanups.
No new features this time"
* 'overlayfs-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs:
ovl: use vfs_get_link()
vfs: add vfs_get_link() helper
ovl: use generic_readlink
ovl: explain error values when removing acl from workdir
ovl: Fix info leak in ovl_lookup_temp()
ovl: during copy up, switch to mounter's creds early
ovl: lookup: do getxattr with mounter's permission
ovl: copy_up_xattr(): use strnlen
fs/super.c: fix race between freeze_super() and thaw_super()
Change thaw_super() to check frozen != SB_FREEZE_COMPLETE rather than
frozen == SB_UNFROZEN, otherwise it can race with freeze_super() which
drops sb->s_umount after SB_FREEZE_WRITE to preserve the lock ordering.
In this case thaw_super() will wrongly call s_op->unfreeze_fs() before
it was actually frozen, and call sb_freeze_unlock() which leads to the
unbalanced percpu_up_write(). Unfortunately lockdep can't detect this,
so this triggers misc BUG_ON()'s in kernel/rcu/sync.c.
Reported-and-tested-by: Nikolay Borisov <kernel@kyup.com> Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: stable@vger.kernel.org Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Vivek Goyal [Fri, 14 Oct 2016 01:03:36 +0000 (03:03 +0200)]
overlayfs: Fix setting IOP_XATTR flag
ovl_fill_super calls ovl_new_inode to create a root inode for the new
superblock before initializing sb->s_xattr. This wrongly causes
IOP_XATTR to be cleared in i_opflags of the new inode, causing SELinux
to log the following message:
SELinux: (dev overlay, type overlay) has no xattr support
Fix this by initializing sb->s_xattr and similar fields before calling
ovl_new_inode.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Vegard Nossum [Sat, 8 Oct 2016 09:18:07 +0000 (11:18 +0200)]
iov_iter: kernel-doc import_iovec() and rw_copy_check_uvector()
Both import_iovec() and rw_copy_check_uvector() take an array
(typically small and on-stack) which is used to hold an iovec array copy
from userspace. This is to avoid an expensive memory allocation in the
fast path (i.e. few iovec elements).
The caller may have to check whether these functions actually used
the provided buffer or allocated a new one -- but this differs between
the too. Let's just add a kernel doc to clarify what the semantics are
for each function.
Signed-off-by: Vegard Nossum <vegard.nossum@oracle.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Linus Torvalds [Fri, 14 Oct 2016 22:17:12 +0000 (15:17 -0700)]
Merge tag 'linux-kselftest-4.9-rc1-update' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest
Pull kselftest updates from Shuah Khan:
"This update consists of:
- Fixes and improvements to existing tests
- Moving code from Documentation to selftests, samples, and tools:
* Moves dnotify_test, prctl, ptp, vDSO, ia64, watchdog, and
networking tests from Documentation to selftests.
* Moves mic/mpssd, misc-devices/mei, timers, watchdog, auxdisplay,
and blackfin examples from Documentation to samples.
* Moves accounting, laptops/dslm, and pcmcia/crc32hash tools from
Documentation to tools.
* Deletes BUILD_DOCSRC and its dependencies"
* tag 'linux-kselftest-4.9-rc1-update' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest: (21 commits)
selftests/futex: Check ANSI terminal color support
Doc: update 00-INDEX files to reflect the runnable code move
samples: move blackfin gptimers-example from Documentation
tools: move pcmcia crc32hash tool from Documentation
tools: move laptops dslm tool from Documentation
tools: move accounting tool from Documentation
samples: move auxdisplay example code from Documentation
samples: move watchdog example code from Documentation
samples: move timers example code from Documentation
samples: move misc-devices/mei example code from Documentation
samples: move mic/mpssd example code from Documentation
selftests: Move networking/timestamping from Documentation
selftests: move watchdog tests from Documentation/watchdog
selftests: move ia64 tests from Documentation/ia64
selftests: move vDSO tests from Documentation/vDSO
selftests: move ptp tests from Documentation/ptp
selftests: move prctl tests from Documentation/prctl
selftests: move dnotify_test from Documentation/filesystems
selftests/timers: Add missing error code assignment before test
selftests/zram: replace ZRAM_LZ4_COMPRESS
...
Linus Torvalds [Fri, 14 Oct 2016 22:03:08 +0000 (15:03 -0700)]
Merge branch 'misc' of git://git.kernel.org/pub/scm/linux/kernel/git/mmarek/kbuild
Pull misc kbuild changes from Michal Marek:
"Just a few patches on the kbuild.git#misc branch this time:
- New Coccinelle patch by Nicholas Mc Guire
- Existing patch fixes by Julia Lawall
- Minor comment fix by Markus Elfring"
* 'misc' of git://git.kernel.org/pub/scm/linux/kernel/git/mmarek/kbuild:
Coccinelle: flag conditions with no effect
scripts/coccicheck: Update reference for the corresponding documentation
Coccinelle: pm_runtime: ensure relevance of pm_runtime reports
Coccinelle: limit memdup_user transformation to GFP_KERNEL case
Linus Torvalds [Fri, 14 Oct 2016 21:26:58 +0000 (14:26 -0700)]
Merge branch 'kbuild' of git://git.kernel.org/pub/scm/linux/kernel/git/mmarek/kbuild
Pull kbuild updates from Michal Marek:
- EXPORT_SYMBOL for asm source by Al Viro.
This does bring a regression, because genksyms no longer generates
checksums for these symbols (CONFIG_MODVERSIONS). Nick Piggin is
working on a patch to fix this.
Plus, we are talking about functions like strcpy(), which rarely
change prototypes.
- Fixes for PPC fallout of the above by Stephen Rothwell and Nick
Piggin
- fixdep speedup by Alexey Dobriyan.
- preparatory work by Nick Piggin to allow architectures to build with
-ffunction-sections, -fdata-sections and --gc-sections
- CONFIG_THIN_ARCHIVES support by Stephen Rothwell
- fix for filenames with colons in the initramfs source by me.
* 'kbuild' of git://git.kernel.org/pub/scm/linux/kernel/git/mmarek/kbuild: (22 commits)
initramfs: Escape colons in depfile
ppc: there is no clear_pages to export
powerpc/64: whitelist unresolved modversions CRCs
kbuild: -ffunction-sections fix for archs with conflicting sections
kbuild: add arch specific post-link Makefile
kbuild: allow archs to select link dead code/data elimination
kbuild: allow architectures to use thin archives instead of ld -r
kbuild: Regenerate genksyms lexer
kbuild: genksyms fix for typeof handling
fixdep: faster CONFIG_ search
ia64: move exports to definitions
sparc32: debride memcpy.S a bit
[sparc] unify 32bit and 64bit string.h
sparc: move exports to definitions
ppc: move exports to definitions
arm: move exports to definitions
s390: move exports to definitions
m68k: move exports to definitions
alpha: move exports to actual definitions
x86: move exports to actual definitions
...
Linus Torvalds [Fri, 14 Oct 2016 21:11:22 +0000 (14:11 -0700)]
Merge tag 'docs-4.9-2' of git://git.lwn.net/linux
Pull one more documentation update from Jonathan Corbet:
"A single commit converting the mac80211 DocBook template over to
Sphinx. Only 32 more to go..."
* tag 'docs-4.9-2' of git://git.lwn.net/linux:
docs-rst: sphinxify 802.11 documentation
Linus Torvalds [Fri, 14 Oct 2016 20:43:08 +0000 (13:43 -0700)]
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma
Pull rdma qedr RoCE driver from Doug Ledford:
"Early on in the merge window I mentioned I had a backlog of new
drivers waiting to be reviewed and that, in addition to the hns-roce
driver, I wanted to get possible a couple more reviewed. I ended up
only having the time to complete one of the additional drivers.
During Dave Miller's pull request this go around, there were a series
of 9 patches to the QLogic qed net driver that add basic support for a
paired RoCE driver. That support is currently not functional because
it is missing the matching RoCE driver in the RDMA subsystem. I
managed to finish that review. However, because it goes against part
of Dave's net pull, and a part that was accepted a day or two after
the merge window opened, to apply cleanly it has to be applied to
either the tip of Dave's net branch, or as I did in this case, I just
applied it to your master after you had taken Dave's pull request."
* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma:
qedr: Add events support and register IB device
qedr: Add GSI support
qedr: Add LL2 RoCE interface
qedr: Add support for data path
qedr: Add support for memory registeration verbs
qedr: Add support for QP verbs
qedr: Add support for PD,PKEY and CQ verbs
qedr: Add support for user context verbs
qedr: Add support for RoCE HW init
qedr: Add RoCE driver framework
Linus Torvalds [Fri, 14 Oct 2016 20:35:05 +0000 (13:35 -0700)]
Merge tag 'for-linus-2' of git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma
Pull more rdma updates from Doug Ledford:
"This merge window was the first where Huawei had to try and coordinate
their patches between their net driver and their new roce driver
(similar to mlx4 and mlx5).
They didn't do horribly, but there were some issues (and we knew that
because they simply didn't know what to do in the beginning). As a
result, I had a set of patches that depended on some patches that
normally would have come to you via Dave's tree. Those patches have
been on netdev@ for a while, so I got Dave to give me his approval to
send them to you. As such, the other 29 patches I had behind them are
also now ready to go.
This catches the hns and hns-roce drivers up to current, and for
future patches we are working with them to get them up to speed on how
to do joint driver development so that they don't have these sorts of
cross tree dependency issues again. BTW, Dave gave me permission to
add his Acked-by: to the patches against the net tree, but I've had
this branch through 0day (but not linux-next since it was off by
itself) and I didn't want to rebase the series just to add Dave's ack
for the 8 patches in the net area.
Updates to the hns drivers:
- Small patch set for hns net driver that the roce patches depend on
- Various fixes to the hns-roce driver
- Add connection manager support to the hns-roce driver"
* tag 'for-linus-2' of git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma: (36 commits)
IB/hns: Fix for removal of redundant code
IB/hns: Delete the redundant lines in hns_roce_v1_m_qp()
IB/hns: Fix the bug when platform_get_resource() exec fail
IB/hns: Update the rq head when modify qp state
IB/hns: Cq has not been freed
IB/hns: Validate mtu when modified qp
IB/hns: Some items of qpc need to take user param
IB/hns: The Ack timeout need a lower limit value
IB/hns: Return bad wr while post send failed
IB/hns: Fix bug of memory leakage for registering user mr
IB/hns: Modify the init of iboe lock
IB/hns: Optimize code of aeq and ceq interrupt handle and fix the bug of qpn
IB/hns: Delete the sqp_start from the structure hns_roce_caps
IB/hns: Fix bug of clear hem
IB/hns: Remove unused parameter named qp_type
IB/hns: Simplify function of pd alloc and qp alloc
IB/hns: Fix bug of using uninit refcount and free
IB/hns: Remove parameters of resize cq
IB/hns: Remove unused parameters in some functions
IB/hns: Add node_guid definition to the bindings document
...
Linus Torvalds [Fri, 14 Oct 2016 20:19:30 +0000 (13:19 -0700)]
Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input
Pull some more input subsystem updates from Dmitry Torokhov:
"An update to the ALPS driver to support the V8 protocol with
touchstick, a change for i8042 to skip selftest on many Asus laptops
which helps to keep their touchpads working after resume, and a couple
other driver fixes"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input:
Input: i8042 - skip selftest on ASUS laptops
Input: melfas_mip4 - add ic_name sysfs attribute
Input: melfas_mip4 - add maintainer information
Input: melfas_mip4 - add devicetree binding documentations
Input: elantech - add Fujitsu Lifebook E556 to force crc_enabled
Input: synaptics-rmi4 - fix error handling in I2C transport driver
Input: synaptics-rmi4 - fix error handling in SPI transport driver
Input: ALPS - add V8 protocol documentation
Input: ALPS - set DualPoint flag for 74 03 28 devices
Input: ALPS - allow touchsticks to report pressure
Input: ALPS - handle 0-pressure 1F events
Input: ALPS - add touchstick support for SS5 hardware
Input: elantech - force needed quirks on Fujitsu H760
Input: elantech - fix Lenovo version typo
Drivers:
- ac100: support clock-output-names
- cmos: properly handle ACPI alarms and quirky BIOSes and other fixes
- ds1307: fix century bit support while staying comaptible with
previous behaviour by default
- ds1347: switch to regmap
- isl12057 is now handled by ds1307
- omap: support external wakeup
- rv8803: allow to disable voltage drop detection"
* tag 'rtc-4.9' of git://git.kernel.org/pub/scm/linux/kernel/git/abelloni/linux: (25 commits)
rtc: rv8803: set VDETOFF and SWOFF via device tree
dt/bindings: Add bindings for Micro Crystal rv8803
devicetree: Add Micro Crystal AG vendor id
rtc: cmos: avoid unused function warning
rtc: ac100: Add NULL checking for devm_kzalloc call
rtc: ds1347: changed raw spi calls to register map calls
rtc: cmos: Restore alarm after resume
rtc: cmos: Clear ACPI-driven alarms upon resume
rtc: omap: Support ext_wakeup configuration
rtc: cmos: Initialize hpet timer before irq is registered
rtc: asm9260: rework locking
rtc: asm9260: allow COMPILE_TEST
rtc: constify rtc_class_ops structures
rtc: ac100: support clock-output-names in device tree binding
rtc: rx6110: remove owner assignment
rtc: pic32: Delete owner assignment
rtc: bq32k: Fix handling of oscillator failure flag
rtc: bq32k: Use correct mask name for 'minutes' register.
rtc: sysfs: fix a cast removing the const attribute
Documentation: dt: Intersil isl12057 is not a trivial device
...
Linus Torvalds [Fri, 14 Oct 2016 20:09:41 +0000 (13:09 -0700)]
Merge branch 'i2c/for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux
Pull more i2c updates from Wolfram Sang:
"A small update pull request from I2C.
This adds one comment to a change we did in this merge window to
handle lockdep better, and pulls in a branch which should have been in
4.8 already improving DT support for I2C"
* 'i2c/for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux:
gpio: pca953x: add a comment explaining the need for a lockdep subclass
i2c: core: Add support for 'i2c-bus' subnode
dt-bindings: i2c: Add support for 'i2c-bus' subnode
Linus Torvalds [Fri, 14 Oct 2016 19:50:05 +0000 (12:50 -0700)]
Merge tag 'acpi-extra-4.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull more ACPI updates from Rafael Wysocki:
"This includes a couple of fixes needed after recent changes, two ACPI
driver fixes (fan and "Processor Aggregator"), an update of the ACPI
device properties handling code and a new MAINTAINERS entry for ACPI
on ARM64.
Specifics:
- Fix an unused function warning that started to appear after recent
changes in the ACPI EC driver (Eric Biggers).
- Fix the KERN_CONT usage in acpi_os_vprintf() that has become
(particularly) annoying recently (Joe Perches).
- Fix the fan status checking in the ACPI fan driver to avoid
returning incorrect error codes sometimes (Srinivas Pandruvada).
- Fix the ACPI Processor Aggregator driver (PAD) to always let the
special processor_aggregator driver from Xen take over when running
as Xen dom0 (Juergen Gross).
- Update the handling of reference device properties in ACPI by
allowing empty rows ("holes") to appear in reference property lists
(Mika Westerberg).
- Add a new MAINTAINERS entry for ACPI on ARM64 (Lorenzo Pieralisi)"
* tag 'acpi-extra-4.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
acpi_os_vprintf: Use printk_get_level() to avoid unnecessary KERN_CONT
ACPI / PAD: don't register acpi_pad driver if running as Xen dom0
ACPI / property: Allow holes in reference properties
MAINTAINERS: Add ARM64-specific ACPI maintainers entry
ACPI / EC: Fix unused function warning when CONFIG_PM_SLEEP=n
ACPI / fan: Fix error reading cur_state
Linus Torvalds [Fri, 14 Oct 2016 19:46:13 +0000 (12:46 -0700)]
Merge tag 'pm-extra-4.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull more power management updates from Rafael Wysocki:
"This includes a couple of fixes for cpufreq regressions introduced in
4.8, a rework of the intel_pstate algorithm used on Atom processors
(that took some time to test) plus a fix and a couple of cleanups in
that driver, a CPPC cpufreq driver fix, and a some devfreq fixes and
cleanups (core and exynos-nocp).
Specifics:
- Fix two cpufreq regressions causing undesirable changes in behavior
to appear (one in the core and one in the conservative governor)
introduced during the 4.8 cycle (Aaro Koskinen, Rafael Wysocki).
- Fix the way the intel_pstate driver accesses MSRs related to the
hardware-managed P-states (HWP) feature during the initialization
which currently is unsafe and may cause the processor to generate a
general protection fault (Srinivas Pandruvada).
- Rework the intel_pstate's P-state selection algorithm used on Atom
processors to avoid known problems with the current one and to make
the computation more straightforward, which also happens to improve
performance in multiple benchmarks a bit (Rafael Wysocki).
- Improve two comments in the intel_pstate driver (Rafael Wysocki).
- Fix the desired performance computation in the CPPC cpufreq driver
(Hoan Tran).
- Fix the devfreq core to avoid printing misleading error messages in
some cases (Tobias Jakobi).
- Fix the error code path in devfreq_add_device() to use proper
locking around list modifications (Axel Lin).
- Fix a build failure and remove a couple of redundant updates of
variables in the exynos-nocp devfreq driver (Axel Lin)"
* tag 'pm-extra-4.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
cpufreq: CPPC: Correct desired_perf calculation
cpufreq: conservative: Fix next frequency selection
cpufreq: skip invalid entries when searching the frequency
cpufreq: intel_pstate: Fix struct pstate_adjust_policy kerneldoc
cpufreq: intel_pstate: Proportional algorithm for Atom
PM / devfreq: Skip status update on uninitialized previous_freq
PM / devfreq: Add proper locking around list_del()
PM / devfreq: exynos-nocp: Remove redundant code
PM / devfreq: exynos-nocp: Select REGMAP_MMIO
cpufreq: intel_pstate: Clarify comment in get_target_pstate_use_performance()
cpufreq: intel_pstate: Fix unsafe HWP MSR access
Steve French [Fri, 14 Oct 2016 00:06:23 +0000 (19:06 -0500)]
CIFS: Retrieve uid and gid from special sid if enabled
New mount option "idsfromsid" indicates to cifs.ko that
it should try to retrieve the uid and gid owner fields
from special sids. This patch adds the code to parse the owner
sids in the ACL to see if they match, and if so populate the
uid and/or gid from them. This is faster than upcalling for
them and asking winbind, and is a fairly common case, and is
also helpful when cifs.upcall and idmapping is not configured.
Signed-off-by: Steve French <steve.french@primarydata.com> Reviewed-by: Shirish Pargaonkar <shirishpargaonkar@gmail.com> Reviewed-by: Jeff Layton <jlayton@redhat.com> Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
Steve French [Fri, 23 Sep 2016 06:36:34 +0000 (01:36 -0500)]
CIFS: Add new mount option to set owner uid and gid from special sids in acl
Add "idsfromsid" mount option to indicate to cifs.ko that it should
try to retrieve the uid and gid owner fields from special sids in the
ACL if present. This first patch just adds the parsing for the mount
option.
Signed-off-by: Steve French <steve.french@primarydata.com> Reviewed-by: Shirish Pargaonkar <shirishpargaonkar@gmail.com> Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
Linus Torvalds [Fri, 14 Oct 2016 19:18:50 +0000 (12:18 -0700)]
Merge branch 'for-4.9' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup updates from Tejun Heo:
- tracepoints for basic cgroup management operations added
- kernfs and cgroup path formatting functions updated to behave in the
style of strlcpy()
- non-critical bug fixes
* 'for-4.9' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
blkcg: Unlock blkcg_pol_mutex only once when cpd == NULL
cgroup: fix error handling regressions in proc_cgroup_show() and cgroup_release_agent()
cpuset: fix error handling regression in proc_cpuset_show()
cgroup: add tracepoints for basic operations
cgroup: make cgroup_path() and friends behave in the style of strlcpy()
kernfs: remove kernfs_path_len()
kernfs: make kernfs_path*() behave in the style of strlcpy()
kernfs: add dummy implementation of kernfs_path_from_node()