Jeff Moyer [Thu, 13 Aug 2015 18:57:57 +0000 (14:57 -0400)]
block: bump BLK_DEF_MAX_SECTORS to 2560
A value of 2560 (1280k) will accommodate a 10-data-disk stripe
write with chunk size 128k. In the testing I've done using
iozone, fio, and aio-stress across a number of different storage
devices, a value of 1280 does not show a big performance
difference from 512, but will hopefully help software RAID
setups using SATA disks, as reported by Christoph.
NOTE: drivers/block/aoe/aoeblk.c sets its own max_hw_sectors_kb to
BLK_DEF_MAX_SECTORS. So, this patch essentially changes aeoblk to
Use a larger maximum sector size, and I did not test this.
Signed-off-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Jens Axboe <axboe@fb.com>
This reverts commit 34b48db66e08ca1c1bc07cf305d672ac940268dc.
That commit caused performance regressions for streaming I/O
workloads on a number of different storage devices, from
SATA disks to external RAID arrays. It also managed to
trip up some buggy firmware in at least one drive, causing
data corruption.
The next patch will bump the default max_sectors_kb value to
1280, which will accommodate a 10-data-disk stripe write
with chunk size 128k. In the testing I've done using iozone,
fio, and aio-stress, a value of 1280 does not show a big
performance difference from 512. This will hopefully still
help the software RAID setup that Christoph saw the original
performance gains with while still not regressing other
storage configurations.
Signed-off-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Jens Axboe <axboe@fb.com>
Ming Lei [Sun, 9 Aug 2015 07:41:51 +0000 (03:41 -0400)]
blk-mq: fix race between timeout and freeing request
Inside timeout handler, blk_mq_tag_to_rq() is called
to retrieve the request from one tag. This way is obviously
wrong because the request can be freed any time and some
fiedds of the request can't be trusted, then kernel oops
might be triggered[1].
Currently wrt. blk_mq_tag_to_rq(), the only special case is
that the flush request can share same tag with the request
cloned from, and the two requests can't be active at the same
time, so this patch fixes the above issue by updating tags->rqs[tag]
with the active request(either flush rq or the request cloned
from) of the tag.
Also blk_mq_tag_to_rq() gets much simplified with this patch.
Given blk_mq_tag_to_rq() is mainly for drivers and the caller must
make sure the request can't be freed, so in bt_for_each() this
helper is replaced with tags->rqs[tag].
Ming Lei [Sun, 9 Aug 2015 07:41:50 +0000 (03:41 -0400)]
blk-mq: fix buffer overflow when reading sysfs file of 'pending'
There may be lots of pending requests so that the buffer of PAGE_SIZE
can't hold them at all.
One typical example is scsi-mq, the queue depth(.can_queue) of
scsi_host and blk-mq is quite big but scsi_device's queue_depth
is a bit small(.cmd_per_lun), then it is quite easy to have lots
of pending requests in hw queue.
This patch fixes the following warning and the related memory
destruction.
Dongsu Park [Fri, 19 Dec 2014 13:53:03 +0000 (14:53 +0100)]
Documentation: update notes in biovecs about arbitrarily sized bios
Update block/biovecs.txt so that it includes a note on what kind of
effects arbitrarily sized bios would bring to the block layer.
Also fix a trivial typo, bio_iter_iovec.
Cc: Christoph Hellwig <hch@infradead.org> Cc: Kent Overstreet <kent.overstreet@gmail.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: linux-doc@vger.kernel.org Signed-off-by: Dongsu Park <dpark@posteo.net> Signed-off-by: Ming Lin <ming.l@ssi.samsung.com> Signed-off-by: Jens Axboe <axboe@fb.com>
Kent Overstreet [Tue, 19 May 2015 12:31:01 +0000 (14:31 +0200)]
block: remove bio_get_nr_vecs()
We can always fill up the bio now, no need to estimate the possible
size based on queue parameters.
Acked-by: Steven Whitehouse <swhiteho@redhat.com> Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
[hch: rebased and wrote a changelog] Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ming Lin <ming.l@ssi.samsung.com> Signed-off-by: Jens Axboe <axboe@fb.com>
Kent Overstreet [Mon, 22 Dec 2014 11:48:42 +0000 (12:48 +0100)]
fs: use helper bio_add_page() instead of open coding on bi_io_vec
Call pre-defined helper bio_add_page() instead of open coding for
iterating through bi_io_vec[]. Doing that, it's possible to make some
parts in filesystems and mm/page_io.c simpler than before.
Acked-by: Dave Kleikamp <shaggy@kernel.org> Cc: Christoph Hellwig <hch@infradead.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: linux-fsdevel@vger.kernel.org Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
[dpark: add more description in commit message] Signed-off-by: Dongsu Park <dpark@posteo.net> Signed-off-by: Ming Lin <ming.l@ssi.samsung.com> Signed-off-by: Jens Axboe <axboe@fb.com>
Kent Overstreet [Tue, 28 Apr 2015 06:48:34 +0000 (23:48 -0700)]
block: kill merge_bvec_fn() completely
As generic_make_request() is now able to handle arbitrarily sized bios,
it's no longer necessary for each individual block driver to define its
own ->merge_bvec_fn() callback. Remove every invocation completely.
Cc: Jens Axboe <axboe@kernel.dk> Cc: Lars Ellenberg <drbd-dev@lists.linbit.com> Cc: drbd-user@lists.linbit.com Cc: Jiri Kosina <jkosina@suse.cz> Cc: Yehuda Sadeh <yehuda@inktank.com> Cc: Sage Weil <sage@inktank.com> Cc: Alex Elder <elder@kernel.org> Cc: ceph-devel@vger.kernel.org Cc: Alasdair Kergon <agk@redhat.com> Cc: Mike Snitzer <snitzer@redhat.com> Cc: dm-devel@redhat.com Cc: Neil Brown <neilb@suse.de> Cc: linux-raid@vger.kernel.org Cc: Christoph Hellwig <hch@infradead.org> Cc: "Martin K. Petersen" <martin.petersen@oracle.com> Acked-by: NeilBrown <neilb@suse.de> (for the 'md' bits) Acked-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
[dpark: also remove ->merge_bvec_fn() in dm-thin as well as
dm-era-target, and resolve merge conflicts] Signed-off-by: Dongsu Park <dpark@posteo.net> Signed-off-by: Ming Lin <ming.l@ssi.samsung.com> Signed-off-by: Jens Axboe <axboe@fb.com>
Kent Overstreet [Wed, 25 Sep 2013 20:37:01 +0000 (13:37 -0700)]
md/raid5: get rid of bio_fits_rdev()
Remove bio_fits_rdev() as sufficient merge_bvec_fn() handling is now
performed by blk_queue_split() in md_make_request().
Cc: Neil Brown <neilb@suse.de> Cc: linux-raid@vger.kernel.org Acked-by: NeilBrown <neilb@suse.de> Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
[dpark: add more description in commit message] Signed-off-by: Dongsu Park <dpark@posteo.net> Signed-off-by: Ming Lin <ming.l@ssi.samsung.com> Signed-off-by: Jens Axboe <axboe@fb.com>
Ming Lin [Thu, 7 May 2015 05:51:24 +0000 (22:51 -0700)]
md/raid5: split bio for chunk_aligned_read
If a read request fits entirely in a chunk, it will be passed directly to the
underlying device (providing it hasn't failed of course). If it doesn't fit,
the slightly less efficient path that uses the stripe_cache is used.
Requests that get to the stripe cache are always completely split up as
necessary.
So with RAID5, ripping out the merge_bvec_fn doesn't cause it to stop work,
but could cause it to take the less efficient path more often.
All that is needed to manage this is for 'chunk_aligned_read' do some bio
splitting, much like the RAID0 code does.
Cc: Neil Brown <neilb@suse.de> Cc: linux-raid@vger.kernel.org Acked-by: NeilBrown <neilb@suse.de> Signed-off-by: Ming Lin <ming.l@ssi.samsung.com> Signed-off-by: Jens Axboe <axboe@fb.com>
Ming Lin [Fri, 22 May 2015 07:46:56 +0000 (00:46 -0700)]
block: remove split code in blkdev_issue_{discard,write_same}
The split code in blkdev_issue_{discard,write_same} can go away
now that any driver that cares does the split. We have to make
sure bio size doesn't overflow.
For discard, we set max discard sectors to (1<<31)>>9 to ensure
it doesn't overflow bi_size and hopefully it is of the proper
granularity as long as the granularity is a power of two.
Acked-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: Ming Lin <ming.l@ssi.samsung.com> Signed-off-by: Jens Axboe <axboe@fb.com>
Kent Overstreet [Tue, 3 Dec 2013 02:30:25 +0000 (18:30 -0800)]
btrfs: remove bio splitting and merge_bvec_fn() calls
Btrfs has been doing bio splitting from btrfs_map_bio(), by checking
device limits as well as calling ->merge_bvec_fn() etc. That is not
necessary any more, because generic_make_request() is now able to
handle arbitrarily sized bios. So clean up unnecessary code paths.
Cc: Chris Mason <clm@fb.com> Cc: Josef Bacik <jbacik@fb.com> Cc: linux-btrfs@vger.kernel.org Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: Chris Mason <clm@fb.com>
[dpark: add more description in commit message] Signed-off-by: Dongsu Park <dpark@posteo.net> Signed-off-by: Ming Lin <ming.l@ssi.samsung.com> Signed-off-by: Jens Axboe <axboe@fb.com>
Kent Overstreet [Sun, 24 Nov 2013 07:11:25 +0000 (23:11 -0800)]
bcache: remove driver private bio splitting code
The bcache driver has always accepted arbitrarily large bios and split
them internally. Now that every driver must accept arbitrarily large
bios this code isn't nessecary anymore.
Cc: linux-bcache@vger.kernel.org Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
[dpark: add more description in commit message] Signed-off-by: Dongsu Park <dpark@posteo.net> Signed-off-by: Ming Lin <ming.l@ssi.samsung.com> Signed-off-by: Jens Axboe <axboe@fb.com>
Kent Overstreet [Sun, 24 Nov 2013 06:30:22 +0000 (22:30 -0800)]
block: simplify bio_add_page()
Since generic_make_request() can now handle arbitrary size bios, all we
have to do is make sure the bvec array doesn't overflow.
__bio_add_page() doesn't need to call ->merge_bvec_fn(), where
we can get rid of unnecessary code paths.
Removing the call to ->merge_bvec_fn() is also fine, as no driver that
implements support for BLOCK_PC commands even has a ->merge_bvec_fn()
method.
Cc: Christoph Hellwig <hch@infradead.org> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
[dpark: rebase and resolve merge conflicts, change a couple of comments,
make bio_add_page() warn once upon a cloned bio.] Signed-off-by: Dongsu Park <dpark@posteo.net> Signed-off-by: Ming Lin <ming.l@ssi.samsung.com> Signed-off-by: Jens Axboe <axboe@fb.com>
Kent Overstreet [Fri, 24 Apr 2015 05:37:18 +0000 (22:37 -0700)]
block: make generic_make_request handle arbitrarily sized bios
The way the block layer is currently written, it goes to great lengths
to avoid having to split bios; upper layer code (such as bio_add_page())
checks what the underlying device can handle and tries to always create
bios that don't need to be split.
But this approach becomes unwieldy and eventually breaks down with
stacked devices and devices with dynamic limits, and it adds a lot of
complexity. If the block layer could split bios as needed, we could
eliminate a lot of complexity elsewhere - particularly in stacked
drivers. Code that creates bios can then create whatever size bios are
convenient, and more importantly stacked drivers don't have to deal with
both their own bio size limitations and the limitations of the
(potentially multiple) devices underneath them. In the future this will
let us delete merge_bvec_fn and a bunch of other code.
We do this by adding calls to blk_queue_split() to the various
make_request functions that need it - a few can already handle arbitrary
size bios. Note that we add the call _after_ any call to
blk_queue_bounce(); this means that blk_queue_split() and
blk_recalc_rq_segments() don't need to be concerned with bouncing
affecting segment merging.
Some make_request_fn() callbacks were simple enough to audit and verify
they don't need blk_queue_split() calls. The skipped ones are:
Some others are almost certainly safe to remove now, but will be left
for future patches.
Cc: Jens Axboe <axboe@kernel.dk> Cc: Christoph Hellwig <hch@infradead.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Ming Lei <ming.lei@canonical.com> Cc: Neil Brown <neilb@suse.de> Cc: Alasdair Kergon <agk@redhat.com> Cc: Mike Snitzer <snitzer@redhat.com> Cc: dm-devel@redhat.com Cc: Lars Ellenberg <drbd-dev@lists.linbit.com> Cc: drbd-user@lists.linbit.com Cc: Jiri Kosina <jkosina@suse.cz> Cc: Geoff Levand <geoff@infradead.org> Cc: Jim Paris <jim@jtan.com> Cc: Philip Kelleher <pjk1939@linux.vnet.ibm.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Nitin Gupta <ngupta@vflare.org> Cc: Oleg Drokin <oleg.drokin@intel.com> Cc: Andreas Dilger <andreas.dilger@intel.com> Acked-by: NeilBrown <neilb@suse.de> (for the 'md/md.c' bits) Acked-by: Mike Snitzer <snitzer@redhat.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
[dpark: skip more mq-based drivers, resolve merge conflicts, etc.] Signed-off-by: Dongsu Park <dpark@posteo.net> Signed-off-by: Ming Lin <ming.l@ssi.samsung.com> Signed-off-by: Jens Axboe <axboe@fb.com>
Sasha Levin [Mon, 10 Aug 2015 23:05:18 +0000 (19:05 -0400)]
block: don't access bio->bi_error after bio_put()
Commit 4246a0b6 ("block: add a bi_error field to struct bio") has added a few
dereferences of 'bio' after a call to bio_put(). This causes use-after-frees
such as:
This patch fixes a few of those places that I caught while auditing the patch, but the
original patch should be audited further for more occurences of this issue since I'm
not too familiar with the code.
block: shrink struct bio down to 2 cache lines again
Commit bcf2843b3f8f added ->bi_error to cleanup the error passing
for struct bio, but that ended up adding 4 bytes and a 4 byte hole
to the size of struct bio. For a clean config, that bumped it from
128 bytes, to 136 bytes, on x86-64.
The ->bi_flags member is currently an unsigned long, but it fits
easily within an int. Change it to an unsigned int, adjust the
the pool offset code, and move ->bi_error into the new hole. Then
we end up with a 128 byte bio again.
Change the bio flag set/clear to use cmpxchg to ensure we don't
lose any flags when manipulating them.
Some places use helpers now, others don't. We only have the 'is set'
helper, add helpers for setting and clearing flags too.
It was a bit of a mess of atomic vs non-atomic access. With
BIO_UPTODATE gone, we don't have any risk of concurrent access to the
flags. So relax the restriction and don't make any of them atomic. The
flags that do have serialization issues (reffed and chained), we
already handle those separately.
Currently we have two different ways to signal an I/O error on a BIO:
(1) by clearing the BIO_UPTODATE flag
(2) by returning a Linux errno value to the bi_end_io callback
The first one has the drawback of only communicating a single possible
error (-EIO), and the second one has the drawback of not beeing persistent
when bios are queued up, and are not passed along from child to parent
bio in the ever more popular chaining scenario. Having both mechanisms
available has the additional drawback of utterly confusing driver authors
and introducing bugs where various I/O submitters only deal with one of
them, and the others have to add boilerplate code to deal with both kinds
of error returns.
So add a new bi_error field to store an errno value directly in struct
bio and remove the existing mechanisms to clean all this up.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: NeilBrown <neilb@suse.com> Signed-off-by: Jens Axboe <axboe@fb.com>
block: make /sys/block/<dev>/queue/discard_max_bytes writeable
Lots of devices support huge discard sizes these days. Depending
on how the device handles them internally, huge discards can
introduce massive latencies (hundreds of msec) on the device side.
We have a sysfs file, discard_max_bytes, that advertises the max
hardware supported discard size. Make this writeable, and split
the settings into a soft and hard limit. This can be set from
'discard_granularity' and up to the hardware limit.
Add a new sysfs file, 'discard_max_hw_bytes', that shows the hw
set limit.
Reviewed-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Jens Axboe <axboe@fb.com>
block: have drivers use blk_queue_max_discard_sectors()
Some drivers use it now, others just set the limits field manually.
But in preparation for splitting this into a hard and soft limit,
ensure that they all call the proper function for setting the hw
limit for discards.
Reviewed-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Jens Axboe <axboe@fb.com>
Merge tag 'pm+acpi-4.2-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull power management and ACPI fixes from Rafael Wysocki:
"These fix two bugs in the cpufreq core (including one recent
regression), fix a 4.0 PCI regression related to the ACPI resources
management and quieten an RCU-related lockdep complaint about a
tracepoint in the suspend-to-idle code.
Specifics:
- Fix a recently introduced issue in the cpufreq policy object
reinitialization that leads to CPU offline/online breakage (Viresh
Kumar)
- Make it possible to access frequency tables of offline CPUs which
is needed by thermal management code among other things (Viresh
Kumar)
- Fix an ACPI resource management regression introduced during the
4.0 cycle that may cause incorrect resource validation results to
appear in 32-bit x86 kernels due to silent truncation of 64-bit
values to 32-bit (Jiang Liu)
- Fix up an RCU-related lockdep complaint about suspicious RCU usage
in idle caused by using a suspend tracepoint in the core suspend-
to-idle code (Rafael J Wysocki)"
* tag 'pm+acpi-4.2-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
ACPI / PCI: Fix regressions caused by resource_size_t overflow with 32-bit kernel
cpufreq: Allow freq_table to be obtained for offline CPUs
cpufreq: Initialize the governor again while restoring policy
suspend-to-idle: Prevent RCU from complaining about tick_freeze()
Merge tag 'platform-drivers-x86-v4.2-3' of git://git.infradead.org/users/dvhart/linux-platform-drivers-x86
Pull x86 platform driver fixes from Darren Hart:
"Fix SMBIOS call handling and hwswitch state coherency in the
dell-laptop driver. Cleanups for intel_*_ipc drivers. Details:
dell-laptop:
- Do not cache hwswitch state
- Check return value of each SMBIOS call
- Clear buffer before each SMBIOS call
intel_scu_ipc:
- Move local memory initialization out of a mutex
* tag 'platform-drivers-x86-v4.2-3' of git://git.infradead.org/users/dvhart/linux-platform-drivers-x86:
intel_scu_ipc: move local memory initialization out of a mutex
intel_pmc_ipc: Update kerneldoc formatting
dell-laptop: Do not cache hwswitch state
dell-laptop: Check return value of each SMBIOS call
dell-laptop: Clear buffer before each SMBIOS call
intel_pmc_ipc: Fix compiler casting warnings
Merge branch 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/gerg/m68knommu
Pull m68knommu/coldfire fixes from Greg Ungerer:
"Contains build fixes and updates for the ColdFire defconfigs.
Specifically there is a couple of fixes that address problems building
allnoconfig. Also fix for enabling PCI bus on the M54xx family of
ColdFire"
* 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/gerg/m68knommu:
m68k: enable PCI support for m5475evb defconfig
m68k: fix io functions for ColdFire/MMU/PCI case
m68knommu: update defconfig for ColdFire m5475evb
m68knommu: update defconfig for ColdFire m5407c3
m68knommu: update defconfig for ColdFire m5307c3
m68knommu: update defconfig for ColdFire m5275evb
m68knommu: update defconfig for ColdFire m5272c3
m68knommu: update defconfig for ColdFire m5249evb
m68knommu: update defconfig for m5208evb
m68knommu: make ColdFire SoC selection a choice
m68knommu: improve the clock configuration defaults
m68knommu: force setting of CONFIG_CLOCK_FREQ for ColdFire
Merge branch 'for-linus' of git://git.kernel.dk/linux-block
Pull block fixes from Jens Axboe:
"A collection of fixes from the last few weeks that should go into the
current series. This contains:
- Various fixes for the per-blkcg policy data, fixing regressions
since 4.1. From Arianna and Tejun
- Code cleanup for bcache closure macros from me. Really just
flushing this out, it's been sitting in another branch for months
- FIELD_SIZEOF cleanup from Maninder Singh
- bio integrity oops fix from Mike
- Timeout regression fix for blk-mq from Ming Lei"
* 'for-linus' of git://git.kernel.dk/linux-block:
blk-mq: set default timeout as 30 seconds
NVMe: Reread partitions on metadata formats
bcache: don't embed 'return' statements in closure macros
blkcg: fix blkcg_policy_data allocation bug
blkcg: implement all_blkcgs list
blkcg: blkcg_css_alloc() should grab blkcg_pol_mutex while iterating blkcg_policy[]
blkcg: allow blkcg_pol_mutex to be grabbed from cgroup [file] methods
block/blk-cgroup.c: free per-blkcg data when freeing the blkcg
block: use FIELD_SIZEOF to calculate size of a field
bio integrity: do not assume bio_integrity_pool exists if bioset exists
Merge tag 'jfs-4.2' of git://github.com/kleikamp/linux-shaggy
Pull jfs fixes from David Kleikamp:
"A couple trivial fixes and an error path fix"
* tag 'jfs-4.2' of git://github.com/kleikamp/linux-shaggy:
jfs: clean up jfs_rename and fix out of order unlock
jfs: fix indentation on if statement
jfs: removed a prohibited space after opening parenthesis
Ming Lei [Thu, 16 Jul 2015 11:53:22 +0000 (19:53 +0800)]
blk-mq: set default timeout as 30 seconds
It is reasonable to set default timeout of request as 30 seconds instead of
30000 ticks, which may be 300 seconds if HZ is 100, for example, some arm64
based systems may choose 100 HZ.
Signed-off-by: Ming Lei <ming.lei@canonical.com> Fixes: c76cbbcf4044 ("blk-mq: put blk_queue_rq_timeout together in blk_mq_init_queue()" Signed-off-by: Jens Axboe <axboe@fb.com>
Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security
Pull TPM bugfixes from James Morris.
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security:
tpm, tpm_crb: fail when TPM2 ACPI table contents look corrupted
tpm: Fix initialization of the cdev
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma
Pull rdma fixes from Doug Ledford:
"Mainly fix-ups for the various 4.2 items"
* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma: (24 commits)
IB/core: Destroy ocrdma_dev_id IDR on module exit
IB/core: Destroy multcast_idr on module exit
IB/mlx4: Optimize do_slave_init
IB/mlx4: Fix memory leak in do_slave_init
IB/mlx4: Optimize freeing of items on error unwind
IB/mlx4: Fix use of flow-counters for process_mad
IB/ipath: Convert use of __constant_<foo> to <foo>
IB/ipoib: Set MTU to max allowed by mode when mode changes
IB/ipoib: Scatter-Gather support in connected mode
IB/ucm: Fix bitmap wrap when devnum > IB_UCM_MAX_DEVICES
IB/ipoib: Prevent lockdep warning in __ipoib_ib_dev_flush
IB/ucma: Fix lockdep warning in ucma_lock_files
rds: rds_ib_device.refcount overflow
RDMA/nes: Fix for incorrect recording of the MAC address
RDMA/nes: Fix for resolving the neigh
RDMA/core: Fixes for port mapper client registration
IB/IPoIB: Fix bad error flow in ipoib_add_port()
IB/mlx4: Do not attemp to report HCA clock offset on VFs
IB/cm: Do not queue work to a device that's going away
IB/srp: Avoid using uninitialized variable
...
Keith Busch [Tue, 14 Jul 2015 17:57:48 +0000 (11:57 -0600)]
NVMe: Reread partitions on metadata formats
This patch has the driver automatically reread partitions if a namespace
has a separate metadata format. Previously revalidating a disk was
sufficient to get the correct capacity set on such formatted drives,
but partitions that may exist would not have been surfaced.
Reported-by: Paul Grabinar <paul.grabinar@ranbarg.com> Signed-off-by: Keith Busch <keith.busch@intel.com> Cc: Matthew Wilcox <willy@linux.intel.com> Tested-by: Paul Grabinar <paul.grabinar@ranbarg.com> Signed-off-by: Jens Axboe <axboe@fb.com>
Merge tag 'locks-v4.2-1' of git://git.samba.org/jlayton/linux
Pull file locking updates from Jeff Layton:
"I had thought that I was going to get away without a pull request this
cycle. There was a NFSv4 file locking problem that cropped up that I
tried to fix in the NFSv4 code alone, but that fix has turned out to
be problematic. These patches fix this in the correct way.
Note that this touches some NFSv4 code as well. Ordinarily I'd wait
for Trond to ACK this, but he's on holiday right now and the bug is
rather nasty. So I suggest we merge this and if he raises issues with
it we can sort it out when he gets back"
Acked-by: Bruce Fields <bfields@fieldses.org> Acked-by: Dan Williams <dan.j.williams@intel.com>
[ +1 to this series fixing a 100% reproducible slab corruption +
general protection fault in my nfs-root test environment. - Dan ] Acked-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
* tag 'locks-v4.2-1' of git://git.samba.org/jlayton/linux:
locks: inline posix_lock_file_wait and flock_lock_file_wait
nfs4: have do_vfs_lock take an inode pointer
locks: new helpers - flock_lock_inode_wait and posix_lock_inode_wait
locks: have flock_lock_file take an inode pointer instead of a filp
Revert "nfs: take extra reference to fl->fl_file when running a LOCKU operation"
Merge tag 'arc-v4.2-rc3-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vgupta/arc
Pull ARC fixes from Vineet Gupta:
- Makefile changes (top-level+ARC) reinstates -O3 builds (regression
since 3.16)
- IDU intc related fixes, IRQ affinity
- patch to make bitops safer for ARC
- perf fix from Alexey to remove signed PC braino
- Futex backend gets llock/scond support
* tag 'arc-v4.2-rc3-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vgupta/arc:
ARCv2: support HS38 releases
ARC: make sure instruction_pointer() returns unsigned value
ARC: slightly refactor macros for boot logging
ARC: Add llock/scond to futex backend
arc:irqchip: prepare for drivers/irqchip/irqchip.h removal
ARC: Make ARC bitops "safer" (add anti-optimization)
ARCv2: [axs103] bump CPU frequency from 75 to 90 MHZ
ARCv2: intc: IDU: Fix potential race in installing a chained IRQ handler
ARCv2: intc: IDU: support irq affinity
ARC: fix unused var wanring
ARC: Don't memzero twice in dma_alloc_coherent for __GFP_ZERO
ARC: Override toplevel default -O2 with -O3
kbuild: Allow arch Makefiles to override {cpp,ld,c}flags
ARCv2: guard SLC DMA ops with spinlock
ARC: Kconfig: better way to disable ARC_HAS_LLSC for ARC_CPU_750D
Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux
Pull s390 fixes from Martin Schwidefsky:
"One improvement for the zcrypt driver, the quality attribute for the
hwrng device has been missing. Without it the kernel entropy seeding
will not happen automatically.
And six bug fixes, the most important one is the fix for the vector
register corruption due to machine checks"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux:
s390/nmi: fix vector register corruption
s390/process: fix sfpc inline assembly
s390/dasd: fix kernel panic when alias is set offline
s390/sclp: clear upper register halves in _sclp_print_early
s390/oprofile: fix compile error
s390/sclp: fix compile error
s390/zcrypt: enable s390 hwrng to seed kernel entropy
Dave Kleikamp [Wed, 15 Jul 2015 17:52:47 +0000 (12:52 -0500)]
jfs: clean up jfs_rename and fix out of order unlock
The end of jfs_rename(), which is also used by the error paths,
included a call to IWRITE_UNLOCK(new_ip) after labels out1, out2
and out3. If we come in through these labels, IWRITE_LOCK() has not
been called yet.
In moving that call to the correct spot, I also moved some
exceptional truncate code earlier as well, since the early error
paths don't need to deal with it, and I renamed out4: to out_tx: so
a future patch by Jan Kara doesn't need to deal with renumbering or
confusing out-of-order labels.
Signed-off-by: Dave Kleikamp <dave.kleikamp@oracle.com>
Merge tag 'module-final-v4.2-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/paulg/linux
Pull final init.h/module.h code relocation from Paul Gortmaker:
"With the release of 4.2-rc2 done, we should not be seeing any new code
added that gets upset by this small code move, and we've banked yet
another complete week of testing with this move in place on top of
4.2-rc1 via linux-next to ensure that remained true.
Given that, I'd like to put it in now so that people formulating new
work for 4.3-rc1 will be exposed to the ever so slightly stricter (but
sensible) requirements wrt. whether they are needing init.h vs.
module.h macros, even if they are not using linux-next.
The diffstat of the move is slightly asymmetrical due to needing to
leave behind a couple #ifdef in the old location and add the same ones
to the new location, but other than that, it is a 1:1 move, complete
with the module_init/exit trailing semicolon that we can't fix. That
is, until/unless someone does a tree-wide sed fix of all the
approximately 800 currently in tree users relying on it"
* tag 'module-final-v4.2-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/paulg/linux:
module: relocate module_init from init.h to module.h
Merge tag 'trace-v4.2-rc1-fix' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing fix from Steven Rostedt:
"Fengguang Wu discovered a crash that happened to be because of the
branch tracer (traces unlikely and likely branches) when enabled with
certain debug options.
What happened was that various debug options like lockdep and
DEBUG_PREEMPT can cause parts of the branch tracer to recurse outside
its recursion protection. In fact, part of its recursion protection
used these features that caused the lockup. This cleans up the code a
little and makes the recursion protection a bit more robust"
* tag 'trace-v4.2-rc1-fix' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
tracing: Have branch tracer use recursive field of task struct
Destroy ocrdma_dev_id IDR on module exit, reclaiming the allocated memory.
This was detected by the following semantic patch (written by Luis Rodriguez
<mcgrof@suse.com>)
<SmPL>
@ defines_module_init @
declarer name module_init, module_exit;
declarer name DEFINE_IDR;
identifier init;
@@
Destroy multcast_idr on module exit, reclaiming the allocated memory.
This was detected by the following semantic patch (written by Luis Rodriguez
<mcgrof@suse.com>)
<SmPL>
@ defines_module_init @
declarer name module_init, module_exit;
declarer name DEFINE_IDR;
identifier init;
@@
There is little chance our memory allocation will fail, so we can
combine initializing the work structs with allocating them instead of
looping through all of them once to allocate and again to initialize.
Then when we need to actually find out if our device is up or in the
process of going down, have all of our work structs batched up, take the
spin_lock once and only once, and do all of the batch under the one
spin_lock invocation instead of incurring all of the locked memory cycles
we would otherwise incur to take/release the spin_lock over and over
again.
We create a number of work structs to be queued up to a workqueue, and
on completion of the workqueue handler, the workqueue handler frees the
allocated memory. If, however, we don't queue the work struct because
the device is going down, then we need to free the memory ourselves.
IB/mlx4: Optimize freeing of items on error unwind
On failure, we loop through all possible pointers and test them before
calling kfree. But really, why even attempt to free items we didn't
allocate when we can easily loop through exactly and only the devices
for which the original memory allocation succeeded and free just those.
Or Gerlitz [Thu, 25 Jun 2015 14:45:38 +0000 (17:45 +0300)]
IB/mlx4: Fix use of flow-counters for process_mad
For IB links, reading HCA flow counters through iboe_process_mad() should
be used when mlx4_ib_process_mad() is invoked only for VFs PMA queries and
exactly nothing else.
Fixes: 7193a141eb74 ('IB/mlx4: Set VF to read from QP counters') Reported-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
Vaishali Thakkar [Tue, 16 Jun 2015 11:43:05 +0000 (17:13 +0530)]
IB/ipath: Convert use of __constant_<foo> to <foo>
In little endian cases, the macros be16_to_cpu and cpu_to_be64
unfolds to __swab{16,64} which provides special case for constants.
In big endian cases, __constant_be16_to_cpu and be16_to_cpu
expand directly to the same expression. The same applies for
__constant_cpu_to_be64 and cpu_to_be64.
So, replace __constant_be16_to_cpu with be16_to_cpu and
__constant_cpu_to_be64 with cpu_to_be64, with the goal of getting
rid of the definition of __constant_be16_to_cpu and
__constant_cpu_to_be64 completely.
IB/ipoib: Scatter-Gather support in connected mode
By default, IPoIB-CM driver uses 64k MTU. Larger MTU gives better
performance.
This MTU plus overhead puts the memory allocation for IP based packets at
32 4k pages (order 5), which have to be contiguous.
When the system memory under pressure, it was observed that allocating 128k
contiguous physical memory is difficult and causes serious errors (such as
system becomes unusable).
This enhancement resolve the issue by removing the physically contiguous
memory requirement using Scatter/Gather feature that exists in Linux stack.
With this fix Scatter-Gather will be supported also in connected mode.
This change reverts some of the change made in commit e112373fd6aa
("IPoIB/cm: Reduce connected mode TX object size").
The ability to use SG in IPoIB CM is possible because the coupling
between NETIF_F_SG and NETIF_F_CSUM was removed in commit ec5f06156423 ("net: Kill link between CSUM and SG features.")
Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com> Acked-by: Christian Marie <christian@ponies.io> Signed-off-by: Doug Ledford <dledford@redhat.com>
Haggai Eran [Tue, 7 Jul 2015 14:45:13 +0000 (17:45 +0300)]
IB/ipoib: Prevent lockdep warning in __ipoib_ib_dev_flush
__ipoib_ib_dev_flush calls itself recursively on child devices, and lockdep
complains about locking vlan_rwsem twice (see below). Use down_read_nested
instead of down_read to prevent the warning.
=============================================
[ INFO: possible recursive locking detected ]
4.1.0-rc4+ #36 Tainted: G O
---------------------------------------------
kworker/u20:2/261 is trying to acquire lock:
(&priv->vlan_rwsem){.+.+..}, at: [<ffffffffa0791e2a>] __ipoib_ib_dev_flush+0x3a/0x2b0 [ib_ipoib]
but task is already holding lock:
(&priv->vlan_rwsem){.+.+..}, at: [<ffffffffa0791e2a>] __ipoib_ib_dev_flush+0x3a/0x2b0 [ib_ipoib]
other info that might help us debug this:
Possible unsafe locking scenario:
Haggai Eran [Tue, 7 Jul 2015 14:45:12 +0000 (17:45 +0300)]
IB/ucma: Fix lockdep warning in ucma_lock_files
The ucma_lock_files() locks the mut mutex on two files, e.g. for migrating
an ID. Use mutex_lock_nested() to prevent the warning below.
=============================================
[ INFO: possible recursive locking detected ]
4.1.0-rc6-hmm+ #40 Tainted: G O
---------------------------------------------
pingpong_rpc_se/10260 is trying to acquire lock:
(&file->mut){+.+.+.}, at: [<ffffffffa047ac55>] ucma_migrate_id+0xc5/0x248 [rdma_ucm]
but task is already holding lock:
(&file->mut){+.+.+.}, at: [<ffffffffa047ac4b>] ucma_migrate_id+0xbb/0x248 [rdma_ucm]
other info that might help us debug this:
Possible unsafe locking scenario:
CPU0
----
lock(&file->mut);
lock(&file->mut);
*** DEADLOCK ***
May be due to missing lock nesting notation
1 lock held by pingpong_rpc_se/10260:
#0: (&file->mut){+.+.+.}, at: [<ffffffffa047ac4b>] ucma_migrate_id+0xbb/0x248 [rdma_ucm]
Wengang Wang [Mon, 6 Jul 2015 06:35:11 +0000 (14:35 +0800)]
rds: rds_ib_device.refcount overflow
Fixes: 3e0249f9c05c ("RDS/IB: add refcount tracking to struct rds_ib_device")
There lacks a dropping on rds_ib_device.refcount in case rds_ib_alloc_fmr
failed(mr pool running out). this lead to the refcount overflow.
A complain in line 117(see following) is seen. From vmcore:
s_ib_rdma_mr_pool_depleted is 2147485544 and rds_ibdev->refcount is -2147475448.
That is the evidence the mr pool is used up. so rds_ib_alloc_fmr is very likely
to return ERR_PTR(-EAGAIN).
RDMA/core: Fixes for port mapper client registration
Fixes to allow clients to make remove mapping requests, after
they have provided the user space service with the mapping
information, they are using when the service is restarted.
1) Adding IWPM_REG_VALID, IWPM_REG_INCOMPL and IWPM_REG_UNDEF
registration types for the port mapper clients and functions
to set/check the registration type.
2) If the port mapper user space service is not available to register
the client, then its registration stays IWPM_REG_UNDEF and the
registration isn't checked until the service becomes available
(no mappings are possible, if the user space service isn't running).
3) After the service is restarted, the user space port mapper pid is set
to valid and the client registration is set to IWPM_REG_INCOMPL
to allow the client to make remove mapping requests.
Signed-off-by: Tatyana Nikolova <Tatyana.E.Nikolova@intel.com> Reviewed-by: Steve Wise <swise@opengridcomputing.com> Tested-by: Steve Wise <swise@opengridcomputing.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
Amir Vadai [Wed, 1 Jul 2015 11:31:01 +0000 (14:31 +0300)]
IB/IPoIB: Fix bad error flow in ipoib_add_port()
Error values of ib_query_port() and ib_query_device() weren't propagated
correctly. Because of that, ipoib_add_port() could return NULL value,
which escaped the IS_ERR() check in ipoib_add_one() and we crashed.
Signed-off-by: Amir Vadai <amirv@mellanox.com> Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Reviewed-by: Ira Weiny <ira.weiny@intel.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
IB/mlx4: Do not attemp to report HCA clock offset on VFs
mlx4 VFs can provide CQE raw time-stamping services, but they
don't have the hca core clock mapped to their PCI bars.
As such, we should not attempt to query and report the clock offset
to user space for VFs. Doing so causes query_device over VFs to fail
with -ENOSUPP.
Fixes: 4b664c4355b2 ('IB/mlx4: Add support for CQ time-stamping') Signed-off-by: Matan Barak <matanb@mellanox.com> Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
Erez Shitrit [Thu, 25 Jun 2015 14:13:22 +0000 (17:13 +0300)]
IB/cm: Do not queue work to a device that's going away
Whenever ib_cm gets remove_one call, like when there is a hot-unplug
event, the driver should mark itself as going_down and confirm that no
new works are going to be queued for that device.
so, the order of the actions are:
1. mark the going_down bit.
2. flush the wq.
3. [make sure no new works for that device.]
4. unregister mad agent.
otherwise, works that are already queued can be scheduled after the mad
agent was freed.
Vaishali Thakkar [Wed, 24 Jun 2015 04:42:13 +0000 (10:12 +0530)]
IB/srpt: Convert use of __constant_cpu_to_beXX to cpu_to_beXX
In little endian cases, the macro cpu_to_be{16,32,64} unfolds to
__swab{16,32,64} which provides special case for constants. In
big endian cases, __constant_cpu_to_be{16,32,64} and
cpu_to_be{16,32,64} expand directly to the same expression. So,
replace __constant_cpu_to_be{16,32,64} with cpu_to_be{16,32,64}
with the goal of getting rid of the definitions of
__constant_cpu_to_be{16,32,64} completely.
The Coccinelle semantic patch that performs this transformation
is as follows:
Ira Weiny [Thu, 25 Jun 2015 13:52:50 +0000 (09:52 -0400)]
IB/mad: Remove improper use of BUG_ON
We recently added BUG_ON's which were inappropriate for a condition which
should never happen. Change these to be WARN_ON_ONCE as a debugging aid.
Fixes: 4cd7c9479aff ('IB/mad: Add support for additional MAD info to/from drivers') Signed-off-by: Ira Weiny <ira.weiny@intel.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
Hal Rosenstock [Mon, 29 Jun 2015 13:57:00 +0000 (09:57 -0400)]
IB: Add rdma_cap_ib_switch helper and use where appropriate
Persuant to Liran's comments on node_type on linux-rdma
mailing list:
In an effort to reform the RDMA core and ULPs to minimize use of
node_type in struct ib_device, an additional bit is added to
struct ib_device for is_switch (IB switch). This is needed
to be initialized by any IB switch device driver. This is a
NEW requirement on such device drivers which are all
"out of tree".
In addition, an ib_switch helper was added to ib_verbs.h
based on the is_switch device bit rather than node_type
(although those should be consistent).
The RDMA core (MAD, SMI, agent, sa_query, multicast, sysfs)
as well as (IPoIB and SRP) ULPs are updated where
appropriate to use this new helper. In some cases,
the helper is now used under the covers of using
rdma_[start end]_port rather than the open coding
previously used.
Reviewed-by: Sean Hefty <sean.hefty@intel.com> Reviewed-By: Jason Gunthorpe <jgunthorpe@obsidianresearch.com> Reviewed-by: Ira Weiny <ira.weiny@intel.com> Tested-by: Ira Weiny <ira.weiny@intel.com> Signed-off-by: Hal Rosenstock <hal@mellanox.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
Jarkko Sakkinen [Wed, 24 Jun 2015 14:14:55 +0000 (17:14 +0300)]
tpm, tpm_crb: fail when TPM2 ACPI table contents look corrupted
At least some versions of AMI BIOS have corrupted contents in the TPM2
ACPI table and namely the physical address of the control area is set to
zero.
This patch changes the driver to fail gracefully when we observe a zero
address instead of continuing to ioremap.
Cc: <stable@vger.kernel.org> Signed-off-by: Jarkko Sakkinen <jarkko.sakkinen@linux.intel.com> Reviewed-by: Peter Huewe <peterhuewe@gmx.de> Signed-off-by: Peter Huewe <peterhuewe@gmx.de>
Jason Gunthorpe [Tue, 30 Jun 2015 19:15:31 +0000 (13:15 -0600)]
tpm: Fix initialization of the cdev
When a cdev is contained in a dynamic structure the cdev parent kobj
should be set to the kobj that controls the lifetime of the enclosing
structure. In TPM's case this is the embedded struct device.
Also, cdev_init 0's the whole structure, so all sets must be after,
not before. This fixes module ref counting and cdev.
Cc: <stable@vger.kernel.org> Fixes: 313d21eeab92 ("tpm: device class for tpm") Signed-off-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com> Reviewed-by: Dmitry Torokhov <dmitry.torokhov@gmail.com> Reviewed-by: Jarkko Sakkinen <jarkko.sakkinen@linux.intel.com> Tested-by: Jarkko Sakkinen <jarkko.sakkinen@linux.intel.com> Signed-off-by: Peter Huewe <peterhuewe@gmx.de>
1) Missing list head init in bluetooth hidp session creation, from Tedd
Ho-Jeong An.
2) Don't leak SKB in bridge netfilter error paths, from Florian
Westphal.
3) ipv6 netdevice private leak in netfilter bridging, fixed by Julien
Grall.
4) Fix regression in IP over hamradio bpq encapsulation, from Ralf
Baechle.
5) Fix race between rhashtable resize events and table walks, from Phil
Sutter.
6) Missing validation of IFLA_VF_INFO netlink attributes, fix from
Daniel Borkmann.
7) Missing security layer socket state initialization in tipc code,
from Stephen Smalley.
8) Fix shared IRQ handling in boomerang 3c59x interrupt handler, from
Denys Vlasenko.
9) Missing minor_idr destroy on module unload on macvtap driver, from
Johannes Thumshirn.
10) Various pktgen kernel thread races, from Oleg Nesterov.
11) Fix races that can cause packets to be processed in the backlog even
after a device attached to that SKB has been fully unregistered.
From Julian Anastasov.
12) bcmgenet driver doesn't account packet drops vs. errors properly,
fix from Petri Gynther.
13) Array index validation and off by one fix in DSA layer from Florian
Fainelli
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (66 commits)
can: replace timestamp as unique skb attribute
ARM: dts: dra7x-evm: Prevent glitch on DCAN1 pinmux
can: c_can: Fix default pinmux glitch at init
can: rcar_can: unify error messages
can: rcar_can: print request_irq() error code
can: rcar_can: fix typo in error message
can: rcar_can: print signed IRQ #
can: rcar_can: fix IRQ check
net: dsa: Fix off-by-one in switch address parsing
net: dsa: Test array index before use
net: switchdev: don't abort unsupported operations
net: bcmgenet: fix accounting of packet drops vs errors
cdc_ncm: update specs URL
Doc: z8530book: Fix typo in API-z8530-sync-txdma-open.html
net: inet_diag: always export IPV6_V6ONLY sockopt for listening sockets
bridge: mdb: allow the user to delete mdb entry if there's a querier
net: call rcu_read_lock early in process_backlog
net: do not process device backlog during unregistration
bridge: fix potential crash in __netdev_pick_tx()
net: axienet: Fix devm_ioremap_resource return value check
...
Pull crypto fixes from Herbert Xu:
"This fixes a duplicate dma_unmap_sg call in omap-des and reentrancy
bugs in the powerpc nx driver which may cause bogus output or worse
memory corruption"
Jeff Layton [Sat, 11 Jul 2015 10:43:03 +0000 (06:43 -0400)]
nfs4: have do_vfs_lock take an inode pointer
Now that we have file locking helpers that can deal with an inode
instead of a filp, we can change the NFSv4 locking code to use that
instead.
This should fix the case where we have a filp that is closed while flock
or OFD locks are set on it, and the task is signaled so that it doesn't
wait for the LOCKU reply to come in before the filp is freed. At that
point we can end up with a use-after-free with the current code, which
relies on dereferencing the fl_file in the lock request.
Signed-off-by: Jeff Layton <jeff.layton@primarydata.com> Reviewed-by: "J. Bruce Fields" <bfields@fieldses.org> Tested-by: "J. Bruce Fields" <bfields@fieldses.org>
Jeff Layton [Sat, 11 Jul 2015 10:43:02 +0000 (06:43 -0400)]
locks: new helpers - flock_lock_inode_wait and posix_lock_inode_wait
Allow callers to pass in an inode instead of a filp.
Signed-off-by: Jeff Layton <jeff.layton@primarydata.com> Reviewed-by: "J. Bruce Fields" <bfields@fieldses.org> Tested-by: "J. Bruce Fields" <bfields@fieldses.org>
Jeff Layton [Sat, 11 Jul 2015 10:43:02 +0000 (06:43 -0400)]
locks: have flock_lock_file take an inode pointer instead of a filp
...and rename it to better describe how it works.
In order to fix a use-after-free in NFS, we need to be able to remove
locks from an inode after the filp associated with them may have already
been freed. flock_lock_file already only dereferences the filp to get to
the inode, so just change it so the callers do that.
All of the callers already pass in a lock request that has the fl_file
set properly, so we don't need to pass it in individually. With that
change it now only dereferences the filp to get to the inode, so just
push that out to the callers.
Signed-off-by: Jeff Layton <jeff.layton@primarydata.com> Reviewed-by: "J. Bruce Fields" <bfields@fieldses.org> Tested-by: "J. Bruce Fields" <bfields@fieldses.org>
William reported that he was seeing instability with this patch, which
is likely due to the fact that it can cause the kernel to take a new
reference to a filp after the last reference has already been put.
Revert this patch for now, as we'll need to fix this in another way.
Cc: stable@vger.kernel.org Reported-by: William Dauchy <william@gandi.net> Signed-off-by: Jeff Layton <jeff.layton@primarydata.com> Reviewed-by: "J. Bruce Fields" <bfields@fieldses.org> Tested-by: "J. Bruce Fields" <bfields@fieldses.org>
If a machine check happens, the machine has the vector facility installed
and the extended save area exists, the cpu will save vector register
contents into the extended save area. This is regardless of control
register 0 contents, which enables and disables the vector facility during
runtime.
On each machine check we should validate the vector registers. The current
code however tries to validate the registers only if the running task is
using vector registers in user space.
However even the current code is broken and causes vector register
corruption on machine checks, if user space uses them:
the prefix area contains a pointer (absolute address) to the machine check
extended save area. In order to save some space the save area was put into
an unused area of the second prefix page.
When validating vector register contents the code uses the absolute address
of the extended save area, which is wrong. Due to prefixing the vector
instructions will then access contents using absolute addresses instead
of real addresses, where the machine stored the contents.
If the above would work there is still the problem that register validition
would only happen if user space uses vector registers. If kernel space uses
them also, this may also lead to vector register content corruption:
if the kernel makes use of vector instructions, but the current running
user space context does not, the machine check handler will validate
floating point registers instead of vector registers.
Given the fact that writing to a floating point register may change the
upper halve of the corresponding vector register, we also experience vector
register corruption in this case.
Fix all of these issues, and always validate vector registers on each
machine check, if the machine has the vector facility installed and the
extended save area is defined.
The sfpc inline assembly within execve_tail() may incorrectly set bits
28-31 of the sfpc instruction to a value which is not zero.
These bits however are currently unused and therefore should be zero
so we won't get surprised if these bits will be used in the future.
Therefore remove the second operand from the inline assembly.
Cc: <stable@vger.kernel.org> Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Stefan Haberland [Fri, 10 Jul 2015 08:47:09 +0000 (10:47 +0200)]
s390/dasd: fix kernel panic when alias is set offline
The dasd device driver selects which (alias or base) device is used
for a given requests when the request is build. If the chosen alias
device is set offline before the request gets queued to the device
queue the starting function may use device structures that are
already freed. This might lead to a hanging offline process or a
kernel panic.
Add a check to the starting function that returns the request to the
upper layer if the device is already in offline processing.
In addition to that prevent that an alias device that's already in
offline processing gets chosen as start device.
Reviewed-by: Sebastian Ott <sebott@linux.vnet.ibm.com> Reviewed-by: Peter Oberparleiter <peter.oberparleiter@linux.vnet.ibm.com> Signed-off-by: Stefan Haberland <stefan.haberland@de.ibm.com> Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
ARC: make sure instruction_pointer() returns unsigned value
Currently instruction_pointer() returns pt_regs->ret and so return value
is of type "long", which implicitly stands for "signed long".
While that's perfectly fine when dealing with 32-bit values if return
value of instruction_pointer() gets assigned to 64-bit variable sign
extension may happen.
And at least in one real use-case it happens already.
In perf_prepare_sample() return value of perf_instruction_pointer()
(which is an alias to instruction_pointer() in case of ARC) is assigned
to (struct perf_sample_data)->ip (which type is "u64").
And what we see if instuction pointer points to user-space application
that in case of ARC lays below 0x8000_0000 "ip" gets set properly with
leading 32 zeros. But if instruction pointer points to kernel address
space that starts from 0x8000_0000 then "ip" is set with 32 leadig
"f"-s. I.e. id instruction_pointer() returns 0x8100_0000, "ip" will be
assigned with 0xffff_ffff__8100_0000. Which is obviously wrong.
In particular that issuse broke output of perf, because perf was unable
to associate addresses like 0xffff_ffff__8100_0000 with anything from
/proc/kallsyms.
That's what we used to see:
----------->8----------
6.27% ls [unknown] [k] 0xffffffff8046c5cc
2.96% ls libuClibc-0.9.34-git.so [.] memcpy
2.25% ls libuClibc-0.9.34-git.so [.] memset
1.66% ls [unknown] [k] 0xffffffff80666536
1.54% ls libuClibc-0.9.34-git.so [.] 0x000224d6
1.18% ls libuClibc-0.9.34-git.so [.] 0x00022472
----------->8----------
With that change perf output looks much better now:
----------->8----------
8.21% ls [kernel.kallsyms] [k] memset
3.52% ls libuClibc-0.9.34-git.so [.] memcpy
2.11% ls libuClibc-0.9.34-git.so [.] malloc
1.88% ls libuClibc-0.9.34-git.so [.] memset
1.64% ls [kernel.kallsyms] [k] _raw_spin_unlock_irqrestore
1.41% ls [kernel.kallsyms] [k] __d_lookup_rcu
----------->8----------
David S. Miller [Mon, 13 Jul 2015 05:24:01 +0000 (22:24 -0700)]
Merge tag 'linux-can-fixes-for-4.2-20150712' of git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can
Marc Kleine-Budde says:
====================
pull-request: can 2015-07-12
this is a pull request of 8 patchs for net/master.
Sergei Shtylyov contributes 5 patches for the rcar_can driver, fixing the IRQ
check and several info and error messages. There are two patches by J.D.
Schroeder and Roger Quadros for the c_can driver and dra7x-evm device tree,
which precent a glitch in the DCAN1 pinmux. Oliver Hartkopp provides a better
approach to make the CAN skbs unique, the timestamp is replaced by a counter.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
The inb/outb/... family of IO methods end up being multiply defined when
building PCI support for the ColdFire. Compiling gives this:
CC init/main.o
In file included from ./arch/m68k/include/asm/io.h:4:0,
from include/linux/bio.h:30,
from include/linux/blkdev.h:18,
from init/main.c:75:
./arch/m68k/include/asm/io_mm.h:420:0: warning: "inb" redefined
./arch/m68k/include/asm/io_mm.h:108:0: note: this is the location of the previous definition
...
The ColdFire/PCI case defines its own IO access methods, so no others
should be defined or used in this case. Conditionally disable other
definitions that clash with it.
It would be nice if we could support multiple ColdFire SoC types in a
single binary - but currently the code simply does not support it.
Change the SoC selection config options to be a choice instead of
individual selectable entries.
This fixes problems with building allnoconfig, and means that a sane
linux kernel is generated for a single ColdFire SoC type.
m68knommu: improve the clock configuration defaults
Create some intelligent default settings for each ColdFire SoC type
in the configuration entry for CONFIG_CLOCK_FREQ.
The ColdFire clock frequency is configurable at build time. There is a
lot of variation in the frequency of operation on specific ColdFire based
boards. But we can choose a default that matches the maximum frequency
of clock operation for a particular ColdFire part. That is typically
the most common clock setting.
m68knommu: force setting of CONFIG_CLOCK_FREQ for ColdFire
It is possible to disable the clock selection at configuration time,
but for ColdFire targets we always expect a clock frequency to be
selected. This results in the following compile time error:
CC arch/m68k/kernel/asm-offsets.s
In file included from ./arch/m68k/include/asm/timex.h:14:0,
from include/linux/timex.h:65,
from include/linux/sched.h:19,
from arch/m68k/kernel/asm-offsets.c:14:
./arch/m68k/include/asm/coldfire.h:25:2: error: #error "Don't know what your ColdFire CPU clock frequency is??"
Remove CONFIG_CLOCK_SELECT completely and always enable CONFIG_CLOCK_FREQ
for ColdFire.
So the change to test 'crtc_state->base.active' cannot possibly be
correct as-is.
There may be some other minimal fix (like just checking crtc_state for
NULL), but I'm just reverting it now for the rc2 release, and people
like Daniel Vetter who actually know this code will figure out what the
right solution is in the longer term.
Reported-and-bisected-by: Jörg Otte <jrg.otte@gmail.com> Cc: Ander Conselvan de Oliveira <ander.conselvan.de.oliveira@intel.com> Cc: Jani Nikula <jani.nikula@linux.intel.com> Cc: Daniel Vetter <daniel.vetter@intel.com> CC: Maarten Lankhorst <maarten.lankhorst@linux.intel.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull VFS fixes from Al Viro:
"Fixes for this cycle regression in overlayfs and a couple of
long-standing (== all the way back to 2.6.12, at least) bugs"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
freeing unlinked file indefinitely delayed
fix a braino in ovl_d_select_inode()
9p: don't leave a half-initialized inode sitting around
Merge branch 'upstream' of git://git.linux-mips.org/pub/scm/ralf/upstream-linus
Pull MIPS fixes from Ralf Baechle:
"A fair number of 4.2 fixes also because Markos opened the flood gates.
- Patch up the math used calculate the location for the page bitmap.
- The FDC (Not what you think, FDC stands for Fast Debug Channel) IRQ
around was causing issues on non-Malta platforms, so move the code
to a Malta specific location.
- A spelling fix replicated through several files.
- Fix to the emulation of an R2 instruction for R6 cores.
- Fix the JR emulation for R6.
- Further patching of mindless 64 bit issues.
- Ensure the kernel won't crash on CPUs with L2 caches with >= 8
ways.
- Use compat_sys_getsockopt for O32 ABI on 64 bit kernels.
- Fix cache flushing for multithreaded cores.
- A build fix"
* 'upstream' of git://git.linux-mips.org/pub/scm/ralf/upstream-linus:
MIPS: O32: Use compat_sys_getsockopt.
MIPS: c-r4k: Extend way_string array
MIPS: Pistachio: Support CDMM & Fast Debug Channel
MIPS: Malta: Make GIC FDC IRQ workaround Malta specific
MIPS: c-r4k: Fix cache flushing for MT cores
Revert "MIPS: Kconfig: Disable SMP/CPS for 64-bit"
MIPS: cps-vec: Use macros for various arithmetics and memory operations
MIPS: kernel: cps-vec: Replace KSEG0 with CKSEG0
MIPS: kernel: cps-vec: Use ta0-ta3 pseudo-registers for 64-bit
MIPS: kernel: cps-vec: Replace mips32r2 ISA level with mips64r2
MIPS: kernel: cps-vec: Replace 'la' macro with PTR_LA
MIPS: kernel: smp-cps: Fix 64-bit compatibility errors due to pointer casting
MIPS: Fix erroneous JR emulation for MIPS R6
MIPS: Fix branch emulation for BLTC and BGEC instructions
MIPS: kernel: traps: Fix broken indentation
MIPS: bootmem: Don't use memory holes for page bitmap
MIPS: O32: Do not handle require 32 bytes from the stack to be readable.
MIPS, CPUFREQ: Fix spelling of Institute.
MIPS: Lemote 2F: Fix build caused by recent mass rename.
Oliver Hartkopp [Fri, 26 Jun 2015 09:58:19 +0000 (11:58 +0200)]
can: replace timestamp as unique skb attribute
Commit 514ac99c64b "can: fix multiple delivery of a single CAN frame for
overlapping CAN filters" requires the skb->tstamp to be set to check for
identical CAN skbs.
Without timestamping to be required by user space applications this timestamp
was not generated which lead to commit 36c01245eb8 "can: fix loss of CAN frames
in raw_rcv" - which forces the timestamp to be set in all CAN related skbuffs
by introducing several __net_timestamp() calls.
This forces e.g. out of tree drivers which are not using alloc_can{,fd}_skb()
to add __net_timestamp() after skbuff creation to prevent the frame loss fixed
in mainline Linux.
This patch removes the timestamp dependency and uses an atomic counter to
create an unique identifier together with the skbuff pointer.
Btw: the new skbcnt element introduced in struct can_skb_priv has to be
initialized with zero in out-of-tree drivers which are not using
alloc_can{,fd}_skb() too.
Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net> Cc: linux-stable <stable@vger.kernel.org> Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
Roger Quadros [Tue, 7 Jul 2015 14:27:57 +0000 (17:27 +0300)]
ARM: dts: dra7x-evm: Prevent glitch on DCAN1 pinmux
Driver core sets "default" pinmux on on probe and CAN driver
sets "sleep" pinmux during register. This causes a small window
where the CAN pins are in "default" state with the DCAN module
being disabled.
Change the "default" state to be like sleep so this glitch is
avoided. Add a new "active" state that is used by the driver
when CAN is actually active.
Signed-off-by: Roger Quadros <rogerq@ti.com> Cc: linux-stable <stable@vger.kernel.org> Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>