Song Liu [Fri, 9 Oct 2015 04:54:13 +0000 (21:54 -0700)]
MD: when RAID journal is missing/faulty, block RESTART_ARRAY_RW
When RAID-4/5/6 array suffers from missing journal device, we put
the array in read only state. We should not allow trasition to
read-write states (clean and active) before replacing journal device.
Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.com>
Shaohua Li [Fri, 9 Oct 2015 04:54:12 +0000 (21:54 -0700)]
MD: set journal disk ->raid_disk
Set journal disk ->raid_disk to >=0, I choose raid_disks + 1 instead of
0, because we already have a disk with ->raid_disk 0 and this causes
sysfs entry creation conflict. A lot of places assumes disk with
->raid_disk >=0 is normal raid disk, so we add check for journal disk.
Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.com>
Song Liu [Fri, 9 Oct 2015 04:54:11 +0000 (21:54 -0700)]
MD: kick out journal disk if it's not fresh
When journal disk is faulty and we are reassemabling the raid array, the
journal disk is old. We don't allow the journal disk added to the raid
array. Since journal disk is missing in the array, the raid5 will mark
the array readonly.
Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.com>
Shaohua Li [Fri, 9 Oct 2015 04:54:10 +0000 (21:54 -0700)]
raid5-cache: start raid5 readonly if journal is missing
If raid array is expected to have journal (eg, journal is set in MD
superblock feature map) and the array is started without journal disk,
start the array readonly.
Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.com>
Song Liu [Fri, 9 Oct 2015 04:54:09 +0000 (21:54 -0700)]
MD: add new bit to indicate raid array with journal
If a raid array has journal feature bit set, add a new bit to indicate
this. If the array is started without journal disk existing, we know
there is something wrong.
Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.com>
Shaohua Li [Fri, 9 Oct 2015 04:54:08 +0000 (21:54 -0700)]
raid5-cache: IO error handling
There are 3 places the raid5-cache dispatches IO. The discard IO error
doesn't matter, so we ignore it. The superblock write IO error can be
handled in MD core. The remaining are log write and flush. When the IO
error happens, we mark log disk faulty and fail all write IO. Read IO is
still allowed to run. Userspace will get a notification too and
corresponding daemon can choose setting raid array readonly for example.
Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.com>
Shaohua Li [Fri, 9 Oct 2015 04:54:07 +0000 (21:54 -0700)]
raid5: journal disk can't be removed
raid5-cache uses journal disk rdev->bdev, rdev->mddev in several places.
Don't allow journal disk disappear magically. On the other hand, we do
need to update superblock for other disks to bump up ->events, so next
time journal disk will be identified as stale.
Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.com>
raid5-cache: simplify state machine when caches flushes are not needed
For devices without a volatile write cache we don't need to send a FLUSH
command to ensure writes are stable on disk, and thus can avoid the whole
step of batching up bios for processing by the MD thread.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.com>
There is no good reason to keep the I/O unit structures around after the
stripe has been written back to the RAID array. The only information
we need is the log sequence number, and the checkpoint offset of the
highest successfull writeback. Store those in the log structure, and
free the IO units from __r5l_stripe_write_finished.
Besides simplifying the code this also avoid having to keep the allocation
for the I/O unit around for a potentially long time as superblock updates
that checkpoint the log do not happen very often.
This also fixes the previously incorrect calculation of 'free' in
r5l_do_reclaim as a side effect: previous if took the last unit which
isn't checkpointed into account.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.com>
Song Liu [Fri, 4 Sep 2015 06:00:35 +0000 (23:00 -0700)]
skip match_mddev_units check for special roles
match_mddev_units is used to check whether 2 RAID arrays share
same disk(s). Arrays that share disk(s) will not do resync at the
same time for better performance (fewer HDD seek). However, this
check should not apply to Spare, Faulty, and Journal disks, as
they do not paticipate in resync.
In this patch, match_mddev_units skips check for disks with flag
"Faulty" or "Journal" or raid_disk < 0.
Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.com>
Shaohua Li [Fri, 4 Sep 2015 21:14:16 +0000 (14:14 -0700)]
raid5-cache: don't delay stripe captured in log
There is a case a stripe gets delayed forever.
1. a stripe finishes construction
2. a new bio hits the stripe
3. handle_stripe runs for the stripe. The stripe gets DELAYED bit set
since construction can't run for new bio (the stripe is locked since
step 1)
Without log, handle_stripe will call ops_run_io. After IO finishes, the
stripe gets unlocked and the stripe will restart and run construction
for the new bio. With log, ops_run_io need to run two times. If the
DELAYED bit set, the stripe can't enter into the handle_list, so the
second ops_run_io doesn't run, which leaves the stripe stalled.
Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.com>
Shaohua Li [Fri, 4 Sep 2015 21:14:05 +0000 (14:14 -0700)]
raid5-cache: check stripe finish out of order
stripes could finish out of order. Hence r5l_move_io_unit_list() of
__r5l_stripe_write_finished might not move any entry and leave
stripe_end_ios list empty.
This applies on top of http://marc.info/?l=linux-raid&m=144122700510667
Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.com>
Shaohua Li [Wed, 2 Sep 2015 20:49:50 +0000 (13:49 -0700)]
md: skip resync for raid array with journal
If a raid array has journal, the journal can guarantee the consistency,
we can skip resync after a unclean shutdown. The exception is raid
creation or user initiated resync, which we still do a raid resync.
Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.com>
Shaohua Li [Wed, 2 Sep 2015 20:49:49 +0000 (13:49 -0700)]
raid5-cache: optimize FLUSH IO with log enabled
With log enabled, bio is written to raid disks after the bio is settled
down in log disk. The recovery guarantees we can recovery the bio data
from log disk, so we we skip FLUSH IO.
Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.com>
raid5-cache: move functionality out of __r5l_set_io_unit_state
Just keep __r5l_set_io_unit_state as a small set the state wrapper, and
remove r5l_set_io_unit_state entirely after moving the real
functionality to the two callers that need it.
Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.com>
Shaohua Li [Wed, 2 Sep 2015 20:49:47 +0000 (13:49 -0700)]
raid5-cache: fix a user-after-free bug
r5l_compress_stripe_end_list() can free an io_unit. This breaks the
assumption only reclaimer can free io_unit. We can add a reference count
based io_unit free, but since only reclaim can wait io_unit becoming to
STRIPE_END state, we use a simple global wait queue here.
Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.com>
Shaohua Li [Wed, 2 Sep 2015 20:49:46 +0000 (13:49 -0700)]
raid5-cache: switching to state machine for log disk cache flush
Before we write stripe data to raid disks, we must guarantee stripe data
is settled down in log disk. To do this, we flush log disk cache and
wait the flush finish. That wait introduces sleep time in raid5d thread
and impact performance. This patch moves the log disk cache flush
process to the stripe handling state machine, which can remove the wait
in raid5d.
Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.com>
Shaohua Li [Thu, 13 Aug 2015 21:32:03 +0000 (14:32 -0700)]
raid5: don't allow resize/reshape with cache(log) support
If cache(log) support is enabled, don't allow resize/reshape in current
stage. In the future, we can flush all data from cache(log) to raid
before resize/reshape and then allow resize/reshape.
Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.com>
Shaohua Li [Thu, 13 Aug 2015 21:32:02 +0000 (14:32 -0700)]
raid5: disable batch with log enabled
With log enabled, r5l_write_stripe will add the stripe to log. With
batch, several stripes are linked together. The stripes must be in the
same state. While with log, the log/reclaim unit is stripe, we can't
guarantee the several stripes are in the same state. Disabling batch for
log now.
Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.com>
Shaohua Li [Wed, 28 Oct 2015 15:41:25 +0000 (08:41 -0700)]
raid5-cache: use crc32c checksum
crc32c has lower overhead with cpu acceleration. It's a shame I didn't
use it in first post, sorry. This changes disk format, but we are still
ok in current stage.
V2: delete unnecessary type conversion as pointed out by Bart
Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.com> Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com>
Shaohua Li [Thu, 13 Aug 2015 21:32:01 +0000 (14:32 -0700)]
raid5: log recovery
This is the log recovery support. The process is quite straightforward.
We scan the log and read all valid meta/data/parity into memory. If a
stripe's data/parity checksum is correct, the stripe will be recoveried.
Otherwise, it's discarded and we don't scan the log further. The reclaim
process guarantees stripe which starts to be flushed raid disks has
completed data/parity and has correct checksum. To recovery a stripe, we
just copy its data/parity to corresponding raid disks.
The trick thing is superblock update after recovery. we can't let
superblock point to last valid meta block. The log might look like:
| meta 1| meta 2| meta 3|
meta 1 is valid, meta 2 is invalid. meta 3 could be valid. If superblock
points to meta 1, we write a new valid meta 2n. If crash happens again,
new recovery will start from meta 1. Since meta 2n is valid, recovery
will think meta 3 is valid, which is wrong. The solution is we create a
new meta in meta2 with its seq == meta 1's seq + 10 and let superblock
points to meta2. recovery will not think meta 3 is a valid meta,
because its seq is wrong
Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.com>
Shaohua Li [Thu, 13 Aug 2015 21:32:00 +0000 (14:32 -0700)]
raid5: log reclaim support
This is the reclaim support for raid5 log. A stripe write will have
following steps:
1. reconstruct the stripe, read data/calculate parity. ops_run_io
prepares to write data/parity to raid disks
2. hijack ops_run_io. stripe data/parity is appending to log disk
3. flush log disk cache
4. ops_run_io run again and do normal operation. stripe data/parity is
written in raid array disks. raid core can return io to upper layer.
5. flush cache of all raid array disks
6. update super block
7. log disk space used by the stripe can be reused
In practice, several stripes consist of an io_unit and we will batch
several io_unit in different steps, but the whole process doesn't
change.
It's possible io return just after data/parity hit log disk, but then
read IO will need read from log disk. For simplicity, IO return happens
at step 4, where read IO can directly read from raid disks.
Currently reclaim run if there is specific reclaimable space (1/4 disk
size or 10G) or we are out of space. Reclaim is just to free log disk
spaces, it doesn't impact data consistency. The size based force reclaim
is to make sure log isn't too big, so recovery doesn't scan log too
much.
Recovery make sure raid disks and log disk have the same data of a
stripe. If crash happens before 4, recovery might/might not recovery
stripe's data/parity depending on if data/parity and its checksum
matches. In either case, this doesn't change the syntax of an IO write.
After step 3, stripe is guaranteed recoverable, because stripe's
data/parity is persistent in log disk. In some cases, log disk content
and raid disks content of a stripe are the same, but recovery will still
copy log disk content to raid disks, this doesn't impact data
consistency. space reuse happens after superblock update and cache
flush.
There is one situation we want to avoid. A broken meta in the middle of
a log causes recovery can't find meta at the head of log. If operations
require meta at the head persistent in log, we must make sure meta
before it persistent in log too. The case is stripe data/parity is in
log and we start write stripe to raid disks (before step 4). stripe
data/parity must be persistent in log before we do the write to raid
disks. The solution is we restrictly maintain io_unit list order. In
this case, we only write stripes of an io_unit to raid disks till the
io_unit is the first one whose data/parity is in log.
The io_unit list order is important for other cases too. For example,
some io_unit are reclaimable and others not. They can be mixed in the
list, we shouldn't reuse space of an unreclaimable io_unit.
Includes fixes to problems which were... Reported-by: kbuild test robot <fengguang.wu@intel.com> Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.com>
Shaohua Li [Thu, 13 Aug 2015 21:31:59 +0000 (14:31 -0700)]
raid5: add basic stripe log
This introduces a simple log for raid5. Data/parity writing to raid
array first writes to the log, then write to raid array disks. If
crash happens, we can recovery data from the log. This can speed up
raid resync and fix write hole issue.
The log structure is pretty simple. Data/meta data is stored in block
unit, which is 4k generally. It has only one type of meta data block.
The meta data block can track 3 types of data, stripe data, stripe
parity and flush block. MD superblock will point to the last valid
meta data block. Each meta data block has checksum/seq number, so
recovery can scan the log correctly. We store a checksum of stripe
data/parity to the metadata block, so meta data and stripe data/parity
can be written to log disk together. otherwise, meta data write must
wait till stripe data/parity is finished.
For stripe data, meta data block will record stripe data sector and
size. Currently the size is always 4k. This meta data record can be made
simpler if we just fix write hole (eg, we can record data of a stripe's
different disks together), but this format can be extended to support
caching in the future, which must record data address/size.
For stripe parity, meta data block will record stripe sector. It's
size should be 4k (for raid5) or 8k (for raid6). We always store p
parity first. This format should work for caching too.
flush block indicates a stripe is in raid array disks. Fixing write
hole doesn't need this type of meta data, it's for caching extension.
Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.com>
Shaohua Li [Thu, 13 Aug 2015 21:31:58 +0000 (14:31 -0700)]
raid5: add a new state for stripe log handling
When a stripe finishes construction, we write the stripe to raid in
ops_run_io normally. With log, we do a bunch of other operations before
the stripe is written to raid. Mainly write the stripe to log disk,
flush disk cache and so on. The operations are still driven by raid5d
and run in the stripe state machine. We introduce a new state for such
stripe (trapped into log). The stripe is in this state from the time it
first enters ops_run_io (finish construction) to the time it is written
to raid. Since we know the state is only for log, we bypass other
check/operation in handle_stripe.
Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.com>
Shaohua Li [Thu, 13 Aug 2015 21:31:56 +0000 (14:31 -0700)]
md: override md superblock recovery_offset for journal device
Journal device stores data in a log structure. We need record the log
start. Here we override md superblock recovery_offset for this purpose.
This field of a journal device is meaningless otherwise.
Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.com>
Song Liu [Thu, 13 Aug 2015 21:31:55 +0000 (14:31 -0700)]
MD: add a new disk role to present write journal device
Next patches will use a disk as raid5/6 journaling. We need a new disk
role to present the journal device and add MD_FEATURE_JOURNAL to
feature_map for backward compability.
Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: NeilBrown <neilb@suse.com>
md-cluster: Call update_raid_disks() if another node --grow's raid_disks
To incorporate --grow feature executed on one node, other nodes need to
acknowledge the change in number of disks. Call update_raid_disks()
to update internal data structures.
This leads to call check_reshape() -> md_allow_write() -> md_update_sb(),
this results in a deadlock. This is done so it can safely allocate memory
(which might trigger writeback which might write to raid1). This is
not required for md with a bitmap.
In the clustered case, we don't perform md_update_sb() in md_allow_write(),
but in do_md_run(). Also we disable safemode for clustered mode.
mddev->recovery_cp need not be set in check_sb_changes() because this
is required only when a node reads another node's bitmap. mddev->recovery_cp
(which is read from sb->resync_offset), is set only if mddev is in_sync.
Since we disabled safemode, in_sync is set to zero.
In a clustered environment, the MD may not be in sync because another
node could be writing to it. So make sure that in_sync is not set in
case of clustered node in __md_stop_writes().
NeilBrown [Tue, 13 Oct 2015 20:09:52 +0000 (07:09 +1100)]
Merge branch 'md-next' of git://github.com/goldwynr/linux into for-next
md-cluster: A better way for METADATA_UPDATED processing
The processing of METADATA_UPDATED message is too simple and prone to
errors. Besides, it would not update the internal data structures as
required.
This set of patches reads the superblock from one of the device of the MD
and checks for changes in the in-memory data structures. If there is a change,
it performs the necessary actions to keep the internal data structures
as it would be in the primary node.
An example is if a devices turns faulty. The algorithm is:
1. The initiator node marks the device as faulty and updates the superblock
2. The initiator node sends METADATA_UPDATED with an advisory device number to the rest of the nodes.
3. The receiving node on receiving the METADATA_UPDATED message
3.1 Reads the superblock
3.2 Detects a device has failed by comparing with memory structure
3.3 Calls the necessary functions to record the failure and get the device out of the active array.
3.4 Acknowledges the message.
The patch series also fixes adding the disk which was impacted because of
the changes.
Patches can also be found at
https://github.com/goldwynr/linux branch md-next
Changes since V2:
- Fix status synchrnoization after --add and --re-add operations
- Included Guoqing's patches on endian correctness, zeroing cmsg etc
- Restructure add_new_disk() and cancel()
Guoqing Jiang [Mon, 12 Oct 2015 09:21:23 +0000 (17:21 +0800)]
md-cluster: make sure the node do not receive it's own msg
During the past test, the node occasionally received the msg which is
sent from itself, this case should not happen in theory, but it is
better to avoid it in case something wrong happened.
md-cluster: Do not printk() every received message
The receive daemon prints kernel messages for every network message
received. This would fill the kernel message log with unnecessary messages.
Remove the pr_info() messages.
md-cluster: Fix adding of new disk with new reload code
Adding the disk worked incorrectly with the new reload code. Fix it:
- No operation should be performed on rdev marked as Candidate
- After a metadata update operation, kick disk if role is 0xfffe
else clear Candidate bit and continue with the regular change check.
- Saving the mode of the lock resource to check if token lock is already
locked, because it can be called twice while adding a disk. However,
unlock_comm() must be called only once.
- add_new_disk() is called by the node initiating the --add operation.
If it needs to be canceled, call add_new_disk_cancel(). The operation
is completed by md_update_sb() which will write and unlock the
communication.
md-cluster: Perform resync/recovery under a DLM lock
Resync or recovery must be performed by only one node at a time.
A DLM lock resource, resync_lockres provides the mutual exclusion
so that only one node performs the recovery/resync at a time.
If a node is unable to get the resync_lockres, because recovery is
being performed by another node, it set MD_RECOVER_NEEDED so as
to schedule recovery in the future.
Remove the debug message in resync_info_update()
used during development.
In a clustered environment, a change such as marking a device faulty,
can be recorded by any of the nodes. This is communicated to all the
nodes and re-recording such a change is unnecessary, and quite often
pretty disruptive.
With this patch, just before the update, we detect for the changes
and if the changes are already in superblock, we abort the update
after clearing all the flags
md-cluster: Improve md_reload_sb to be less error prone
md_reload_sb is too simplistic and it explicitly needs to determine
the changes made by the writing node. However, there are multiple areas
where a simple reload could fail.
Instead, read the superblock of one of the "good" rdevs and update
the necessary information:
- read the superblock into a newly allocated page, by temporarily
swapping out rdev->sb_page and calling ->load_super.
- if that fails return
- if it succeeds, call check_sb_changes
1. iterates over list of active devices and checks the matching
dev_roles[] value.
If that is 'faulty', the device must be marked as faulty
- call md_error to mark the device as faulty. Make sure
not to set CHANGE_DEVS and wakeup mddev->thread or else
it would initiate a resync process, which is the responsibility
of the "primary" node.
- clear the Blocked bit
- Call remove_and_add_spares() to hot remove the device.
If the device is 'spare':
- call remove_and_add_spares() to get the number of spares
added in this operation.
- Reduce mddev->degraded to mark the array as not degraded.
2. reset recovery_cp
- read the rest of the rdevs to update recovery_offset. If recovery_offset
is equal to MaxSector, call spare_active() to set it In_sync
This required that recovery_offset be initialized to MaxSector, as
opposed to zero so as to communicate the end of sync for a rdev.
md-cluster: send BITMAP_NEEDS_SYNC when node is leaving cluster
Previously, BITMAP_NEEDS_SYNC message is sent when the resyc
aborts, but it could abort for different reasons, and not all
of reasons require another node to take over the resync ownship.
It is better make BITMAP_NEEDS_SYNC message only be sent when
the node is leaving cluster with dirty bitmap. And we also need
to ensure dlm connection is ok.
Suspending the entire device for resync could take too long. Resync
in small chunks.
cluster's resync window (32M) is maintained in r1conf as
cluster_sync_low and cluster_sync_high and processed in
raid1's sync_request(). If the current resync is outside the cluster
resync window:
1. Set the cluster_sync_low to curr_resync_completed.
2. Check if the sync will fit in the new window, if not issue a
wait_barrier() and set cluster_sync_low to sector_nr.
3. Set cluster_sync_high to cluster_sync_low + resync_window.
4. Send a message to all nodes so they may add it in their suspension
list.
bitmap_cond_end_sync is modified to allow to force a sync inorder
to get the curr_resync_completed uptodate with the sector passed.
md-cluster: complete all write requests before adding suspend_info
process_suspend_info - which handles the RESYNCING request - must not
reply until all writes which were initiated before the request arrived,
have completed.
As a by-product, all process_* functions now take mddev as their
first arguement making it uniform.
Linus Torvalds [Sun, 11 Oct 2015 17:16:59 +0000 (10:16 -0700)]
Merge branch 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull irq fixes from Thomas Gleixner:
"Three trivial commits:
- Fix a kerneldoc regression
- Export handle_bad_irq to unbreak a driver in next
- Add an accessor for the of_node field so refactoring in next does
not depend on merge ordering"
* 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
irqdomain: Add an accessor for the of_node field
genirq: Fix handle_bad_irq kerneldoc comment
genirq: Export handle_bad_irq
Linus Torvalds [Sun, 11 Oct 2015 17:02:30 +0000 (10:02 -0700)]
Merge tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi
Pull SCSI fixes from James Bottomley:
"This is a set of three bug fixes, two of which are regressions from
recent updates (the 3ware one from 4.1 and the device handler fixes
from 4.2)"
* tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
3w-9xxx: don't unmap bounce buffered commands
scsi_dh: Use the correct module name when loading device handler
libiscsi: Fix iscsi_check_transport_timeouts possible infinite loop
Linus Torvalds [Sat, 10 Oct 2015 18:17:45 +0000 (11:17 -0700)]
Merge tag 'usb-4.3-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb
Pull USB fixes from Greg KH:
"Here are some small USB and PHY fixes and quirk updates for 4.3-rc5.
Nothing major here, full details in the shortlog, and all of these
have been in linux-next for a while"
* tag 'usb-4.3-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb:
usb: Add device quirk for Logitech PTZ cameras
USB: chaoskey read offset bug
USB: Add reset-resume quirk for two Plantronics usb headphones.
usb: renesas_usbhs: Add support for R-Car H3
usb: renesas_usbhs: fix build warning if 64-bit architecture
usb: gadget: bdc: fix memory leak
phy: berlin-sata: Fix module autoload for OF platform driver
phy: rockchip-usb: power down phy when rockchip phy probe
phy: qcom-ufs: fix build error when the component is built as a module
Linus Torvalds [Sat, 10 Oct 2015 18:09:55 +0000 (11:09 -0700)]
Merge tag 'tty-4.3-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty
Pull tty/serial driver fixes from Greg KH:
"Here are a few bug fixes for the tty core that resolve reported
issues, and some serial driver fixes as well (including the
much-reported imx driver problem)
All of these have been in linux-next with no reported problems"
* tag 'tty-4.3-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty:
drivers/tty: require read access for controlling terminal
serial: 8250: add uart_config entry for PORT_RT2880
tty: fix data race on tty_buffer.commit
tty: fix data race in tty_buffer_flush
tty: fix data race in flush_to_ldisc
tty: fix stall caused by missing memory barrier in drivers/tty/n_tty.c
serial: atmel: fix error path of probe function
tty: don't leak cdev in tty_cdev_add()
Revert "serial: imx: remove unbalanced clk_prepare"
Linus Torvalds [Sat, 10 Oct 2015 18:03:31 +0000 (11:03 -0700)]
Merge tag 'staging-4.3-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging
Pull staging fixes from Greg KH:
"Here are two tiny staging tree fixes for 4.3-rc5.
One fixes the broken speakup subsystem as reported by a user, and the
other removes an entry in the MAINTAINERS file for a developer that
doesn't want to be listed anymore"
* tag 'staging-4.3-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging:
staging: speakup: fix speakup-r regression
MAINTAINERS: Remove myself as nvec co-maintainer
Linus Torvalds [Sat, 10 Oct 2015 17:58:27 +0000 (10:58 -0700)]
Merge tag 'char-misc-4.3-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc
Pull char/misc driver fixes from Greg KH:
"Here are some small fixes for some misc drivers that resolve some
reported issues. All of these have been linux-next for a while"
* tag 'char-misc-4.3-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc:
mcb: Fix error handling in mcb_pci_probe()
mei: hbm: fix error in state check logic
nvmem: sunxi: Check for memory allocation failure
nvmem: core: Fix memory leak in nvmem_cell_write
nvmem: core: Handle shift bits in-place if cell->nbits is non-zero
nvmem: core: fix the out-of-range leak in read/write()
Linus Torvalds [Sat, 10 Oct 2015 17:31:13 +0000 (10:31 -0700)]
Merge branch 'stable/for-linus-4.3' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/swiotlb
Pull swiotlb fixlet from Konrad Rzeszutek Wilk:
"Enable the SWIOTLB under 32-bit PAE kernels.
Nowadays most distros enable this due to CONFIG_HYPERVISOR|XEN=y which
select SWIOTLB. But for those that are not interested in
virtualization and wanting to use 32-bit PAE kernels and wanting to
have working DMA operations - this configures it for them"
* 'stable/for-linus-4.3' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/swiotlb:
swiotlb: Enable it under x86 PAE
Trond Myklebust [Fri, 9 Oct 2015 17:44:34 +0000 (13:44 -0400)]
namei: results of d_is_negative() should be checked after dentry revalidation
Leandro Awa writes:
"After switching to version 4.1.6, our parallelized and distributed
workflows now fail consistently with errors of the form:
T34: ./regex.c:39:22: error: config.h: No such file or directory
From our 'git bisect' testing, the following commit appears to be the
possible cause of the behavior we've been seeing: commit 766c4cbfacd8"
Al Viro says:
"What happens is that 766c4cbfacd8 got the things subtly wrong.
We used to treat d_is_negative() after lookup_fast() as "fall with
ENOENT". That was wrong - checking ->d_flags outside of ->d_seq
protection is unreliable and failing with hard error on what should've
fallen back to non-RCU pathname resolution is a bug.
Unfortunately, we'd pulled the test too far up and ran afoul of
another kind of staleness. The dentry might have been absolutely
stable from the RCU point of view (and we might be on UP, etc), but
stale from the remote fs point of view. If ->d_revalidate() returns
"it's actually stale", dentry gets thrown away and the original code
wouldn't even have looked at its ->d_flags.
What we need is to check ->d_flags where 766c4cbfacd8 does (prior to
->d_seq validation) but only use the result in cases where we do not
discard this dentry outright"
Linus Torvalds [Sat, 10 Oct 2015 01:39:04 +0000 (18:39 -0700)]
Merge tag 'pm+acpi-4.3-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull power management and ACPI fixes from Rafael Wysocki:
"These are four fixes for bugs in the devfreq and cpufreq subsystems,
including two regression fixes (one for a recent regression and one
for a problem introduced in 4.2).
Specifics:
- Two fixes for cpufreq regressions, an acpi-cpufreq driver one
introduced during the 4.2 cycle when we started to preserve cpufreq
directories for offline CPUs and a general one introduced recently
(Srinivas Pandruvada).
- Two devfreq fixes, one for a double kfree() in an error code path
and one for a confusing sysfs-related failure (Geliang Tang, Tobias
Jakobi)"
* tag 'pm+acpi-4.3-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
cpufreq: prevent lockup on reading scaling_available_frequencies
cpufreq: acpi_cpufreq: prevent crash on reading freqdomain_cpus
PM / devfreq: fix double kfree
PM / devfreq: Fix governor_store()
Linus Torvalds [Fri, 9 Oct 2015 23:58:11 +0000 (16:58 -0700)]
Merge tag 'dm-4.3-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm
Pull dm fixes from Mike Snitzer:
"Three stable fixes:
- DM core AB-BA deadlock fix in the device destruction path (vs
device creation's DM table swap).
- DM raid fix to properly round up the region_size to the next
power-of-2.
- DM cache fix for a NULL pointer seen while switching from the
"cleaner" cache policy.
Two fixes for regressions introduced during the 4.3 merge:
- request-based DM error propagation regressed due to incorrect
changes introduced when adding the bi_error field to bio.
- DM snapshot fix to only support snapshots that overflow if the
client (e.g. lvm2) is prepared to deal with the associated
snapshot status interface change"
* tag 'dm-4.3-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
dm snapshot: add new persistent store option to support overflow
dm cache: fix NULL pointer when switching from cleaner policy
dm: fix request-based dm error reporting
dm raid: fix round up of default region size
dm: fix AB-BA deadlock in __dm_destroy()
Linus Torvalds [Fri, 9 Oct 2015 23:39:35 +0000 (16:39 -0700)]
Merge branch 'for-linus-4.3' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs fixes from Chris Mason:
"These are small and assorted. Neil's is the oldest, I dropped the
ball thinking he was going to send it in"
* 'for-linus-4.3' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
Btrfs: support NFSv2 export
Btrfs: open_ctree: Fix possible memory leak
Btrfs: fix deadlock when finalizing block group creation
Btrfs: update fix for read corruption of compressed and shared extents
Btrfs: send, fix corner case for reference overwrite detection
Linus Torvalds [Fri, 9 Oct 2015 22:54:14 +0000 (15:54 -0700)]
Merge tag 'fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc
Pull ARM SoC fixes from Arnd Bergmann:
"The fixes for this week include one small patch that was years in the
making and that finally fixes using all eight CPUs on exynos542x.
The rest are lots of minor changes for sunxi, imx, exynos and shmobile
- fixing the minimum voltage for Allwinner A20
- thermal boot issue on SMDK5250.
- invalid clock used for FIMD IOMMU.
- audio on Renesas r8a7790/r8a7791
- invalid clock used for FIMD IOMMU
- LEDs on exynos5422-odroidxu3-common
- usb pin control for imx-rex
- imx53: fix PMIC interrupt level
- a Makefile typo"
* tag 'fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc:
ARM: dts: Fix wrong clock binding for sysmmu_fimd1_1 on exynos5420
ARM: dts: Fix bootup thermal issue on smdk5250
ARM: shmobile: r8a7791 dtsi: Add CPG/MSTP Clock Domain for sound
ARM: shmobile: r8a7790 dtsi: Add CPG/MSTP Clock Domain for sound
arm-cci500: Don't enable PMU driver by default
ARM: dts: fix usb pin control for imx-rex dts
ARM: imx53: qsrb: fix PMIC interrupt level
ARM: imx53: include IRQ dt-bindings header
ARM: dts: add suspend opp to exynos4412
ARM: dts: Fix LEDs on exynos5422-odroidxu3
ARM: EXYNOS: reset Little cores when cpu is up
ARM: dts: Fix Makefile target for sun4i-a10-itead-iteaduino-plus
ARM: dts: sunxi: Raise minimum CPU voltage for sun7i-a20 to meet SoC specifications
Mike Snitzer [Thu, 8 Oct 2015 22:05:41 +0000 (18:05 -0400)]
dm snapshot: add new persistent store option to support overflow
Commit 76c44f6d80 introduced the possibly for "Overflow" to be reported
by the snapshot device's status. Older userspace (e.g. lvm2) does not
handle the "Overflow" status response.
Fix this incompatibility by requiring newer userspace code, that can
cope with "Overflow", request the persistent store with overflow support
by using "PO" (Persistent with Overflow) for the snapshot store type.
Marc Zyngier [Fri, 9 Oct 2015 14:50:11 +0000 (15:50 +0100)]
irqdomain: Add an accessor for the of_node field
As we're about to remove the of_node field from the irqdomain
structure, introduce an accessor for it. Subsequent patches
will take care of the actual repainting.
Joe Thornber [Fri, 9 Oct 2015 13:03:38 +0000 (14:03 +0100)]
dm cache: fix NULL pointer when switching from cleaner policy
The cleaner policy doesn't make use of the per cache block hint space in
the metadata (unlike the other policies). When switching from the
cleaner policy to mq or smq a NULL pointer crash (in dm_tm_new_block)
was observed. The crash was caused by bugs in dm-cache-metadata.c
when trying to skip creation of the hint btree.
The minimal fix is to change hint size for the cleaner policy to 4 bytes
(only hint size supported).
Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org
Mikulas Patocka [Thu, 1 Oct 2015 19:17:43 +0000 (15:17 -0400)]
crash in md-raid1 and md-raid10 due to incorrect list manipulation
The commit 55ce74d4bfe1b9444436264c637f39a152d1e5ac (md/raid1: ensure
device failure recorded before write request returns) is causing crash in
the LVM2 testsuite test shell/lvchange-raid.sh. For me the crash is 100%
reproducible.
The reason for the crash is that the newly added code in raid1d moves the
list from conf->bio_end_io_list to tmp, then tests if tmp is non-empty and
then incorrectly pops the bio from conf->bio_end_io_list (which is empty
because the list was alrady moved).
cpufreq: prevent lockup on reading scaling_available_frequencies
When scaling_available_frequencies is read on an offlined cpu, then
either lockup or junk values are displayed. This is caused by
freed freq_table, which policy is using.
Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Acked-by: Viresh Kumar <viresh.kumar@linaro.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
cpufreq: acpi_cpufreq: prevent crash on reading freqdomain_cpus
When freqdomain_cpus attribute is read from an offlined cpu, it will
cause crash. This change prevents calling cpufreq_show_cpus when
policy driver_data is NULL.