Josh Hunt [Tue, 9 Feb 2016 23:11:53 +0000 (10:11 +1100)]
block: restore /proc/partitions to not display non-partitionable removable devices
We found with newer kernels we started seeing the cdrom device showing
up in /proc/partitions, but it was not there before.
Looking into this I found that commit d27769ec ("block: add
GENHD_FL_NO_PART_SCAN") introduces this change in behavior. It's not
clear to me from the commit's changelog if this change was intentional or
not. This comment still remains: /* Don't show non-partitionable
removeable devices or empty devices */ so I've decided to send a patch to
restore the behavior of not printing unpartitionable removable devices.
Signed-off-by: Josh Hunt <johunt@akamai.com> Cc: Tejun Heo <tj@kernel.org> Cc: Kay Sievers <kay.sievers@vrfy.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
San Mehat [Tue, 9 Feb 2016 23:11:52 +0000 (10:11 +1100)]
block: partition: add partition specific uevent callbacks for partition info
This patch has been carried in the Android tree for quite some time and is
one of the few patches required to get a mainline kernel up and running
with an exsiting Android userspace. So I wanted to submit it for review
and consideration if it should be merged.
For partitions, add new uevent parameters 'PARTN' which specifies the
partitions index in the table, and 'PARTNAME', which specifies PARTNAME
specifices the partition name of a partition device.
Android's userspace uses this for creating device node links from the
partition name and number: ie:
/dev/block/platform/soc/by-name/system
or
/dev/block/platform/soc/by-num/p1
One can see its usage here:
https://android.googlesource.com/platform/system/core/+/master/init/devices.cpp#355
and
https://android.googlesource.com/platform/system/core/+/master/init/devices.cpp#494
[john.stultz@linaro.org: dropped NPARTS and reworded commit message for context] Signed-off-by: Dima Zavin <dima@android.com> Signed-off-by: John Stultz <john.stultz@linaro.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Rom Lemarchand <romlem@google.com> Cc: Android Kernel Team <kernel-team@android.com> Cc: Jeff Moyer <jmoyer@redhat.com> Cc: <harald@redhat.com> Cc: Kees Cook <keescook@chromium.org> Cc: Kay Sievers <kay@vrfy.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Joel Becker <jlbec@evilplan.org> Cc: Joseph Qi <joseph.qi@huawei.com> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Mark Fasheh <mfasheh@suse.de> Cc: xuejiufei <xuejiufei@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
xuejiufei [Tue, 9 Feb 2016 23:11:51 +0000 (10:11 +1100)]
ocfs2/dlm: move lock to the tail of grant queue while doing in-place convert
We have found a bug when two nodes doing umount one after another.
1) Node 1 migrate a lockres that has 3 locks in grant queue such as
N2(PR)<->N3(NL)<->N4(PR) to N2. After migration, lvb of the lock
N3(NL) and N4(PR) are empty on node 2 because migration target do not
copy lvb to these two lock.
2) Node 3 want to convert to PR, it can be granted in
__dlmconvert_master(), and the order of these locks is unchanged. The
lvb of the lock N3(PR) on node 2 is copyed from lockres in function
dlm_update_lvb() while the lvb of lock N4(PR) is still empty.
3) Node 2 want to leave domain, it will migrate this lockres to node 3.
Then node 2 will trigger the BUG in dlm_prepare_lvb_for_migration()
when adding the lock N4(PR) to mres with the following message because
the lvb of mres is already copied from lock N3(PR), but the lvb of lock
N4(PR) is empty.
"Mismatched lvb in lock cookie=%u:%llu, name=%.*s, node=%u"
Signed-off-by: xuejiufei <xuejiufei@huawei.com> Cc: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Joseph Qi <joseph.qi@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Junxiao Bi [Tue, 9 Feb 2016 23:11:50 +0000 (10:11 +1100)]
ocfs2: o2hb: fix hb hung time
hr_last_timeout_start should be set as the last time where hb is still OK.
When hb write timeout, hung time will be (jiffies -
hr_last_timeout_start).
Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com> Reviewed-by: Ryan Ding <ryan.ding@oracle.com> Cc: Gang He <ghe@suse.com> Cc: rwxybh <rwxybh@126.com> Cc: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Cc: Joseph Qi <joseph.qi@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Junxiao Bi [Tue, 9 Feb 2016 23:11:50 +0000 (10:11 +1100)]
ocfs2: o2hb: don't negotiate if last hb fail
Sometimes io error is returned when storage is down for a while. Like for
iscsi device, stroage is made offline when session timeout, and this will
make all io return -EIO. For this case, nodes shouldn't do negotiate
timeout but should fence self. So let nodes fence self when
o2hb_do_disk_heartbeat return an error, this is the same behavior with
o2hb without negotiate timer.
Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com> Reviewed-by: Ryan Ding <ryan.ding@oracle.com> Cc: Gang He <ghe@suse.com> Cc: rwxybh <rwxybh@126.com> Cc: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Cc: Joseph Qi <joseph.qi@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Junxiao Bi [Tue, 9 Feb 2016 23:11:50 +0000 (10:11 +1100)]
ocfs2: o2hb: add some user/debug log
Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com> Reviewed-by: Ryan Ding <ryan.ding@oracle.com> Cc: Gang He <ghe@suse.com> Cc: rwxybh <rwxybh@126.com> Cc: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Cc: Joseph Qi <joseph.qi@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Junxiao Bi [Tue, 9 Feb 2016 23:11:49 +0000 (10:11 +1100)]
ocfs2: o2hb: add NEGOTIATE_APPROVE message
This message is used to re-queue write timeout timer and negotiate timer
when all nodes suffer a write hung to storage, this makes node not fence
self if storage down.
Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com> Reviewed-by: Ryan Ding <ryan.ding@oracle.com> Cc: Gang He <ghe@suse.com> Cc: rwxybh <rwxybh@126.com> Cc: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Cc: Joseph Qi <joseph.qi@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Junxiao Bi [Tue, 9 Feb 2016 23:11:49 +0000 (10:11 +1100)]
ocfs2: o2hb: add NEGO_TIMEOUT message
This message is sent to master node when non-master nodes's negotiate
timer expired. Master node records these nodes in a bitmap which is used
to do write timeout timer re-queue decision.
Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com> Reviewed-by: Ryan Ding <ryan.ding@oracle.com> Cc: Gang He <ghe@suse.com> Cc: rwxybh <rwxybh@126.com> Cc: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Cc: Joseph Qi <joseph.qi@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Junxiao Bi [Tue, 9 Feb 2016 23:11:49 +0000 (10:11 +1100)]
ocfs2: o2hb: add negotiate timer
This series of patches is to fix the issue that when storage down, all
nodes will fence self due to write timeout.
With this patch set, all nodes will keep going until storage back online,
except if the following issue happens, then all nodes will do as before to
fence self.
1. io error got
2. network between nodes down
3. nodes panic
This patch (of 6):
When storage down, all nodes will fence self due to write timeout. The
negotiate timer is designed to avoid this, with it node will wait until
storage up again.
Negotiate timer working in the following way:
1. The timer expires before write timeout timer, its timeout is half
of write timeout now. It is re-queued along with write timeout timer.
If expires, it will send NEGO_TIMEOUT message to master node(node with
lowest node number). This message does nothing but marks a bit in a
bitmap recording which nodes are negotiating timeout on master node.
2. If storage down, nodes will send this message to master node, then
when master node finds its bitmap including all online nodes, it sends
NEGO_APPROVL message to all nodes one by one, this message will
re-queue write timeout timer and negotiate timer. For any node doesn't
receive this message or meets some issue when handling this message, it
will be fenced. If storage up at any time, o2hb_thread will run and
re-queue all the timer, nothing will be affected by these two steps.
Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com> Reviewed-by: Ryan Ding <ryan.ding@oracle.com> Cc: Gang He <ghe@suse.com> Cc: rwxybh <rwxybh@126.com> Cc: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Cc: Joseph Qi <joseph.qi@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Gang He [Tue, 9 Feb 2016 23:11:48 +0000 (10:11 +1100)]
ocfs2: add feature document for online file check
This document will describe OCFS2 online file check feature. OCFS2 is
often used in high-availaibility systems. However, OCFS2 usually converts
the filesystem to read-only when encounters an error. This may not be
necessary, since turning the filesystem read-only would affect other
running processes as well, decreasing availability.
Then, a mount option (errors=continue) is introduced, which would return
the -EIO errno to the calling process and terminate furhter processing so
that the filesystem is not corrupted further. The filesystem is not
converted to read-only, and the problematic file's inode number is
reported in the kernel log. The user can try to check/fix this file via
online filecheck feature.
Signed-off-by: Gang He <ghe@suse.com> Cc: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Joseph Qi <joseph.qi@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Gang He [Tue, 9 Feb 2016 23:11:48 +0000 (10:11 +1100)]
ocfs2: check/fix inode block for online file check
Implement online check or fix inode block during reading a inode block to
memory.
Signed-off-by: Gang He <ghe@suse.com> Cc: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Joseph Qi <joseph.qi@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Gang He [Tue, 9 Feb 2016 23:11:47 +0000 (10:11 +1100)]
ocfs2: create/remove sysfile for online file check
Create online file check sysfile when ocfs2 mount, remove the related
sysfile when ocfs2 umount.
Signed-off-by: Gang He <ghe@suse.com> Reviewed-by: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Joseph Qi <joseph.qi@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Gang He [Tue, 9 Feb 2016 23:11:47 +0000 (10:11 +1100)]
ocfs2: sysfile interfaces for online file check
Implement online file check sysfile interfaces, e.g. how to create the
related sysfile according to device name, how to display/handle file check
request from the sysfile.
Signed-off-by: Gang He <ghe@suse.com> Cc: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Joseph Qi <joseph.qi@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Gang He [Tue, 9 Feb 2016 23:11:47 +0000 (10:11 +1100)]
ocfs2: export ocfs2_kset for online file check
When there are errors in the ocfs2 filesystem,
they are usually accompanied by the inode number which caused the error.
This inode number would be the input to fixing the file.
One of these options could be considered:
A file in the sys filesytem which would accept inode numbers.
This could be used to communication back what has to be fixed or is fixed.
You could write:
Compare with second version, I re-design filecheck sysfs interfaces, there
are three sysfs files (check, fix and set) under filecheck directory (see
above), sysfs will accept only one argument <inode>. Second, I adjust
some code in ocfs2_filecheck_repair_inode_block() function according to
upstream feedback, we cannot just add VALID_FL flag back as a inode block
fix, then we will not fix this field corruption currently until having a
complete solution. Compare with first version, I use strncasecmp instead
of double strncmp functions. Second, update the source file contribution
vendor.
This patch (of 4):
Export ocfs2_kset object from ocfs2_stackglue kernel module, then
online file check code will create the related sysfiles under
ocfs2_kset object. We're exporting this because it's built in
ocfs2_stackglue.ko.
Signed-off-by: Gang He <ghe@suse.com> Reviewed-by: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Joseph Qi <joseph.qi@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
jiangyiwen [Tue, 9 Feb 2016 23:11:46 +0000 (10:11 +1100)]
ocfs2: solve a problem of crossing the boundary in updating backups
In update_backups() there exists a problem of crossing the boundary
as follows:
we assume that lun will be resized to 1TB(cluster_size is 32kb),
it will include 0~33554431 cluster, in update_backups func,
it will backup super block in location of 1TB which is the 33554432th
cluster, so the phenomenon of crossing the boundary happens.
Signed-off-by: Yiwen Jiang <jiangyiwen@huawei.com> Reviewed-by: Joseph Qi <joseph.qi@huawei.com> Cc: Xue jiufei <xuejiufei@huawei.com> Cc: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
jiangyiwen [Tue, 9 Feb 2016 23:11:46 +0000 (10:11 +1100)]
ocfs2: fix occurring deadlock by changing ocfs2_wq from global to local
This patch fixes a deadlock, as follows:
Node 1 Node 2 Node 3
1)volume a and b are only mount vol a only mount vol b
mounted
2) start to mount b start to mount a
3) check hb of Node 3 check hb of Node 2
in vol a, qs_holds++ in vol b, qs_holds++
4) -------------------- all nodes' network down --------------------
5) progress of mount b the same situation as
failed, and then call Node 2
ocfs2_dismount_volume.
but the process is hung,
since there is a work
in ocfs2_wq cannot beo
completed. This work is
about vol a, because
ocfs2_wq is global wq.
BTW, this work which is
scheduled in ocfs2_wq is
ocfs2_orphan_scan_work,
and the context in this work
needs to take inode lock
of orphan_dir, because
lockres owner are Node 1 and
all nodes' nework has been down
at the same time, so it can't
get the inode lock.
6) Why can't this node be fenced
when network disconnected?
Because the process of
mount is hung what caused qs_holds
is not equal 0.
Because all works in the ocfs2_wq are relative to the super block.
The solution is to change the ocfs2_wq from global to local. In other
words, move it into struct ocfs2_super.
Signed-off-by: Yiwen Jiang <jiangyiwen@huawei.com> Cc: Xue jiufei <xuejiufei@huawei.com> Reviewed-by: Joseph Qi <joseph.qi@huawei.com> Cc: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Xue jiufei [Tue, 9 Feb 2016 23:11:45 +0000 (10:11 +1100)]
ocfs2: extend enough credits for freeing one truncate record while replaying truncate records
Now function ocfs2_replay_truncate_records() first modifies tl_used, then
calls ocfs2_extend_trans() to extend transactions for gd and alloc inode
used for freeing clusters. jbd2_journal_restart() may be called and it
may happen that tl_used in truncate log is decreased but the clusters are
not freed, which means these clusters are lost. So we should avoid
extending transactions in these two operations.
Signed-off-by: joyce.xue <xuejiufei@huawei.com> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Joseph Qi <joseph.qi@huawei.com> Reviewed-by: Mark Fasheh <mfasheh@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Xue jiufei [Tue, 9 Feb 2016 23:11:45 +0000 (10:11 +1100)]
ocfs2: extend transaction for ocfs2_remove_rightmost_path() and ocfs2_update_edge_lengths() before to avoid inconsistency between inode and et
I found that jbd2_journal_restart() is called in some places without
keeping things consistently before. However, jbd2_journal_restart() may
commit the handle's transaction and restart another one. If the first
transaction is committed successfully while another not, it may cause
filesystem inconsistency or read only. This is an effort to fix this kind
of problems.
This patch (of 3):
The following functions will be called while truncating an extent:
ocfs2_remove_btree_range
-> ocfs2_start_trans
-> ocfs2_remove_extent
-> ocfs2_truncate_rec
-> ocfs2_extend_rotate_transaction
-> jbd2_journal_restart if jbd2_journal_extend fail
-> ocfs2_rotate_tree_left
-> ocfs2_remove_rightmost_path
-> ocfs2_extend_rotate_transaction
-> ocfs2_unlink_subtree
-> ocfs2_update_edge_lengths
-> ocfs2_extend_trans
-> jbd2_journal_restart if jbd2_journal_extend fail
-> ocfs2_et_update_clusters
-> ocfs2_commit_trans
jbd2_journal_restart() may be called and it may happened that the buffers
dirtied in ocfs2_truncate_rec() are committed while buffers dirtied in
ocfs2_et_update_clusters() are not, the total clusters on extent tree and
i_clusters in ocfs2_dinode is inconsistency. So the clusters got from
ocfs2_dinode is incorrect, and it also cause read-only problem when call
ocfs2_commit_truncate() with the error message: "Inode %llu has empty
extent block at %llu".
We should extend enough credits for function ocfs2_remove_rightmost_path
and ocfs2_update_edge_lengths to avoid this inconsistency.
Signed-off-by: joyce.xue <xuejiufei@huawei.com> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Joseph Qi [Tue, 9 Feb 2016 23:11:45 +0000 (10:11 +1100)]
ocfs2/dlm: fix BUG in dlm_move_lockres_to_recovery_list
When master handles convert request, it queues ast first and then returns
status. This may happen that the ast is sent before the request status
because the above two messages are sent by two threads. And right after
the ast is sent, if master down, it may trigger BUG in
dlm_move_lockres_to_recovery_list in the requested node because ast
handler moves it to grant list without clear lock->convert_pending. So
remove BUG_ON statement and check if the ast is processed in
dlmconvert_remote.
Signed-off-by: Joseph Qi <joseph.qi@huawei.com> Reported-by: Yiwen Jiang <jiangyiwen@huawei.com> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Cc: Tariq Saeed <tariq.x.saeed@oracle.com> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Joseph Qi [Tue, 9 Feb 2016 23:11:43 +0000 (10:11 +1100)]
ocfs2/dlm: fix race between convert and recovery
There is a race window between dlmconvert_remote and
dlm_move_lockres_to_recovery_list, which will cause a lock with
OCFS2_LOCK_BUSY in grant list, thus system hangs.
status = dlm_send_remote_convert_request();
>>>>>> race window, master has queued ast and return DLM_NORMAL,
and then down before sending ast.
this node detects master down and calls
dlm_move_lockres_to_recovery_list, which will revert the
lock to grant list.
Then OCFS2_LOCK_BUSY won't be cleared as new master won't
send ast any more because it thinks already be authorized.
In this case, check if res->state has DLM_LOCK_RES_RECOVERING bit set (res
is still in recovering) or res master changed (new master has finished
recovery), reset the status to DLM_RECOVERING, then it will retry convert.
Signed-off-by: Joseph Qi <joseph.qi@huawei.com> Reported-by: Yiwen Jiang <jiangyiwen@huawei.com> Reviewed-by: Junxiao Bi <junxiao.bi@oracle.com> Cc: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Cc: Tariq Saeed <tariq.x.saeed@oracle.com> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Ryan Ding [Tue, 9 Feb 2016 23:11:43 +0000 (10:11 +1100)]
ocfs2: fix a deadlock issue in ocfs2_dio_end_io_write()
Should call ocfs2_free_alloc_context() to free meta_ac & data_ac before
calling ocfs2_run_deallocs(). Because ocfs2_run_deallocs() will acquire
the system inode's i_mutex hold by meta_ac. So try to release the lock
before ocfs2_run_deallocs().
Fixes: af1310367f41 ("ocfs2: fix sparse file & data ordering issue in direct io.") Signed-off-by: Ryan Ding <ryan.ding@oracle.com> Acked-by: Junxiao Bi <junxiao.bi@oracle.com> Cc: Joseph Qi <joseph.qi@huawei.com> Cc: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Ryan Ding [Tue, 9 Feb 2016 23:11:43 +0000 (10:11 +1100)]
ocfs2: fix disk file size and memory file size mismatch
When doing append direct write in an already allocated cluster, and fast
path in ocfs2_dio_get_block() is triggered, function
ocfs2_dio_end_io_write() will be skipped as there is no context allocated.
So disk file size will not be changed as it should be. The solution is
to skip fast path when we are about to change file size.
Fixes: af1310367f41 ("ocfs2: fix sparse file & data ordering issue in direct io.") Signed-off-by: Ryan Ding <ryan.ding@oracle.com> Acked-by: Junxiao Bi <junxiao.bi@oracle.com> Cc: Joseph Qi <joseph.qi@huawei.com> Cc: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Ryan Ding [Tue, 9 Feb 2016 23:11:42 +0000 (10:11 +1100)]
ocfs2: take ip_alloc_sem in ocfs2_dio_get_block & ocfs2_dio_end_io_write
Take ip_alloc_sem to prevent concurrent access to extent tree, which may
cause the extent tree in an unstable state.
Signed-off-by: Ryan Ding <ryan.ding@oracle.com> Reviewed-by: Junxiao Bi <junxiao.bi@oracle.com> Cc: Joseph Qi <joseph.qi@huawei.com> Cc: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Joel Becker <jlbec@evilplan.org> Cc: Joseph Qi <joseph.qi@huawei.com> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Mark Fasheh <mfasheh@suse.de> Cc: Ryan Ding <ryan.ding@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Ryan Ding [Tue, 9 Feb 2016 23:11:41 +0000 (10:11 +1100)]
ocfs2: fix ip_unaligned_aio deadlock with dio work queue
In the current implementation of unaligned aio+dio, lock order behave as
follow:
in user process context:
-> call io_submit()
-> get i_mutex
<== window1
-> get ip_unaligned_aio
-> submit direct io to block device
-> release i_mutex
-> io_submit() return
in dio work queue context(the work queue is created in __blockdev_direct_IO):
-> release ip_unaligned_aio
<== window2
-> get i_mutex
-> clear unwritten flag & change i_size
-> release i_mutex
There is a limitation to the thread number of dio work queue. 256 at
default. If all 256 thread are in the above 'window2' stage, and there is
a user process in the 'window1' stage, the system will became deadlock.
Since the user process hold i_mutex to wait ip_unaligned_aio lock, while
there is a direct bio hold ip_unaligned_aio mutex who is waiting for a dio
work queue thread to be schedule. But all the dio work queue thread is
waiting for i_mutex lock in 'window2'.
This case only happened in a test which send a large number(more than 256)
of aio at one io_submit() call.
My design is to remove ip_unaligned_aio lock. Change it to a sync io
instead. Just like ip_unaligned_aio lock, serialize the unaligned aio
dio.
Signed-off-by: Ryan Ding <ryan.ding@oracle.com> Reviewed-by: Junxiao Bi <junxiao.bi@oracle.com> Cc: Joseph Qi <joseph.qi@huawei.com> Cc: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Andrew Morton [Tue, 9 Feb 2016 23:11:41 +0000 (10:11 +1100)]
ocfs2-code-clean-up-for-direct-io-fix
remove unused label
Reported-by: kbuild test robot <fengguang.wu@intel.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Joseph Qi <joseph.qi@huawei.com> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Mark Fasheh <mfasheh@suse.de> Cc: Ryan Ding <ryan.ding@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Ryan Ding [Tue, 9 Feb 2016 23:11:41 +0000 (10:11 +1100)]
ocfs2: code clean up for direct io
Clean up ocfs2_file_write_iter & ocfs2_prepare_inode_for_write:
* remove append dio check: it will be checked in ocfs2_direct_IO()
* remove file hole check: file hole is supported for now
* remove inline data check: it will be checked in ocfs2_direct_IO()
* remove the full_coherence check when append dio: we will get the inode_lock
in ocfs2_dio_get_block, there is no need to fall back to buffer io to ensure
the coherence semantics.
Now the drop dio procedure is gone. :)
Signed-off-by: Ryan Ding <ryan.ding@oracle.com> Reviewed-by: Junxiao Bi <junxiao.bi@oracle.com> Cc: Joseph Qi <joseph.qi@huawei.com> Cc: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Ryan Ding [Tue, 9 Feb 2016 23:11:40 +0000 (10:11 +1100)]
ocfs2: fix sparse file & data ordering issue in direct io
There are mainly 3 issue in the direct io code path after commit 24c40b329e03 ("ocfs2: implement ocfs2_direct_IO_write"):
* Do not support sparse file.
* Do not support data ordering. eg: when write to a file hole, it will alloc
extent first. If system crashed before io finished, data will corrupt.
* Potential risk when doing aio+dio. The -EIOCBQUEUED return value is likely
to be ignored by ocfs2_direct_IO_write().
To resolve above problems, re-design direct io code with following ideas:
* Use buffer io to fill in holes. And this will make better performance also.
* Clear unwritten after direct write finished. So we can make sure meta data
changes after data write to disk. (Unwritten extent is invisible to user,
from user's view, meta data is not changed when allocate an unwritten
extent.)
* Clear ocfs2_direct_IO_write(). Do all ending work in end_io.
This patch has passed fs,dio,ltp-aiodio.part1,ltp-aiodio.part2,ltp-aiodio.part4
test cases of ltp.
For performance improvement, see following test result:
ocfs2 cluster size 1MB, ocfs2 volume is mounted on /mnt/.
The original way:
+ rm /mnt/test.img -f
+ dd if=/dev/zero of=/mnt/test.img bs=4K count=1048576 oflag=direct 1048576+0 records in 1048576+0 records out 4294967296 bytes (4.3 GB) copied, 1707.83 s, 2.5 MB/s
+ rm /mnt/test.img -f
+ dd if=/dev/zero of=/mnt/test.img bs=256K count=16384 oflag=direct
16384+0 records in
16384+0 records out 4294967296 bytes (4.3 GB) copied, 582.705 s, 7.4 MB/s
After this patch:
+ rm /mnt/test.img -f
+ dd if=/dev/zero of=/mnt/test.img bs=4K count=1048576 oflag=direct 1048576+0 records in 1048576+0 records out 4294967296 bytes (4.3 GB) copied, 64.6412 s, 66.4 MB/s
+ rm /mnt/test.img -f
+ dd if=/dev/zero of=/mnt/test.img bs=256K count=16384 oflag=direct
16384+0 records in
16384+0 records out 4294967296 bytes (4.3 GB) copied, 34.7611 s, 124 MB/s
Signed-off-by: Ryan Ding <ryan.ding@oracle.com> Reviewed-by: Junxiao Bi <junxiao.bi@oracle.com> Cc: Joseph Qi <joseph.qi@huawei.com> Cc: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Ryan Ding [Tue, 9 Feb 2016 23:11:40 +0000 (10:11 +1100)]
ocfs2: record UNWRITTEN extents when populate write desc
To support direct io in ocfs2_write_begin_nolock & ocfs2_write_end_nolock.
There is still one issue in the direct write procedure.
phase 1: alloc extent with UNWRITTEN flag
phase 2: submit direct data to disk, add zero page to page cache
phase 3: clear UNWRITTEN flag when data has been written to disk
When there are 2 direct write A(0~3KB),B(4~7KB) writing to the same
cluster 0~7KB (cluster size 8KB). Write request A arrive phase 2 first,
it will zero the region (4~7KB). Before request A enter to phase 3,
request B arrive phase 2, it will zero region (0~3KB). This is just like
request B steps request A.
To resolve this issue, we should let request B knows this cluster is already
under zero, to prevent it from steps the previous write request.
This patch will add function ocfs2_unwritten_check() to do this job. It
will record all clusters that are under direct write(it will be recorded
in the 'ip_unwritten_list' member of inode info), and prevent the later
direct write writing to the same cluster to do the zero work again.
Signed-off-by: Ryan Ding <ryan.ding@oracle.com> Reviewed-by: Junxiao Bi <junxiao.bi@oracle.com> Cc: Joseph Qi <joseph.qi@huawei.com> Cc: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Ryan Ding [Tue, 9 Feb 2016 23:11:39 +0000 (10:11 +1100)]
ocfs2: return the physical address in ocfs2_write_cluster
To support direct io in ocfs2_write_begin_nolock & ocfs2_write_end_nolock.
Direct io needs to get the physical address from write_begin, to map the
user page. This patch is to change the arg 'phys' of ocfs2_write_cluster
to a pointer, so it can be retrieved to write_begin. And we can retrieve
it to the direct io procedure.
Signed-off-by: Ryan Ding <ryan.ding@oracle.com> Reviewed-by: Junxiao Bi <junxiao.bi@oracle.com> Cc: Joseph Qi <joseph.qi@huawei.com> Cc: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Ryan Ding [Tue, 9 Feb 2016 23:11:39 +0000 (10:11 +1100)]
ocfs2: do not change i_size in write_end for direct io
To support direct io in ocfs2_write_begin_nolock & ocfs2_write_end_nolock.
Append direct io do not change i_size in get block phase. It only move to
orphan when starting write. After data is written to disk, it will delete
itself from orphan and update i_size. So skip i_size change section in
write_begin for direct io.
And when there is no extents alloc, no meta data changes needed for direct
io (since write_begin start trans for 2 reason: alloc extents & change
i_size. Now none of them needed). So we can skip start trans procedure.
Signed-off-by: Ryan Ding <ryan.ding@oracle.com> Reviewed-by: Junxiao Bi <junxiao.bi@oracle.com> Cc: Joseph Qi <joseph.qi@huawei.com> Cc: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Ryan Ding [Tue, 9 Feb 2016 23:11:39 +0000 (10:11 +1100)]
ocfs2: test target page before change it
To support direct io in ocfs2_write_begin_nolock & ocfs2_write_end_nolock.
Direct io data will not appear in buffer. The w_target_page member will
not be filled by direct io. So avoid to use it when it's NULL. Unlinke
buffer io and mmap, direct io will call write_begin with more than 1 page
a time. So the target_index is not sufficient to describe the actual
data. change it to a range start at target_index, end in end_index.
Signed-off-by: Ryan Ding <ryan.ding@oracle.com> Reviewed-by: Junxiao Bi <junxiao.bi@oracle.com> Cc: Joseph Qi <joseph.qi@huawei.com> Cc: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Ryan Ding [Tue, 9 Feb 2016 23:11:38 +0000 (10:11 +1100)]
ocfs2: use c_new to indicate newly allocated extents
To support direct io in ocfs2_write_begin_nolock & ocfs2_write_end_nolock.
There is a problem in ocfs2's direct io implement: if system crashed after
extents allocated, and before data return, we will get a extent with dirty
data on disk. This problem violate the journal=order semantics, which
means meta changes take effect after data written to disk. To resolve
this issue, direct write can use the UNWRITTEN flag to describe a extent
during direct data writeback. The direct write procedure should act in
the following order:
phase 1: alloc extent with UNWRITTEN flag
phase 2: submit direct data to disk, add zero page to page cache
phase 3: clear UNWRITTEN flag when data has been written to disk
This patch is to change the 'c_unwritten' member of
ocfs2_write_cluster_desc to 'c_clear_unwritten'. Means whether to clear
the unwritten flag. It do not care if a extent is allocated or not. And
use 'c_new' to specify a newly allocated extent. So the direct io
procedure can use c_clear_unwritten to control the UNWRITTEN bit on
extent.
Signed-off-by: Ryan Ding <ryan.ding@oracle.com> Reviewed-by: Junxiao Bi <junxiao.bi@oracle.com> Cc: Joseph Qi <joseph.qi@huawei.com> Cc: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Ryan Ding [Tue, 9 Feb 2016 23:11:38 +0000 (10:11 +1100)]
ocfs2: add ocfs2_write_type_t type to identify the caller of write
Patchset: fix ocfs2 direct io code patch to support sparse file and data
ordering semantics
The idea is to use buffer io(more precisely use the interface
ocfs2_write_begin_nolock & ocfs2_write_end_nolock) to do the zero work
beyond block size. And clear UNWRITTEN flag until direct io data has been
written to disk, which can prevent data corruption when system crashed
during direct write.
And we will also archive a better performance:
eg. dd direct write new file with block size 4KB:
before this patchset:
2.5 MB/s
after this patchset:
66.4 MB/s
This patch (of 8):
To support direct io in ocfs2_write_begin_nolock & ocfs2_write_end_nolock.
Remove unused args filp & flags. Add new arg type. The type is one of
buffer/direct/mmap. Indicate 3 way to perform write. buffer/mmap type
has implemented. direct type will be implemented later.
Signed-off-by: Ryan Ding <ryan.ding@oracle.com> Reviewed-by: Junxiao Bi <junxiao.bi@oracle.com> Cc: Joseph Qi <joseph.qi@huawei.com> Cc: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
xuejiufei [Tue, 9 Feb 2016 23:11:37 +0000 (10:11 +1100)]
ocfs2/dlm: return EINVAL when the lockres on migration target is in DROPPING_REF state
If master migrate this lock resource to node when it happened to purge it,
a new lock resource will be created and inserted into hash list. If then
master goes down, the lock resource being purged is recovered, so there
exist two lock resource with different owner. So return error to master
if the lock resource is in DROPPING state, master will retry to migrate
this lock resource.
Signed-off-by: xuejiufei <xuejiufei@huawei.com> Cc: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Reviewed-by: Joseph Qi <joseph.qi@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
xuejiufei [Tue, 9 Feb 2016 23:11:37 +0000 (10:11 +1100)]
ocfs2/dlm: clear DROPPING_REF flag when the master goes down
If the master goes down after return in-progress for deref message. The
lock resource on non-master node can not be purged. Clear the
DROPPING_REF flag and recovery it.
Signed-off-by: xuejiufei <xuejiufei@huawei.com> Cc: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Reviewed-by: Joseph Qi <joseph.qi@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
xuejiufei [Tue, 9 Feb 2016 23:11:37 +0000 (10:11 +1100)]
ocfs2/dlm: return in progress if master can not clear the refmap bit right now
Master returns in-progress to non-master node when it can not clear the
refmap bit right now. And non-master node will not purge the lock
resource until receiving deref done message.
Signed-off-by: xuejiufei <xuejiufei@huawei.com> Cc: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Reviewed-by: Joseph Qi <joseph.qi@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
dlm_purge_lockres
dlm_deref_handler
-> find lock resource is in
DLM_LOCK_RES_SETREF_INPROG state,
so dispatch a deref work
dlm_purge_lockres succeed.
call dlmlock again
dlm_do_master_request
dlm_master_request_handler
-> dlm_lockres_set_refmap_bit
deref work trigger, call
dlm_lockres_clear_refmap_bit
to clear Node 1 from refmap
dlm_purge_lockres succeed
dlm_send_remote_lock_request
return DLM_IVLOCKID because
the lockres is not exist
BUG if the lockres is $RECOVERY
This series of patches add a new message to keep the order of set and
clear. Other nodes can purge the lock resource only after the refmap bit
on master is cleared.
This patch is to add DEREF_DONE message and corresponding handler. Node
can purge the lock resource after receiving this message. As a new
message is added, so increase the minor number of dlm protocol version.
Signed-off-by: xuejiufei <xuejiufei@huawei.com> Cc: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Reviewed-by: Joseph Qi <joseph.qi@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Joseph Qi [Tue, 9 Feb 2016 23:11:36 +0000 (10:11 +1100)]
ocfs2/dlm: fix a typo in dlmcommon.h
Refer to cluster/tcp.h, NET_MAX_PAYLOAD_BYTES is a typo for
O2NET_MAX_PAYLOAD_BYTES.
Since currently DLM_MIG_LOCKRES_RESERVED is not actually used, it won't
cause any problem. But we'd better correct it for further use.
Signed-off-by: Joseph Qi <joseph.qi@huawei.com> Cc: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Joseph Qi <joseph.qi@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
jiangyiwen [Tue, 9 Feb 2016 23:11:35 +0000 (10:11 +1100)]
ocfs2: use spinlock_irqsave() to downconvert lock in ocfs2_osb_dump()
commit a75e9ccabd92 ("ocfs2: use spinlock irqsave for downconvert lock")
missed an unmodified place in ocfs2_osb_dump(), so it still exists a
deadlock scenario.
This patch still uses spin_lock_irqsave() - replace spin_lock() to solve
this situation.
Signed-off-by: Yiwen Jiang <jiangyiwen@huawei.com> Reviewed-by: Joseph Qi <joseph.qi@huawei.com> Cc: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
jiangyiwen [Tue, 9 Feb 2016 23:11:35 +0000 (10:11 +1100)]
ocfs2/cluster: replace the interrupt safe spinlocks with common ones
There actually no hardware or software interrupts in the context
which using o2hb_live_lock, so we don't need to worry about race
conditions caused by irq/softirq with spinlock held. Turning off
irq is not good for system performance after all. Just replace
them with a non interrupt safe function.
Signed-off-by: Yiwen Jiang <jiangyiwen@huawei.com> Reviewed-by: Joseph Qi <joseph.qi@huawei.com> Cc: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
fs/ext4/fsync.c: generic_file_fsync call based on barrier flag
generic_file_fsync has been updated to issue a flush for older
filesystems.
This patch tests for barrier flag in ext4 mount flags and calls the right
function.
Lukas said:
: Note that the actual generic_file_fsync change fixes a real bug in ext4
: where we would _not_ send a flush on sync if we have file system
: without journal.
:
: Ted, it would be useful to mention that in the commit description
: along with the commit id:
:
: ac13a829f6adb674015ab399594c089990104af7 fs/libfs.c: add generic
: data flush to fsync
Signed-off-by: Fabian Frederick <fabf@skynet.be> Suggested-by: Jan Kara <jack@suse.cz> Suggested-by: Christoph Hellwig <hch@infradead.org> Reviewed-by: Lukas Czerner <lczerner@redhat.com> Cc: Jan Kara <jack@suse.cz> Cc: Christoph Hellwig <hch@infradead.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Sudip Mukherjee [Tue, 9 Feb 2016 23:11:34 +0000 (10:11 +1100)]
m32r: mm: fix build warning
While building we are getting warnings:
arch/m32r/mm/init.c:63:17: warning: unused variable 'low'
arch/m32r/mm/init.c:62:17: warning: unused variable 'max_dma'
max_dma and low are only used if CONFIG_MMU is defined. Lets declare the
variables inside the #ifdef.
Signed-off-by: Sudip Mukherjee <sudip@vectorindia.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
VM_DATA_DEFAULT_FLAGS uses READ_IMPLIES_EXEC, so page.h should include
personality.h to provide this.
This is needed for "mm: warn about VmData over RLIMIT_DATA".
Cc: Russell King <linux@arm.linux.org.uk> Cc: Will Deacon <will.deacon@arm.com> Cc: Konstantin Khlebnikov <koct9i@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Dmitry Safonov [Tue, 9 Feb 2016 23:11:33 +0000 (10:11 +1100)]
mm: slab: free kmem_cache_node after destroy sysfs file
v8: reintroduce locking in free_partial & nits from Vladimir
v9: zapped __remove_partial and spin_lock_irq instead of spin_lock_irqsave
(added BUG_ON(irqs_disabled()) to be sure)
Signed-off-by: Dmitry Safonov <dsafonov@virtuozzo.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Vladimir Davydov <vdavydov@virtuozzo.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Dmitry Safonov [Tue, 9 Feb 2016 23:11:32 +0000 (10:11 +1100)]
mm: slab: free kmem_cache_node after destroy sysfs file
When slub_debug alloc_calls_show is enabled we will try to track location
and user of slab object on each online node, kmem_cache_node structure and
cpu_cache/cpu_slub shouldn't be freed till there is the last reference to
sysfs file.
Separated __kmem_cache_release from __kmem_cache_shutdown which now called
on slab_kmem_cache_release (after the last reference to sysfs file object
has dropped).
Reintroduced locking in free_partial as sysfs file might access cache's
partial list after shutdowning - partiall revert of the commit 69cb8e6b7c2982 ("slub: free slabs without holding locks") Zap
__remove_partial and use remove_partial (w/o underscores) as free_partial
now takes list_lock which s partial revert for commit 1e4dd9461fabfb
("slub: do not assert not having lock in removing freed partial")
Signed-off-by: Dmitry Safonov <dsafonov@virtuozzo.com> Suggested-by: Vladimir Davydov <vdavydov@virtuozzo.com> Acked-by: Vladimir Davydov <vdavydov@virtuozzo.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Andrew Morton [Tue, 9 Feb 2016 23:11:32 +0000 (10:11 +1100)]
kernel/locking/lockdep.c: convert hash tables to hlists
Mike said:
: CONFIG_UBSAN_ALIGNMENT breaks x86-64 kernel with lockdep enabled, i. e
: kernel with CONFIG_UBSAN_ALIGNMENT fails to load without even any error
: message.
:
: The problem is that ubsan callbacks use spinlocks and might be called
: before lockdep is initialized. Particularly this line in the
: reserve_ebda_region function causes problem:
:
: lowmem = *(unsigned short *)__va(BIOS_LOWMEM_KILOBYTES);
:
: If i put lockdep_init() before reserve_ebda_region call in
: x86_64_start_reservations kernel loads well.
Fix this ordering issue permanently: change lockdep so that it uses hlists
for the hash tables. Unlike a list_head, an hlist_head is in its
initialized state when it is all-zeroes, so lockdep is ready for operation
immediately upon boot - lockdep_init() need not have run.
The patch will also save some memory.
lockdep_init() and lockdep_initialized can be done away with now - a 4.6
patch has been prepared to do this.
Reported-by: Mike Krinkin <krinkin.m.u@gmail.com> Suggested-by: Mike Krinkin <krinkin.m.u@gmail.com> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
ipc/shm: handle removed segments gracefully in shm_mmap()
remap_file_pages(2) emulation can reach file which represents removed IPC
ID as long as a memory segment is mapped. It breaks expectations of IPC
subsystem.
Test case (rewritten to be more human readable, originally autogenerated
by syzkaller[1]):
We need to use post-decrement to get percpu_counter_destroy() called on
&wb->stat[0]. Moreover, the pre-decremebt would cause infinite
out-of-bounds accesses if the setup code failed at i==0.
Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Vladimir Davydov <vdavydov@virtuozzo.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
mm, dax: check for pmd_none() after split_huge_pmd()
DAX implements split_huge_pmd() by clearing pmd. This simple approach
reduces memory overhead, as we don't need to deposit page table on huge
page mapping to make split_huge_pmd() never-fail. PTE table can be
allocated and populated later on page fault from backing store.
But one side effect is that have to check if pmd is pmd_none() after
split_huge_pmd(). In most places we do this already to deal with parallel
MADV_DONTNEED.
But I found two call sites which is not affected by MADV_DONTNEED (due
down_write(mmap_sem)), but need to have the check to work with DAX
properly.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Matthew Wilcox <willy@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The kptr_restrict flag, when set to 1, only prints the kernel address when
the user has CAP_SYSLOG. When it is set to 2, the kernel address is
always printed as zero. When set to 1, this needs to check whether or not
we're in IRQ. However, when set to 2, this check is unneccessary, and
produces confusing results in dmesg. Thus, only make sure we're not in
IRQ when mode 1 is used, but not mode 2.
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk> Cc: Kees Cook <keescook@chromium.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Yang Shi [Tue, 9 Feb 2016 23:11:28 +0000 (10:11 +1100)]
ubsan: cosmetic fix to Kconfig text
When enabling UBSAN_SANITIZE_ALL, the kernel image size gets increased
significantly (~3x). So, it sounds better to have some note in Kconfig.
And, fixed a typo.
Signed-off-by: Yang Shi <yang.shi@linaro.org> Acked-by: Andrey Ryabinin <aryabinin@virtuozzo.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Linus Torvalds [Mon, 8 Feb 2016 18:32:30 +0000 (10:32 -0800)]
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull KVM fixes from Paolo Bonzini:
"KVM-ARM fixes, mostly coming from the PMU work"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
arm64: KVM: Fix guest dead loop when register accessor returns false
arm64: KVM: Fix comments of the CP handler
arm64: KVM: Fix wrong use of the CPSR MODE mask for 32bit guests
arm64: KVM: Obey RES0/1 reserved bits when setting CPTR_EL2
arm64: KVM: Fix AArch64 guest userspace exception injection
Linus Torvalds [Mon, 8 Feb 2016 18:20:06 +0000 (10:20 -0800)]
Merge tag 'regmap-fix-v4.5-big-endian' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regmap
Pull regmap fix from Mark Brown:
"A single revert back to v4.4 endianness handling.
Commit 29bb45f25ff3 ("regmap-mmio: Use native endianness for
read/write") attempted to fix some long standing bugs in the MMIO
implementation for big endian systems caused by duplicate byte
swapping in both regmap and readl()/writel(). Sadly the fix makes
things worse rather than better, so revert it for now"
* tag 'regmap-fix-v4.5-big-endian' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regmap:
regmap: mmio: Revert to v4.4 endianness handling
Paolo Bonzini [Mon, 8 Feb 2016 15:20:51 +0000 (16:20 +0100)]
Merge tag 'kvm-arm-for-4.5-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into kvm-master
KVM/ARM fixes for v4.5-rc2
A few random fixes, mostly coming from the PMU work by Shannon:
- fix for injecting faults coming from the guest's userspace
- cleanup for our CPTR_EL2 accessors (reserved bits)
- fix for a bug impacting perf (user/kernel discrimination)
- fix for a 32bit sysreg handling bug
Linus Torvalds [Sun, 7 Feb 2016 23:23:20 +0000 (15:23 -0800)]
Merge tag 'armsoc-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc
Pull ARM SoC fixes from Olof Johansson:
"The first real batch of fixes for this release cycle, so there are a
few more than usual.
Most of these are fixes and tweaks to board support (DT bugfixes,
etc). I've also picked up a couple of small cleanups that seemed
innocent enough that there was little reason to wait (const/
__initconst and Kconfig deps).
Quite a bit of the changes on OMAP were due to fixes to no longer
write to rodata from assembly when ARM_KERNMEM_PERMS was enabled, but
there were also other fixes.
Kirkwood had a bunch of gpio fixes for some boards. OMAP had RTC
fixes on OMAP5, and Nomadik had changes to MMC parameters in DT.
All in all, mostly the usual mix of various fixes"
* tag 'armsoc-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc: (46 commits)
ARM: multi_v7_defconfig: enable DW_WATCHDOG
ARM: nomadik: fix up SD/MMC DT settings
ARM64: tegra: Add chosen node for tegra132 norrin
ARM: realview: use "depends on" instead of "if" after prompt
ARM: tango: use "depends on" instead of "if" after prompt
ARM: tango: use const and __initconst for smp_operations
ARM: realview: use const and __initconst for smp_operations
bus: uniphier-system-bus: revive tristate prompt
arm64: dts: Add missing DMA Abort interrupt to Juno
bus: vexpress-config: Add missing of_node_put
ARM: dts: am57xx: sbc-am57x: correct Eth PHY settings
ARM: dts: am57xx: cl-som-am57x: fix CPSW EMAC pinmux
ARM: dts: am57xx: sbc-am57x: fix UART3 pinmux
ARM: dts: am57xx: cl-som-am57x: update SPI Flash frequency
ARM: dts: am57xx: cl-som-am57x: set HOST mode for USB2
ARM: dts: am57xx: sbc-am57x: fix SB-SOM EEPROM I2C address
ARM: dts: LogicPD Torpedo: Revert Duplicative Entries
ARM: dts: am437x: pixcir_tangoc: use correct flags for irq types
ARM: dts: am4372: fix irq type for arm twd and global timer
ARM: dts: at91: sama5d4 xplained: fix phy0 IRQ type
...
Linus Torvalds [Sun, 7 Feb 2016 06:14:46 +0000 (22:14 -0800)]
Merge tag 'usb-4.5-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb
Pull USB fixes from Greg KH:
"Here are some USB fixes for 4.5-rc3.
The usual, xhci fixes for reported issues, combined with some small
gadget driver fixes, and a MAINTAINERS file update. All have been in
linux-next with no reported issues"
* tag 'usb-4.5-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb:
xhci: harden xhci_find_next_ext_cap against device removal
xhci: Fix list corruption in urb dequeue at host removal
usb: host: xhci-plat: fix NULL pointer in probe for device tree case
usb: xhci-mtk: fix AHB bus hang up caused by roothubs polling
usb: xhci-mtk: fix bpkts value of LS/HS periodic eps not behind TT
usb: xhci: apply XHCI_PME_STUCK_QUIRK to Intel Broxton-M platforms
usb: xhci: set SSIC port unused only if xhci_suspend succeeds
usb: xhci: add a quirk bit for ssic port unused
usb: xhci: handle both SSIC ports in PME stuck quirk
usb: dwc3: gadget: set the OTG flag in dwc3 gadget driver.
Revert "xhci: don't finish a TD if we get a short-transfer event mid TD"
MAINTAINERS: fix my email address
usb: dwc2: Fix probe problem on bcm2835
Revert "usb: dwc2: Move reset into dwc2_get_hwparams()"
usb: musb: ux500: Fix NULL pointer dereference at system PM
usb: phy: mxs: declare variable with initialized value
usb: phy: msm: fix error handling in probe.
Linus Torvalds [Sun, 7 Feb 2016 06:13:16 +0000 (22:13 -0800)]
Merge tag 'staging-4.5-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging
Pull staging and IIO driver fixes from Greg KH:
"Here are some IIO and staging driver fixes for 4.5-rc3.
All of them, except one, are for IIO drivers, and one is for a speakup
driver fix caused by some earlier patches, to resolve a reported build
failure"
* tag 'staging-4.5-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging:
Staging: speakup: Fix allyesconfig build on mn10300
iio: dht11: Use boottime
iio: ade7753: avoid uninitialized data
iio: pressure: mpl115: fix temperature offset sign
iio: imu: Fix dependencies for !HAS_IOMEM archs
staging: iio: Fix dependencies for !HAS_IOMEM archs
iio: adc: Fix dependencies for !HAS_IOMEM archs
iio: inkern: fix a NULL dereference on error
iio:adc:ti_am335x_adc Fix buffered mode by identifying as software buffer.
iio: light: acpi-als: Report data as processed
iio: dac: mcp4725: set iio name property in sysfs
iio: add HAS_IOMEM dependency to VF610_ADC
iio: add IIO_TRIGGER dependency to STK8BA50
iio: proximity: lidar: correct return value
iio-light: Use a signed return type for ltr501_match_samp_freq()
Linus Torvalds [Sat, 6 Feb 2016 04:20:07 +0000 (20:20 -0800)]
Merge branch 'akpm' (patches from Andrew)
Merge fixes from Andrew Morton:
"22 fixes"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (22 commits)
epoll: restrict EPOLLEXCLUSIVE to POLLIN and POLLOUT
radix-tree: fix oops after radix_tree_iter_retry
MAINTAINERS: trim the file triggers for ABI/API
dax: dirty inode only if required
thp: make deferred_split_scan() work again
mm: replace vma_lock_anon_vma with anon_vma_lock_read/write
ocfs2/dlm: clear refmap bit of recovery lock while doing local recovery cleanup
um: asm/page.h: remove the pte_high member from struct pte_t
mm, hugetlb: don't require CMA for runtime gigantic pages
mm/hugetlb: fix gigantic page initialization/allocation
mm: downgrade VM_BUG in isolate_lru_page() to warning
mempolicy: do not try to queue pages from !vma_migratable()
mm, vmstat: fix wrong WQ sleep when memory reclaim doesn't make any progress
vmstat: make vmstat_update deferrable
mm, vmstat: make quiet_vmstat lighter
mm/Kconfig: correct description of DEFERRED_STRUCT_PAGE_INIT
memblock: don't mark memblock_phys_mem_size() as __init
dump_stack: avoid potential deadlocks
mm: validate_mm browse_rb SMP race condition
m32r: fix build failure due to SMP and MMU
...
Linus Torvalds [Sat, 6 Feb 2016 03:52:57 +0000 (19:52 -0800)]
Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client
Pull Ceph fixes from Sage Weil:
"We have a few wire protocol compatibility fixes, ports of a few recent
CRUSH mapping changes, and a couple error path fixes"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client:
libceph: MOSDOpReply v7 encoding
libceph: advertise support for TUNABLES5
crush: decode and initialize chooseleaf_stable
crush: add chooseleaf_stable tunable
crush: ensure take bucket value is valid
crush: ensure bucket id is valid before indexing buckets array
ceph: fix snap context leak in error path
ceph: checking for IS_ERR instead of NULL
Linus Torvalds [Sat, 6 Feb 2016 03:38:15 +0000 (19:38 -0800)]
Merge branch 'drm-fixes' of git://people.freedesktop.org/~airlied/linux
Pull drm fixes from Dave Airlie:
"Fixes all over the place:
- amdkfd: two static checker fixes
- mst: a bunch of static checker and spec/hw interaction fixes
- amdgpu: fix Iceland hw properly, and some fiji bugs, along with
some write-combining fixes.
- exynos: some regression fixes
- adv7511: fix some EDID reading issues"
* 'drm-fixes' of git://people.freedesktop.org/~airlied/linux: (38 commits)
drm/dp/mst: deallocate payload on port destruction
drm/dp/mst: Reverse order of MST enable and clearing VC payload table.
drm/dp/mst: move GUID storage from mgr, port to only mst branch
drm/dp/mst: change MST detection scheme
drm/dp/mst: Calculate MST PBN with 31.32 fixed point
drm: Add drm_fixp_from_fraction and drm_fixp2int_ceil
drm/mst: Add range check for max_payloads during init
drm/mst: Don't ignore the MST PBN self-test result
drm: fix missing reference counting decrease
drm/amdgpu: disable uvd and vce clockgating on Fiji
drm/amdgpu: remove exp hardware support from iceland
drm/amdgpu: load MEC ucode manually on iceland
drm/amdgpu: don't load MEC2 on topaz
drm/amdgpu: drop topaz support from gmc8 module
drm/amdgpu: pull topaz gmc bits into gmc_v7
drm/amdgpu: The VI specific EXE bit should only apply to GMC v8.0 above
drm/amdgpu: iceland use CI based MC IP
drm/amdgpu: move gmc7 support out of CIK dependency
drm/amdgpu/gfx7: enable cp inst/reg error interrupts
drm/amdgpu/gfx8: enable cp inst/reg error interrupts
...
Linus Torvalds [Sat, 6 Feb 2016 02:11:23 +0000 (18:11 -0800)]
Merge tag 'pm+acpi-4.5-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull power management and ACPI fixes from Rafael Wysocki:
"These are: a fix for a recently introduced false-positive warnings
about PM domain pointers being changed inappropriately (harmless but
annoying), an MCH size workaround quirk for one more platform, a
compiler warning fix (generic power domains framework), an ACPI LPSS
(Intel SoCs) driver fixup and a cleanup of the ACPI CPPC core code.
Specifics:
- PM core fix to avoid false-positive warnings generated when the
pm_domain field is cleared for a device that appears to be bound to
a driver (Rafael Wysocki).
- New MCH size workaround quirk for Intel Haswell-ULT (Josh Boyer).
- Fix for an "unused function" compiler warning in the generic power
domains framework (Ulf Hansson).
- Fixup for the ACPI driver for Intel SoCs (acpi-lpss) to set the PM
domain pointer of a device properly in one place that was
overlooked by a recent PM core update (Andy Shevchenko).
- Removal of a redundant function declaration in the ACPI CPPC core
code (Timur Tabi)"
* tag 'pm+acpi-4.5-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
PM: Avoid false-positive warnings in dev_pm_domain_set()
PM / Domains: Silence compiler warning for an unused function
ACPI / CPPC: remove redundant mbox_send_message() declaration
ACPI / LPSS: set PM domain via helper setter
PNP: Add Haswell-ULT to Intel MCH size workaround
Jason Baron [Fri, 5 Feb 2016 23:37:04 +0000 (15:37 -0800)]
epoll: restrict EPOLLEXCLUSIVE to POLLIN and POLLOUT
In the current implementation of the EPOLLEXCLUSIVE flag (added for
4.5-rc1), if epoll waiters create different POLL* sets and register them
as exclusive against the same target fd, the current implementation will
stop waking any further waiters once it finds the first idle waiter.
This means that waiters could miss wakeups in certain cases.
For example, when we wake up a pipe for reading we do:
wake_up_interruptible_sync_poll(&pipe->wait, POLLIN | POLLRDNORM); So if
one epoll set or epfd is added to pipe p with POLLIN and a second set
epfd2 is added to pipe p with POLLRDNORM, only epfd may receive the
wakeup since the current implementation will stop after it finds any
intersection of events with a waiter that is blocked in epoll_wait().
We could potentially address this by requiring all epoll waiters that
are added to p be required to pass the same set of POLL* events. IE the
first EPOLL_CTL_ADD that passes EPOLLEXCLUSIVE establishes the set POLL*
flags to be used by any other epfds that are added as EPOLLEXCLUSIVE.
However, I think it might be somewhat confusing interface as we would
have to reference count the number of users for that set, and so
userspace would have to keep track of that count, or we would need a
more involved interface. It also adds some shared state that we'd have
store somewhere. I don't think anybody will want to bloat
__wait_queue_head for this.
I think what we could do instead, is to simply restrict EPOLLEXCLUSIVE
such that it can only be specified with EPOLLIN and/or EPOLLOUT. So
that way if the wakeup includes 'POLLIN' and not 'POLLOUT', we can stop
once we hit the first idle waiter that specifies the EPOLLIN bit, since
any remaining waiters that only have 'POLLOUT' set wouldn't need to be
woken. Likewise, we can do the same thing if 'POLLOUT' is in the wakeup
bit set and not 'POLLIN'. If both 'POLLOUT' and 'POLLIN' are set in the
wake bit set (there is at least one example of this I saw in fs/pipe.c),
then we just wake the entire exclusive list. Having both 'POLLOUT' and
'POLLIN' both set should not be on any performance critical path, so I
think that's ok (in fs/pipe.c its in pipe_release()). We also continue
to include EPOLLERR and EPOLLHUP by default in any exclusive set. Thus,
the user can specify EPOLLERR and/or EPOLLHUP but is not required to do
so.
Since epoll waiters may be interested in other events as well besides
EPOLLIN, EPOLLOUT, EPOLLERR and EPOLLHUP, these can still be added by
doing a 'dup' call on the target fd and adding that as one normally
would with EPOLL_CTL_ADD. Since I think that the POLLIN and POLLOUT
events are what we are interest in balancing, I think that the 'dup'
thing could perhaps be added to only one of the waiter threads.
However, I think that EPOLLIN, EPOLLOUT, EPOLLERR and EPOLLHUP should be
sufficient for the majority of use-cases.
Since EPOLLEXCLUSIVE is intended to be used with a target fd shared
among multiple epfds, where between 1 and n of the epfds may receive an
event, it does not satisfy the semantics of EPOLLONESHOT where only 1
epfd would get an event. Thus, it is not allowed to be specified in
conjunction with EPOLLEXCLUSIVE.
EPOLL_CTL_MOD is also not allowed if the fd was previously added as
EPOLLEXCLUSIVE. It seems with the limited number of flags to not be as
interesting, but this could be relaxed at some further point.
Signed-off-by: Jason Baron <jbaron@akamai.com> Tested-by: Madars Vitolins <m@silodev.com> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Al Viro <viro@ftp.linux.org.uk> Cc: Eric Wong <normalperson@yhbt.net> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Hagen Paul Pfeifer <hagen@jauu.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Helper radix_tree_iter_retry() resets next_index to the current index.
In following radix_tree_next_slot current chunk size becomes zero. This
isn't checked and it tries to dereference null pointer in slot.
Tagged iterator is fine because retry happens only at slot 0 where tag
bitmask in iter->tags is filled with single bit.
Fixes: 46437f9a554f ("radix-tree: fix race in gang lookup") Signed-off-by: Konstantin Khlebnikov <koct9i@gmail.com> Cc: Matthew Wilcox <willy@linux.intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Ohad Ben-Cohen <ohad@wizery.com> Cc: Jeremiah Mahler <jmmahler@gmail.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit ea8f8fc8631 ("MAINTAINERS: add linux-api for review of API/ABI
changes") added file triggers for various paths that likely indicated
API/ABI changes. However, catching all changes in Documentation/ABI/
and include/uapi/ produces a large volume of mail to linux-api, rather
than only API/ABI changes. Drop those two entries, but leave
include/linux/syscalls.h and kernel/sys_ni.c to catch syscall-related
changes.
Dmitry Monakhov [Fri, 5 Feb 2016 23:36:55 +0000 (15:36 -0800)]
dax: dirty inode only if required
Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Matthew Wilcox <matthew.r.wilcox@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We need to iterate over split_queue, not local empty list to get
anything split from the shrinker.
Fixes: e3ae19535c66 ("thp: limit number of object to scan on deferred_split_scan()") Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
mm: replace vma_lock_anon_vma with anon_vma_lock_read/write
Sequence vma_lock_anon_vma() - vma_unlock_anon_vma() isn't safe if
anon_vma appeared between lock and unlock. We have to check anon_vma
first or call anon_vma_prepare() to be sure that it's here. There are
only few users of these legacy helpers. Let's get rid of them.
This patch fixes anon_vma lock imbalance in validate_mm(). Write lock
isn't required here, read lock is enough.
And reorders expand_downwards/expand_upwards: security_mmap_addr() and
wrapping-around check don't have to be under anon vma lock.
Link: https://lkml.kernel.org/r/CACT4Y+Y908EjM2z=706dv4rV6dWtxTLK9nFg9_7DhRMLppBo2g@mail.gmail.com Signed-off-by: Konstantin Khlebnikov <koct9i@gmail.com> Reported-by: Dmitry Vyukov <dvyukov@google.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
xuejiufei [Fri, 5 Feb 2016 23:36:47 +0000 (15:36 -0800)]
ocfs2/dlm: clear refmap bit of recovery lock while doing local recovery cleanup
When recovery master down, dlm_do_local_recovery_cleanup() only remove
the $RECOVERY lock owned by dead node, but do not clear the refmap bit.
Which will make umount thread falling in dead loop migrating $RECOVERY
to the dead node.
Signed-off-by: xuejiufei <xuejiufei@huawei.com> Reviewed-by: Joseph Qi <joseph.qi@huawei.com> Cc: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Nicolai Stange [Fri, 5 Feb 2016 23:36:44 +0000 (15:36 -0800)]
um: asm/page.h: remove the pte_high member from struct pte_t
Commit 16da306849d0 ("um: kill pfn_t") introduced a compile warning for
defconfig (SUBARCH=i386):
arch/um/kernel/skas/mmu.c:38:206:
warning: right shift count >= width of type [-Wshift-count-overflow]
Aforementioned patch changes the definition of the phys_to_pfn() macro
from
((pfn_t) ((p) >> PAGE_SHIFT))
to
((p) >> PAGE_SHIFT)
This effectively changes the phys_to_pfn() expansion's type from
unsigned long long to unsigned long.
Through the callchain init_stub_pte() => mk_pte(), the expansion of
phys_to_pfn() is (indirectly) fed into the 'phys' argument of the
pte_set_val(pte, phys, prot) macro, eventually leading to
(pte).pte_high = (phys) >> 32;
This results in the warning from above.
Since UML only deals with 32 bit addresses, the upper 32 bits from
'phys' used to be always zero anyway. Also, all page protection flags
defined by UML don't use any bits beyond bit 9. Since the contents of a
PTE are defined within architecture scope only, the ->pte_high member
can be safely removed.
Remove the ->pte_high member from struct pte_t.
Rename ->pte_low to ->pte.
Adapt the pte helper macros in arch/um/include/asm/page.h.
Noteworthy is the pte_copy() macro where a smp_wmb() gets dropped. This
write barrier doesn't seem to be paired with any read barrier though and
thus, was useless anyway.
Fixes: 16da306849d0 ("um: kill pfn_t") Signed-off-by: Nicolai Stange <nicstange@gmail.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Richard Weinberger <richard@nod.at> Cc: Nicolai Stange <nicstange@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Vlastimil Babka [Fri, 5 Feb 2016 23:36:41 +0000 (15:36 -0800)]
mm, hugetlb: don't require CMA for runtime gigantic pages
Commit 944d9fec8d7a ("hugetlb: add support for gigantic page allocation
at runtime") has added the runtime gigantic page allocation via
alloc_contig_range(), making this support available only when CONFIG_CMA
is enabled. Because it doesn't depend on MIGRATE_CMA pageblocks and the
associated infrastructure, it is possible with few simple adjustments to
require only CONFIG_MEMORY_ISOLATION instead of full CONFIG_CMA.
After this patch, alloc_contig_range() and related functions are
available and used for gigantic pages with just CONFIG_MEMORY_ISOLATION
enabled. Note CONFIG_CMA selects CONFIG_MEMORY_ISOLATION. This allows
supporting runtime gigantic pages without the CMA-specific checks in
page allocator fastpaths.
Attempting to preallocate 1G gigantic huge pages at boot time with
"hugepagesz=1G hugepages=1" on the kernel command line will prevent
booting with the following:
kernel BUG at mm/hugetlb.c:1218!
When mapcount accounting was reworked, the setting of
compound_mapcount_ptr in prep_compound_gigantic_page was overlooked. As
a result, the validation of mapcount in free_huge_page fails.
The "BUG_ON" checks in free_huge_page were also changed to
"VM_BUG_ON_PAGE" to assist with debugging.
Fixes: 53f9263baba69 ("mm: rework mapcount accounting to enable 4k mapping of THPs") Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: David Rientjes <rientjes@google.com> Tested-by: Vlastimil Babka <vbabka@suse.cz> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> Cc: Jerome Marchand <jmarchan@redhat.com> Cc: Michal Hocko <mhocko@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The only case when we can queue pages from such VMA is MPOL_MF_STRICT
plus MPOL_MF_MOVE or MPOL_MF_MOVE_ALL for VMA which has pages on LRU,
but gfp mask is not sutable for migaration (see mapping_gfp_mask() check
in vma_migratable()). That's looks like a bug to me.
Let's filter out non-migratable vma at start of queue_pages_test_walk()
and go to queue_pages_pte_range() only if MPOL_MF_MOVE or
MPOL_MF_MOVE_ALL flag is set.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Dmitry Vyukov <dvyukov@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: David Rientjes <rientjes@google.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Tetsuo Handa [Fri, 5 Feb 2016 23:36:30 +0000 (15:36 -0800)]
mm, vmstat: fix wrong WQ sleep when memory reclaim doesn't make any progress
Jan Stancek has reported that system occasionally hanging after "oom01"
testcase from LTP triggers OOM. Guessing from a result that there is a
kworker thread doing memory allocation and the values between "Node 0
Normal free:" and "Node 0 Normal:" differs when hanging, vmstat is not
up-to-date for some reason.
According to commit 373ccbe59270 ("mm, vmstat: allow WQ concurrency to
discover memory reclaim doesn't make any progress"), it meant to force
the kworker thread to take a short sleep, but it by error used
schedule_timeout(1). We missed that schedule_timeout() in state
TASK_RUNNING doesn't do anything.
Fix it by using schedule_timeout_uninterruptible(1) which forces the
kworker thread to take a short sleep in order to make sure that vmstat
is up-to-date.
Fixes: 373ccbe59270 ("mm, vmstat: allow WQ concurrency to discover memory reclaim doesn't make any progress") Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Reported-by: Jan Stancek <jstancek@redhat.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Tejun Heo <tj@kernel.org> Cc: Cristopher Lameter <clameter@sgi.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Arkadiusz Miskiewicz <arekm@maven.pl> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Michal Hocko [Fri, 5 Feb 2016 23:36:27 +0000 (15:36 -0800)]
vmstat: make vmstat_update deferrable
Commit 0eb77e988032 ("vmstat: make vmstat_updater deferrable again and
shut down on idle") made vmstat_shepherd deferrable. vmstat_update
itself is still useing standard timer which might interrupt idle task.
This is possible because "mm, vmstat: make quiet_vmstat lighter" removed
cancel_delayed_work from the quiet_vmstat.
Change vmstat_work to use DEFERRABLE_WORK to prevent from pointless
wakeups from the idle context.
Acked-by: Christoph Lameter <cl@linux.com> Signed-off-by: Michal Hocko <mhocko@suse.com> Cc: Mike Galbraith <umgwanakikbuti@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This is caused by commit 0eb77e988032 ("vmstat: make vmstat_updater
deferrable again and shut down on idle") which has placed quiet_vmstat
into cpu_idle_loop. The main reason here seems to be that the idle
entry has to get over all zones and perform atomic operations for each
vmstat entry even though there might be no per cpu diffs. This is a
pointless overhead for _each_ idle entry.
Make sure that quiet_vmstat is as light as possible.
First of all it doesn't make any sense to do any local sync if the
current cpu is already set in oncpu_stat_off because vmstat_update puts
itself there only if there is nothing to do.
Then we can check need_update which should be a cheap way to check for
potential per-cpu diffs and only then do refresh_cpu_vm_stats.
The original patch also did cancel_delayed_work which we are not doing
here. There are two reasons for that. Firstly cancel_delayed_work from
idle context will blow up on RT kernels (reported by Mike):
And secondly, even on !RT kernels it might add some non trivial overhead
which is not necessary. Even if the vmstat worker wakes up and preempts
idle then it will be most likely a single shot noop because the stats
were already synced and so it would end up on the oncpu_stat_off anyway.
We just need to teach both vmstat_shepherd and vmstat_update to stop
scheduling the worker if there is nothing to do.
[mgalbraith@suse.de: cancel pending work of the cpu_stat_off CPU] Signed-off-by: Michal Hocko <mhocko@suse.com> Reported-by: Mike Galbraith <umgwanakikbuti@gmail.com> Acked-by: Christoph Lameter <cl@linux.com> Signed-off-by: Mike Galbraith <mgalbraith@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
David Gibson [Fri, 5 Feb 2016 23:36:19 +0000 (15:36 -0800)]
memblock: don't mark memblock_phys_mem_size() as __init
At the moment memblock_phys_mem_size() is marked as __init, and so is
discarded after boot. This is different from most of the memblock
functions which are marked __init_memblock, and are only discarded after
boot if memory hotplug is not configured.
To allow for upcoming code which will need memblock_phys_mem_size() in
the hotplug path, change it from __init to __init_memblock.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The mmap_sem for reading in validate_mm called from expand_stack is not
enough to prevent the argumented rbtree rb_subtree_gap information to
change from under us because expand_stack may be running from other
threads concurrently which will hold the mmap_sem for reading too.
The argumented rbtree is updated with vma_gap_update under the
page_table_lock so use it in browse_rb() too to avoid false positives.
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Reported-by: Dmitry Vyukov <dvyukov@google.com> Tested-by: Dmitry Vyukov <dvyukov@google.com> Cc: Konstantin Khlebnikov <koct9i@gmail.com> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Sudip Mukherjee [Fri, 5 Feb 2016 23:36:10 +0000 (15:36 -0800)]
m32r: fix build failure due to SMP and MMU
One of the randconfig build failed with the error:
arch/m32r/kernel/smp.c: In function 'smp_flush_tlb_mm':
arch/m32r/kernel/smp.c:283:20: error: subscripted value is neither array nor pointer nor vector
mmc = &mm->context[cpu_id];
^
arch/m32r/kernel/smp.c: In function 'smp_flush_tlb_page':
arch/m32r/kernel/smp.c:353:20: error: subscripted value is neither array nor pointer nor vector
mmc = &mm->context[cpu_id];
^
arch/m32r/kernel/smp.c: In function 'smp_invalidate_interrupt':
arch/m32r/kernel/smp.c:479:41: error: subscripted value is neither array nor pointer nor vector
unsigned long *mmc = &flush_mm->context[cpu_id];
It turned out that CONFIG_SMP was defined but CONFIG_MMU was not
defined. But arch/m32r/include/asm/mmu.h only defines mm_context_t as
an array when both CONFIG_SMP and CONFIG_MMU are defined. And
arch/m32r/kernel/smp.c is always using context as an array. So without
MMU SMP can not work.
Ross Zwisler [Fri, 5 Feb 2016 23:36:08 +0000 (15:36 -0800)]
block: fix pfn_mkwrite() DAX fault handler
Previously the pfn_mkwrite() fault handler for raw block devices called
bldev_dax_fault() -> __dax_fault() to do a full DAX page fault.
Really what the pfn_mkwrite() fault handler needs to do is call
dax_pfn_mkwrite() to make sure that the radix tree entry for the given
PTE is marked as dirty so that a follow-up fsync or msync call will
flush it durably to media.
Fixes: 5a023cdba50c ("block: enable dax for raw block devices") Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Chinner <david@fromorbit.com> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Matthew Wilcox <willy@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Linus Torvalds [Fri, 5 Feb 2016 20:46:38 +0000 (12:46 -0800)]
Merge tag 'media/v4.5-3' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media
Pull media fixes from Mauro Carvalho Chehab:
- vb2: fix a vb2_thread regression and DVB read() breakages
- vsp1: fix compilation and links creation
- s5k6a3: Fix VIDIOC_SUBDEV_G_FMT ioctl for TRY format
- exynos4-is: fix a build issue, format negotiation and sensor detection
- Fix a regression with pvrusb2 and ir-kbd-i2c
- atmel-isi: fix debug message which only show the first format
- tda1004x: fix a tuning bug if G_PROPERTY is called too early
- saa7134-alsa: fix a bug at device unbinding/driver removal
- Fix build of one driver if !HAS_DMA
- soc_camera: cleanup control device on async_unbind
* tag 'media/v4.5-3' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media:
[media] saa7134-alsa: Only frees registered sound cards
[media] vb2-core: call threadio->fnc() if !VB2_BUF_STATE_ERROR
[media] vb2: fix nasty vb2_thread regression
[media] tda1004x: only update the frontend properties if locked
[media] media: i2c: Don't export ir-kbd-i2c module alias
[media] exynos4-is: make VIDEO_SAMSUNG_EXYNOS4_IS tristate
[media] media: Kconfig: add dependency of HAS_DMA
[media] exynos4-is: Wait for 100us before opening sensor
[media] exynos4-is: Open shouldn't fail when sensor entity is not linked
[media] s5k6a3: Fix VIDIOC_SUBDEV_G_FMT ioctl for TRY format
[media] exynos4-is: fix a format string bug
[media] drivers/media: vsp1_video: fix compile error
[media] atmel-isi: fix debug message which only show the first format
[media] soc_camera: cleanup control device on async_unbind
[media] v4l: vsp1: Fix wrong entities links creation
Linus Torvalds [Fri, 5 Feb 2016 20:34:44 +0000 (12:34 -0800)]
Merge tag 'sound-4.5-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound
Pull sound fixes from Takashi Iwai:
"This was a busy week and I had to prepare a pile of duct tapes for the
bugs reported by syzkaller fuzzer in wide range of ALSA core APIs:
timer, rawmidi, sequencer, and PCM OSS emulation. Let's see how many
other holes we need to plug.
Besides that, a few usual boring stuff, HD- and USB-audio quirks, have
been added"
* tag 'sound-4.5-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound:
ALSA: timer: Fix leftover link at closing
ALSA: seq: Fix lockdep warnings due to double mutex locks
ALSA: rawmidi: Fix race at copying & updating the position
ALSA: rawmidi: Make snd_rawmidi_transmit() race-free
ALSA: hda - Add fixup for Mac Mini 7,1 model
ALSA: hda/realtek - Support headset mode for ALC225
ALSA: hda/realtek - Support Dell headset mode for ALC225
ALSA: hda/realtek - New codec support of ALC225
ALSA: timer: Sync timer deletion at closing the system timer
ALSA: timer: Fix link corruption due to double start or stop
ALSA: seq: Fix yet another races among ALSA timer accesses
ALSA: pcm: Fix potential deadlock in OSS emulation
ALSA: rawmidi: Remove kernel WARNING for NULL user-space buffer check
ALSA: seq: Fix race at closing in virmidi driver
ALSA: emu10k1: correctly handling failed thread creation
ALSA: usb-audio: Add quirk for Microsoft LifeCam HD-6000
ALSA: usb-audio: Add native DSD support for PS Audio NuWave DAC
ALSA: usb-audio: Fix OPPO HA-1 vendor ID
Linus Torvalds [Fri, 5 Feb 2016 19:20:15 +0000 (11:20 -0800)]
Merge git://www.linux-watchdog.org/linux-watchdog
Pull watchdog fixes from Wim Van Sebroeck:
"This fixes several Kconfig dependencies, a compilation warning in
pcwd_usb, a failure to abort the sp805 wdt after a ping and the
max63xx wdt's MODULE_LICENSE"
* git://www.linux-watchdog.org/linux-watchdog:
watchdog: Fix dependencies for !HAS_IOMEM archs
watchdog: imgdpc: select WATCHDOG_CORE
watchdog: tango: rename ARCH_TANGOX to ARCH_TANGO
watchdog: pcwd_usb: fix compilation warning
watchdog: sp805: ping fails to abort wdt reset
watchdog: max63xx: make module's license marker match the header
Mark Brown [Fri, 5 Feb 2016 11:22:04 +0000 (11:22 +0000)]
regmap: mmio: Revert to v4.4 endianness handling
Commit 29bb45f25ff3 (regmap-mmio: Use native endianness for read/write)
attempted to fix some long standing bugs in the MMIO implementation for
big endian systems caused by duplicate byte swapping in both regmap and
readl()/writel() which affected MIPS systems as when they are in big
endian mode they flip the endianness of all registers in the system, not
just the CPU. MIPS systems had worked around this by declaring regmap
using IPs as little endian which is inaccurate, unfortunately the issue
had not been reported.
Sadly the fix makes things worse rather than better. By changing the
behaviour to match the documentation it caused behaviour changes for
other IPs which broke them and by using the __raw I/O accessors to avoid
the endianness swapping in readl()/writel() it removed some memory
ordering guarantees and could potentially generate unvirtualisable
instructions on some architectures.
Unfortunately sorting out all this mess in any half way sensible fashion
was far too invasive to go in during an -rc cycle so instead let's go
back to the old broken behaviour for v4.5, the better fixes are already
queued for v4.6. This does mean that we keep the broken MIPS DTs for
another release but that seems the least bad way of handling the
situation.
Reported-by: Johannes Berg <johannes@sipsolutions.net> Signed-off-by: Mark Brown <broonie@kernel.org>
Dave Airlie [Fri, 5 Feb 2016 05:24:17 +0000 (15:24 +1000)]
Merge branch 'drm-fixes-mst' of git://people.freedesktop.org/~airlied/linux into drm-fixes
displayport multistream fixes from AMD.
* 'drm-fixes-mst' of git://people.freedesktop.org/~airlied/linux:
drm/dp/mst: deallocate payload on port destruction
drm/dp/mst: Reverse order of MST enable and clearing VC payload table.
drm/dp/mst: move GUID storage from mgr, port to only mst branch
drm/dp/mst: change MST detection scheme
drm/dp/mst: Calculate MST PBN with 31.32 fixed point
drm: Add drm_fixp_from_fraction and drm_fixp2int_ceil
drm/mst: Add range check for max_payloads during init
drm/mst: Don't ignore the MST PBN self-test result
drm: fix missing reference counting decrease