Jan Kara [Tue, 13 Dec 2016 02:34:12 +0000 (21:34 -0500)]
dax: Fix sleep in atomic contex in grab_mapping_entry()
Commit 642261ac995e: "dax: add struct iomap based DAX PMD support" has
introduced unmapping of page tables if huge page needs to be split in
grab_mapping_entry(). However the unmapping happens after
radix_tree_preload() call which disables preemption and thus
unmap_mapping_range() tries to acquire i_mmap_lock in atomic context
which is a bug. Fix the problem by moving unmapping before
radix_tree_preload() call.
Fixes: 642261ac995e01d7837db1f4b90181496f7e6835 Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Sergey Karamov [Sat, 10 Dec 2016 22:54:58 +0000 (17:54 -0500)]
ext4: do not perform data journaling when data is encrypted
Currently data journalling is incompatible with encryption: enabling both
at the same time has never been supported by design, and would result in
unpredictable behavior. However, users are not precluded from turning on
both features simultaneously. This change programmatically replaces data
journaling for encrypted regular files with ordered data journaling mode.
Background:
Journaling encrypted data has not been supported because it operates on
buffer heads of the page in the page cache. Namely, when the commit
happens, which could be up to five seconds after caching, the commit
thread uses the buffer heads attached to the page to copy the contents of
the page to the journal. With encryption, it would have been required to
keep the bounce buffer with ciphertext for up to the aforementioned five
seconds, since the page cache can only hold plaintext and could not be
used for journaling. Alternatively, it would be required to setup the
journal to initiate a callback at the commit time to perform deferred
encryption - in this case, not only would the data have to be written
twice, but it would also have to be encrypted twice. This level of
complexity was not justified for a mode that in practice is very rarely
used because of the overhead from the data journalling.
Solution:
If data=journaled has been set as a mount option for a filesystem, or if
journaling is enabled on a regular file, do not perform journaling if the
file is also encrypted, instead fall back to the data=ordered mode for the
file.
Rationale:
The intent is to allow seamless and proper filesystem operation when
journaling and encryption have both been enabled, and have these two
conflicting features gracefully resolved by the filesystem.
Dan Carpenter [Sat, 10 Dec 2016 14:56:01 +0000 (09:56 -0500)]
ext4: return -ENOMEM instead of success
We should set the error code if kzalloc() fails.
Fixes: 67cf5b09a46f ("ext4: add the basic function for inline data support") Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@vger.kernel.org
Dan Carpenter [Sat, 3 Dec 2016 21:46:58 +0000 (16:46 -0500)]
ext4: remove another test in ext4_alloc_file_blocks()
Before commit c3fe493ccdb1 ('ext4: remove unneeded test in
ext4_alloc_file_blocks()') then it was possible for "depth" to be -1
but now, it's not possible that it is negative.
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Jan Kara <jack@suse.cz>
Jan Kara [Sat, 3 Dec 2016 21:20:53 +0000 (16:20 -0500)]
ext4: fix checks for data=ordered and journal_async_commit options
Combination of data=ordered mode and journal_async_commit mount option
is invalid. However the check in parse_options() fails to detect the
case where we simply end up defaulting to data=ordered mode and we
detect the problem only on remount which triggers hard to understand
failure to remount the filesystem.
Fix the checking of mount options to take into account also the default
mode by moving the check somewhat later in the mount sequence.
Reported-by: Wolfgang Walter <linux@stwm.de> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Eric Biggers [Sat, 3 Dec 2016 20:43:48 +0000 (15:43 -0500)]
mbcache: use consistent type for entry count
mbcache used several different types to represent the number of entries
in the cache. For consistency within mbcache and with the shrinker API,
always use unsigned long.
This does not change behavior for current mbcache users (ext2 and ext4)
since they limit the entry count to a value which easily fits in an int.
Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Jan Kara <jack@suse.cz>
Eric Biggers [Sat, 3 Dec 2016 20:38:29 +0000 (15:38 -0500)]
mbcache: remove unnecessary module_get/module_put
When mbcache is built as a module, any modules that use it (ext2 and/or
ext4) will depend on its symbols directly, incrementing its reference
count. Therefore, there is no need to do module_get/module_put.
Also note that since the module_get/module_put were in the mbcache
module itself, executing those lines of code was already dependent on
another reference to the mbcache module being held.
Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Jan Kara <jack@suse.cz>
Eric Biggers [Sat, 3 Dec 2016 20:28:53 +0000 (15:28 -0500)]
mbcache: don't BUG() if entry cache cannot be allocated
mbcache can be a module that is loaded long after startup, when someone
asks to mount an ext2 or ext4 filesystem. Therefore it should not BUG()
if kmem_cache_create() fails, but rather just fail the module load.
Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Jan Kara <jack@suse.cz>
Eric Biggers [Sat, 3 Dec 2016 20:13:15 +0000 (15:13 -0500)]
mbcache: correctly handle 'e_referenced' bit
mbcache entries have an 'e_referenced' bit which users can set with
mb_cache_entry_touch() to indicate that an entry should be given another
pass through the LRU list before the shrinker can delete it. However,
mb_cache_shrink() actually would, when seeing an e_referenced entry at
the front of the list (the least-recently used end), place it right at
the front of the list again. The next iteration would then remove the
entry from the list and delete it. Consequently, e_referenced had
essentially no effect, so ext2/ext4 xattr blocks would sometimes not be
reused as often as expected.
Fix this by making the shrinker move e_referenced entries to the back of
the list rather than the front.
Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Jan Kara <jack@suse.cz>
Theodore Ts'o [Fri, 2 Dec 2016 17:12:53 +0000 (12:12 -0500)]
ext4: fix reading new encrypted symlinks on no-journal file systems
On a filesystem with no journal, a symlink longer than about 32
characters (exact length depending on padding for encryption) could not
be followed or read immediately after being created in an encrypted
directory. This happened because when the symlink data went through the
delayed allocation path instead of the journaling path, the symlink was
incorrectly detected as a "fast" symlink rather than a "slow" symlink
until its data was written out.
To fix this, disable delayed allocation for symlinks, since there is
no benefit for delayed allocation anyway.
Reported-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Eryu Guan [Thu, 1 Dec 2016 20:08:37 +0000 (15:08 -0500)]
ext4: validate s_first_meta_bg at mount time
Ralf Spenneberg reported that he hit a kernel crash when mounting a
modified ext4 image. And it turns out that kernel crashed when
calculating fs overhead (ext4_calculate_overhead()), this is because
the image has very large s_first_meta_bg (debug code shows it's 842150400), and ext4 overruns the memory in count_overhead() when
setting bitmap buffer, which is PAGE_SIZE.
Eric Biggers [Thu, 1 Dec 2016 19:57:29 +0000 (14:57 -0500)]
ext4: correctly detect when an xattr value has an invalid size
It was possible for an xattr value to have a very large size, which
would then pass validation on 32-bit architectures due to a pointer
wraparound. Fix this by validating the size in a way which avoids
pointer wraparound.
It was also possible that a value's size would fit in the available
space but its padded size would not. This would cause an out-of-bounds
memory write in ext4_xattr_set_entry when replacing the xattr value.
For example, if an xattr value of unpadded size 253 bytes went until the
very end of the inode or block, then using setxattr(2) to replace this
xattr's value with 256 bytes would cause a write to the 3 bytes past the
end of the inode or buffer, and the new xattr value would be incorrectly
truncated. Fix this by requiring that the padded size fit in the
available space rather than the unpadded size.
This patch shouldn't have any noticeable effect on
non-corrupted/non-malicious filesystems.
Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Eric Biggers [Thu, 1 Dec 2016 19:51:58 +0000 (14:51 -0500)]
ext4: don't read out of bounds when checking for in-inode xattrs
With i_extra_isize equal to or close to the available space, it was
possible for us to read past the end of the inode when trying to detect
or validate in-inode xattrs. Fix this by checking for the needed extra
space first.
This patch shouldn't have any noticeable effect on
non-corrupted/non-malicious filesystems.
Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Eric Biggers [Thu, 1 Dec 2016 19:43:33 +0000 (14:43 -0500)]
ext4: forbid i_extra_isize not divisible by 4
i_extra_isize not divisible by 4 is problematic for several reasons:
- It causes the in-inode xattr space to be misaligned, but the xattr
header and entries are not declared __packed to express this
possibility. This may cause poor performance or incorrect code
generation on some platforms.
- When validating the xattr entries we can read past the end of the
inode if the size available for xattrs is not a multiple of 4.
- It allows the nonsensical i_extra_isize=1, which doesn't even leave
enough room for i_extra_isize itself.
Therefore, update ext4_iget() to consider i_extra_isize not divisible by
4 to be an error, like the case where i_extra_isize is too large.
This also matches the rule recently added to e2fsck for determining
whether an inode has valid i_extra_isize.
This patch shouldn't have any noticeable effect on
non-corrupted/non-malicious filesystems, since the size of ext4_inode
has always been a multiple of 4.
Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Eric Biggers [Thu, 1 Dec 2016 16:55:51 +0000 (11:55 -0500)]
ext4: disable pwsalt ioctl when encryption disabled by config
On a CONFIG_EXT4_FS_ENCRYPTION=n kernel, the ioctls to get and set
encryption policies were disabled but EXT4_IOC_GET_ENCRYPTION_PWSALT was
not. But there's no good reason to expose the pwsalt ioctl if the
kernel doesn't support encryption. The pwsalt ioctl was also disabled
pre-4.8 (via ext4_sb_has_crypto() previously returning 0 when encryption
was disabled by config) and seems to have been enabled by mistake when
ext4 encryption was refactored to use fs/crypto/. So let's disable it
again.
Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Eric Biggers [Thu, 1 Dec 2016 16:54:18 +0000 (11:54 -0500)]
ext4: get rid of ext4_sb_has_crypto()
ext4_sb_has_crypto() just called through to ext4_has_feature_encrypt(),
and all callers except one were already using the latter. So remove it
and switch its one caller to ext4_has_feature_encrypt().
Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Daeho Jeong [Thu, 1 Dec 2016 16:49:12 +0000 (11:49 -0500)]
ext4: fix inode checksum calculation problem if i_extra_size is small
We've fixed the race condition problem in calculating ext4 checksum
value in commit b47820edd163 ("ext4: avoid modifying checksum fields
directly during checksum veficationon"). However, by this change,
when calculating the checksum value of inode whose i_extra_size is
less than 4, we couldn't calculate the checksum value in a proper way.
This problem was found and reported by Nix, Thank you.
Reported-by: Nix <nix@esperi.org.uk> Signed-off-by: Daeho Jeong <daeho.jeong@samsung.com> Signed-off-by: Youngjin Gil <youngjin.gil@samsung.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Jan Kara [Thu, 1 Dec 2016 16:46:40 +0000 (11:46 -0500)]
ext4: warn when page is dirtied without buffers
Warn when a page is dirtied without buffers (as that will likely lead to
a crash in ext4_writepages()) or when it gets newly dirtied without the
page being locked (as there is nothing that prevents buffers to get
stripped just before calling set_page_dirty() under memory pressure).
Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Jan Kara [Tue, 29 Nov 2016 16:18:39 +0000 (11:18 -0500)]
ext4: be more strict when verifying flags set via SETFLAGS ioctls
Currently we just silently ignore flags that we don't understand (or
that cannot be manipulated) through EXT4_IOC_SETFLAGS and
EXT4_IOC_FSSETXATTR ioctls. This makes it problematic for the unused
flags to be used in future (some app may be inadvertedly setting them
and we won't notice until the flag gets used). Also this is inconsistent
with other filesystems like XFS or BTRFS which return EOPNOTSUPP when
they see a flag they cannot set.
ext4 has the additional problem that there are flags which are returned
by EXT4_IOC_GETFLAGS ioctl but which cannot be modified via
EXT4_IOC_SETFLAGS. So we have to be careful to ignore value of these
flags and not fail the ioctl when they are set (as e.g. chattr(1) passes
flags returned from EXT4_IOC_GETFLAGS to EXT4_IOC_SETFLAGS without any
masking and thus we'd break this utility).
Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Jan Kara [Tue, 29 Nov 2016 16:13:13 +0000 (11:13 -0500)]
ext4: add EXT4_JOURNAL_DATA_FL and EXT4_EXTENTS_FL to modifiable mask
Add EXT4_JOURNAL_DATA_FL and EXT4_EXTENTS_FL to EXT4_FL_USER_MODIFIABLE
to recognize that they are modifiable by userspace. So far we got away
without having them there because ext4_ioctl_setflags() treats them in a
special way. But it was really confusing like that.
Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Eric Sandeen [Sat, 26 Nov 2016 19:24:51 +0000 (14:24 -0500)]
ext4: fix mmp use after free during unmount
In ext4_put_super, we call brelse on the buffer head containing
the ext4 superblock, but then try to use it when we stop the
mmp thread, because when the thread shuts down it does:
which reaches into sb->s_fs_info->s_es->s_feature_ro_compat,
which lives in the superblock buffer s_sbh which we just released.
Fix this by moving the brelse down to a point where we are no
longer using it.
Reported-by: Wang Shu <shuwang@redhat.com> Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Andreas Dilger <adilger@dilger.ca>
When a file is created in an encrypted directory, ext4_set_context() is
called to set an encryption context on the new file. This calls
ext4_xattr_set(), which contains a retry loop where the journal is
forced to commit if an ENOSPC error is encountered.
If the task actually were to wait for the journal to commit in this
case, then it would deadlock because a handle remains open from
__ext4_new_inode(), so the running transaction can't be committed yet.
Fortunately, __jbd2_journal_force_commit() avoids the deadlock by not
allowing the running transaction to be committed while the current task
has it open. However, the above lockdep warning is still triggered.
This was a false positive which was introduced by: 1eaa566d368b: jbd2:
track more dependencies on transaction commit
Fix the problem by passing the handle through the 'fs_data' argument to
ext4_set_context(), then using ext4_xattr_set_handle() instead of
ext4_xattr_set(). And in the case where no journal handle is specified
and ext4_set_context() has to open one, add an ENOSPC retry loop since
in that case it is the outermost transaction.
Ross Zwisler [Mon, 21 Nov 2016 16:51:44 +0000 (11:51 -0500)]
ext4: remove unused function ext4_aligned_io()
The last user of ext4_aligned_io() was the DAX path in
ext4_direct_IO_write(). This usage was removed by Jan Kara's patch
entitled "ext4: Rip out DAX handling from direct IO path".
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Jan Kara [Mon, 21 Nov 2016 01:47:07 +0000 (20:47 -0500)]
ext2: use iomap_zero_range() for zeroing truncated page in DAX path
Currently the last user of ext2_get_blocks() for DAX inodes was
dax_truncate_page(). Convert that to iomap_zero_range() so that all DAX
IO uses the iomap path.
Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Jan Kara [Sun, 20 Nov 2016 23:51:24 +0000 (18:51 -0500)]
ext4: convert DAX faults to iomap infrastructure
Convert DAX faults to use iomap infrastructure. We would not have to start
transaction in ext4_dax_fault() anymore since ext4_iomap_begin takes
care of that but so far we do that to avoid lock inversion of
transaction start with DAX entry lock which gets acquired in
dax_iomap_fault() before calling ->iomap_begin handler.
Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Jan Kara [Sun, 20 Nov 2016 23:10:09 +0000 (18:10 -0500)]
ext4: avoid split extents for DAX writes
Currently mapping of blocks for DAX writes happen with
EXT4_GET_BLOCKS_PRE_IO flag set. That has a result that each
ext4_map_blocks() call creates a separate written extent, although it
could be merged to the neighboring extents in the extent tree. The
reason for using this flag is that in case the extent is unwritten, we
need to convert it to written one and zero it out. However this "convert
mapped range to written" operation is already implemented by
ext4_map_blocks() for the case of data writes into unwritten extent. So
just use flags for that mode of operation, simplify the code, and avoid
unnecessary split extents.
Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Jan Kara [Sun, 20 Nov 2016 23:08:05 +0000 (18:08 -0500)]
ext4: use iomap for zeroing blocks in DAX mode
Use iomap infrastructure for zeroing blocks when in DAX mode.
ext4_iomap_begin() handles read requests just fine and that's all that
is needed for iomap_zero_range().
Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Jan Kara [Sun, 20 Nov 2016 22:32:59 +0000 (17:32 -0500)]
ext4: only set S_DAX if DAX is really supported
Currently we have S_DAX set inode->i_flags for a regular file whenever
ext4 is mounted with dax mount option. However in some cases we cannot
really do DAX - e.g. when inode is marked to use data journalling, when
inode data is being encrypted, or when inode is stored inline. Make sure
S_DAX flag is appropriately set/cleared in these cases.
Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Theodore Ts'o [Fri, 18 Nov 2016 18:37:47 +0000 (13:37 -0500)]
ext4: add sanity checking to count_overhead()
The commit "ext4: sanity check the block and cluster size at mount
time" should prevent any problems, but in case the superblock is
modified while the file system is mounted, add an extra safety check
to make sure we won't overrun the allocated buffer.
Theodore Ts'o [Fri, 18 Nov 2016 18:24:26 +0000 (13:24 -0500)]
ext4: fix in-superblock mount options processing
Fix a large number of problems with how we handle mount options in the
superblock. For one, if the string in the superblock is long enough
that it is not null terminated, we could run off the end of the string
and try to interpret superblocks fields as characters. It's unlikely
this will cause a security problem, but it could result in an invalid
parse. Also, parse_options is destructive to the string, so in some
cases if there is a comma-separated string, it would be modified in
the superblock. (Fortunately it only happens on file systems with a
1k block size.)
Theodore Ts'o [Fri, 18 Nov 2016 18:00:24 +0000 (13:00 -0500)]
ext4: sanity check the block and cluster size at mount time
If the block size or cluster size is insane, reject the mount. This
is important for security reasons (although we shouldn't be just
depending on this check).
Eric Whitney [Tue, 15 Nov 2016 02:48:35 +0000 (21:48 -0500)]
ext4: allow inode expansion for nojournal file systems
Runs of xfstest ext4/022 on nojournal file systems result in failures
because the inodes of some of its test files do not expand as expected.
The cause is a conditional in ext4_mark_inode_dirty() that prevents inode
expansion unless the test file system has a journal. Remove this
unnecessary restriction.
Signed-off-by: Eric Whitney <enwlinux@gmail.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Deepa Dinamani [Tue, 15 Nov 2016 02:40:10 +0000 (21:40 -0500)]
ext4: use current_time() for inode timestamps
CURRENT_TIME_SEC and CURRENT_TIME are not y2038 safe.
current_time() will be transitioned to be y2038 safe
along with vfs.
current_time() returns timestamps according to the
granularities set in the super_block.
The granularity check in ext4_current_time() to call
current_time() or CURRENT_TIME_SEC is not required.
Use current_time() directly to obtain timestamps
unconditionally, and remove ext4_current_time().
Quota files are assumed to be on the same filesystem.
Hence, use current_time() for these files as well.
Chandan Rajendra [Tue, 15 Nov 2016 02:26:26 +0000 (21:26 -0500)]
ext4: fix stack memory corruption with 64k block size
The number of 'counters' elements needed in 'struct sg' is
super_block->s_blocksize_bits + 2. Presently we have 16 'counters'
elements in the array. This is insufficient for block sizes >= 32k. In
such cases the memcpy operation performed in ext4_mb_seq_groups_show()
would cause stack memory corruption.
Fixes: c9de560ded61f Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Jan Kara <jack@suse.cz> Cc: stable@vger.kernel.org
Chandan Rajendra [Tue, 15 Nov 2016 02:04:37 +0000 (21:04 -0500)]
ext4: fix mballoc breakage with 64k block size
'border' variable is set to a value of 2 times the block size of the
underlying filesystem. With 64k block size, the resulting value won't
fit into a 16-bit variable. Hence this commit changes the data type of
'border' to 'unsigned int'.
Theodore Ts'o [Mon, 14 Nov 2016 03:02:29 +0000 (22:02 -0500)]
ext4: don't lock buffer in ext4_commit_super if holding spinlock
If there is an error reported in mballoc via ext4_grp_locked_error(),
the code is holding a spinlock, so ext4_commit_super() must not try to
lock the buffer head, or else it will trigger a BUG:
Since ext4_grp_locked_error() calls ext4_commit_super with sync == 0
(and it is the only caller which does so), avoid locking and unlocking
the buffer in this case.
This can result in races with ext4_commit_super() if there are other
problems (which is what commit 4743f83990614 was trying to address),
but a Warning is better than BUG.
Fixes: 4743f83990614 Cc: stable@vger.kernel.org # 4.9 Reported-by: Nikolay Borisov <kernel@kyup.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Jan Kara <jack@suse.cz>
Theodore Ts'o [Mon, 14 Nov 2016 03:02:26 +0000 (22:02 -0500)]
ext4: allow ext4_truncate() to return an error
This allows us to properly propagate errors back up to
ext4_truncate()'s callers. This also means we no longer have to
silently ignore some errors (e.g., when trying to add the inode to the
orphan inode list).
Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Jan Kara <jack@suse.cz>
Eric Biggers [Mon, 14 Nov 2016 01:41:09 +0000 (20:41 -0500)]
fscrypto: don't use on-stack buffer for key derivation
With the new (in 4.9) option to use a virtually-mapped stack
(CONFIG_VMAP_STACK), stack buffers cannot be used as input/output for
the scatterlist crypto API because they may not be directly mappable to
struct page. get_crypt_info() was using a stack buffer to hold the
output from the encryption operation used to derive the per-file key.
Fix it by using a heap buffer.
This bug could most easily be observed in a CONFIG_DEBUG_SG kernel
because this allowed the BUG in sg_set_buf() to be triggered.
Cc: stable@vger.kernel.org Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Eric Biggers [Mon, 14 Nov 2016 01:35:52 +0000 (20:35 -0500)]
fscrypto: don't use on-stack buffer for filename encryption
With the new (in 4.9) option to use a virtually-mapped stack
(CONFIG_VMAP_STACK), stack buffers cannot be used as input/output for
the scatterlist crypto API because they may not be directly mappable to
struct page. For short filenames, fname_encrypt() was encrypting a
stack buffer holding the padded filename. Fix it by encrypting the
filename in-place in the output buffer, thereby making the temporary
buffer unnecessary.
This bug could most easily be observed in a CONFIG_DEBUG_SG kernel
because this allowed the BUG in sg_set_buf() to be triggered.
Cc: stable@vger.kernel.org Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
David Gstir [Sun, 13 Nov 2016 21:20:48 +0000 (22:20 +0100)]
fscrypt: Let fs select encryption index/tweak
Avoid re-use of page index as tweak for AES-XTS when multiple parts of
same page are encrypted. This will happen on multiple (partial) calls of
fscrypt_encrypt_page on same page.
page->index is only valid for writeback pages.
Signed-off-by: David Gstir <david@sigma-star.at> Signed-off-by: Richard Weinberger <richard@nod.at> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
David Gstir [Sun, 13 Nov 2016 21:20:45 +0000 (22:20 +0100)]
fscrypt: Allow fscrypt_decrypt_page() to function with non-writeback pages
Some filesystem might pass pages which do not have page->mapping->host
set to the encrypted inode. We want the caller to explicitly pass the
corresponding inode.
Signed-off-by: David Gstir <david@sigma-star.at> Signed-off-by: Richard Weinberger <richard@nod.at> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
David Gstir [Sun, 13 Nov 2016 21:20:44 +0000 (22:20 +0100)]
fscrypt: Add in-place encryption mode
ext4 and f2fs require a bounce page when encrypting pages. However, not
all filesystems will need that (eg. UBIFS). This is handled via a
flag on fscrypt_operations where a fs implementation can select in-place
encryption over using a bounce page (which is the default).
Signed-off-by: David Gstir <david@sigma-star.at> Signed-off-by: Richard Weinberger <richard@nod.at> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Jan Kara [Wed, 9 Nov 2016 23:26:50 +0000 (10:26 +1100)]
dax: Introduce IOMAP_FAULT flag
Introduce a flag telling iomap operations whether they are handling a
fault or other IO. That may influence behavior wrt inode size and
similar things.
Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <david@fromorbit.com>
Ross Zwisler [Tue, 8 Nov 2016 00:35:02 +0000 (11:35 +1100)]
xfs: use struct iomap based DAX PMD fault path
Switch xfs_filemap_pmd_fault() from using dax_pmd_fault() to the new and
improved dax_iomap_pmd_fault(). Also, now that it has no more users,
remove xfs_get_blocks_dax_fault().
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Dave Chinner <david@fromorbit.com>
Ross Zwisler [Tue, 8 Nov 2016 00:34:45 +0000 (11:34 +1100)]
dax: add struct iomap based DAX PMD support
DAX PMDs have been disabled since Jan Kara introduced DAX radix tree based
locking. This patch allows DAX PMDs to participate in the DAX radix tree
based locking scheme so that they can be re-enabled using the new struct
iomap based fault handlers.
There are currently three types of DAX 4k entries: 4k zero pages, 4k DAX
mappings that have an associated block allocation, and 4k DAX empty
entries. The empty entries exist to provide locking for the duration of a
given page fault.
This patch adds three equivalent 2MiB DAX entries: Huge Zero Page (HZP)
entries, PMD DAX entries that have associated block allocations, and 2 MiB
DAX empty entries.
Unlike the 4k case where we insert a struct page* into the radix tree for
4k zero pages, for HZP we insert a DAX exceptional entry with the new
RADIX_DAX_HZP flag set. This is because we use a single 2 MiB zero page in
every 2MiB hole mapping, and it doesn't make sense to have that same struct
page* with multiple entries in multiple trees. This would cause contention
on the single page lock for the one Huge Zero Page, and it would break the
page->index and page->mapping associations that are assumed to be valid in
many other places in the kernel.
One difficult use case is when one thread is trying to use 4k entries in
radix tree for a given offset, and another thread is using 2 MiB entries
for that same offset. The current code handles this by making the 2 MiB
user fall back to 4k entries for most cases. This was done because it is
the simplest solution, and because the use of 2MiB pages is already
opportunistic.
If we were to try to upgrade from 4k pages to 2MiB pages for a given range,
we run into the problem of how we lock out 4k page faults for the entire
2MiB range while we clean out the radix tree so we can insert the 2MiB
entry. We can solve this problem if we need to, but I think that the cases
where both 2MiB entries and 4K entries are being used for the same range
will be rare enough and the gain small enough that it probably won't be
worth the complexity.
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Dave Chinner <david@fromorbit.com>
Ross Zwisler [Tue, 8 Nov 2016 00:33:44 +0000 (11:33 +1100)]
dax: move put_(un)locked_mapping_entry() in dax.c
No functional change.
The static functions put_locked_mapping_entry() and
put_unlocked_mapping_entry() will soon be used in error cases in
grab_mapping_entry(), so move their definitions above this function.
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Dave Chinner <david@fromorbit.com>
Ross Zwisler [Tue, 8 Nov 2016 00:33:35 +0000 (11:33 +1100)]
dax: move RADIX_DAX_* defines to dax.h
The RADIX_DAX_* defines currently mostly live in fs/dax.c, with just
RADIX_DAX_ENTRY_LOCK being in include/linux/dax.h so it can be used in
mm/filemap.c. When we add PMD support, though, mm/filemap.c will also need
access to the RADIX_DAX_PTE type so it can properly construct a 4k sized
empty entry.
Instead of shifting the defines between dax.c and dax.h as they are
individually used in other code, just move them wholesale to dax.h so
they'll be available when we need them.
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Dave Chinner <david@fromorbit.com>
Ross Zwisler [Tue, 8 Nov 2016 00:33:26 +0000 (11:33 +1100)]
dax: dax_iomap_fault() needs to call iomap_end()
Currently iomap_end() doesn't do anything for DAX page faults for both ext2
and XFS. ext2_iomap_end() just checks for a write underrun, and
xfs_file_iomap_end() checks to see if it needs to finish a delayed
allocation. However, in the future iomap_end() calls might be needed to
make sure we have balanced allocations, locks, etc. So, add calls to
iomap_end() with appropriate error handling to dax_iomap_fault().
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com> Suggested-by: Jan Kara <jack@suse.cz> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Dave Chinner <david@fromorbit.com>
Ross Zwisler [Tue, 8 Nov 2016 00:33:09 +0000 (11:33 +1100)]
dax: add dax_iomap_sector() helper function
To be able to correctly calculate the sector from a file position and a
struct iomap there is a complex little bit of logic that currently happens
in both dax_iomap_actor() and dax_iomap_fault(). This will need to be
repeated yet again in the DAX PMD fault handler when it is added, so break
it out into a helper function.
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Dave Chinner <david@fromorbit.com>
Ross Zwisler [Tue, 8 Nov 2016 00:32:46 +0000 (11:32 +1100)]
dax: correct dax iomap code namespace
The recently added DAX functions that use the new struct iomap data
structure were named iomap_dax_rw(), iomap_dax_fault() and
iomap_dax_actor(). These are actually defined in fs/dax.c, though, so
should be part of the "dax" namespace and not the "iomap" namespace.
Rename them to dax_iomap_rw(), dax_iomap_fault() and dax_iomap_actor()
respectively.
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com> Suggested-by: Dave Chinner <david@fromorbit.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Dave Chinner <david@fromorbit.com>
Ross Zwisler [Tue, 8 Nov 2016 00:32:35 +0000 (11:32 +1100)]
dax: remove dax_pmd_fault()
dax_pmd_fault() is the old struct buffer_head + get_block_t based 2 MiB DAX
fault handler. This fault handler has been disabled for several kernel
releases, and support for PMDs will be reintroduced using the struct iomap
interface instead.
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Dave Chinner <david@fromorbit.com>
Ross Zwisler [Tue, 8 Nov 2016 00:32:20 +0000 (11:32 +1100)]
dax: coordinate locking for offsets in PMD range
DAX radix tree locking currently locks entries based on the unique
combination of the 'mapping' pointer and the pgoff_t 'index' for the entry.
This works for PTEs, but as we move to PMDs we will need to have all the
offsets within the range covered by the PMD to map to the same bit lock.
To accomplish this, for ranges covered by a PMD entry we will instead lock
based on the page offset of the beginning of the PMD entry. The 'mapping'
pointer is still used in the same way.
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Dave Chinner <david@fromorbit.com>
Ross Zwisler [Tue, 8 Nov 2016 00:32:12 +0000 (11:32 +1100)]
dax: consistent variable naming for DAX entries
No functional change.
Consistently use the variable name 'entry' instead of 'ret' for DAX radix
tree entries. This was already happening in most of the code, so update
get_unlocked_mapping_entry(), grab_mapping_entry() and
dax_unlock_mapping_entry().
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Dave Chinner <david@fromorbit.com>
Ross Zwisler [Tue, 8 Nov 2016 00:32:00 +0000 (11:32 +1100)]
dax: remove the last BUG_ON() from fs/dax.c
Don't take down the kernel if we get an invalid 'from' and 'length'
argument pair. Just warn once and return an error.
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Dave Chinner <david@fromorbit.com>
Ross Zwisler [Tue, 8 Nov 2016 00:31:44 +0000 (11:31 +1100)]
dax: make 'wait_table' global variable static
The global 'wait_table' variable is only used within fs/dax.c, and
generates the following sparse warning:
fs/dax.c:39:19: warning: symbol 'wait_table' was not declared. Should it be static?
Make it static so it has scope local to fs/dax.c, and to make sparse happy.
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Dave Chinner <david@fromorbit.com>
I believe this path to be untested as ext2 doesn't reliably provide block
allocations that are aligned to 2MiB. In my testing I've been unable to
get ext2 to actually fault in a PMD. It always fails with a "pfn
unaligned" message because the sector returned by ext2_get_block() isn't
aligned.
I've tried various settings for the "stride" and "stripe_width" extended
options to mkfs.ext2, without any luck.
Since we can't reliably get PMDs, remove support so that we don't have an
untested code path that we may someday traverse when we happen to get an
aligned block allocation. This should also make 4k DAX faults in ext2 a
bit faster since they will no longer have to call the PMD fault handler
only to get a response of VM_FAULT_FALLBACK.
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Dave Chinner <david@fromorbit.com>
Ross Zwisler [Tue, 8 Nov 2016 00:31:14 +0000 (11:31 +1100)]
dax: remove buffer_size_valid()
Now that ext4 properly sets bh.b_size when we call get_block() for a hole,
rely on that value and remove the buffer_size_valid() sanity check.
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <david@fromorbit.com>
Ross Zwisler [Tue, 8 Nov 2016 00:30:58 +0000 (11:30 +1100)]
ext4: tell DAX the size of allocation holes
When DAX calls _ext4_get_block() and the file offset points to a hole we
currently don't set bh->b_size. This is current worked around via
buffer_size_valid() in fs/dax.c.
_ext4_get_block() has the hole size information from ext4_map_blocks(), so
populate bh->b_size so we can remove buffer_size_valid() in a later patch.
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Dave Chinner <david@fromorbit.com>
Linus Torvalds [Sat, 5 Nov 2016 18:46:02 +0000 (11:46 -0700)]
Merge branches 'sched-urgent-for-linus' and 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull stack vmap fixups from Thomas Gleixner:
"Two small patches related to sched_show_task():
- make sure to hold a reference on the task stack while accessing it
- remove the thread_saved_pc printout
.. and add a sanity check into release_task_stack() to catch problems
with task stack references"
* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/core: Remove pointless printout in sched_show_task()
sched/core: Fix oops in sched_show_task()
* 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
fork: Add task stack refcounting sanity check and prevent premature task stack freeing
Linus Torvalds [Sat, 5 Nov 2016 18:34:07 +0000 (11:34 -0700)]
Merge tag 'md/4.9-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/shli/md
Pull MD fixes from Shaohua Li:
"There are several bug fixes queued:
- fix raid5-cache recovery bugs
- fix discard IO error handling for raid1/10
- fix array sync writes bogus position to superblock
- fix IO error handling for raid array with external metadata"
* tag 'md/4.9-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/shli/md:
md: be careful not lot leak internal curr_resync value into metadata. -- (all)
raid1: handle read error also in readonly mode
raid5-cache: correct condition for empty metadata write
md: report 'write_pending' state when array in sync
md/raid5: write an empty meta-block when creating log super-block
md/raid5: initialize next_checkpoint field before use
RAID10: ignore discard error
RAID1: ignore discard error
Linus Torvalds [Sat, 5 Nov 2016 18:28:21 +0000 (11:28 -0700)]
Merge tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi
Pull SCSI fixes from James Bottomley:
"Two more important data integrity fixes related to RAID device drivers
which wrongly throw away the SYNCHRONIZE CACHE command in the non-RAID
path and a memory leak in the scsi_debug driver"
* tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
scsi: arcmsr: Send SYNCHRONIZE_CACHE command to firmware
scsi: scsi_debug: Fix memory leak if LBP enabled and module is unloaded
scsi: megaraid_sas: Fix data integrity failure for JBOD (passthrough) devices
Linus Torvalds [Sat, 5 Nov 2016 18:15:09 +0000 (11:15 -0700)]
Merge tag 'media/v4.9-3' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media
Pull media fixes from Mauro Carvalho Chehab:
"A series of fixup patches meant to fix the usage of DMA on stack, plus
one warning fixup"
* tag 'media/v4.9-3' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media: (32 commits)
[media] radio-bcm2048: don't ignore errors
[media] pctv452e: fix semicolon.cocci warnings
[media] flexcop-usb: don't use stack for DMA
[media] stk-webcam: don't use stack for DMA
[media] s2255drv: don't use stack for DMA
[media] cpia2_usb: don't use stack for DMA
[media] digitv: handle error code on RC query
[media] dw2102: return error if su3000_power_ctrl() fails
[media] nova-t-usb2: handle error code on RC query
[media] technisat-usb2: use DMA buffers for I2C transfers
[media] pctv452e: don't call BUG_ON() on non-fatal error
[media] pctv452e: don't do DMA on stack
[media] nova-t-usb2: don't do DMA on stack
[media] gp8psk: don't go past the buffer size
[media] gp8psk: don't do DMA on stack
[media] dtv5100: don't do DMA on stack
[media] dtt200u: handle USB control message errors
[media] dtt200u: don't do DMA on stack
[media] dtt200u-fe: handle errors on USB control messages
[media] dtt200u-fe: don't do DMA on stack
...
Linus Torvalds [Sat, 5 Nov 2016 18:11:31 +0000 (11:11 -0700)]
Merge tag 'pci-v4.9-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci
Pull PCI fixes from Bjorn Helgaas:
- fix for a Qualcomm driver issue that causes a use-before-set crash
- fix for DesignWare iATU unroll support that causes external aborts
when enabling the host bridge
* tag 'pci-v4.9-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci:
PCI: designware: Check for iATU unroll support after initializing host
PCI: qcom: Fix pp->dev usage before assignment
Linus Torvalds [Sat, 5 Nov 2016 17:52:29 +0000 (10:52 -0700)]
Merge tag 'for-linus-20161104' of git://git.infradead.org/linux-mtd
Pull MTD fixes from Brian Norris:
- MAINTAINERS updates to reflect some new maintainers/submaintainers.
We have some great volunteers who've been developing and reviewing
already. We're going to try a group maintainership model, so
eventually you'll probably see pull requests from people besides me.
- NAND fixes from Boris:
"Three simple fixes:
- fix a non-critical bug in the gpmi driver
- fix a bug in the 'automatic NAND timings selection' feature
introduced in 4.9-rc1
- fix a false positive uninitialized-var warning"
* tag 'for-linus-20161104' of git://git.infradead.org/linux-mtd:
mtd: mtk: avoid warning in mtk_ecc_encode
mtd: nand: Fix data interface configuration logic
mtd: nand: gpmi: disable the clocks on errors
MAINTAINERS: add more people to the MTD maintainer team
MAINTAINERS: add a maintainer for the SPI NOR subsystem
Linus Torvalds [Sat, 5 Nov 2016 17:42:20 +0000 (10:42 -0700)]
Merge tag 'gpio-v4.9-3' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-gpio
Pull GPIO fixes from Linus Walleij:
"Some GPIO fixes for the v4.9 series:
- Fix a nasty file descriptor leak when getting line handles.
- A fix for a cleanup that seemed innocent but created a problem for
drivers instantiating several gpiochips for one single OF node.
- Fix a unpredictable problem using irq_domain_simple() in the mvebu
driver by converting it to a lineas irqdomain"
* tag 'gpio-v4.9-3' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-gpio:
gpio/mvebu: Use irq_domain_add_linear
gpio: of: fix GPIO drivers with multiple gpio_chip for a single node
gpio: GPIO_GET_LINE{HANDLE,EVENT}_IOCTL: Fix file descriptor leak
Linus Torvalds [Sat, 5 Nov 2016 03:12:10 +0000 (20:12 -0700)]
Merge tag 'nfsd-4.9-1' of git://linux-nfs.org/~bfields/linux
Pull nfsd bugfixes from Bruce Fields:
"Fixes for some recent regressions including fallout from the vmalloc'd
stack change (after which we can no longer encrypt stuff on the
stack)"
* tag 'nfsd-4.9-1' of git://linux-nfs.org/~bfields/linux:
nfsd: Fix general protection fault in release_lock_stateid()
svcrdma: backchannel cannot share a page for send and rcv buffers
sunrpc: fix some missing rq_rbuffer assignments
sunrpc: don't pass on-stack memory to sg_set_buf
nfsd: move blocked lock handling under a dedicated spinlock
Linus Torvalds [Sat, 5 Nov 2016 03:08:16 +0000 (20:08 -0700)]
Merge branch 'for-4.9-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from Chris Mason:
"Some fixes that Dave Sterba collected. We held off on these last week
because I was focused on the memory corruption testing"
* 'for-4.9-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: fix WARNING in btrfs_select_ref_head()
Btrfs: remove some no-op casts
btrfs: pass correct args to btrfs_async_run_delayed_refs()
btrfs: make file clone aware of fatal signals
btrfs: qgroup: Prevent qgroup->reserved from going subzero
Btrfs: kill BUG_ON in do_relocation
Linus Torvalds [Sat, 5 Nov 2016 03:03:14 +0000 (20:03 -0700)]
Merge branch 'overlayfs-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs
Pull overlayfs fixes from Miklos Szeredi:
"Fix two more POSIX ACL bugs introduced in 4.8 and add a missing fsync
during copy up to prevent possible data loss"
* 'overlayfs-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs:
ovl: fsync after copy-up
ovl: fix get_acl() on tmpfs
ovl: update S_ISGID when setting posix ACLs
Linus Torvalds [Fri, 4 Nov 2016 20:30:13 +0000 (13:30 -0700)]
Merge tag 'drm-fixes-for-v4.9-rc4' of git://people.freedesktop.org/~airlied/linux
Pull drm fixes from Dave Airlie:
"Fixes for amdgpu, radeon, intel, imx and virtio-gpu.
This is a bit larger than I'd like, but I had some stuff I meant to
send for -rc3 but was waiting for the PAT regression fix to land. So
this is really fixes for rc3 and rc4 in one go.
There are a set of fixes for an oops we've been seeing around MST
display unplug, along with more suspend/resume and shutdown fixes for
amdgpu, one power management follow on fix for nouveau, and set of imx
fixes, and a single virtio-gpu regression fix"
* tag 'drm-fixes-for-v4.9-rc4' of git://people.freedesktop.org/~airlied/linux: (54 commits)
virtio-gpu: fix vblank events
drm/nouveau/acpi: fix check for power resources support
drm/i915: Fix SKL+ 90/270 degree rotated plane coordinate computation
drm/i915: Remove two invalid warns
drm/i915: Rotated view does not need a fence
drm/i915/fbc: fix CFB size calculation for gen8+
drm: i915: Wait for fences on new fb, not old
drm/i915: Clean up DDI DDC/AUX CH sanitation
drm/i915: Respect alternate_aux_channel for all DDI ports
drm/i915/gen9: fix watermarks when using the pipe scaler
drm/i915: Fix mismatched INIT power domain disabling during suspend
drm/i915: fix a read size argument
drm/i915: Use fence_write() from rpm resume
drm/i915/gen9: fix DDB partitioning for multi-screen cases
drm/i915: workaround sparse warning on variable length arrays
drm/i915: keep declarations in i915_drv.h
drm/amd/powerplay: fix bug get wrong evv voltage of Polaris.
drm/amdgpu/si_dpm: workaround for SI kickers
drm/radeon/si_dpm: workaround for SI kickers
drm/amdgpu: fix s3 resume back, uvd dpm randomly can't disable.
...
Niklas Cassel [Fri, 14 Oct 2016 21:54:55 +0000 (23:54 +0200)]
PCI: designware: Check for iATU unroll support after initializing host
dw_pcie_iatu_unroll_enabled() reads a dbi_base register. Reading any
dbi_base register before pp->ops->host_init has been called causes
"imprecise external abort" on platforms like ARTPEC-6, where the PCIe
module is disabled at boot and first enabled in pp->ops->host_init. Move
dw_pcie_iatu_unroll_enabled() to dw_pcie_setup_rc(), since it is after
pp->ops->host_init, but before pp->iatu_unroll_enabled is actually used.
Fixes: a0601a470537 ("PCI: designware: Add iATU Unroll feature") Tested-by: James Le Cuirot <chewi@gentoo.org> Signed-off-by: Niklas Cassel <niklas.cassel@axis.com> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Acked-by: Joao Pinto <jpinto@synopsys.com> Acked-by: Olof Johansson <olof@lixom.net>
Linus Torvalds [Fri, 4 Nov 2016 20:08:05 +0000 (13:08 -0700)]
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull KVM updates from Paolo Bonzini:
"One NULL pointer dereference, and two fixes for regressions introduced
during the merge window.
The rest are fixes for MIPS, s390 and nested VMX"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
kvm: x86: Check memopp before dereference (CVE-2016-8630)
kvm: nVMX: VMCLEAR an active shadow VMCS after last use
KVM: x86: drop TSC offsetting kvm_x86_ops to fix KVM_GET/SET_CLOCK
KVM: x86: fix wbinvd_dirty_mask use-after-free
kvm/x86: Show WRMSR data is in hex
kvm: nVMX: Fix kernel panics induced by illegal INVEPT/INVVPID types
KVM: document lock orders
KVM: fix OOPS on flush_work
KVM: s390: Fix STHYI buffer alignment for diag224
KVM: MIPS: Precalculate MMIO load resume PC
KVM: MIPS: Make ERET handle ERL before EXL
KVM: MIPS: Fix lazy user ASID regenerate for SMP
Linus Torvalds [Fri, 4 Nov 2016 20:03:57 +0000 (13:03 -0700)]
Merge branch 'upstream' of git://git.linux-mips.org/pub/scm/ralf/upstream-linus
Pull MIPS fixes from Ralf Baechle:
"A set of MIPS fixes for 4.9:
- lots of fixes for printk continuations
- six fixes for FP related code.
- fix max_low_pfn with disabled highmem
- fix KASLR handling of NULL FDT and KASLR for generic kernels
- fix build of compressed image
- provide default mips_cpc_default_phys_base to ignore CPC
- fix reboot on Malta"
* 'upstream' of git://git.linux-mips.org/pub/scm/ralf/upstream-linus:
MIPS: Fix max_low_pfn with disabled highmem
MIPS: Correct MIPS I FP sigcontext layout
MIPS: Fix ISA I/II FP signal context offsets
MIPS: Remove FIR from ISA I FP signal context
MIPS: Fix ISA I FP sigcontext access violation handling
MIPS: Fix FCSR Cause bit handling for correct SIGFPE issue
MIPS: ptrace: Also initialize the FP context on individual FCSR writes
MIPS: dump_tlb: Fix printk continuations
MIPS: Fix __show_regs() output
MIPS: traps: Fix output of show_code
MIPS: traps: Fix output of show_stacktrace
MIPS: traps: Fix output of show_backtrace
MIPS: Fix build of compressed image
MIPS: generic: Fix KASLR for generic kernel.
MIPS: KASLR: Fix handling of NULL FDT
MIPS: Malta: Fixup reboot
MIPS: CPC: Provide default mips_cpc_default_phys_base to ignore CPC
Linus Torvalds [Fri, 4 Nov 2016 20:01:13 +0000 (13:01 -0700)]
Merge branch 'parisc-4.9-3' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux
Pull parisc updates from Helge Deller:
"The first three patches are trivial and add some required KERN_CONT,
ignore the new pkey syscalls on parisc and use the LINUX_GATEWAY_ADDR
define instead of hardcoded values.
The two patches from Dave Anglin are important.
The first one avoids trashing the sr2 and sr3 space registers in the
Light-weight syscall path. Especially the usage of sr3 is critical
since it may get trashed by the interrupt handler.
The second patch is even more important and tagged for stable series.
It protects one critical section in the syscall entry path by
disabling local interrupts. Without disabling interrupts, the sr7
space register may not be in sync with the current stack setup and
thus an incoming hardware interrupt may destroy memory in random
userspace areas"
* 'parisc-4.9-3' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux:
parisc: Ignore the pkey system calls for now
parisc: Use LINUX_GATEWAY_ADDR define instead of hardcoded value
parisc: Ensure consistent state when switching to kernel stack at syscall entry
parisc: Avoid trashing sr2 and sr3 in LWS code
parisc: use KERN_CONT when printing device inventory
i2c: core: fix NULL pointer dereference under race condition
Race condition between registering an I2C device driver and
deregistering an I2C adapter device which is assumed to manage that
I2C device may lead to a NULL pointer dereference due to the
uninitialized list head of driver clients.
The root cause of the issue is that the I2C bus may know about the
registered device driver and thus it is matched by bus_for_each_drv(),
but the list of clients is not initialized and commonly it is NULL,
because I2C device drivers define struct i2c_driver as static and
clients field is expected to be initialized by I2C core:
To solve the problem it is sufficient to do clients list head
initialization before calling driver_register().
The problem was found while using an I2C device driver with a sluggish
registration routine on a bus provided by a physically detachable I2C
master controller, but practically the oops may be reproduced under
the race between arbitraty I2C device driver registration and managing
I2C bus device removal e.g. by unbinding the latter over sysfs:
% echo 21a4000.i2c > /sys/bus/platform/drivers/imx-i2c/unbind
Unable to handle kernel NULL pointer dereference at virtual address 00000000
Internal error: Oops: 17 [#1] SMP ARM
CPU: 2 PID: 533 Comm: sh Not tainted 4.9.0-rc3+ #61
Hardware name: Freescale i.MX6 Quad/DualLite (Device Tree)
task: e5ada400 task.stack: e4936000
PC is at i2c_do_del_adapter+0x20/0xcc
LR is at __process_removed_adapter+0x14/0x1c
Flags: NzCv IRQs on FIQs on Mode SVC_32 ISA ARM Segment none
Control: 10c5387d Table: 35bd004a DAC: 00000051
Process sh (pid: 533, stack limit = 0xe4936210)
Stack: (0xe4937d28 to 0xe4938000)
Backtrace:
[<c0667be0>] (i2c_do_del_adapter) from [<c0667cc0>] (__process_removed_adapter+0x14/0x1c)
[<c0667cac>] (__process_removed_adapter) from [<c0516998>] (bus_for_each_drv+0x6c/0xa0)
[<c051692c>] (bus_for_each_drv) from [<c06685ec>] (i2c_del_adapter+0xbc/0x284)
[<c0668530>] (i2c_del_adapter) from [<bf0110ec>] (i2c_imx_remove+0x44/0x164 [i2c_imx])
[<bf0110a8>] (i2c_imx_remove [i2c_imx]) from [<c051a838>] (platform_drv_remove+0x2c/0x44)
[<c051a80c>] (platform_drv_remove) from [<c05183d8>] (__device_release_driver+0x90/0x12c)
[<c0518348>] (__device_release_driver) from [<c051849c>] (device_release_driver+0x28/0x34)
[<c0518474>] (device_release_driver) from [<c0517150>] (unbind_store+0x80/0x104)
[<c05170d0>] (unbind_store) from [<c0516520>] (drv_attr_store+0x28/0x34)
[<c05164f8>] (drv_attr_store) from [<c0298acc>] (sysfs_kf_write+0x50/0x54)
[<c0298a7c>] (sysfs_kf_write) from [<c029801c>] (kernfs_fop_write+0x100/0x214)
[<c0297f1c>] (kernfs_fop_write) from [<c0220130>] (__vfs_write+0x34/0x120)
[<c02200fc>] (__vfs_write) from [<c0221088>] (vfs_write+0xa8/0x170)
[<c0220fe0>] (vfs_write) from [<c0221e74>] (SyS_write+0x4c/0xa8)
[<c0221e28>] (SyS_write) from [<c0108a20>] (ret_fast_syscall+0x0/0x1c)
Signed-off-by: Vladimir Zapolskiy <vladimir_zapolskiy@mentor.com> Signed-off-by: Wolfram Sang <wsa@the-dreams.de> Cc: stable@kernel.org
James Hogan [Tue, 1 Nov 2016 13:59:09 +0000 (13:59 +0000)]
MIPS: Fix max_low_pfn with disabled highmem
When low memory doesn't reach HIGHMEM_START (e.g. up to 256MB at PA=0 is
common) and highmem is present above HIGHMEM_START (e.g. on Malta the
RAM overlayed by the IO region is aliased at PA=0x90000000), max_low_pfn
will be initially calculated very large and then clipped down to
HIGHMEM_START.
This causes crashes when reading /sys/kernel/mm/page_idle/bitmap
(i.e. CONFIG_IDLE_PAGE_TRACKING=y) when highmem is disabled. pfn_valid()
will compare against max_mapnr which is derived from max_low_pfn when
there is no highend_pfn set up, and will return true for PFNs right up
to HIGHMEM_START, even though they are beyond the end of low memory and
no page structs will actually exist for these PFNs.
This is fixed by skipping high memory regions when initially calculating
max_low_pfn if highmem is disabled, so it doesn't get clipped too high.
We also clip regions which overlap the highmem boundary when highmem is
disabled, so that max_pfn doesn't extend into highmem either.
Signed-off-by: James Hogan <james.hogan@imgtec.com> Cc: Paul Burton <paul.burton@imgtec.com> Cc: linux-mips@linux-mips.org
Patchwork: https://patchwork.linux-mips.org/patch/14490/ Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
Complement commit 80cbfad79096 ("MIPS: Correct MIPS I FP context
layout") and correct the way Floating Point General registers are stored
in a signal context with MIPS I hardware.
Use the S.D and L.D assembly macros to have pairs of SWC1 instructions
and pairs of LWC1 instructions produced, respectively, in an arrangement
which makes the memory representation of floating-point data passed
compatible with that used by hardware SDC1 and LDC1 instructions, where
available, regardless of the hardware endianness used. This matches the
layout used by r4k_fpu.S, ensuring run-time compatibility for MIPS I
software across all o32 hardware platforms.
Define an EX2 macro to handle exceptions from both hardware instructions
implicitly produced from S.D and L.D assembly macros.
Signed-off-by: Maciej W. Rozycki <macro@imgtec.com> Cc: linux-mips@linux-mips.org Cc: linux-kernel@vger.kernel.org
Patchwork: https://patchwork.linux-mips.org/patch/14477/ Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
Fix a regression introduced with commit 2db9ca0a3551 ("MIPS: Use struct
mips_abi offsets to save FP context") for MIPS I/I FP signal contexts,
by converting save/restore code to the updated internal API. Start FGR
offsets from 0 rather than SC_FPREGS from $a0 and use $a1 rather than
the offset of SC_FPC_CSR from $a0 for the Floating Point Control/Status
Register (FCSR).
Document the new internal API and adjust assembly code formatting for
consistency.
Signed-off-by: Maciej W. Rozycki <macro@imgtec.com> Cc: Paul Burton <paul.burton@imgtec.com> Cc: linux-mips@linux-mips.org Cc: linux-kernel@vger.kernel.org
Patchwork: https://patchwork.linux-mips.org/patch/14476/ Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
Complement commit e50c0a8fa60d ("Support the MIPS32 / MIPS64 DSP ASE.")
and remove the Floating Point Implementation Register (FIR) from the FP
register set recorded in a signal context with MIPS I processors too, in
line with the change applied to r4k_fpu.S.
The `sc_fpc_eir' slot is unused according to our current ABI and the FIR
register is read-only and always directly accessible from user software.
[ralf@linux-mips.org: This is also required because the next commit depends
on it.]
Signed-off-by: Maciej W. Rozycki <macro@imgtec.com> Cc: linux-mips@linux-mips.org Cc: linux-kernel@vger.kernel.org
Patchwork: https://patchwork.linux-mips.org/patch/14475/ Signed-off-by: Ralf Baechle <ralf@linux-mips.org>