Eric Sandeen [Sat, 29 Oct 2011 14:15:35 +0000 (10:15 -0400)]
ext4: fix race in xattr block allocation path
Ceph users reported that when using Ceph on ext4, the filesystem
would often become corrupted, containing inodes with incorrect
i_blocks counters.
I managed to reproduce this with a very hacked-up "streamtest"
binary from the Ceph tree.
Ceph is doing a lot of xattr writes, to out-of-inode blocks.
There is also another thread which does sync_file_range and close,
of the same files. The problem appears to happen due to this race:
do_writepages ext4_xattr_set
ext4_da_writepages ext4_xattr_set_handle
mpage_da_map_blocks ext4_xattr_block_set
set DELALLOC_RESERVE
ext4_new_meta_blocks
ext4_mb_new_blocks
if (!i_delalloc_reserved_flag)
vfs_dq_alloc_block
ext4_get_blocks
down_write(i_data_sem)
set i_delalloc_reserved_flag
...
up_write(i_data_sem)
if (i_delalloc_reserved_flag)
vfs_dq_alloc_block_nofail
In other words, the sync/flush thread pops in and sets
i_delalloc_reserved_flag on the inode, which makes the xattr thread
think that it's in a delalloc path in ext4_new_meta_blocks(),
and add the block for a second time, after already having added
it once in the !i_delalloc_reserved_flag case in ext4_mb_new_blocks
The real problem is that we shouldn't be using the DELALLOC_RESERVED
state flag, and instead we should be passing
EXT4_GET_BLOCKS_DELALLOC_RESERVE down to ext4_map_blocks() instead of
using an inode state flag. We'll fix this for now with using
i_data_sem to prevent this race, but this is really not the right way
to fix things.
Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: stable@kernel.org
Dmitry Monakhov [Sat, 29 Oct 2011 13:05:00 +0000 (09:05 -0400)]
ext4: fix quota accounting during migration
The tmp_inode should have same uid/gid as the original inode.
Otherwise new metadata blocks will be accounted to wrong quota-id,
which will result in a quota leak after the inode migration is
completed.
Dmitry Monakhov [Sat, 29 Oct 2011 13:03:00 +0000 (09:03 -0400)]
ext4: migrate cleanup
This patch cleanup code a bit, actual logic not changed
- Move current block pointer to migrate_structure, let's all
walk info will be in one structure.
- Get rid of usless null ind-block ptr checks, caller already
does that check.
Theodore Ts'o [Sat, 29 Oct 2011 12:24:18 +0000 (08:24 -0400)]
fs: optimize out 16 bytes worth of padding in struct inode
Rearrange the fields in struct inode so that on an x86_64 system,
fields that require 8-byte alignment don't end up causing 4-byte holes
in the structure. It reduces the size of struct inode from 568 bytes
to 552 bytes.
Also move the fields protected by i_lock (i_blocks, i_bytes, and
i_size) into the same cache line as i_lock.
Eric Gouriou [Thu, 27 Oct 2011 15:52:18 +0000 (11:52 -0400)]
ext4: optimize memmmove lengths in extent/index insertions
ext4_ext_insert_extent() (respectively ext4_ext_insert_index())
was using EXT_MAX_EXTENT() (resp. EXT_MAX_INDEX()) to determine
how many entries needed to be moved beyond the insertion point.
In practice this means that (320 - I) * 24 bytes were memmove()'d
when I is the insertion point, rather than (#entries - I) * 24 bytes.
This patch uses EXT_LAST_EXTENT() (resp. EXT_LAST_INDEX()) instead
to only move existing entries. The code flow is also simplified
slightly to highlight similarities and reduce code duplication in
the insertion logic.
This patch reduces system CPU consumption by over 25% on a 4kB
synchronous append DIO write workload when used with the
pre-2.6.39 x86_64 memmove() implementation. With the much faster
2.6.39 memmove() implementation we still see a decrease in
system CPU usage between 2% and 7%.
Note that the ext_debug() output changes with this patch, splitting
some log information between entries. Users of the ext_debug() output
should note that the "move %d" units changed from reporting the number
of bytes moved to reporting the number of entries moved.
Signed-off-by: Eric Gouriou <egouriou@google.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Eric Gouriou [Thu, 27 Oct 2011 15:43:23 +0000 (11:43 -0400)]
ext4: optimize ext4_ext_convert_to_initialized()
This patch introduces a fast path in ext4_ext_convert_to_initialized()
for the case when the conversion can be performed by transferring
the newly initialized blocks from the uninitialized extent into
an adjacent initialized extent. Doing so removes the expensive
invocations of memmove() which occur during extent insertion and
the subsequent merge.
In practice this should be the common case for clients performing
append writes into files pre-allocated via
fallocate(FALLOC_FL_KEEP_SIZE). In such a workload performed via
direct IO and when using a suboptimal implementation of memmove()
(x86_64 prior to the 2.6.39 rewrite), this patch reduces kernel CPU
consumption by 32%.
Two new trace points are added to ext4_ext_convert_to_initialized()
to offer visibility into its operations. No exit trace point has
been added due to the multiplicity of return points. This can be
revisited once the upstream cleanup is backported.
Signed-off-by: Eric Gouriou <egouriou@google.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Tao Ma [Wed, 26 Oct 2011 15:08:39 +0000 (11:08 -0400)]
ext4: don't check io->flag when setting EXT4_STATE_DIO_UNWRITTEN inode state
When we want to convert the unitialized extent in direct write, we can
either do it in ext4_end_io_nolock(AIO case) or in
ext4_ext_direct_IO(non AIO case) and EXT4_I(inode)->cur_aio_dio is a
guard for ext4_ext_map_blocks to find the right case. In e9e3bcecf,
we mistakenly change it by:
So now if we map 2 blocks, and the first one set the
EXT_IO_END_UNWRITTEN, the 2nd mapping will set inode state because of
the check for the flag. This is wrong.
Cc: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Tao Ma <boyu.mt@taobao.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
after we set /sys/fs/ext4/sda/mb_group_prealloc to zero and create new files in an ext4 filesystem.
The reason is: ac_b_ex.fe_len also set to zero(mb_group_prealloc) in ext4_mb_normalize_group_request
because the ac_flags contains EXT4_MB_HINT_GROUP_ALLOC.
I think when someone set mb_group_prealloc to zero, it means DO NOT USE GROUP PREALLOCATION,
so we should set alloc-strategy to STREAM in this case.
Signed-off-by: Robin Dong <sanbai@taobao.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Yongqiang Yang [Wed, 26 Oct 2011 09:00:19 +0000 (05:00 -0400)]
ext4: let ext4_page_mkwrite stop started handle in failure
The started journal handle should be stopped in failure case.
Signed-off-by: Yongqiang Yang <xiaoqiangnk@gmail.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Acked-by: Jan Kara <jack@suse.cz> Cc: stable@kernel.org
Curt Wohlgemuth [Wed, 26 Oct 2011 08:38:59 +0000 (04:38 -0400)]
ext4: handle NULL p_ext in ext4_ext_next_allocated_block()
In ext4_ext_next_allocated_block(), the path[depth] might
have a p_ext that is NULL -- see ext4_ext_binsearch(). In
such a case, dereferencing it will crash the machine.
This patch checks for p_ext == NULL in
ext4_ext_next_allocated_block() before dereferencinging it.
Tested using a hand-crafted an inode with eh_entries == 0 in
an extent block, verified that running FIEMAP on it crashes
without this patch, works fine with it.
Dan Carpenter [Wed, 26 Oct 2011 07:42:36 +0000 (03:42 -0400)]
ext4: error handling fix in ext4_ext_convert_to_initialized()
When allocated is unsigned it breaks the error handling at the end
of the function when we call:
allocated = ext4_split_extent(...);
if (allocated < 0)
err = allocated;
I've made it a signed int instead of unsigned.
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Andreas Dilger [Wed, 26 Oct 2011 07:22:31 +0000 (03:22 -0400)]
ext4: avoid setting directory i_nlink to zero
If a directory with more than EXT4_LINK_MAX subdirectories, the nlink
count is set to 1. Subsequently, if any subdirectories are deleted,
ext4_dec_count() decrements the i_nlink count, which may go to 0
temporarily before being incremented back to 1.
While this is done under i_mutex, which prevents races for directory
and inode operations that check i_nlink, the temporary i_nlink == 0
case is exposed to userspace via stat() and similar calls that do not
hold i_mutex.
Instead, change the code to not decrement i_nlink count for any
directories that do not already have i_nlink larger than 2.
Reported-by: Cliff White <cliffw@whamcloud.com> Reviewed-by: Johann Lombardi <johann@whamcloud.com> Signed-off-by: Andreas Dilger <adilger@whamcloud.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Darrick J. Wong [Tue, 25 Oct 2011 13:18:41 +0000 (09:18 -0400)]
ext4: prevent stack overrun in ext4_file_open
In ext4_file_open, the filesystem records the mountpoint of the first
file that is opened after mounting the filesystem. It does this by
allocating a 64-byte stack buffer, calling d_path() to grab the mount
point through which this file was accessed, and then memcpy()ing 64
bytes into the superblock's s_last_mounted field, starting from the
return value of d_path(), which is stored as "cp". However, if cp >
buf (which it frequently is since path components are prepended
starting at the end of buf) then we can end up copying stack data into
the superblock.
Writing stack variables into the superblock doesn't sound like a great
idea, so use strlcpy instead. Andi Kleen suggested using strlcpy
instead of strncpy.
Signed-off-by: Darrick J. Wong <djwong@us.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Dmitry Monakhov [Tue, 25 Oct 2011 12:15:12 +0000 (08:15 -0400)]
ext4: update EOFBLOCKS flag on fallocate properly
EOFBLOCK_FL should be updated if called w/o FALLOCATE_FL_KEEP_SIZE
Currently it happens only if new extent was allocated.
TESTCASE:
fallocate test_file -n -l4096
fallocate test_file -l4096
Last fallocate cmd has updated size, but keept EOFBLOCK_FL set. And
fsck will complain about that.
Also remove ping pong in ext4_fallocate() in case of new extents,
where ext4_ext_map_blocks() clear EOFBLOCKS bit, and later
ext4_falloc_update_inode() restore it again.
Dmitry Monakhov [Tue, 25 Oct 2011 09:35:05 +0000 (05:35 -0400)]
ext4: remove messy logic from ext4_ext_rm_leaf
- Both callers(truncate and punch_hole) already aligned left end point
so we no longer need split logic here.
- Remove dead duplicated code.
- Call ext4_ext_dirty only after we have updated eh_entries, otherwise
we'll loose entries update. Regression caused by d583fb87a3ff0
266'th testcase in xfstests (http://patchwork.ozlabs.org/patch/120872)
Dmitry Monakhov [Sat, 22 Oct 2011 05:26:05 +0000 (01:26 -0400)]
ext4: cleanup ext4_ext_grow_indepth code
Currently code make an impression what grow procedure is very complicated
and some mythical paths, blocks are involved. But in fact grow in depth
it relatively simple procedure:
1) Just create new meta block and copy root data to that block.
2) Convert root from extent to index if old depth == 0
3) Update root block pointer
This patch does:
- Reorganize code to make it more self explanatory
- Do not pass path parameter to new_meta_block() in order to
provoke allocation from inode's group because top-level block
should site closer to it's inode, but not to leaf data block.
[ This happens anyway, due to logic in mballoc; we should drop
the path parameter from new_meta_block() entirely. -- tytso ]
Kazuya Mio [Thu, 20 Oct 2011 23:23:08 +0000 (19:23 -0400)]
ext4: fix the deadlock in mpage_da_map_and_submit()
If ext4_jbd2_file_inode() in mpage_da_map_and_submit() fails due to
journal abort, this function returns to caller without unlocking the
page. It leads to the deadlock, and the patch fixes this issue by
calling mpage_da_submit_io().
Akira Fujita [Thu, 20 Oct 2011 22:56:10 +0000 (18:56 -0400)]
ext4: fix deadlock in ext4_ordered_write_end()
If ext4_jbd2_file_inode() in ext4_ordered_write_end() fails for some
reasons, this function returns to caller without unlocking the page.
It leads to the deadlock, and the patch fixes this issue.
The function declarations in ext4.h are already marked extern, so it's
not necessary to do so in the .c files.
This quiets the sparse noise:
warning: function 'ext4_flush_completed_IO' with external linkage has definition
warning: function 'ext4_init_inode_table' with external linkage has definition
Signed-off-by: H Hartley Sweeten <hsweeten@visionengravers.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Shaohua Li [Tue, 18 Oct 2011 14:55:51 +0000 (10:55 -0400)]
ext4: add block plug for .writepages
Add block plug for ext4 .writepages. Though ext4 .writepages
already handles request merge very well, block plug is still
helpful to reduce block lock contention.
Signed-off-by: Shaohua Li <shaohua.li@intel.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Darrick J. Wong [Tue, 18 Oct 2011 14:53:51 +0000 (10:53 -0400)]
ext4: Fix comparison endianness problem in MMP initialization
As part of startup, the MMP initialization code does this:
mmp->mmp_seq = seq = cpu_to_le32(mmp_new_seq());
Next, mmp->mmp_seq is written out to disk, a delay happens, and then
the MMP block is read back in and the sequence value is tested:
if (seq != le32_to_cpu(mmp->mmp_seq)) {
/* fail the mount */
On a LE system such as x86, the *le32* functions do nothing and this
works. Unfortunately, on a BE system such as ppc64, this comparison
becomes:
if (cpu_to_le32(new_seq) != le32_to_cpu(cpu_to_le32(new_seq)) {
/* fail the mount */
Except for a few palindromic sequence numbers, this test always causes
the mount to fail, which makes MMP filesystems generally unmountable
on ppc64. The attached patch fixes this situation.
Signed-off-by: Darrick J. Wong <djwong@us.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
ext4: MMP: fix error message rate-limiting logic in kmmpd
Current logic would print an error message only once, and then
'failed_writes' would stay at 1. Rework the loop to increment
'failed_writes' and print the error message every
s_mmp_update_interval * 60 seconds, as intended according to the
comment.
Signed-off-by: Nikitas Angelinas <nikitas_angelinas@xyratex.com> Signed-off-by: Andrew Perepechko <andrew_perepechko@xyratex.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Acked-by: Andreas Dilger <adilger@dilger.ca>
ext4: MMP: kmmpd should use nodename from init_uts_ns.name, not sysname
sysname holds "Linux" by default, i.e. what appears when doing a "uname
-s"; nodename should be used to print the machine's hostname, i.e. what
is returned when doing a "uname -n" or "hostname", and what
gethostname(2)/sethostname(2) manipulate, in order to notify the
administrator of the node which is contending to mount the filesystem.
Acked-by: Andreas Dilger <adilger@dilger.ca> Signed-off-by: Nikitas Angelinas <nikitas_angelinas@xyratex.com> Signed-off-by: Andrew Perepechko <andrew_perepechko@xyratex.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Fabrice Jouhaud [Sat, 8 Oct 2011 20:26:03 +0000 (16:26 -0400)]
ext4: fix ext4 so it works without CONFIG_PROC_FS
This fixes a bug which was introduced in dd68314ccf3fb. The problem
came from the test of the return value of proc_mkdir which is always
false without procfs, and this would initialization of ext4.
Tao Ma [Sat, 8 Oct 2011 19:56:35 +0000 (15:56 -0400)]
ext4: remove the obsolete/broken EXT4_IOC_WAIT_FOR_READONLY ioctl
There are no users of the EXT4_IOC_WAIT_FOR_READONLY ioctl, and it is
also broken. No one sets the set_ro_timer, no one wakes up us and our
state is set to TASK_INTERRUPTIBLE not RUNNING. So remove it.
Signed-off-by: Tao Ma <boyu.mt@taobao.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Tao Ma [Sat, 8 Oct 2011 19:53:49 +0000 (15:53 -0400)]
ext4: fix the comment describing ext4_ext_search_right()
The comment describing what ext4_ext_search_right() does is incorrect.
We return 0 in *phys when *logical is the 'largest' allocated block,
not smallest.
Fix a few other typos while we're at it.
Cc: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Tao Ma <boyu.mt@taobao.com>
Lukas Czerner [Sat, 8 Oct 2011 18:34:47 +0000 (14:34 -0400)]
ext4: remove deprecated oldalloc
For a long time now orlov is the default block allocator in the
ext4. It performs better than the old one and no one seems to claim
otherwise so we can safely drop it and make oldalloc and orlov mount
option deprecated.
This is a part of the effort to reduce number of ext4 options hence the
test matrix.
Theodore Ts'o [Sat, 8 Oct 2011 18:01:08 +0000 (14:01 -0400)]
ext4: documentation: remove acl and user_xattr mount options
Acl and user_xattr mount options are no longer needed since those
features are enabled by default if configured in (seee commit ea6633369458992241599c9d9ebadffaeddec164). We can not easily deprecate
mount options itself (since it is probably too early), but we can
remove it from documentation first.
Tao Ma [Thu, 6 Oct 2011 14:22:28 +0000 (10:22 -0400)]
ext4: Free resources in ext4_mb_init()'s error paths
In commit 79a77c5ac, we move ext4_mb_init_backend after the allocation
of s_locality_group to avoid memory leak in error path, but there are
still some other error paths in ext4_mb_init that need to do the same
work. So this patch adds all the error patch for ext4_mb_init. And all
the pointers are reset to NULL in case the caller may double free them.
Signed-off-by: Tao Ma <boyu.mt@taobao.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Aditya Kali [Fri, 9 Sep 2011 23:20:51 +0000 (19:20 -0400)]
ext4: attempt to fix race in bigalloc code path
Currently, there exists a race between delayed allocated writes and
the writeback when bigalloc feature is in use. The race was because we
wanted to determine what blocks in a cluster are under delayed
allocation and we were using buffer_delayed(bh) check for it. But, the
writeback codepath clears this bit without any synchronization which
resulted in a race and an ext4 warning similar to:
EXT4-fs (ram1): ext4_da_update_reserve_space: ino 13, used 1 with only 0
reserved data blocks
The race existed in two places.
(1) between ext4_find_delalloc_range() and ext4_map_blocks() when called from
writeback code path.
(2) between ext4_find_delalloc_range() and ext4_da_get_block_prep() (where
buffer_delayed(bh) is set.
To fix (1), this patch introduces a new buffer_head state bit -
BH_Da_Mapped. This bit is set under the protection of
EXT4_I(inode)->i_data_sem when we have actually mapped the delayed
allocated blocks during the writeout time. We can now reliably check
for this bit inside ext4_find_delalloc_range() to determine whether
the reservation for the blocks have already been claimed or not.
To fix (2), it was necessary to set buffer_delay(bh) under the
protection of i_data_sem. So, I extracted the very beginning of
ext4_map_blocks into a new function - ext4_da_map_blocks() - and
performed the required setting of bh_delay bit and the quota
reservation under the protection of i_data_sem. These two fixes makes
the checking of buffer_delay(bh) and buffer_da_mapped(bh) consistent,
thus removing the race.
Tested: I was able to reproduce the problem by running 'dd' and
'fsync' in parallel. Also, xfstests sometimes used to reproduce this
race. After the fix both my test and xfstests were successful and no
race (warning message) was observed.
ext4: Rename ext4_free_blks_{count,set}() to refer to clusters
The field bg_free_blocks_count_{lo,high} in the block group
descriptor has been repurposed to hold the number of free clusters for
bigalloc functions. So rename the functions so it makes it easier to
read and audit the block allocation and block freeing code.
Note: at this point in bigalloc development we doesn't support
online resize, so this also makes it really obvious all of the places
we need to fix up to add support for online resize.
Aditya Kali [Fri, 9 Sep 2011 23:04:51 +0000 (19:04 -0400)]
ext4: Fix bigalloc quota accounting and i_blocks value
With bigalloc changes, the i_blocks value was not correctly set (it was still
set to number of blocks being used, but in case of bigalloc, we want i_blocks
to represent the number of clusters being used). Since the quota subsystem sets
the i_blocks value, this patch fixes the quota accounting and makes sure that
the i_blocks value is set correctly.
Signed-off-by: Aditya Kali <adityakali@google.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
ext4: tune mballoc's default group prealloc size for bigalloc file systems
The default group preallocation size had been previously set to 512
blocks/clusters, regardless of the block/cluster size. This is
probably too big for large cluster sizes. So adjust the default so
that it is 2 megabytes or 32 clusters, whichever is larger.
ext4: convert s_{dirty,free}blocks_counter to s_{dirty,free}clusters_counter
Convert the percpu counters s_dirtyblocks_counter and
s_freeblocks_counter in struct ext4_super_info to be
s_dirtyclusters_counter and s_freeclusters_counter.
ext4: teach ext4_ext_truncate() about the bigalloc feature
When we are truncating (as opposed unlinking) a file, we need to worry
about partial truncates of a file, especially in the light of sparse
files. The changes here make sure that arbitrary truncates of sparse
files works correctly. Yeah, it's messy.
Note that these functions will need to be revisted when the punch
ioctl is integrated --- in fact this commit will probably have merge
conflicts with the punch changes which Allison Henders and the IBM LTC
have been working on. I will need to fix this up when either patch
hits mainline.
ext4: teach ext4_free_blocks() about bigalloc and clusters
The ext4_free_blocks() function now has two new flags that indicate
whether a partial cluster at the beginning or the end of the block
extents should be freed or not. That will be up the caller (i.e.,
truncate), who can figure out whether partial clusters at the
beginning or the end of a block range can be freed.
We also have to update the ext4_mb_free_metadata() and
release_blocks_on_commit() machinery to be cluster-based, since it is
used by ext4_free_blocks().
ext4: teach mballoc preallocation code about bigalloc clusters
In most of mballoc.c, we do everything in units of clusters, since the
block allocation bitmaps and buddy bitmaps are all denominated in
clusters. The one place where we do deal with absolute block numbers
is in the code that handles the preallocation regions, since in the
case of inode-based preallocation regions, the start of the
preallocation region can't be relative to the beginning of the group.
So this adds a bit of complexity, where pa_pstart and pa_lstart are
block numbers, while pa_free, pa_len, and fe_len are denominated in
units of clusters.
ext4: convert block group-relative offsets to use clusters
Certain parts of the ext4 code base, primarily in mballoc.c, use a
block group number and offset from the beginning of the block group.
This offset is invariably used to index into the allocation bitmap, so
change the offset to be denominated in units of clusters.
The function ext4_free_blocks_after_init() used to be a #define of
ext4_init_block_bitmap(). This actually made it difficult to
understand how the function worked, and made it hard make changes to
support clusters. So as an initial cleanup, I've separated out the
functionality of initializing block bitmap from calculating the number
of free blocks in the new block group.
ext4: factor out block group accounting into functions
This makes it easier to understand how ext4_init_block_bitmap() works,
and it will assist when we split out ext4_free_blocks_after_init() in
the next commit.
ext4: convert instances of EXT4_BLOCKS_PER_GROUP to EXT4_CLUSTERS_PER_GROUP
Change the places in fs/ext4/mballoc.c where EXT4_BLOCKS_PER_GROUP are
used to indicate the number of bits in a block bitmap (which is really
a cluster allocation bitmap in bigalloc file systems). There are
still some places in the ext4 codebase where usage of
EXT4_BLOCKS_PER_GROUP needs to be audited/fixed, in code paths that
aren't used given the initial restricted assumptions for bigalloc.
These will need to be fixed before we can relax those restrictions.
ext4: enforce bigalloc restrictions (e.g., no online resizing, etc.)
At least initially if the bigalloc feature is enabled, we will not
support non-extent mapped inodes, online resizing, online defrag, or
the FITRIM ioctl. This simplifies the initial implementation.
This adds supports for bigalloc file systems. It teaches the mount
code just enough about bigalloc superblock fields that it will mount
the file system without freaking out that the number of blocks per
group is too big.
ext4: add ext4-specific kludge to avoid an oops after the disk disappears
The del_gendisk() function uninitializes the disk-specific data
structures, including the bdi structure, without telling anyone
else. Once this happens, any attempt to call mark_buffer_dirty()
(for example, by ext4_commit_super), will cause a kernel OOPS.
Fix this for now until we can fix things in an architecturally correct
way.
While running extended fsx tests to verify the preceeding patches,
a similar bug was also found in the write operation
When ever a write operation begins or ends in a hole,
or extends EOF, the partial page contained in the hole
or beyond EOF needs to be zeroed out.
To correct this the new ext4_discard_partial_page_buffers_no_lock
routine is used to zero out the partial page, but only for buffer
heads that are already unmapped.
While running extended fsx tests to verify the first
two patches, a similar bug was also found in the
truncate operation.
This bug happens because the truncate routine only zeros
the unblock aligned portion of the last page. This means
that the block aligned portions of the page appearing after
i_size are left unzeroed, and the buffer heads still mapped.
This bug is corrected by using ext4_discard_partial_page_buffers
in the truncate routine to zero the partial page and unmap
the buffer headers.
ext4: only call ext4_jbd2_file_inode when an inode has been extended
In delayed allocation mode, it's important to only call
ext4_jbd2_file_inode when the file has been extended. This is
necessary to avoid a race which first got introduced in commit 678aaf481, but which was made much more common with the introduction
of the "punch hole" functionality. (Especially when dioread_nolock
was enabled; when I could reliably reproduce this problem with
xfstests #74.)
The race is this: If while trying to writeback a delayed allocation
inode, there is a need to map delalloc blocks, and we run out of space
in the journal, *and* at the same time the inode is already on the
committing transaction's t_inode_list (because for example while doing
the punch hole operation, ext4_jbd2_file_inode() is called), then the
commit operation will wait for the inode to finish all of its pending
writebacks by calling filemap_fdatawait(), but since that inode has
one or more pages with the PageWriteback flag set, the commit
operation will wait forever, and the so the writeback of the inode can
never take place, and the kjournald thread and the writeback thread
end up waiting for each other --- forever.
It's important at this point to recall why an inode is placed on the
t_inode_list; it is to provide the data=ordered guarantees that we
don't end up exposing stale data. In the case where we are truncating
or punching a hole in the inode, there is no possibility that stale
data could be exposed in the first place, so we don't need to put the
inode on the t_inode_list!
The right long-term fix is to get rid of data=ordered mode altogether,
and only update the extent tree or indirect blocks after the data has
been written. Until then, this change will also avoid some
unnecessary waiting in the commit operation.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: Allison Henderson <achender@linux.vnet.ibm.com> Cc: Jan Kara <jack@suse.cz>
Dan Carpenter [Sun, 4 Sep 2011 14:20:14 +0000 (10:20 -0400)]
jbd2: use gfp_t instead of int
This silences some Sparse warnings:
fs/jbd2/transaction.c:135:69: warning: incorrect type in argument 2 (different base types)
fs/jbd2/transaction.c:135:69: expected restricted gfp_t [usertype] flags
fs/jbd2/transaction.c:135:69: got int [signed] gfp_mask
Signed-off-by: Dan Carpenter <error27@gmail.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
jbd2: add debugging information to jbd2_journal_dirty_metadata()
Add debugging information in case jbd2_journal_dirty_metadata() is
called with a buffer_head which didn't have
jbd2_journal_get_write_access() called on it, or if the journal_head
has the wrong transaction in it. In addition, return an error code.
This won't change anything for ocfs2, which will BUG_ON() the non-zero
exit code.
For ext4, the caller of this function is ext4_handle_dirty_metadata(),
and on seeing a non-zero return code, will call __ext4_journal_stop(),
which will print the function and line number of the (buggy) calling
function and abort the journal. This will allow us to recover instead
of bug halting, which is better from a robustness and reliability
point of view.
ext4: improve handling of conflicting mount options
If the user explicitly specifies conflicting mount options for
delalloc or dioread_nolock and data=journal, fail the mount, instead
of printing a warning and continuing (since many user's won't look at
dmesg and notice the warning).
Also, print a single warning that data=journal implies that delayed
allocation is not on by default (since it's not supported), and
furthermore that O_DIRECT is not supported. Improve the text in
Documentation/filesystems/ext4.txt so this is clear there as well.
Similarly, if the dioread_nolock mount option is specified when the
file system block size != PAGE_SIZE, fail the mount instead of
printing a warning message and ignoring the mount option.
This patch fixes a second punch hole bug found by xfstests 127.
This bug happens because punch hole needs to flush the pages
of the hole to avoid race conditions. But if the end of the
hole is in the same page as i_size, the buffer heads beyond
i_size need to be unmapped and the page needs to be zeroed
after it is flushed.
To correct this, the new ext4_discard_partial_page_buffers
routine is used to zero and unmap the partial page
beyond i_size if the end of the hole appears in the same
page as i_size.
The code has also been optimized to set the end of the hole
to the page after i_size if the specified hole exceeds i_size,
and the code that flushes the pages has been simplified.
This patch addresses a bug found by xfstests 75, 112, 127
when blocksize = 1k
This bug happens because the punch hole code only zeros
out non block aligned regions of the page. This means that if the
blocks are smaller than a page, then the block aligned regions of
the page inside the hole are left un-zeroed, and their buffer heads
are still mapped. This bug is corrected by using
ext4_discard_partial_page_buffers to properly zero the partial page
at the head and tail of the hole, and unmap the corresponding buffer
heads
This patch also addresses a bug reported by Lukas while working on a
new patch to add discard support for loop devices using punch hole.
The bug happened because of the first and last block number
needed to be cast to a larger data type before calculating the
byte offset, but since now we only need the byte offsets of the
pages, we no longer even need to be calculating the byte offsets
of the blocks. The code to do the block offset calculations is
removed in this patch.
ext4: Add new ext4_discard_partial_page_buffers routines
This patch adds two new routines: ext4_discard_partial_page_buffers
and ext4_discard_partial_page_buffers_no_lock.
The ext4_discard_partial_page_buffers routine is a wrapper
function to ext4_discard_partial_page_buffers_no_lock.
The wrapper function locks the page and passes it to
ext4_discard_partial_page_buffers_no_lock.
Calling functions that already have the page locked can call
ext4_discard_partial_page_buffers_no_lock directly.
The ext4_discard_partial_page_buffers_no_lock function
zeros a specified range in a page, and unmaps the
corresponding buffer heads. Only block aligned regions of the
page will have their buffer heads unmapped. Unblock aligned regions
will be mapped if needed so that they can be updated with the
partial zero out. This function is meant to
be used to update a page and its buffer heads to be zeroed
and unmapped when the corresponding blocks have been released
or will be released.
This routine is used in the following scenarios:
* A hole is punched and the non page aligned regions
of the head and tail of the hole need to be discarded
* The file is truncated and the partial page beyond EOF needs
to be discarded
* The end of a hole is in the same page as EOF. After the
page is flushed, the partial page beyond EOF needs to be
discarded.
* A write operation begins or ends inside a hole and the partial
page appearing before or after the write needs to be discarded
* A write operation extends EOF and the partial page beyond EOF
needs to be discarded
This function takes a flag EXT4_DISCARD_PARTIAL_PG_ZERO_UNMAPPED
which is used when a write operation begins or ends in a hole.
When the EXT4_DISCARD_PARTIAL_PG_ZERO_UNMAPPED flag is used, only
buffer heads that are already unmapped will have the corresponding
regions of the page zeroed.
Theodore Ts'o [Wed, 31 Aug 2011 16:02:51 +0000 (12:02 -0400)]
ext4: call ext4_handle_dirty_metadata with correct inode in ext4_dx_add_entry
ext4_dx_add_entry manipulates bh2 and frames[0].bh, which are two buffer_heads
that point to directory blocks assigned to the directory inode. However, the
function calls ext4_handle_dirty_metadata with the inode of the file that's
being added to the directory, not the directory inode itself. Therefore,
correct the code to dirty the directory buffers with the directory inode, not
the file inode.
Signed-off-by: Darrick J. Wong <djwong@us.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: stable@kernel.org
Darrick J. Wong [Wed, 31 Aug 2011 16:00:51 +0000 (12:00 -0400)]
ext4: ext4_mkdir should dirty dir_block with newly created directory inode
ext4_mkdir calls ext4_handle_dirty_metadata with dir_block and the inode "dir".
Unfortunately, dir_block belongs to the newly created directory (which is
"inode"), not the parent directory (which is "dir"). Fix the incorrect
association.
Signed-off-by: Darrick J. Wong <djwong@us.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: stable@kernel.org
Darrick J. Wong [Wed, 31 Aug 2011 15:58:51 +0000 (11:58 -0400)]
ext4: ext4_rename should dirty dir_bh with the correct directory
When ext4_rename performs a directory rename (move), dir_bh is a
buffer that is modified to update the '..' link in the directory being
moved (old_inode). However, ext4_handle_dirty_metadata is called with
the old parent directory inode (old_dir) and dir_bh, which is
incorrect because dir_bh does not belong to the parent inode. Fix
this error.
Signed-off-by: Darrick J. Wong <djwong@us.ibm.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: stable@kernel.org
Theodore Ts'o [Wed, 31 Aug 2011 15:56:51 +0000 (11:56 -0400)]
ext4: fake direct I/O mode for data=journal
Currently attempts to open a file with O_DIRECT in data=journal mode
causes the open to fail with -EINVAL. This makes it very hard to test
data=journal mode. So we will let the open succeed, but then always
fall back to O_DSYNC buffered writes.
Theodore Ts'o [Wed, 31 Aug 2011 15:54:51 +0000 (11:54 -0400)]
ext2,ext3,ext4: don't inherit APPEND_FL or IMMUTABLE_FL for new inodes
This doesn't make much sense, and it exposes a bug in the kernel where
attempts to create a new file in an append-only directory using
O_CREAT will fail (but still leave a zero-length file). This was
discovered when xfstests #79 was generalized so it could run on all
file systems.
Jiaying Zhang [Wed, 31 Aug 2011 15:50:51 +0000 (11:50 -0400)]
ext4: remove i_mutex lock in ext4_evict_inode to fix lockdep complaining
The i_mutex lock and flush_completed_IO() added by commit 2581fdc810
in ext4_evict_inode() causes lockdep complaining about potential
deadlock in several places. In most/all of these LOCKDEP complaints
it looks like it's a false positive, since many of the potential
circular locking cases can't take place by the time the
ext4_evict_inode() is called; but since at the very least it may mask
real problems, we need to address this.
This change removes the flush_completed_IO() and i_mutex lock in
ext4_evict_inode(). Instead, we take a different approach to resolve
the software lockup that commit 2581fdc810 intends to fix. Rather
than having ext4-dio-unwritten thread wait for grabing the i_mutex
lock of an inode, we use mutex_trylock() instead, and simply requeue
the work item if we fail to grab the inode's i_mutex lock.
This should speed up work queue processing in general and also
prevents the following deadlock scenario: During page fault,
shrink_icache_memory is called that in turn evicts another inode B.
Inode B has some pending io_end work so it calls ext4_ioend_wait()
that waits for inode B's i_ioend_count to become zero. However, inode
B's ioend work was queued behind some of inode A's ioend work on the
same cpu's ext4-dio-unwritten workqueue. As the ext4-dio-unwritten
thread on that cpu is processing inode A's ioend work, it tries to
grab inode A's i_mutex lock. Since the i_mutex lock of inode A is
still hold before the page fault happened, we enter a deadlock.
Linus Torvalds [Mon, 22 Aug 2011 18:25:44 +0000 (11:25 -0700)]
Merge branch 'stable/bug.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen
* 'stable/bug.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen:
xen/tracing: Fix tracing config option properly
xen: Do not enable PV IPIs when vector callback not present
xen/x86: replace order-based range checking of M2P table by linear one
xen: xen-selfballoon.c needs more header files
This change replaced the SMP operations with event based handlers without
taking into account that this only works when the hypervisor supports
callback vectors. This causes unexplainable hangs early on boot for
HVM guests with more than one CPU.
BugLink: http://bugs.launchpad.net/bugs/791850 CC: stable@kernel.org Signed-off-by: Stefan Bader <stefan.bader@canonical.com> Signed-off-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com> Tested-and-Reported-by: Stefan Bader <stefan.bader@canonical.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Josef Bacik [Sat, 20 Aug 2011 12:29:51 +0000 (08:29 -0400)]
Btrfs: fix 64 bit divide problem
This fixes a regression introduced by commit cdcb725c05fe ("Btrfs: check
if there is enough space for balancing smarter"). We can't do 64-bit
divides on 32-bit architectures.
In cases where we need to divide/multiply by 2 we should just left/right
shift respectively, and in cases where theres N number of devices use
do_div. Also make the counters u64 to match up with rw_devices.
Thanks,
Linus Torvalds [Sun, 21 Aug 2011 13:59:41 +0000 (06:59 -0700)]
Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
ext4: flush any pending end_io requests before DIO reads w/dioread_nolock
ext4: fix nomblk_io_submit option so it correctly converts uninit blocks
ext4: Resolve the hang of direct i/o read in handling EXT4_IO_END_UNWRITTEN.
ext4: call ext4_ioend_wait and ext4_flush_completed_IO in ext4_evict_inode
ext4: Fix ext4_should_writeback_data() for no-journal mode
Linus Torvalds [Sun, 21 Aug 2011 13:59:02 +0000 (06:59 -0700)]
Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound-2.6
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound-2.6:
ALSA: sound/aoa/fabrics/layout.c: remove unneeded kfree
ALSA: hda - Fix error check from snd_hda_get_conn_index() in patch_cirrus.c
ALSA: hda - Don't spew too many ELD errors
ALSA: usb-audio - Fix missing mixer dB information
ALSA: hda - Add "PCM" volume to vmaster slave list
ALSA: hda - Fix duplicated capture-volume creation for ALC268 models
ALSA: ac97: Add HP Compaq dc5100 SFF(PT003AW) to Headphone Jack Sense whitelist
ALSA: snd_usb_caiaq: track submitted output urbs
Randy Dunlap [Sat, 20 Aug 2011 18:49:43 +0000 (11:49 -0700)]
pci: fix new kernel-doc warning in pci.c
Fix new kernel-doc warning in pci.c:
Warning(drivers/pci/pci.c:3259): No description found for parameter 'mps'
Warning(drivers/pci/pci.c:3259): Excess function parameter 'rq' description in 'pcie_set_mps'
(
if (...) { ... when != kfree(x)
when != x = E3
when != E3 = x
* return ...;
}
... when != x = E2
when != I(...,x,...) S
if (...) { ... when != x = E4
kfree(x); ... return ...; }
)
// </smpl>
Signed-off-by: Julia Lawall <julia@diku.dk> Signed-off-by: Takashi Iwai <tiwai@suse.de>
Takashi Iwai [Sat, 20 Aug 2011 07:14:45 +0000 (09:14 +0200)]
ALSA: hda - Don't spew too many ELD errors
Currently HD-audio driver shows the all error ELD byte as an error
in the kernel message. This is annoying when the video driver doesn't
set the correct ELD from the beginning. e.g. radeon sends a zero-byte
data, but we still check ELD with the fixed 128 byte as a workaround
for some broken devices, it spews 128-times errors.
For avoiding this, the driver aborts reading when the first byte is
invalid. In such a case, the whole data is certainly invalid.
Linus Torvalds [Sat, 20 Aug 2011 06:07:08 +0000 (23:07 -0700)]
Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jbarnes/drm-intel
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jbarnes/drm-intel:
drm/i915: set GFX_MODE to pre-Ivybridge default value even on Ivybridge
Jiaying Zhang [Fri, 19 Aug 2011 23:13:32 +0000 (19:13 -0400)]
ext4: flush any pending end_io requests before DIO reads w/dioread_nolock
There is a race between ext4 buffer write and direct_IO read with
dioread_nolock mount option enabled. The problem is that we clear
PageWriteback flag during end_io time but will do
uninitialized-to-initialized extent conversion later with dioread_nolock.
If an O_direct read request comes in during this period, ext4 will return
zero instead of the recently written data.
This patch checks whether there are any pending uninitialized-to-initialized
extent conversion requests before doing O_direct read to close the race.
Note that this is just a bandaid fix. The fundamental issue is that we
clear PageWriteback flag before we really complete an IO, which is
problem-prone. To fix the fundamental issue, we may need to implement an
extent tree cache that we can use to look up pending to-be-converted extents.
Jesse Barnes [Fri, 12 Aug 2011 22:28:32 +0000 (15:28 -0700)]
drm/i915: set GFX_MODE to pre-Ivybridge default value even on Ivybridge
Prior to Ivybridge, the GFX_MODE would default to 0x800, meaning that
MI_FLUSH would flush the TLBs in addition to the rest of the caches
indicated in the MI_FLUSH command. However starting with Ivybridge, the
register defaults to 0x2800 out of reset, meaning that to invalidate the
TLB we need to use PIPE_CONTROL. Since we're not doing that yet, go
back to the old default so things work.
v2: don't forget to actually *clear* the new bit
Reviewed-by: Eric Anholt <eric@anholt.net> Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk> Tested-by: Kenneth Graunke <kenneth@whitecape.org> Signed-off-by: Jesse Barnes <jbarnes@virtuousgeek.org>