Miao Xie [Thu, 20 Feb 2014 10:08:52 +0000 (18:08 +0800)]
Btrfs: fix the skipped transaction commit during the file sync
We may abort the wait earlier if ->last_trans_log_full_commit was set to
the current transaction id, at this case, we need commit the current
transaction instead of the log sub-transaction. But the current code
didn't tell the caller to do it (return 0, not -EAGAIN). Fix it.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fb.com>
Miao Xie [Thu, 20 Feb 2014 10:08:51 +0000 (18:08 +0800)]
Btrfs: use ACCESS_ONCE to prevent the optimize accesses to ->last_trans_log_full_commit
->last_trans_log_full_commit may be changed by the other tasks without lock,
so we need prevent the compiler from the optimize access just like
tmp = fs_info->last_trans_log_full_commit
if (tmp == ...)
...
<do something>
if (tmp == ...)
...
In fact, we need get the new value of ->last_trans_log_full_commit during
the second access. Fix it by ACCESS_ONCE().
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fb.com>
Liu Bo [Mon, 10 Feb 2014 09:37:25 +0000 (17:37 +0800)]
Btrfs: avoid warning bomb of btrfs_invalidate_inodes
So after transaction is aborted, we need to cleanup inode resources by
calling btrfs_invalidate_inodes(), and btrfs_invalidate_inodes() hopes
roots' refs to be zero in old times and sets a WARN_ON(), however, this
is not always true within cleaning up transaction, so we get to detect
transaction abortion and not warn at all.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Josef Bacik <jbacik@fb.com>
Liu Bo [Mon, 10 Feb 2014 09:07:16 +0000 (17:07 +0800)]
Btrfs: fix possible deadlock in btrfs_cleanup_transaction
[13654.480669] ======================================================
[13654.480905] [ INFO: possible circular locking dependency detected ]
[13654.481003] 3.12.0+ #4 Tainted: G W O
[13654.481060] -------------------------------------------------------
[13654.481060] btrfs-transacti/9347 is trying to acquire lock:
[13654.481060] (&(&root->ordered_extent_lock)->rlock){+.+...}, at: [<ffffffffa02d30a1>] btrfs_cleanup_transaction+0x271/0x570 [btrfs]
[13654.481060] but task is already holding lock:
[13654.481060] (&(&fs_info->ordered_root_lock)->rlock){+.+...}, at: [<ffffffffa02d3015>] btrfs_cleanup_transaction+0x1e5/0x570 [btrfs]
[13654.481060] which lock already depends on the new lock.
Instead of inserting the new file extent item if we deleted existing
file extent items covering our target file range, also allow to insert
the new file extent item if we didn't find any existing items to delete
and replace_extent != 0, since in this case our caller would do another
tree search to insert the new file extent item anyway, therefore just
combine the two tree searches into a single one, saving cpu time, reducing
lock contention and reducing btree node/leaf COW operations.
This covers the case where applications keep doing tail append writes to
files, which for example is the case of Apache CouchDB (its database and
view index files are always open with O_APPEND).
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Josef Bacik <jbacik@fb.com>
Wang Shilong [Sat, 8 Feb 2014 15:46:36 +0000 (23:46 +0800)]
Revert "Btrfs: remove transaction from btrfs send"
This reverts commit 41ce9970a8a6a362ae8df145f7a03d789e9ef9d2.
Previously i was thinking we can use readonly root's commit root
safely while it is not true, readonly root may be cowed with the
following cases.
1.snapshot send root will cow source root.
2.balance,device operations will also cow readonly send root
to relocate.
So i have two ideas to make us safe to use commit root.
-->approach 1:
make it protected by transaction and end transaction properly and we research
next item from root node(see btrfs_search_slot_for_read()).
-->approach 2:
add another counter to local root structure to sync snapshot with send.
and add a global counter to sync send with exclusive device operations.
So with approach 2, send can use commit root safely, because we make sure
send root can not be cowed during send. Unfortunately, it make codes *ugly*
and more complex to maintain.
To make snapshot and send exclusively, device operations and send operation
exclusively with each other is a little confusing for common users.
So why not drop into previous way.
Cc: Josef Bacik <jbacik@fb.com> Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fb.com>
Hidetoshi Seto [Wed, 5 Feb 2014 07:34:38 +0000 (16:34 +0900)]
Btrfs: skip submitting barrier for missing device
I got an error on v3.13:
BTRFS error (device sdf1) in write_all_supers:3378: errno=-5 IO failure (errors while submitting device barriers.)
how to reproduce:
> mkfs.btrfs -f -d raid1 /dev/sdf1 /dev/sdf2
> wipefs -a /dev/sdf2
> mount -o degraded /dev/sdf1 /mnt
> btrfs balance start -f -sconvert=single -mconvert=single -dconvert=single /mnt
The reason of the error is that barrier_all_devices() failed to submit
barrier to the missing device. However it is clear that we cannot do
anything on missing device, and also it is not necessary to care chunks
on the missing device.
This patch stops sending/waiting barrier if device is missing.
Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com> Cc: <stable@vger.kernel.org> Signed-off-by: Josef Bacik <jbacik@fb.com>
Josef Bacik [Fri, 7 Feb 2014 17:21:23 +0000 (12:21 -0500)]
Btrfs: unlock extent and pages on error in cow_file_range
When I converted the BUG_ON() for the free_space_cache_inode in cow_file_range I
made it so we just return an error instead of unlocking all of our various
stuff. This is a mistake and causes us to hang when we run into this. This
patch fixes this problem. Thanks,
Josef Bacik [Thu, 6 Feb 2014 21:06:06 +0000 (16:06 -0500)]
Btrfs: balance delayed inode updates
While trying to reproduce a delayed ref problem I noticed the box kept falling
over using all 80gb of my ram with btrfs_inode's and btrfs_delayed_node's.
Turns out this is because we only throttle delayed inode updates in
btrfs_dirty_inode, which doesn't actually get called that often, especially when
all you are doing is creating a bunch of files. So balance delayed inode
updates everytime we create a new inode. With this patch we no longer use up
all of our ram with delayed inode updates. Thanks,
David Sterba [Wed, 5 Feb 2014 15:17:34 +0000 (16:17 +0100)]
btrfs: send: lower memory requirements in common case
The fs_path structure uses an inline buffer and falls back to a chain of
allocations, but vmalloc is not necessary because PATH_MAX fits into
PAGE_SIZE.
The size of fs_path has been reduced to 256 bytes from PAGE_SIZE,
usually 4k. Experimental measurements show that most paths on a single
filesystem do not exceed 200 bytes, and these get stored into the inline
buffer directly, which is now 230 bytes. Longer paths are kmalloced when
needed.
Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fb.com>
Btrfs: make some tree searches in send.c more efficient
We have this pattern where we do search for a contiguous group of
items in a tree and everytime we find an item, we process it, then
we release our path, increment the offset of the search key, do
another full tree search and repeat these steps until a tree search
can't find more items we're interested in.
Instead of doing these full tree searches after processing each item,
just process the next item/slot in our leaf and don't release the path.
Since all these trees are read only and we always use the commit root
for a search and skip node/leaf locks, we're not affecting concurrency
on the trees.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Josef Bacik <jbacik@fb.com>
David Sterba [Mon, 3 Feb 2014 17:24:09 +0000 (18:24 +0100)]
btrfs: send: replace check with an assert in gen_unique_name
The buffer passed to snprintf can hold the fully expanded format string,
64 = 3x largest ULL + 3x char + trailing null. I don't think that removing the
check entirely is a good idea, hence the ASSERT.
Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fb.com>
Btrfs: more send support for parent/child dir relationship inversion
The commit titled "Btrfs: fix infinite path build loops in incremental send"
didn't cover a particular case where the parent-child relationship inversion
of directories doesn't imply a rename of the new parent directory. This was
due to a simple logic mistake, a logical and instead of a logical or.
Btrfs: fix send dealing with file renames and directory moves
This fixes a case that the commit titled:
Btrfs: fix infinite path build loops in incremental send
didn't cover. If the parent-child relationship between 2 directories
is inverted, both get renamed, and the former parent has a file that
got renamed too (but remains a child of that directory), the incremental
send operation would use the file's old path after sending an unlink
operation for that old path, causing receive to fail on future operations
like changing owner, permissions or utimes of the corresponding inode.
This is not a regression from the commit mentioned before, as without
that commit we would fall into the issues that commit fixed, so it's
just one case that wasn't covered before.
Hugo Mills [Thu, 30 Jan 2014 20:17:00 +0000 (20:17 +0000)]
btrfs: Fix 32/64-bit problem with BTRFS_SET_RECEIVED_SUBVOL ioctl
The structure for BTRFS_SET_RECEIVED_IOCTL packs differently on 32-bit
and 64-bit systems. This means that it is impossible to use btrfs
receive on a system with a 64-bit kernel and 32-bit userspace, because
the structure size (and hence the ioctl number) is different.
This patch adds a compatibility structure and ioctl to deal with the
above case.
Signed-off-by: Hugo Mills <hugo@carfax.org.uk> Signed-off-by: Josef Bacik <jbacik@fb.com>
Btrfs: add missing error check in incremental send
Function wait_for_parent_move() returns negative value if an error
happened, 0 if we don't need to wait for the parent's move, and
1 if the wait is needed.
Before this change an error return value was being treated like the
return value 1, which was not correct.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Josef Bacik <jbacik@fb.com>
Miao Xie [Thu, 30 Jan 2014 08:46:55 +0000 (16:46 +0800)]
Btrfs: fix use-after-free in the finishing procedure of the device replace
During device replace test, we hit a null pointer deference (It was very easy
to reproduce it by running xfstests' btrfs/011 on the devices with the virtio
scsi driver). There were two bugs that caused this problem:
- We might allocate new chunks on the replaced device after we updated
the mapping tree. And we forgot to replace the source device in those
mapping of the new chunks.
- We might get the mapping information which including the source device
before the mapping information update. And then submit the bio which was
based on that mapping information after we freed the source device.
For the first bug, we can fix it by doing mapping tree update and source
device remove in the same context of the chunk mutex. The chunk mutex is
used to protect the allocable device list, the above method can avoid
the new chunk allocation, and after we remove the source device, all
the new chunks will be allocated on the new device. So it can fix
the first bug.
For the second bug, we need make sure all flighting bios are finished and
no new bios are produced during we are removing the source device. To fix
this problem, we introduced a global @bio_counter, we not only inc/dec
@bio_counter outsize of map_blocks, but also inc it before submitting bio
and dec @bio_counter when ending bios.
Since Raid56 is a little different and device replace dosen't support raid56
yet, it is not addressed in the patch and I add comments to make sure we will
fix it in the future.
Reported-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com> Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fb.com>
Miao Xie [Tue, 14 Jan 2014 12:31:51 +0000 (20:31 +0800)]
Btrfs: don't mix the ordered extents of all files together during logging the inodes
There was a problem in the old code:
If we failed to log the csum, we would free all the ordered extents in the log list
including those ordered extents that were logged successfully, it would make the
log committer not to wait for the completion of the ordered extents.
This patch doesn't insert the ordered extents that is about to be logged into
a global list, instead, we insert them into a local list. If we log the ordered
extents successfully, we splice them with the global list, or we will throw them
away, then do full sync. It can also reduce the lock contention and the traverse
time of list.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fb.com>
Btrfs: use right clone root offset for compressed extents
For non compressed extents, iterate_extent_inodes() gives us offsets
that take into account the data offset from the file extent items, while
for compressed extents it doesn't. Therefore we have to adjust them before
placing them in a send clone instruction. Not doing this adjustment leads to
the receiving end requesting for a wrong a file range to the clone ioctl,
which results in different file content from the one in the original send
root.
Issue reproducible with the following excerpt from the test I made for
xfstests:
Josef Bacik [Fri, 14 Feb 2014 18:43:48 +0000 (13:43 -0500)]
Btrfs: unset DCACHE_DISCONNECTED when mounting default subvol
A user was running into errors from an NFS export of a subvolume that had a
default subvol set. When we mount a default subvol we will use d_obtain_alias()
to find an existing dentry for the subvolume in the case that the root subvol
has already been mounted, or a dummy one is allocated in the case that the root
subvol has not already been mounted. This allows us to connect the dentry later
on if we wander into the path. However if we don't ever wander into the path we
will keep DCACHE_DISCONNECTED set for a long time, which angers NFS. It doesn't
appear to cause any problems but it is annoying nonetheless, so simply unset
DCACHE_DISCONNECTED in the get_default_root case and switch btrfs_lookup() to
use d_materialise_unique() instead which will make everything play nicely
together and reconnect stuff if we wander into the defaul subvol path from a
different way. With this patch I'm no longer getting the NFS errors when
exporting a volume that has been mounted with a default subvol set. Thanks,
cc: bfields@fieldses.org
cc: ebiederm@xmission.com Signed-off-by: Josef Bacik <jbacik@fb.com> Acked-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Chris Mason <clm@fb.com>
Mitch Harder [Thu, 13 Feb 2014 15:13:16 +0000 (09:13 -0600)]
Btrfs: fix max_inline mount option
Currently, the only mount option for max_inline that has any effect is
max_inline=0. Any other value that is supplied to max_inline will be
adjusted to a minimum of 4k. Since max_inline has an effective maximum
of ~3900 bytes due to page size limitations, the current behaviour
only has meaning for max_inline=0.
This patch will allow the the max_inline mount option to accept non-zero
values as indicated in the documentation.
Signed-off-by: Mitch Harder <mitch.harder@sabayonlinux.org> Signed-off-by: Chris Mason <clm@fb.com>
David Sterba found a different way to provide these features without adding a new
ioctl. We haven't released any progs with this ioctl yet, so I'm taking this out
for now until we finalize things.
Signed-off-by: Chris Mason <clm@fb.com> Signed-off-by: David Sterba <dsterba@suse.cz> CC: Jeff Mahoney <jeffm@suse.com>
Btrfs: fix data corruption when reading/updating compressed extents
When using a mix of compressed file extents and prealloc extents, it
is possible to fill a page of a file with random, garbage data from
some unrelated previous use of the page, instead of a sequence of zeroes.
A simple sequence of steps to get into such case, taken from the test
case I made for xfstests, is:
This results in the following file items in the fs tree:
item 4 key (257 INODE_ITEM 0) itemoff 15879 itemsize 160
inode generation 6 transid 6 size 542872 block group 0 mode 100600
item 5 key (257 INODE_REF 256) itemoff 15863 itemsize 16
inode ref index 2 namelen 6 name: foobar
item 6 key (257 EXTENT_DATA 0) itemoff 15810 itemsize 53
extent data disk byte 0 nr 0 gen 6
extent data offset 0 nr 24576 ram 266240
extent compression 0
item 7 key (257 EXTENT_DATA 24576) itemoff 15757 itemsize 53
prealloc data disk byte 12849152 nr 241664 gen 6
prealloc data offset 0 nr 241664
item 8 key (257 EXTENT_DATA 266240) itemoff 15704 itemsize 53
extent data disk byte 12845056 nr 4096 gen 6
extent data offset 0 nr 20480 ram 20480
extent compression 2
item 9 key (257 EXTENT_DATA 286720) itemoff 15651 itemsize 53
prealloc data disk byte 13090816 nr 405504 gen 6
prealloc data offset 0 nr 258048
The on disk extent at offset 266240 (which corresponds to 1 single disk block),
contains 5 compressed chunks of file data. Each of the first 4 compress 4096
bytes of file data, while the last one only compresses 3024 bytes of file data.
Therefore a read into the file region [285648 ; 286720[ (length = 4096 - 3024 =
1072 bytes) should always return zeroes (our next extent is a prealloc one).
The solution here is the compression code path to zero the remaining (untouched)
bytes of the last page it uncompressed data into, as the information about how
much space the file data consumes in the last page is not known in the upper layer
fs/btrfs/extent_io.c:__do_readpage(). In __do_readpage we were correctly zeroing
the remainder of the page but only if it corresponds to the last page of the inode
and if the inode's size is not a multiple of the page size.
This would cause not only returning random data on reads, but also permanently
storing random data when updating parts of the region that should be zeroed.
For the example above, it means updating a single byte in the region [285648 ; 286720[
would store that byte correctly but also store random data on disk.
A test case for xfstests follows soon.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Chris Mason <clm@fb.com>
Josef Bacik [Fri, 7 Feb 2014 18:57:59 +0000 (13:57 -0500)]
Btrfs: don't loop forever if we can't run because of the tree mod log
A user reported a 100% cpu hang with my new delayed ref code. Turns out I
forgot to increase the count check when we can't run a delayed ref because of
the tree mod log. If we can't run any delayed refs during this there is no
point in continuing to look, and we need to break out. Thanks,
Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
David Sterba [Fri, 7 Feb 2014 13:34:04 +0000 (14:34 +0100)]
btrfs: reserve no transaction units in btrfs_ioctl_set_features
Added in patch "btrfs: add ioctls to query/change feature bits online"
modifications to superblock don't need to reserve metadata blocks when
starting a transaction.
Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>
Jeff Mahoney [Fri, 7 Feb 2014 13:33:57 +0000 (14:33 +0100)]
btrfs: commit transaction after setting label and features
The set_fslabel ioctl uses btrfs_end_transaction, which means it's
possible that the change will be lost if the system crashes, same for
the newly set features. Let's use btrfs_commit_transaction instead.
Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>
Josef Bacik [Wed, 5 Feb 2014 21:19:21 +0000 (16:19 -0500)]
Btrfs: fix assert screwup for the pending move stuff
Wang noticed that he was failing btrfs/030 even though me and Filipe couldn't
reproduce. Turns out this is because Wang didn't have CONFIG_BTRFS_ASSERT set,
which meant that a key part of Filipe's original patch was not being built in.
This appears to be a mess up with merging Filipe's patch as it does not exist in
his original patch. Fix this by changing how we make sure del_waiting_dir_move
asserts that it did not error and take the function out of the ifdef check.
This makes btrfs/030 pass with the assert on or off. Thanks,
Signed-off-by: Josef Bacik <jbacik@fb.com> Reviewed-by: Filipe Manana <fdmanana@gmail.com> Signed-off-by: Chris Mason <clm@fb.com>
It seems that when init_btrfs_fs() is called, crc32c/crc32c-intel might
not always be already initialized, which results in the call to crypto_alloc_shash()
returning -ENOENT, as experienced by Ahmet who reported this.
Therefore make sure init_btrfs_fs() is called after crc32c is initialized (which
is at initialization level 6, module_init), by using late_initcall (which is at
initialization level 7) instead of module_init for btrfs.
Reported-and-Tested-by: Ahmet Inan <ainan@mathematik.uni-freiburg.de> Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Chris Mason <clm@fb.com>
Btrfs: use btrfs_crc32c everywhere instead of libcrc32c
After the commit titled "Btrfs: fix btrfs boot when compiled as built-in",
LIBCRC32C requirement was removed from btrfs' Kconfig. This made it not
possible to build a kernel with btrfs enabled (either as module or built-in)
if libcrc32c is not enabled as well. So just replace all uses of libcrc32c
with the equivalent function in btrfs hash.h - btrfs_crc32c.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Chris Mason <clm@fb.com>
Chris Mason [Fri, 10 Jan 2014 01:28:00 +0000 (17:28 -0800)]
Btrfs: setup inode location during btrfs_init_inode_locked
We have a race during inode init because the BTRFS_I(inode)->location is setup
after the inode hash table lock is dropped. btrfs_find_actor uses the location
field, so our search might not find an existing inode in the hash table if we
race with the inode init code.
This commit changes things to setup the location field sooner. Also the find actor now
uses only the location objectid to match inodes. For inode hashing, we just
need a unique and stable test, it doesn't have to reflect the inode numbers we
show to userland.
Signed-off-by: Chris Mason <clm@fb.com> CC: stable@vger.kernel.org
Chris Mason [Sat, 4 Jan 2014 05:07:00 +0000 (21:07 -0800)]
Btrfs: don't use ram_bytes for uncompressed inline items
If we truncate an uncompressed inline item, ram_bytes isn't updated to reflect
the new size. The fixe uses the size directly from the item header when
reading uncompressed inlines, and also fixes truncate to update the
size as it goes.
Reported-by: Jens Axboe <axboe@fb.com> Signed-off-by: Chris Mason <clm@fb.com> CC: stable@vger.kernel.org
If the current path's leaf slot is 0, we do search for the previous
leaf (via btrfs_prev_leaf) and set the new path's leaf slot to a
value corresponding to the number of items - 1 of the former leaf.
Fix this by using the slot set by btrfs_prev_leaf, decrementing it
by 1 if it's equal to the leaf's number of items.
Use of btrfs_search_slot_for_read() for backward iteration is used in
particular by the send feature, which could miss items when the input
leaf has less items than its previous leaf.
This could be reproduced by running btrfs/007 from xfstests in a loop.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Chris Mason <clm@fb.com>
Wang Shilong [Tue, 28 Jan 2014 16:25:35 +0000 (00:25 +0800)]
Btrfs: do not export ulist functions
There are not any users that use ulist except Btrfs,don't
export them.
Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com> Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
Wang Shilong [Tue, 28 Jan 2014 16:25:34 +0000 (00:25 +0800)]
Btrfs: rework ulist with list+rb_tree
We are really suffering from now ulist's implementation, some developers
gave their try, and i just gave some of my ideas for things:
1. use list+rb_tree instead of arrary+rb_tree
2. add cur_list to iterator rather than ulist structure.
3. add seqnum into every node when they are added, this is
used to do selfcheck when iterating node.
I noticed Zach Brown's comments before, long term is to kick off
ulist implementation, however, for now, we need at least avoid
arrary from ulist.
Cc: Liu Bo <bo.li.liu@oracle.com> Cc: Josef Bacik <jbacik@fb.com> Cc: Zach Brown <zab@redhat.com> Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
Wang Shilong [Tue, 28 Jan 2014 11:13:38 +0000 (19:13 +0800)]
Btrfs: fix memory leaks on walking backrefs failure
When walking backrefs, we may iterate every inode's extent
and add/merge them into ulist, and the caller will free memory
from ulist.
However, if we fail to allocate inode's extents element
memory or ulist_add() fail to allocate memory, we won't
add allocated memory into ulist, and the caller won't
free some allocated memory thus memory leaks happen.
Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
Btrfs: fix send file hole detection leading to data corruption
There was a case where file hole detection was incorrect and it would
cause an incremental send to override a section of a file with zeroes.
This happened in the case where between the last leaf we processed which
contained a file extent item for our current inode and the leaf we're
currently are at (and has a file extent item for our current inode) there
are only leafs containing exclusively file extent items for our current
inode, and none of them was updated since the previous send operation.
The file hole detection code would incorrectly consider the file range
covered by these leafs as a hole.
A test case for xfstests follows soon.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
Wang Shilong [Sun, 26 Jan 2014 14:32:18 +0000 (22:32 +0800)]
Btrfs: add a reschedule point in btrfs_find_all_roots()
I can easily trigger the following warnings when enabling quota
in my virtual machine(running Opensuse), Steps are firstly creating
a subvolume full of fragment extents, and then create many snapshots
(500 in my test case).
By adding a reschedule point at the end of btrfs_find_all_roots(), i no longer
hit these warnings.
Cc: Josef Bacik <jbacik@fb.com> Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com> Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
Btrfs: make send's file extent item search more efficient
Instead of looking for a file extent item, process it, release the path
and do a btree search for the next file extent item, just process all
file extent items in a leaf without intermediate btree searches. This way
we save cpu and we're not blocking other tasks or affecting concurrency on
the btree, because send's paths use the commit root and skip btree node/leaf
locking.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
Wang Shilong [Thu, 23 Jan 2014 05:47:48 +0000 (13:47 +0800)]
Btrfs: fix protection between walking backrefs and root deletion
There is a race condition between resolving indirect ref and root deletion,
and we should gurantee that root can not be destroyed to avoid accessing
broken tree here.
Here we fix it by holding @subvol_srcu, and we will release it as soon
as we have held root node lock.
Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
Gui Hecheng [Thu, 23 Jan 2014 05:41:09 +0000 (13:41 +0800)]
btrfs: fix warning while merging two adjacent extents
When we have two adjacent extents in relink_extent_backref,
we try to merge them. When we use btrfs_search_slot to locate the
slot for the current extent, we shouldn't set "ins_len = 1",
because we will merge it into the previous extent rather than
insert a new item. Otherwise, we may happen to create a new leaf
in btrfs_search_slot and path->slot[0] will be 0. Then we try to
fetch the previous item using "path->slots[0]--", and it will cause
a warning as follows:
I encounter this warning when running defrag having mkfs.btrfs
with option -M. At the same time there are read/writes & snapshots
running at background.
Signed-off-by: Gui Hecheng <guihc.fnst@cn.fujitsu.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
Btrfs: fix infinite path build loops in incremental send
The send operation processes inodes by their ascending number, and assumes
that any rename/move operation can be successfully performed (sent to the
caller) once all previous inodes (those with a smaller inode number than the
one we're currently processing) were processed.
This is not true when an incremental send had to process an hierarchical change
between 2 snapshots where the parent-children relationship between directory
inodes was reversed - that is, parents became children and children became
parents. This situation made the path building code go into an infinite loop,
which kept allocating more and more memory that eventually lead to a krealloc
warning being displayed in dmesg:
Even without this loop, the incremental send couldn't succeed, because it would attempt
to send a rename/move operation for the lower inode before the highest inode number was
renamed/move. This issue is easy to trigger with the following steps:
Anand Jain [Wed, 22 Jan 2014 03:15:51 +0000 (11:15 +0800)]
btrfs: undo sysfs when open_ctree() fails
reproducer:
mkfs.btrfs -f /dev/sdb &&\
mount /dev/sdb /btrfs &&\
btrfs dev add -f /dev/sdc /btrfs &&\
umount /btrfs &&\
wipefs -a /dev/sdc &&\
mount -o degraded /dev/sdb /btrfs
//above mount fails so try with RO
mount -o degraded,ro /dev/sdb /btrfs
Btrfs: fix snprintf usage by send's gen_unique_name
The buffer size argument passed to snprintf must account for the
trailing null byte added by snprintf, and it returns a value >= then
sizeof(buffer) when the string can't fit in the buffer.
Since our buffer has a size of 64 characters, and the maximum orphan
name we can generate is 63 characters wide, we must pass 64 as the
buffer size to snprintf, and not 63.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
Justin Maggard [Tue, 21 Jan 2014 19:18:29 +0000 (11:18 -0800)]
btrfs: fix defrag 32-bit integer overflow
When defragging a very large file, the cluster variable can wrap its 32-bit
signed int type and become negative, which eventually gets passed to
btrfs_force_ra() as a very large unsigned long value. On 32-bit platforms,
this eventually results in an Oops from the SLAB allocator.
Change the cluster and max_cluster signed int variables to unsigned long to
match the readahead functions. This also allows the min() comparison in
btrfs_defrag_file() to work as intended.
Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
David Sterba [Wed, 15 Jan 2014 17:15:52 +0000 (18:15 +0100)]
btrfs: restrict snapshotting to own subvolumes
Currently, any user can snapshot any subvolume if the path is accessible and
thus indirectly create and keep files he does not own under his direcotries.
This is not possible with traditional directories.
In security context, a user can snapshot root filesystem and pin any
potentially buggy binaries, even if the updates are applied.
All the snapshots are visible to the administrator, so it's possible to
verify if there are suspicious snapshots.
Another more practical problem is that any user can pin the space used
by eg. root and cause ENOSPC.
Original report:
https://bugs.launchpad.net/ubuntu/+source/apparmor/+bug/484786
CC: stable@vger.kernel.org Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
Miao Xie [Wed, 15 Jan 2014 12:00:56 +0000 (20:00 +0800)]
Btrfs: cleanup the code of used_block_group in find_free_extent()
used_block_group is just used for the space cluster which doesn't
belong to the current block group, the other place needn't use it.
Or the logic of code seems unclear.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
Miao Xie [Wed, 15 Jan 2014 12:00:54 +0000 (20:00 +0800)]
Btrfs: change the members' order of btrfs_space_info structure to reduce the cache miss
It is better that the position of the lock is close to the data which is
protected by it, because they may be in the same cache line, we will load
less cache lines when we access them. So we rearrange the members' position
of btrfs_space_info structure to make the lock be closer to the its data.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
Miao Xie [Tue, 14 Jan 2014 12:31:52 +0000 (20:31 +0800)]
Btrfs: flush the dirty pages of the ordered extent aggressively during logging csum
The performance of fsync dropped down suddenly sometimes, the main reason
of this problem was that we might only flush part dirty pages in a ordered
extent, then got that ordered extent, wait for the csum calcucation. But if
no task flushed the left part, we would wait until the flusher flushed them,
sometimes we need wait for several seconds, it made the performance drop
down suddenly. (On my box, it drop down from 56MB/s to 4-10MB/s)
This patch improves the above problem by flushing left dirty pages aggressively.
Test Environment:
CPU: 2CPU * 2Cores
Memory: 4GB
Partition: 20GB(HDD)
Wang Shilong [Tue, 14 Jan 2014 11:42:20 +0000 (19:42 +0800)]
Btrfs: fix transaction abortion when remounting btrfs from RW to RO
Steps to reproduce:
# mkfs.btrfs -f /dev/sda8
# mount /dev/sda8 /mnt -o flushoncommit
# dd if=/dev/zero of=/mnt/data bs=4k count=102400 &
# mount /dev/sda8 /mnt -o remount, ro
When remounting RW to RO, the logic is to firstly set flag
to RO and then commit transaction, however with option
flushoncommit enabled,we will do RO check within committing
transaction, so we get a transaction abortion here.
Actually,here check is wrong, we should check if FS_STATE_ERROR
is set, fix it.
Reported-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Suggested-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
Btrfs: faster file extent item search in clone ioctl
When we are looking for file extent items that intersect the cloning
range, for each one that falls completely outside the range, don't
release the path and do another full tree search - just move on
to the next slot and copy the file extent item into our buffer only
if the item intersects the cloning range.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
Liu Bo [Mon, 13 Jan 2014 11:53:53 +0000 (19:53 +0800)]
Btrfs: fix extent state leak on transaction abortion
When transaction is aborted, we fail to commit transaction, instead we do
cleanup work. After that when we umount btrfs, we get to free fs roots' log
trees respectively, but that happens after we unpin extents, so those extents
pinned by freeing log trees will remain in memory and lead to the leak.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
Qu Wenruo [Mon, 13 Jan 2014 05:36:07 +0000 (13:36 +0800)]
btrfs: Cleanup the btrfs_parse_options for remount.
Since remount will pending the new mount options to the original mount
options, which will make btrfs_parse_options check the old options then
new options, causing some stupid output like "enabling XXX" following by
"disable XXX".
This patch will add extra check before every btrfs_info to skip the
output from old options checking.
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
Qu Wenruo [Mon, 13 Jan 2014 05:36:06 +0000 (13:36 +0800)]
btrfs: Add noinode_cache mount option
Add noinode_cache mount option for btrfs.
Since inode map cache involves all the btrfs_find_free_ino/return_ino
things and if just trigger the mount_opt,
an inode number get from inode map cache will not returned to inode map
cache.
To keep the find and return inode both in the same behavior,
a new bit in mount_opt, CHANGE_INODE_CACHE, is introduced for this idea.
CHANGE_INODE_CACHE is set/cleared in remounting, and the original
INODE_MAP_CACHE is set/cleared according to CHANGE_INODE_CACHE after a
success transaction.
Since find/return inode is all done between btrfs_start_transaction and
btrfs_commit_transaction, this will keep consistent behavior.
Also noinode_cache mount option will not stop the caching_kthread.
Cc: David Sterba <dsterba@suse.cz> Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
Wang Shilong [Sun, 12 Jan 2014 13:38:33 +0000 (21:38 +0800)]
Btrfs: fix to search previous metadata extent item since skinny metadata
There is a bug that using btrfs_previous_item() to search metadata extent item.
This is because in btrfs_previous_item(), we need type match, however, since
skinny metada was introduced by josef, we may mix this two types. So just
use btrfs_previous_item() is not working right.
To keep btrfs_previous_item() like normal tree search, i introduce another
function btrfs_previous_extent_item().
Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
Btrfs: fix send to not send non-aligned clone operations
It is possible for the send feature to send clone operations that
request a cloning range (offset + length) that is not aligned with
the block size. This makes the btrfs receive command send issue a
clone ioctl call that will fail, as the ioctl will return an -EINVAL
error because of the unaligned range.
Fix this by not sending clone operations for non block aligned ranges,
and instead send regular write operation for these (less common) cases.
The following xfstest reproduces this issue, which fails on the second
btrfs receive command without this change:
seq=`basename $0`
seqres=$RESULT_DIR/$seq
echo "QA output created by $seq"
tmp=`mktemp -d`
status=1 # failure is the default!
trap "_cleanup; exit \$status" 0 1 2 3 15
_cleanup()
{
rm -fr $tmp
}
# get standard environment, filters and checks
. ./common/rc
. ./common/filter
# real QA test starts here
_supported_fs btrfs
_supported_os Linux
_require_scratch
_need_to_be_root
QA output created by 025
FSSync 'SCRATCH_MNT'
FSSync 'SCRATCH_MNT'
wrote 2978/2978 bytes at offset 1482752
XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
FSSync 'SCRATCH_MNT'
Create a readonly snapshot of 'SCRATCH_MNT' in 'SCRATCH_MNT/mysnap1'
FSSync 'SCRATCH_MNT'
Create a readonly snapshot of 'SCRATCH_MNT' in 'SCRATCH_MNT/mysnap2'
At subvol SCRATCH_MNT/mysnap1
At subvol SCRATCH_MNT/mysnap2 129b8eaee8d3c2bcad49bec596591cb3 SCRATCH_MNT/foo 42b6369eae2a8725c1aacc0440e597aa SCRATCH_MNT/mysnap1/foo 129b8eaee8d3c2bcad49bec596591cb3 SCRATCH_MNT/mysnap2/foo
At subvol mysnap1 42b6369eae2a8725c1aacc0440e597aa SCRATCH_MNT/mysnap1/foo
At snapshot mysnap2 129b8eaee8d3c2bcad49bec596591cb3 SCRATCH_MNT/mysnap2/foo
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
After the change titled "Btrfs: add support for inode properties", if
btrfs was built-in the kernel (i.e. not as a module), it would cause a
kernel panic, as reported recently by Fengguang:
The issue here is that the initialization function of btrfs (super.c:init_btrfs_fs)
started using crc32c (from lib/libcrc32c.c). But when it needs to call crc32c (as
part of the properties initialization routine), the libcrc32c is not yet initialized,
so crc32c derreferenced a NULL pointer (lib/libcrc32c.c:tfm), causing the kernel
panic on boot.
The approach to fix this is to use crypto component directly to use its crc32c (which
is basically what lib/libcrc32c.c is, a wrapper around crypto). This is what ext4 is
doing as well, it uses crypto directly to get crc32c functionality.
Verified this works both when btrfs is built-in and when it's loadable kernel module.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
Btrfs: unlock inodes in correct order in clone ioctl
In the clone ioctl, when the source and target inodes are different,
we can acquire their mutexes in 2 possible different orders. After
we're done cloning, we were releasing the mutexes always in the same
order - the most correct way of doing it is to release them by the
reverse order they were acquired.
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
Wang Shilong [Fri, 10 Jan 2014 13:25:46 +0000 (21:25 +0800)]
Btrfs: optimize to remove unnecessary removal with ulist reallocation
Here we are not going to free memory, no need to remove every node
one by one, just init root node here is ok.
Cc: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
Liu Bo [Thu, 9 Jan 2014 06:57:06 +0000 (14:57 +0800)]
Btrfs: release subvolume's block_rsv before transaction commit
We don't have to keep subvolume's block_rsv during transaction commit,
and within transaction commit, we may also need the free space reclaimed
from this block_rsv to process delayed refs.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
Miao Xie [Thu, 9 Jan 2014 02:06:10 +0000 (10:06 +0800)]
Btrfs: fix the race between write back and nocow buffered write
When we ran the 274th case of xfstests with nodatacow mount option,
We met the following warning message:
WARNING: CPU: 1 PID: 14185 at fs/btrfs/extent-tree.c:3734 btrfs_free_reserved_data_space+0xa6/0xd0
It is caused by the race between the write back and nocow buffered
write:
Task1 Task2
__btrfs_buffered_write()
skip data reservation
reserve the metadata space
copy the data
dirty the pages
unlock the pages
write back the pages
release the data space
becasue there is no
noreserve flag
set the noreserve flag
This patch fixes this problem by unlocking the pages after
the noreserve flag is set.
Reported-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com> Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
Josef Bacik [Fri, 24 Jan 2014 19:05:42 +0000 (14:05 -0500)]
Btrfs: only process as many file extents as there are refs
The backref walking code will search down to the key it is looking for and then
proceed to walk _all_ of the extents on the file until it hits the end. This is
suboptimal with large files, we only need to look for as many extents as we have
references for that inode. I have a testcase that creates a randomly written 4
gig file and before this patch it took 6min 30sec to do the initial send, with
this patch it takes 2min 30sec to do the intial send. Thanks,
Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
Josef Bacik [Thu, 23 Jan 2014 21:03:45 +0000 (16:03 -0500)]
Btrfs: fix extent_from_logical to deal with skinny metadata
I don't think this is an issue and I've not seen it in practice but
extent_from_logical will fail to find a skinny extent because it uses
btrfs_previous_item and gives it the normal extent item type. This is just not
a place to use btrfs_previous_item since we care about either normal extents or
skinny extents, so open code btrfs_previous_item to properly check. This would
only affect metadata and the only place this is used for metadata is scrub and
I'm pretty sure it's just for printing stuff out, not actually doing any work so
hopefully it was never a problem other than a cosmetic one. Thanks,
Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
Josef Bacik [Thu, 23 Jan 2014 15:54:11 +0000 (10:54 -0500)]
Btrfs: throttle delayed refs better
On one of our gluster clusters we noticed some pretty big lag spikes. This
turned out to be because our transaction commit was taking like 3 minutes to
complete. This is because we have like 30 gigs of metadata, so our global
reserve would end up being the max which is like 512 mb. So our throttling code
would allow a ridiculous amount of delayed refs to build up and then they'd all
get run at transaction commit time, and for a cold mounted file system that
could take up to 3 minutes to run. So fix the throttling to be based on both
the size of the global reserve and how long it takes us to run delayed refs.
This patch tracks the time it takes to run delayed refs and then only allows 1
seconds worth of outstanding delayed refs at a time. This way it will auto-tune
itself from cold cache up to when everything is in memory and it no longer has
to go to disk. This makes our transaction commits take much less time to run.
Thanks,
Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
Josef Bacik [Thu, 23 Jan 2014 14:21:38 +0000 (09:21 -0500)]
Btrfs: attach delayed ref updates to delayed ref heads
Currently we have two rb-trees, one for delayed ref heads and one for all of the
delayed refs, including the delayed ref heads. When we process the delayed refs
we have to hold onto the delayed ref lock for all of the selecting and merging
and such, which results in quite a bit of lock contention. This was solved by
having a waitqueue and only one flusher at a time, however this hurts if we get
a lot of delayed refs queued up.
So instead just have an rb tree for the delayed ref heads, and then attach the
delayed ref updates to an rb tree that is per delayed ref head. Then we only
need to take the delayed ref lock when adding new delayed refs and when
selecting a delayed ref head to process, all the rest of the time we deal with a
per delayed ref head lock which will be much less contentious.
The locking rules for this get a little more complicated since we have to lock
up to 3 things to properly process delayed refs, but I will address that problem
later. For now this passes all of xfstests and my overnight stress tests.
Thanks,
Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
Josef Bacik [Wed, 15 Jan 2014 18:34:13 +0000 (13:34 -0500)]
Btrfs: make fsync latency less sucky
Looking into some performance related issues with large amounts of metadata
revealed that we can have some pretty huge swings in fsync() performance. If we
have a lot of delayed refs backed up (as you will tend to do with lots of
metadata) fsync() will wander off and try to run some of those delayed refs
which can result in reading from disk and such. Since the actual act of fsync()
doesn't create any delayed refs there is no need to make it throttle on delayed
ref stuff, that will be handled by other people. With this patch we get much
smoother fsync performance with large amounts of metadata. Thanks,
Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
This change adds infrastructure to allow for generic properties for
inodes. Properties are name/value pairs that can be associated with
inodes for different purposes. They are stored as xattrs with the
prefix "btrfs."
Properties can be inherited - this means when a directory inode has
inheritable properties set, these are added to new inodes created
under that directory. Further, subvolumes can also have properties
associated with them, and they can be inherited from their parent
subvolume. Naturally, directory properties have priority over subvolume
properties (in practice a subvolume property is just a regular
property associated with the root inode, objectid 256, of the
subvolume's fs tree).
This change also adds one specific property implementation, named
"compression", whose values can be "lzo" or "zlib" and it's an
inheritable property.
The corresponding changes to btrfs-progs were also implemented.
A patch with xfstests for this feature will follow once there's
agreement on this change/feature.
Further, the script at the bottom of this commit message was used to
do some benchmarks to measure any performance penalties of this feature.
Basically the tests correspond to:
Test 1 - create a filesystem and mount it with compress-force=lzo,
then sequentially create N files of 64Kb each, measure how long it took
to create the files, unmount the filesystem, mount the filesystem and
perform an 'ls -lha' against the test directory holding the N files, and
report the time the command took.
Test 2 - create a filesystem and don't use any compression option when
mounting it - instead set the compression property of the subvolume's
root to 'lzo'. Then create N files of 64Kb, and report the time it took.
The unmount the filesystem, mount it again and perform an 'ls -lha' like
in the former test. This means every single file ends up with a property
(xattr) associated to it.
Test 3 - same as test 2, but uses 4 properties - 3 are duplicates of the
compression property, have no real effect other than adding more work
when inheriting properties and taking more btree leaf space.
Test 4 - same as test 3 but with 10 properties per file.
Results (in seconds, and averages of 5 runs each), for different N
numbers of files follow.
* Without properties (test 1)
file creation time ls -lha time
10 000 files 3.49 0.76
100 000 files 47.19 8.37
1 000 000 files 518.51 107.06
* With 1 property (compression property set to lzo - test 2)
file creation time ls -lha time
10 000 files 3.63 0.93
100 000 files 48.56 9.74
1 000 000 files 537.72 125.11
* With 4 properties (test 3)
file creation time ls -lha time
10 000 files 3.94 1.20
100 000 files 52.14 11.48
1 000 000 files 572.70 142.13
* With 10 properties (test 4)
file creation time ls -lha time
10 000 files 4.61 1.35
100 000 files 58.86 13.83
1 000 000 files 656.01 177.61
The increased latencies with properties are essencialy because of:
*) When creating an inode, we now synchronously write 1 more item
(an xattr item) for each property inherited from the parent dir
(or subvolume). This could be done in an asynchronous way such
as we do for dir intex items (delayed-inode.c), which could help
reduce the file creation latency;
*) With properties, we now have larger fs trees. For this particular
test each xattr item uses 75 bytes of leaf space in the fs tree.
This could be less by using a new item for xattr items, instead of
the current btrfs_dir_item, since we could cut the 'location' and
'type' fields (saving 18 bytes) and maybe 'transid' too (saving a
total of 26 bytes per xattr item) from the btrfs_dir_item type.
Also tried batching the xattr insertions (ignoring proper hash
collision handling, since it didn't exist) when creating files that
inherit properties from their parent inode/subvolume, but the end
results were (surprisingly) essentially the same.
Test script:
$ cat test.pl
#!/usr/bin/perl -w
use strict;
use Time::HiRes qw(time);
use constant NUM_FILES => 10_000;
use constant FILE_SIZES => (64 * 1024);
use constant DEV => '/dev/sdb4';
use constant MNT_POINT => '/home/fdmanana/btrfs-tests/dev';
use constant TEST_DIR => (MNT_POINT . '/testdir');
system("mkfs.btrfs", "-l", "16384", "-f", DEV) == 0 or die "mkfs.btrfs failed!";
# following line for testing without properties
#system("mount", "-o", "compress-force=lzo", DEV, MNT_POINT) == 0 or die "mount failed!";
# following 2 lines for testing with properties
system("mount", DEV, MNT_POINT) == 0 or die "mount failed!";
system("btrfs", "prop", "set", MNT_POINT, "compression", "lzo") == 0 or die "set prop failed!";
system("mkdir", TEST_DIR) == 0 or die "mkdir failed!";
my ($t1, $t2);
$t1 = time();
for (my $i = 1; $i <= NUM_FILES; $i++) {
my $p = TEST_DIR . '/file_' . $i;
open(my $f, '>', $p) or die "Error opening file!";
$f->autoflush(1);
for (my $j = 0; $j < FILE_SIZES; $j += 4096) {
print $f ('A' x 4096) or die "Error writing to file!";
}
close($f);
}
$t2 = time();
print "Time to create " . NUM_FILES . ": " . ($t2 - $t1) . " seconds.\n";
system("umount", DEV) == 0 or die "umount failed!";
system("mount", DEV, MNT_POINT) == 0 or die "mount failed!";
$t1 = time();
system("bash -c 'ls -lha " . TEST_DIR . " > /dev/null'") == 0 or die "ls failed!";
$t2 = time();
print "Time to ls -lha all files: " . ($t2 - $t1) . " seconds.\n";
system("umount", DEV) == 0 or die "umount failed!";
Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
When writing to a file we drop existing file extent items that cover the
write range and then add a new file extent item that represents that write
range.
Before this change we were doing a tree lookup to remove the file extent
items, and then after we did another tree lookup to insert the new file
extent item.
Most of the time all the file extent items we need to drop are located
within a single leaf - this is the leaf where our new file extent item ends
up at. Therefore, in this common case just combine these 2 operations into
a single one.
By avoiding the second btree navigation for insertion of the new file extent
item, we reduce btree node/leaf lock acquisitions/releases, btree block/leaf
COW operations, CPU time on btree node/leaf key binary searches, etc.
Besides for file writes, this is an operation that happens for file fsync's
as well. However log btrees are much less likely to big as big as regular
fs btrees, therefore the impact of this change is smaller.
The following benchmark was performed against an SSD drive and a
HDD drive, both for random and sequential writes:
Wang Shilong [Tue, 7 Jan 2014 09:26:58 +0000 (17:26 +0800)]
Btrfs: handle EAGAIN case properly in btrfs_drop_snapshot()
We may return early in btrfs_drop_snapshot(), we shouldn't
call btrfs_std_err() for this case, fix it.
Cc: stable@vger.kernel.org Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
Wang Shilong [Tue, 7 Jan 2014 09:25:19 +0000 (17:25 +0800)]
Btrfs: fix protection between send and root deletion
We should gurantee that parent and clone roots can not be destroyed
during send, for this we have two ideas.
1.by holding @subvol_sem, this might be a nightmare, because it will
block all subvolumes deletion for a long time.
2.Miao pointed out we can reuse @send_in_progress, that mean we will
skip snapshot deletion if root sending is in progress.
Here we adopt the second approach since it won't block other subvolumes
deletion for a long time.
Besides in btrfs_clean_one_deleted_snapshot(), we only check first root
, if this root is involved in send, we return directly rather than
continue to check.There are several reasons about it:
1.this case happen seldomly.
2.after sending,cleaner thread can continue to drop that root.
3.make code simple
Cc: David Sterba <dsterba@suse.cz> Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com> Reviewed-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
Wang Shilong [Tue, 7 Jan 2014 09:25:18 +0000 (17:25 +0800)]
Btrfs: fix wrong send_in_progress accounting
Steps to reproduce:
# mkfs.btrfs -f /dev/sda8
# mount /dev/sda8 /mnt
# btrfs sub snapshot -r /mnt /mnt/snap1
# btrfs sub snapshot -r /mnt /mnt/snap2
# btrfs send /mnt/snap1 -p /mnt/snap2 -f /mnt/1
# dmesg
The problem is that we will sort clone roots(include @send_root), it
might push @send_root before thus @send_root's @send_in_progress will
be decreased twice.
Cc: David Sterba <dsterba@suse.cz> Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>