Chris Ball [Tue, 29 Sep 2009 17:51:05 +0000 (13:51 -0400)]
Btrfs: Use CONFIG_BTRFS_POSIX_ACL to enable ACL code
We've already defined CONFIG_BTRFS_POSIX_ACL in Kconfig, but we're
currently not using it and are testing CONFIG_FS_POSIX_ACL instead.
CONFIG_FS_POSIX_ACL states "Never use this symbol for ifdefs".
Signed-off-by: Chris Ball <cjb@laptop.org> Signed-off-by: Chris Mason <chris.mason@oracle.com>
Chris Ball [Tue, 29 Sep 2009 17:51:04 +0000 (13:51 -0400)]
Btrfs: Fix setting umask when POSIX ACLs are not enabled
We currently set sb->s_flags |= MS_POSIXACL unconditionally, which is
incorrect -- it tells the VFS that it shouldn't set umask because we
will, yet we don't set it ourselves if we aren't using POSIX ACLs, so
the umask ends up ignored.
Signed-off-by: Chris Ball <cjb@laptop.org> Signed-off-by: Chris Mason <chris.mason@oracle.com>
Josef Bacik [Fri, 11 Sep 2009 20:12:44 +0000 (16:12 -0400)]
Btrfs: proper -ENOSPC handling
At the start of a transaction we do a btrfs_reserve_metadata_space() and
specify how many items we plan on modifying. Then once we've done our
modifications and such, just call btrfs_unreserve_metadata_space() for
the same number of items we reserved.
For keeping track of metadata needed for data I've had to add an extent_io op
for when we merge extents. This lets us track space properly when we are doing
sequential writes, so we don't end up reserving way more metadata space than
what we need.
The only place where the metadata space accounting is not done is in the
relocation code. This is because Yan is going to be reworking that code in the
near future, so running btrfs-vol -b could still possibly result in a ENOSPC
related panic. This patch also turns off the metadata_ratio stuff in order to
allow users to more efficiently use their disk space.
This patch makes it so we track how much metadata we need for an inode's
delayed allocation extents by tracking how many extents are currently
waiting for allocation. It introduces two new callbacks for the
extent_io tree's, merge_extent_hook and split_extent_hook. These help
us keep track of when we merge delalloc extents together and split them
up. Reservations are handled prior to any actually dirty'ing occurs,
and then we unreserve after we dirty.
btrfs_unreserve_metadata_for_delalloc() will make the appropriate
unreservations as needed based on the number of reservations we
currently have and the number of extents we currently have. Doing the
reservation outside of doing any of the actual dirty'ing lets us do
things like filemap_flush() the inode to try and force delalloc to
happen, or as a last resort actually start allocation on all delalloc
inodes in the fs. This has survived dbench, fs_mark and an fsx torture
test.
Signed-off-by: Josef Bacik <jbacik@redhat.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
The extent relocation code copy file extents one by one when
relocating data block group. This is inefficient if file
extents are small. This patch makes the relocation code copy
file extents in clusters. So we can can make better use of
read-ahead.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
A recent change enforces only one access point to each subvolume. The first
directory entry (the one added when the subvolume/snapshot was created) is
treated as valid access point, all other subvolume links are linked to dummy
empty directories. The dummy directories are temporary inodes that only in
memory, so we can not rename file into them.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
Btrfs: check size of inode backref before adding hardlink
For every hardlink in btrfs, there is a corresponding inode back
reference. All inode back references for hardlinks in a given
directory are stored in single b-tree item. The size of b-tree item
is limited by the size of b-tree leaf, so we can only create limited
number of hardlinks to a given file in a directory.
The original code lacks of the check, it oops if the number of
hardlinks goes over the limit. This patch fixes the issue by adding
check to btrfs_link and btrfs_rename.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
Chris Mason [Thu, 24 Sep 2009 00:28:46 +0000 (20:28 -0400)]
Btrfs: fix releasepage to avoid unlocking extents we haven't locked
During releasepage, we try to drop any extent_state structs for the
bye offsets of the page we're releaseing. But the code was incorrectly
telling clear_extent_bit to delete the state struct unconditionallly.
Normally this would be fine because we have the page locked, but other
parts of btrfs will lock down an entire extent, the most common place
being IO completion.
releasepage was deleting the extent state without first locking the extent,
which may result in removing a state struct that another process had
locked down. The fix here is to leave the NODATASUM and EXTENT_LOCKED
bits alone in releasepage.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Chris Mason [Thu, 24 Sep 2009 00:23:16 +0000 (20:23 -0400)]
Btrfs: Fix test_range_bit for whole file extents
If test_range_bit finds an extent that goes all the way to (u64)-1, it
can incorrectly wrap the u64 instead of treaing it like the end of
the address space.
This just adds a check for the highest possible offset so we don't wrap.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Chris Mason [Wed, 23 Sep 2009 23:51:09 +0000 (19:51 -0400)]
Btrfs: fix errors handling cached state in set/clear_extent_bit
Both set and clear_extent_bit allow passing a cached
state struct to reduce rbtree search times. clear_extent_bit
was improperly bypassing some of the checks around making sure
the extent state fields were correct for a given operation.
The fix used here (from Yan Zheng) is to use the hit_next
goto target instead of jumping all the way down to start clearing
bits without making sure the cached state was exactly correct
for the operation we were doing.
This also fixes up the setting of the start variable for both
ops in the case where we find an overlapping extent that
begins before the range we want to change. In both cases
we were incorrectly going backwards from the original
requested change.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Chris Mason [Tue, 22 Sep 2009 18:48:44 +0000 (14:48 -0400)]
Btrfs: fix early enospc during balancing
We now do extra checks before a balance to make sure
there is room for the balance to take place. One of
the checks was testing to see if we were trying to
balance away the last block group of a given type.
If there is no space available for new chunks, we
should not try and balance away the last block group
of a give type. But, the code wasn't checking for
available chunk space, and so it was exiting too soon.
The fix here is to combine some of the checks and make
sure we try to allocate new chunks when we're balancing
the last block group.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Chris Mason [Tue, 22 Sep 2009 18:45:50 +0000 (14:45 -0400)]
Btrfs: deal with NULL space info
After a balance it is briefly possible for the space info
field in the inode to be NULL. This adds some checks
to make sure things properly deal with the NULL value.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Josef Bacik [Fri, 11 Sep 2009 20:11:20 +0000 (16:11 -0400)]
Btrfs: account for space used by the super mirrors
As we get closer to proper -ENOSPC handling in btrfs, we need more accurate
space accounting for the space info's. Currently we exclude the free space for
the super mirrors, but the space they take up isn't accounted for in any of the
counters. This patch introduces bytes_super, which keeps track of the amount
of bytes used for a super mirror in the block group cache and space info. This
makes sure that our free space caclucations will be completely accurate.
Signed-off-by: Josef Bacik <jbacik@redhat.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
Josef Bacik [Fri, 11 Sep 2009 20:11:20 +0000 (16:11 -0400)]
Btrfs: fix extent entry threshold calculation
There is a slight problem with the extent entry threshold calculation for the
free space cache. We only adjust the threshold down as we add bitmaps, but
never actually adjust the threshold up as we add bitmaps. This means we could
fragment the free space so badly that we end up using all bitmaps to describe
the free space, use all the free space which would result in the bitmaps being
freed, but then go to add free space again as we delete things and immediately
add bitmaps since the extent threshold would still be 0. Now as we free
bitmaps the extent threshold will be ratcheted up to allow more extent entries
to be added.
Signed-off-by: Josef Bacik <jbacik@redhat.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
Josef Bacik [Fri, 11 Sep 2009 20:11:20 +0000 (16:11 -0400)]
Btrfs: fix bitmap size tracking
When we first go to add free space, we allocate a new info and set the offset
and bytes to the space we are adding. This is fine, except we actually set the
size of a bitmap as we set the bits in it, so if we add space to a bitmap, we'd
end up counting the same space twice. This isn't a huge deal, it just makes
the allocator behave weirdly since it will think that a bitmap entry has more
space than it ends up actually having. I used a BUG_ON() to catch when this
problem happened, and with this patch I no longer get the BUG_ON().
Signed-off-by: Josef Bacik <jbacik@redhat.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
Josef Bacik [Fri, 11 Sep 2009 20:11:20 +0000 (16:11 -0400)]
Btrfs: don't keep retrying a block group if we fail to allocate a cluster
The box can get locked up in the allocator if we happen upon a block group
under these conditions:
1) During a commit, so caching threads cannot make progress
2) Our block group currently is in the middle of being cached
3) Our block group currently has plenty of free space in it
4) Our block group is so fragmented that it ends up having no free space chunks
larger than min_bytes calculated by btrfs_find_space_cluster.
What happens is we try and do btrfs_find_space_cluster, which fails because it
is unable to find enough free space chunks that are large than min_bytes and
are close enough together. Since the block group is not cached we do a
wait_block_group_cache_progress, which waits for the number of bytes we need,
except the block group already has _plenty_ of free space, its just severely
fragmented, so we loop and try again, ad infinitum. This patch keeps us from
waiting on the block group to finish caching if we failed to find a free space
cluster before. It also makes sure that we don't even try to find a free space
cluster if we are on our last loop in the allocator, since we will have tried
everything at this point at it is futile.
Signed-off-by: Josef Bacik <jbacik@redhat.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
Josef Bacik [Fri, 11 Sep 2009 20:11:19 +0000 (16:11 -0400)]
Btrfs: make balance code choose more wisely when relocating
Currently, we can panic the box if the first block group we go to move is of a
type where there is no space left to move those extents. For example, if we
fill the disk up with data, and then we try to balance and we have no room to
move the data nor room to allocate new chunks, we will panic. Change this by
checking to see if we have room to move this chunk around, and if not, return
-ENOSPC and move on to the next chunk. This will make sure we remove block
groups that are moveable, like if we have alot of empty metadata block groups,
and then that way we make room to be able to balance our data chunks as well.
Tested this with an fs that would panic on btrfs-vol -b normally, but no longer
panics with this patch.
V1->V2:
-actually search for a free extent on the device to make sure we can allocate a
chunk if need be.
-fix btrfs_shrink_device to make sure we actually try to relocate all the
chunks, and then if we can't return -ENOSPC so if we are doing a btrfs-vol -r
we don't remove the device with data still on it.
-check to make sure the block group we are going to relocate isn't the last one
in that particular space
-fix a bug in btrfs_shrink_device where we would change the device's size and
not fix it if we fail to do our relocate
Signed-off-by: Josef Bacik <jbacik@redhat.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
btrfs allows subvolumes and snapshots anywhere in the directory tree.
If we snapshot a subvolume that contains a link to other subvolume
called subvolA, subvolA can be accessed through both the original
subvolume and the snapshot. This is similar to creating hard link to
directory, and has the very similar problems.
The aim of this patch is enforcing there is only one access point to
each subvolume. Only the first directory entry (the one added when
the subvolume/snapshot was created) is treated as valid access point.
The first directory entry is distinguished by checking root forward
reference. If the corresponding root forward reference is missing,
we know the entry is not the first one.
This patch also adds snapshot/subvolume rename support, the code
allows rename subvolume link across subvolumes.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
Btrfs: do not reuse objectid of deleted snapshot/subvol
The new back reference format does not allow reusing objectid of
deleted snapshot/subvol. So we use ++highest_objectid to allocate
objectid for new snapshot/subvol.
Now we use ++highest_objectid to allocate objectid for both new inode
and new snapshot/subvolume, so this patch removes 'find hole' code in
btrfs_find_free_objectid.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
This patch contains two changes to avoid unnecessary tree block reads during
snapshot dropping.
First, check tree block's reference count and flags before reading the tree
block. if reference count > 1 and there is no need to update backrefs, we can
avoid reading the tree block.
Second, save when snapshot was created in root_key.offset. we can compare block
pointer's generation with snapshot's creation generation during updating
backrefs. If a given block was created before snapshot was created, the
snapshot can't be the tree block's owner. So we can avoid reading the block.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
Chris Mason [Fri, 18 Sep 2009 20:07:03 +0000 (16:07 -0400)]
Btrfs: search for an allocation hint while filling file COW
The allocator has some nice knobs for sending hints about where
to try and allocate new blocks, but when we're doing file allocations
we're not sending any hint at all.
This commit adds a simple extent map search to see if we can
quickly and easily find a hint for the allocator.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Chris Mason [Fri, 18 Sep 2009 20:03:16 +0000 (16:03 -0400)]
Btrfs: properly honor wbc->nr_to_write changes
When btrfs fills a delayed allocation, it tries to increase
the wbc nr_to_write to cover a big part of allocation. The
theory is that we're doing contiguous IO and writing a few
more blocks will save seeks overall at a very low cost.
The problem is that extent_write_cache_pages could ignore
the new higher nr_to_write if nr_to_write had already gone
down to zero. We fix that by rechecking the nr_to_write
for every page that is processed in the pagevec.
This updates the math around bumping the nr_to_write value
to make sure we don't leave a tiny amount of IO hanging
around for the very end of a new extent.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Yan Zheng [Fri, 11 Sep 2009 20:11:19 +0000 (16:11 -0400)]
Btrfs: improve async block group caching
This patch gets rid of two limitations of async block group caching.
The old code delays handling pinned extents when block group is in
caching. To allocate logged file extents, the old code need wait
until block group is fully cached. To get rid of the limitations,
This patch introduces a data structure to track the progress of
caching. Base on the caching progress, we know which extents should
be added to the free space cache when handling the pinned extents.
The logged file extents are also handled in a similar way.
This patch also changes how pinned extents are tracked. The old
code uses one tree to track pinned extents, and copy the pinned
extents tree at transaction commit time. This patch makes it use
two trees to track pinned extents. One tree for extents that are
pinned in the running transaction, one tree for extents that can
be unpinned. At transaction commit time, we swap the two trees.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
Chris Mason [Wed, 16 Sep 2009 00:02:33 +0000 (20:02 -0400)]
Btrfs: Fix async thread shutdown race
It was possible for an async worker thread to be selected to
receive a new work item, but exit before the work item was
actually placed into that thread's work list.
This commit fixes the race by incrementing the num_pending
counter earlier, and making sure to check the number of pending
work items before a thread exits.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Chris Mason [Tue, 15 Sep 2009 23:57:42 +0000 (19:57 -0400)]
Btrfs: fix async worker startup race
After a new worker thread starts, it is placed into the
list of idle threads. But, this may race with a
check for idle done by the worker thread itself, resulting
in a double list_add operation.
This fix adds a check to make sure the idle thread addition
is done properly.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Chris Mason [Fri, 11 Sep 2009 16:36:29 +0000 (12:36 -0400)]
Btrfs: zero page past end of inline file items
When btrfs_get_extent is reading inline file items for readpage,
it needs to copy the inline extent into the page. If the
inline extent doesn't cover all of the page, that means there
is a hole in the file, or that our file is smaller than one
page.
readpage does zeroing for the case where the file is smaller than one
page, but nobody is currently zeroing for the case where there is
a hole after the inline item.
This commit changes btrfs_get_extent to zero fill the page past
the end of the inline item.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Chris Mason [Fri, 11 Sep 2009 16:27:37 +0000 (12:27 -0400)]
Btrfs: Fix extent replacment race
Data COW means that whenever we write to a file, we replace any old
extent pointers with new ones. There was a window where a readpage
might find the old extent pointers on disk and cache them in the
extent_map tree in ram in the middle of a given write replacing them.
Even though both the readpage and the write had their respective bytes
in the file locked, the extent readpage inserts may cover more bytes than
it had locked down.
This commit closes the race by keeping the new extent pinned in the extent
map tree until after the on-disk btree is properly setup with the new
extent pointers.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Chris Mason [Wed, 2 Sep 2009 20:53:46 +0000 (16:53 -0400)]
Btrfs: Use PagePrivate2 to track pages in the data=ordered code.
Btrfs writes go through delalloc to the data=ordered code. This
makes sure that all of the data is on disk before the metadata
that references it. The tracking means that we have to make sure
each page in an extent is fully written before we add that extent into
the on-disk btree.
This was done in the past by setting the EXTENT_ORDERED bit for the
range of an extent when it was added to the data=ordered code, and then
clearing the EXTENT_ORDERED bit in the extent state tree as each page
finished IO.
One of the reasons we had to do this was because sometimes pages are
magically dirtied without page_mkwrite being called. The EXTENT_ORDERED
bit is checked at writepage time, and if it isn't there, our page become
dirty without going through the proper path.
These bit operations make for a number of rbtree searches for each page,
and can cause considerable lock contention.
This commit switches from the EXTENT_ORDERED bit to use PagePrivate2.
As pages go into the ordered code, PagePrivate2 is set on each one.
This is a cheap operation because we already have all the pages locked
and ready to go.
As IO finishes, the PagePrivate2 bit is cleared and the ordered
accoutning is updated for each page.
At writepage time, if the PagePrivate2 bit is missing, we go into the
writepage fixup code to handle improperly dirtied pages.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Chris Mason [Wed, 2 Sep 2009 19:22:30 +0000 (15:22 -0400)]
Btrfs: use a cached state for extent state operations during delalloc
This changes the btrfs code to find delalloc ranges in the extent state
tree to use the new state caching code from set/test bit. It reduces
one of the biggest causes of rbtree searches in the writeback path.
test_range_bit is also modified to take the cached state as a starting
point while searching.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Chris Mason [Wed, 2 Sep 2009 19:11:07 +0000 (15:11 -0400)]
Btrfs: don't lock bits in the extent tree during writepage
At writepage time, we have the page locked and we have the
extent_map entry for this extent pinned in the extent_map tree.
So, the page can't go away and its mapping can't change.
There is no need for the extra extent_state lock bits during writepage.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Chris Mason [Wed, 2 Sep 2009 19:04:12 +0000 (15:04 -0400)]
Btrfs: cache values for locking extents
Many of the btrfs extent state tree users follow the same pattern.
They lock an extent range in the tree, do some operation and then
unlock.
This translates to at least 2 rbtree searches, and maybe more if they
are doing operations on the extent state tree. A locked extent
in the tree isn't going to be merged or changed, and so we can
safely return the extent state structure as a cached handle.
This changes set_extent_bit to give back a cached handle, and also
changes both set_extent_bit and clear_extent_bit to use the cached
handle if it is available.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Chris Mason [Wed, 2 Sep 2009 17:24:36 +0000 (13:24 -0400)]
Btrfs: reduce CPU usage in the extent_state tree
Btrfs is currently mirroring some of the page state bits into
its extent state tree. The goal behind this was to use it in supporting
blocksizes other than the page size.
But, we don't currently support that, and we're using quite a lot of CPU
on the rb tree and its spin lock. This commit starts a series of
cleanups to reduce the amount of work done in the extent state tree as
part of each IO.
This commit:
* Adds the ability to lock an extent in the state tree and also set
other bits. The idea is to do locking and delalloc in one call
* Removes the EXTENT_WRITEBACK and EXTENT_DIRTY bits. Btrfs is using
a combination of the page bits and the ordered write code for this
instead.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Chris Mason [Fri, 11 Sep 2009 15:25:02 +0000 (11:25 -0400)]
Btrfs: Fix new state initialization order
As the extent state tree is manipulated, there are call backs
that are used to take extra actions when different state bits are set
or cleared. One example of this is a counter for the total number
of delayed allocation bytes in a single inode and in the whole FS.
When new states are inserted, this callback is being done before we
properly setup the new state. This hasn't caused problems before
because the lock bit was always done first, and the existing call backs
don't care about the lock bit.
This patch makes sure the state is properly setup before using the
callback, which is important for later optimizations that do more work
without using the lock bit.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Chris Mason [Fri, 7 Aug 2009 13:59:15 +0000 (09:59 -0400)]
Btrfs: tweak congestion backoff
The btrfs io submission thread tries to back off congested devices in
favor of rotating off to another disk.
But, it tries to make sure it submits at least some IO before rotating
on (the others may be congested too), and so it has a magic number of
requests it tries to write before it hops.
This makes the magic number smaller. Testing shows that we're spending
too much time on congested devices and leaving the other devices idle.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Chris Mason [Fri, 7 Aug 2009 13:28:20 +0000 (09:28 -0400)]
Btrfs: use larger nr_to_write for larger extents
When btrfs fills a large delayed allocation extent, it is a good idea
to try and convince the write_cache_pages caller to go ahead and
write a good chunk of that extent. The extra IO is basically free
because we know it is contiguous.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Chris Mason [Fri, 7 Aug 2009 13:27:38 +0000 (09:27 -0400)]
Btrfs: reduce worker thread spin_lock_irq hold times
This changes the btrfs worker threads to batch work items
into a local list. It allows us to pull work items in
large chunks and significantly reduces the number of times we
need to take the worker thread spinlock.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Chris Mason [Wed, 5 Aug 2009 20:36:45 +0000 (16:36 -0400)]
Btrfs: keep irqs on more often in the worker threads
The btrfs worker thread spinlock was being used both for the
queueing of IO and for the processing of ordered events.
The ordered events never happen from end_io handlers, and so they
don't need to use the _irq version of spinlocks. This adds a
dedicated lock to the ordered lists so they don't have to run
with irqs off.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Chris Mason [Wed, 5 Aug 2009 16:57:59 +0000 (12:57 -0400)]
Btrfs: optimize set extent bit
The Btrfs set_extent_bit call currently searches the rbtree
every time it needs to find more extent_state objects to fill
the requested operation.
This adds a simple test with rb_next to see if the next object
in the tree was adjacent to the one we just found. If so,
we skip the search and just use the next object.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Chris Mason [Tue, 4 Aug 2009 20:56:34 +0000 (16:56 -0400)]
Btrfs: Allow worker threads to exit when idle
The Btrfs worker threads don't currently die off after they have
been idle for a while, leading to a lot of threads sitting around
doing nothing for each mount.
Also, they are unable to start atomically (from end_io hanlders).
This commit reworks the worker threads so they can be started
from end_io handlers (just setting a flag that asks for a thread
to be added at a later date) and so they can exit if they
have been idle for a long time.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
"This seems to generate /sys/block/$device/queue and its contents for
everyone who is using queues, not just for those queues that have a
non-NULL queue->request_fn."
Note that embedding a queue inside another object has always been
an illegal construct, since the queues are reference counted and
must persist until the last reference is dropped. So aoe was
always buggy in this respect (Jens).
Signed-off-by: Ed Cashin <ecashin@coraid.com> Cc: Andy Whitcroft <apw@canonical.com> Cc: "Rafael J. Wysocki" <rjw@sisk.pl> Cc: Bruno Premont <bonbons@linux-vserver.org> Cc: Martin K. Petersen <martin.petersen@oracle.com> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
i915: disable interrupts before tearing down GEM state
Reinette Chatre reports a frozen system (with blinking keyboard LEDs)
when switching from graphics mode to the text console, or when
suspending (which does the same thing). With netconsole, the oops
turned out to be
BUG: unable to handle kernel NULL pointer dereference at 0000000000000084
IP: [<ffffffffa03ecaab>] i915_driver_irq_handler+0x26b/0xd20 [i915]
and it's due to the i915_gem.c code doing drm_irq_uninstall() after
having done i915_gem_idle(). And the i915_gem_idle() path will do
but if an i915 interrupt comes in after this stage, it may want to
access that hw_status_page, and gets the above NULL pointer dereference.
And since the NULL pointer dereference happens from within an interrupt,
and with the screen still in graphics mode, the common end result is
simply a silently hung machine.
Fix it by simply uninstalling the irq handler before idling rather than
after. Fixes
Zhenyu Wang [Tue, 8 Sep 2009 06:52:25 +0000 (14:52 +0800)]
drm/i915: fix mask bits setting
eDP is exclusive connector too, and add missing crtc_mask
setting for TV.
This fixes
http://bugzilla.kernel.org/show_bug.cgi?id=14139
Signed-off-by: Zhenyu Wang <zhenyuw@linux.intel.com> Reported-and-tested-by: Carlos R. Mafra <crmafra2@gmail.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/anholt/drm-intel
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/anholt/drm-intel:
agp/intel: support for new chip variant of IGDNG mobile
drm/i915: Unref old_obj on get_fence_reg() error path
drm/i915: increase default latency constant (v2 w/comment)
Dave Airlie [Mon, 7 Sep 2009 05:26:19 +0000 (15:26 +1000)]
drm/radeon/kms: add LTE/GTE discard + rv515 two sided stencil register.
This adds some rv350+ register for LTE/GTE discard,
and enables the rv515 two sided stencil register.
It also disables the DEPTHXY_OFFSET register which
can be used to workaround the CS checker.
Moves rs690 to proper place in rs600 and uses correct
table on rs600.
breaks the build of the gianfar driver because "dev" is undefined in
this function. To quickly test rc9 I changed this to priv->ndev but I do
not know if this is the correct one.
--------------------
Signed-off-by: David S. Miller <davem@davemloft.net>
powerpc: Fix i8259 interrupt driver kernel crash on ML510
This patch fixes a null pointer exception caused by removal of
'ack()' for level interrupts in the Xilinx interrupt driver. A recent
change to the xilinx interrupt controller removed the ack hook for
level irqs.
Signed-off-by: Roderick Colenbrander <thunderbird2k@gmail.com> Signed-off-by: Grant Likely <grant.likely@secretlab.ca> Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* git://git.kernel.org/pub/scm/linux/kernel/git/agk/linux-2.6-dm:
dm snapshot: fix on disk chunk size validation
dm exception store: split set_chunk_size
dm snapshot: fix header corruption race on invalidation
dm snapshot: refactor zero_disk_area to use chunk_io
dm log: userspace add luid to distinguish between concurrent log instances
dm raid1: do not allow log_failure variable to unset after being set
dm log: remove incorrect field from userspace table output
dm log: fix userspace status output
dm stripe: expose correct io hints
dm table: add more context to terse warning messages
dm table: fix queue_limit checking device iterator
dm snapshot: implement iterate devices
dm multipath: fix oops when request based io fails when no paths
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/sparc-2.6:
sparc64: Fix bootup with mcount in some configs.
sparc64: Kill spurious NMI watchdog triggers by increasing limit to 30 seconds.
Nicolas Pitre [Sat, 5 Sep 2009 04:25:37 +0000 (00:25 -0400)]
ext2: fix unbalanced kmap()/kunmap()
In ext2_rename(), dir_page is acquired through ext2_dotdot(). It is
then released through ext2_set_link() but only if old_dir != new_dir.
Failing that, the pkmap reference count is never decremented and the
page remains pinned forever. Repeat that a couple times with highmem
pages and all pkmap slots get exhausted, and every further kmap() calls
end up stalling on the pkmap_map_wait queue at which point the whole
system comes to a halt.
Merge branch 'upstream-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jlbec/ocfs2
* 'upstream-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jlbec/ocfs2:
ocfs2: ocfs2_write_begin_nolock() should handle len=0
ocfs2: invalidate dentry if its dentry_lock isn't initialized.
pty: don't limit the writes to 'pty_space()' inside 'pty_write()'
The whole write-room thing is something that is up to the _caller_ to
worry about, not the pty layer itself. The total buffer space will
still be limited by the buffering routines themselves, so there is no
advantage or need in having pty_write() artificially limit the size
somehow.
And what happened was that the caller (the n_tty line discipline, in
this case) may have verified that there is room for 2 bytes to be
written (for NL -> CRNL expansion), and it used to then do those writes
as two single-byte writes. And if the first byte written (CR) then
caused a new tty buffer to be allocated, pty_space() may have returned
zero when trying to write the second byte (LF), and then incorrectly
failed the write - leading to a lost newline character.
This should finally fix
http://bugzilla.kernel.org/show_bug.cgi?id=14015
Reported-by: Mikael Pettersson <mikpe@it.uu.se> Acked-by: Alan Cox <alan@lxorguk.ukuu.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When translating CR to CRNL in the n_tty line discipline, we did it as
two tty_put_char() calls. Which works, but is stupid, and has caused
problems before too with bad interactions with the write_room() logic.
The generic USB serial driver had that problem, for example.
Now the pty layer had similar issues after being moved to the generic
tty buffering code (in commit d945cb9cce20ac7143c2de8d88b187f62db99bdc:
"pty: Rework the pty layer to use the normal buffering logic").
So stop doing the silly separate two writes, and do it as a single write
instead. That's what the n_tty layer already does for the space
expansion of tabs (XTABS), and it means that we'll now always have just
a single write for the CRNL to match the single 'tty_write_room()' test,
which hopefully means that the next time somebody screws up buffering,
it won't cause weeks of debugging.
But the root of the problem lies in the fact that do_execve() path calls
tracehook_report_exec() which can stop if the tracer sets PT_TRACE_EXEC.
The tracee must not sleep in TASK_TRACED holding this mutex. Even if we
remove ->cred_guard_mutex from mm_for_maps() and proc_pid_attr_write(),
another task doing PTRACE_ATTACH should not hang until it is killed or the
tracee resumes.
With this patch do_execve() does not use ->cred_guard_mutex directly and
we do not hold it throughout, instead:
- introduce prepare_bprm_creds() helper, it locks the mutex
and calls prepare_exec_creds() to initialize bprm->cred.
- install_exec_creds() drops the mutex after commit_creds(),
and thus before tracehook_report_exec()->ptrace_stop().
or, if exec fails,
free_bprm() drops this mutex when bprm->cred != NULL which
indicates install_exec_creds() was not called.
Reported-by: Tom Horsley <tom.horsley@att.net> Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: David Howells <dhowells@redhat.com> Cc: Roland McGrath <roland@redhat.com> Cc: James Morris <jmorris@namei.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
page-allocator: always change pageblock ownership when anti-fragmentation is disabled
On low-memory systems, anti-fragmentation gets disabled as fragmentation
cannot be avoided on a sufficiently large boundary to be worthwhile. Once
disabled, there is a period of time when all the pageblocks are marked
MOVABLE and the expectation is that they get marked UNMOVABLE at each call
to __rmqueue_fallback().
However, when MAX_ORDER is large the pageblocks do not change ownership
because the normal criteria are not met. This has the effect of
prematurely breaking up too many large contiguous blocks. This is most
serious on NOMMU systems which depend on high-order allocations to boot.
This patch causes pageblocks to change ownership on every fallback when
anti-fragmentation is disabled. This prevents the large blocks being
prematurely broken up.
This is a fix to commit 49255c619fbd482d704289b5eb2795f8e3b7ff2e [page
allocator: move check for disabled anti-fragmentation out of fastpath] and
the problem affects 2.6.31-rc8.
Signed-off-by: Mel Gorman <mel@csn.ul.ie> Tested-by: Paul Mundt <lethal@linux-sh.org> Cc: David Howells <dhowells@redhat.com> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Acked-by: Greg Ungerer <gerg@snapgear.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
David Howells [Sat, 5 Sep 2009 18:17:07 +0000 (11:17 -0700)]
nommu: fix error handling in do_mmap_pgoff()
Fix the error handling in do_mmap_pgoff(). If do_mmap_shared_file() or
do_mmap_private() fail, we jump to the error_put_region label at which
point we cann __put_nommu_region() on the region - but we haven't yet
added the region to the tree, and so __put_nommu_region() may BUG
because the region tree is empty or it may corrupt the region tree.
To get around this, we can afford to add the region to the region tree
before calling do_mmap_shared_file() or do_mmap_private() as we keep
nommu_region_sem write-locked, so no-one can race with us by seeing a
transient region.
Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: Pekka Enberg <penberg@cs.helsinki.fi> Acked-by: Paul Mundt <lethal@linux-sh.org> Cc: Mel Gorman <mel@csn.ul.ie> Acked-by: Greg Ungerer <gerg@snapgear.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
cancel_delayed_work() has to use del_timer_sync() to guarantee the timer
function is not running after return. But most users doesn't actually
need this, and del_timer_sync() has problems: it is not useable from
interrupt, and it depends on every lock which could be taken from irq.
Introduce __cancel_delayed_work() which calls del_timer() instead.
The immediate reason for this patch is
http://bugzilla.kernel.org/show_bug.cgi?id=13757
but hopefully this helper makes sense anyway.
As for 13757 bug, actually we need requeue_delayed_work(), but its
semantics are not yet clear.
Merge this patch early to resolves cross-tree interdependencies between
input and infiniband.
Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Dmitry Torokhov <dmitry.torokhov@gmail.com> Cc: Roland Dreier <rdreier@cisco.com> Cc: Stefan Richter <stefanr@s5r6.in-berlin.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Stefan Richter [Thu, 3 Sep 2009 21:07:35 +0000 (23:07 +0200)]
firewire: sbp2: fix freeing of unallocated memory
If a target writes invalid status (typically status of a command that
already timed out), firewire-sbp2 attempts to put away an ORB that
doesn't exist. https://bugzilla.redhat.com/show_bug.cgi?id=519772
Signed-off-by: Stefan Richter <stefanr@s5r6.in-berlin.de>
Stefan Richter [Fri, 28 Aug 2009 11:26:03 +0000 (13:26 +0200)]
firewire: ohci: fix Ricoh R5C832, video reception
In dual-buffer DMA mode, no video frames are ever received from R5C832
by libdc1394. Fallback to packet-per-buffer DMA works reliably.
http://thread.gmane.org/gmane.linux.kernel.firewire.devel/13393/focus=13476
Reported-by: Jonathan Cameron <jic23@cam.ac.uk> Signed-off-by: Stefan Richter <stefanr@s5r6.in-berlin.de>
Stefan Richter [Fri, 28 Aug 2009 11:25:15 +0000 (13:25 +0200)]
firewire: ohci: fix Agere FW643 and multiple cameras
An Agere FW643 OHCI 1.1 card works fine for video reception from one
camera but fails early if receiving from two cameras. After a short
while, no IR IRQ events occur and the context control register does not
react anymore. This happens regardless whether both IR DMA contexts are
dual-buffer or one is dual-buffer and the other packet-per-buffer.
This can be worked around by disabling dual buffer DMA mode entirely.
http://sourceforge.net/mailarchive/message.php?msg_name=4A7C0594.2020208%40gmail.com
(Reported by Samuel Audet.)
In another report (by Jonathan Cameron), an FW643 works OK with two
cameras in dual buffer mode. Whether this is due to different chip
revisions or different usage patterns (different video formats) is not
yet clear. However, as far as the current capabilities of
firewire-core's isochronous I/O interface are concerned, simply
switching off dual-buffer on non-working and working FW643s alike is not
a problem in practice. We only need to revisit this issue if we are
going to enhance the interface, e.g. so that applications can explicitly
choose modes.
Reported-by: Samuel Audet <samuel.audet@gmail.com> Reported-by: Jonathan Cameron <jic23@cam.ac.uk> Signed-off-by: Stefan Richter <stefanr@s5r6.in-berlin.de>
As David Moore noted, a previously correct sizeof() expression became
wrong since the commit changed its argument from an array to a pointer.
This resulted in an oops in ohci_cancel_packet in the shared workqueue
thread's context when an isochronous resource was to be freed.
Reported-by: Jonathan Cameron <jic23@cam.ac.uk> Signed-off-by: Stefan Richter <stefanr@s5r6.in-berlin.de>
Fix some problems seen in the chunk size processing when activating a
pre-existing snapshot.
For a new snapshot, the chunk size can either be supplied by the creator
or a default value can be used. For an existing snapshot, the
chunk size in the snapshot header on disk should always be used.
If someone attempts to load an existing snapshot and has the 'default
chunk size' option set, the kernel uses its default value even when it
is incorrect for the snapshot being loaded. This patch ensures the
correct on-disk value is always used.
Secondly, when the code does use the chunk size stored on the disk it is
prudent to revalidate it, so the code can exit cleanly if it got
corrupted as happened in
https://bugzilla.redhat.com/show_bug.cgi?id=461506 .
dm snapshot: fix header corruption race on invalidation
If a persistent snapshot fills up, a race can corrupt the on-disk header
which causes a crash on any future attempt to activate the snapshot
(typically while booting). This patch fixes the race.
When the snapshot overflows, __invalidate_snapshot is called, which calls
snapshot store method drop_snapshot. It goes to persistent_drop_snapshot that
calls write_header. write_header constructs the new header in the "area"
location.
Concurrently, an existing kcopyd job may finish, call copy_callback
and commit_exception method, that goes to persistent_commit_exception.
persistent_commit_exception doesn't do locking, relying on the fact that
callbacks are single-threaded, but it can race with snapshot invalidation and
overwrite the header that is just being written while the snapshot is being
invalidated.
The result of this race is a corrupted header being written that can
lead to a crash on further reactivation (if chunk_size is zero in the
corrupted header).
The fix is to use separate memory areas for each.
See the bug: https://bugzilla.redhat.com/show_bug.cgi?id=461506
dm log: userspace add luid to distinguish between concurrent log instances
Device-mapper userspace logs (like the clustered log) are
identified by a universally unique identifier (UUID). This
identifier is used to associate requests from the kernel to
a specific log in userspace. The UUID must be unique everywhere,
since multiple machines may use this identifier when communicating
about a particular log, as is the case for cluster logs.
Sometimes, device-mapper/LVM may re-use a UUID. This is the
case during pvmoves, when moving from one segment of an LV
to another, or when resizing a mirror, etc. In these cases,
a new log is created with the same UUID and loaded in the
"inactive" slot. When a device-mapper "resume" is issued,
the "live" table is deactivated and the new "inactive" table
becomes "live". (The "inactive" table can also be removed
via a device-mapper 'clear' command.)
The above two issues were colliding. More than one log was being
created with the same UUID, and there was no way to distinguish
between them. So, sometimes the wrong log would be swapped
out during the exchange.
The solution is to create a locally unique identifier,
'luid', to go along with the UUID. This new identifier is used
to determine exactly which log is being referenced by the kernel
when the log exchange is made. The identifier is not
universally safe, but it does not need to be, since
create/destroy/suspend/resume operations are bound to a specific
machine; and these are the operations that make up the exchange.
Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
dm raid1: do not allow log_failure variable to unset after being set
This patch fixes a bug which was triggering a case where the primary leg
could not be changed on failure even when the mirror was in-sync.
The case involves the failure of the primary device along with
the transient failure of the log device. The problem is that
bios can be put on the 'failures' list (due to log failure)
before 'fail_mirror' is called due to the primary device failure.
Normally, this is fine, but if the log device failure is transient,
a subsequent iteration of the work thread, 'do_mirror', will
reset 'log_failure'. The 'do_failures' function then resets
the 'in_sync' variable when processing bios on the failures list.
The 'in_sync' variable is what is used to determine if the
primary device can be switched in the event of a failure. Since
this has been reset, the primary device is incorrectly assumed
to be not switchable.
The case has been seen in the cluster mirror context, where one
machine realizes the log device is dead before the other machines.
As the responsibilities of the server migrate from one node to
another (because the mirror is being reconfigured due to the failure),
the new server may think for a moment that the log device is fine -
thus resetting the 'log_failure' variable.
In any case, it is inappropiate for us to reset the 'log_failure'
variable. The above bug simply illustrates that it can actually
hurt us.
Cc: stable@kernel.org Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
dm log: remove incorrect field from userspace table output
The output of 'dmsetup table' includes an internal field that should not
be there. This patch removes it. To make the fix simpler, we first
reorder a constructor argument
The 'device size' argument is generated internally. Currently it is
placed as the last space-separated word of the constructor string.
However, we need to use a version of the string without this word, so we
move it to the beginning instead so it is trivial to skip past it.
We keep a copy of the arguments passed to userspace for creating a log,
just in case we need to resend them. These are the same arguments that
are desired in the STATUSTYPE_TABLE request, except for one. When
creating the userspace log, the userspace daemon must know the size of
the mirror, so that is added to the arguments given in the constructor
table. We were printing this extra argument out as well, which is a
mistake.
Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Mike Snitzer [Fri, 4 Sep 2009 19:40:25 +0000 (20:40 +0100)]
dm stripe: expose correct io hints
Set sensible I/O hints for striped DM devices in the topology
infrastructure added for 2.6.31 for userspace tools to
obtain via sysfs.
Add .io_hints to 'struct target_type' to allow the I/O hints portion
(io_min and io_opt) of the 'struct queue_limits' to be set by each
target and implement this for dm-stripe.
Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>