Josef Bacik [Tue, 27 Sep 2011 15:01:30 +0000 (11:01 -0400)]
Btrfs: only inherit btrfs specific flags when creating files
Xfstests 79 was failing because we were inheriting the S_APPEND flag when we
weren't supposed to. There isn't any specific documentation on this so I'm
taking the test as the standard of how things work, and having S_APPEND set on a
directory doesn't mean that S_APPEND gets inherited by its children according to
this test. So only inherit btrfs specific things. This will let us set
compress/nocompress on specific directories and everything in the directories
will inherit this flag, same with nodatacow. With this patch test 79 passes.
Thanks,
Josef Bacik [Mon, 26 Sep 2011 21:12:22 +0000 (17:12 -0400)]
Btrfs: allow us to overcommit our enospc reservations
One of the things that kills us is the fact that our ENOSPC reservations are
horribly over the top in most normal cases. There isn't too much that can be
done about this because when we are completely full we really need them to work
like this so we don't under reserve. However if there is plenty of unallocated
chunks on the disk we can use that to gauge how much we can overcommit. So this
patch adds chunk free space accounting so we always know how much unallocated
space we have. Then if we fail to make a reservation within our allocated
space, check to see if we can overcommit. In the normal flushing case (like
with delalloc metadata reservations) we'll take the free space and divide it by
2 if our metadata profile is setup for DUP or any of those, and then divide it
by 8 to make sure we don't overcommit too much. Then if we're in a non-flushing
case (we really need this reservation now!) we only limit ourselves to half of
the free space. This makes this fio test
go from taking around 45 minutes to 10 seconds on my freshly formatted 3 TiB
file system. This doesn't seem to break my other enospc tests, but could really
use some more testing as this is a super scary change. Thanks,
Josef Bacik [Mon, 26 Sep 2011 19:55:20 +0000 (15:55 -0400)]
Btrfs: break out of orphan cleanup if we can't make progress
I noticed while running xfstests 83 that if we didn't have enough space to
delete our inode the orphan cleanup would just loop. This is because it keeps
finding the same orphan item and keeps trying to kill it but can't because we
don't get an error back from iput for deleting the inode. So keep track of the
last guy we tried to kill, if it's the same as the one we're trying to kill
currently we know we are having problems and can just error out. I don't have a
way to test this so look hard and make sure it's right. Thanks,
Josef Bacik [Mon, 26 Sep 2011 19:46:06 +0000 (15:46 -0400)]
Btrfs: use the global reserve as a backup for deleting inodes
Xfstests 83 really stresses our ENOSPC since it uses a 100mb fs which ends up
with the mixed block group stuff. Because of this we can run into a situation
where we don't have enough space to delete inodes, or even worse we can't free
the inodes when we next mount the fs which causes the orphan code to lose its
mind. So if we fail to make our reservation, steal from the global reserve.
The global reserve will end up taking up the entire rest of the free space on
the fs in this worst case so there really is no other option. With this patch
test 83 doesn't freak out. Thanks,
Josef Bacik [Mon, 26 Sep 2011 17:58:47 +0000 (13:58 -0400)]
Btrfs: stop using write_one_page
While looking for a performance regression a user was complaining about, I
noticed that we had a regression with the varmail test of filebench. This was
introduced by
which keeps us from calling writepages in writepage. This is a correct change,
however it happens to help the varmail test because we write out in larger
chunks. This is largly to do with how we write out dirty pages for each
transaction. If you run filebench with
load varmail
set $dir=/mnt/btrfs-test
run 60
prior to this patch you would get ~1420 ops/second, but with the patch you get
~1200 ops/second. This is a 16% decrease. So since we know the range of dirty
pages we want to write out, don't write out in one page chunks, write out in
ranges. So to do this we call filemap_fdatawrite_range() on the range of bytes.
Then we convert the DIRTY extents to NEED_WAIT extents. When we then call
btrfs_wait_marked_extents() we only have to filemap_fdatawait_range() on that
range and clear the NEED_WAIT extents. This doesn't get us back to our original
speeds, but I've been seeing ~1380 ops/second, which is a <5% regression as
opposed to a >15% regression. That is acceptable given that the original commit
greatly reduces our latency to begin with. Thanks,
Josef Bacik [Mon, 26 Sep 2011 17:56:12 +0000 (13:56 -0400)]
Btrfs: introduce convert_extent_bit
If I have a range where I know a certain bit is and I want to set it to another
bit the only option I have is to call set and then clear bit, which will result
in 2 tree searches. This is inefficient, so introduce convert_extent_bit which
will go through and set the bit I want and clear the old bit I don't want.
Thanks,
Josef Bacik [Thu, 22 Sep 2011 18:30:02 +0000 (14:30 -0400)]
Btrfs: check unused against how much space we actually want
There is a bug that may lead to early ENOSPC in our reservation code. We've
been checking against num_bytes which may be above and beyond what we want to
actually reserve, which could give us a false ENOSPC. Fix this by making sure
the unused space is above how much we want to reserve and not how much we're
trying to flush. Thanks,
Josef Bacik [Wed, 21 Sep 2011 20:55:59 +0000 (16:55 -0400)]
Btrfs: fix orphan cleanup regression
In fixing how we deal with bad inodes, we had a regression in the orphan cleanup
code, since it expects to get a bad inode back. So fix it to deal with getting
-ESTALE back by deleting the orphan item manually and moving on. Thanks,
Reported-by: Simon Kirby <sim@hostway.ca> Signed-off-by: Josef Bacik <josef@redhat.com>
Josef Bacik [Wed, 21 Sep 2011 19:05:58 +0000 (15:05 -0400)]
Btrfs: use the inode's mapping mask for allocating pages
Johannes pointed out we were allocating only kernel pages for doing writes,
which is kind of a big deal if you are on 32bit and have more than a gig of ram.
So fix our allocations to use the mapping's gfp but still clear __GFP_FS so we
don't re-enter. Thanks,
Reported-by: Johannes Weiner <jweiner@redhat.com> Signed-off-by: Josef Bacik <josef@redhat.com>
Josef Bacik [Mon, 19 Sep 2011 16:26:24 +0000 (12:26 -0400)]
Btrfs: delay iput when deleting a block group
I kept getting warnings from evict because we were calling
btrfs_start_transaction() with a transaction already started when doing a
balance. This is because we remove a block group which requires a transaction,
and the put the last reference on the cache inode. Instead of doing this we
need to delay the iput so it is done not within a transaction having started.
This gets rid of our warnings. Thanks,
Josef Bacik [Mon, 19 Sep 2011 15:58:54 +0000 (11:58 -0400)]
Btrfs: make sure to unset trans->block_rsv before running delayed refs
Checksums are charged in 2 different ways. The first case is when we're writing
to the disk, we account for the new checksums with the delalloc block rsv. In
order for this to work we check if we're allocating a block for the csum root
and if trans->block_rsv == the delalloc block rsv. But when we're deleting the
csums because of cow, this is charged to the global block rsv, and is done when
we run the delayed refs. So we need to make sure that trans->block_rsv == NULL
when running the delayed refs. So set it to NULL and reset it in
should_end_transaction, and set it to NULL in commit_transaction. This got rid
of the ridiculous amount of warnings I was seeing when trying to do a balance.
Thanks,
Josef Bacik [Tue, 30 Aug 2011 16:34:28 +0000 (12:34 -0400)]
Btrfs: stop passing a trans handle all around the reservation code
The only thing that we need to have a trans handle for is in
reserve_metadata_bytes and thats to know how much flushing we can do. So
instead of passing it around, just check current->journal_info for a
trans_handle so we know if we can commit a transaction to try and free up space
or not. Thanks,
Josef Bacik [Tue, 30 Aug 2011 15:31:29 +0000 (11:31 -0400)]
Btrfs: use the transactions block_rsv for the csum root
The alloc warnings everybody has been seeing is because we have been reserving
space for csums, but we weren't actually using that space. So make
get_block_rsv() return the trans->block_rsv if we're modifying the csum root.
Also set the trans->block_rsv to NULL so that if we modify the csum root when
running delayed ref's that comes out of the global reserve like it's supposed
to. With this patch I'm not seeing those alloc warnings anymore. Thanks,
Josef Bacik [Tue, 30 Aug 2011 14:19:10 +0000 (10:19 -0400)]
Btrfs: handle enospc accounting for free space inodes
Since free space inodes now use normal checksumming we need to make sure to
account for their metadata use. So reserve metadata space, and then if we fail
to write out the metadata we can just release it, otherwise it will be freed up
when the io completes. Thanks,
Josef Bacik [Mon, 29 Aug 2011 18:06:00 +0000 (14:06 -0400)]
Btrfs: put the block group cache after we commit the super
In moving some enospc stuff around I noticed that when we unmount we are often
evicting the free space cache inodes before we do our last commit. This isn't
bad, but it makes us constantly have to re-read the inodes back. So instead
don't evict the cache until after we do our last commit, this will make things a
little less crappy and makes a future enospc change work properly. Thanks,
Josef Bacik [Mon, 29 Aug 2011 15:01:31 +0000 (11:01 -0400)]
Btrfs: set truncate block rsv's size
While debugging a different issue I noticed that we were always reserving space
when we tried to use our truncate block rsv's. This is because they didn't have
a ->size value, so use_block_rsv just assumes there is nothing reserved and it
does a reserve_metadata_bytes. This is because btrfs_check_block_rsv() doesn't
actually add to the size of the block rsv. That seems to be the right thing to
do so set ->size to the minimum truncate size we need, since we will always only
refill to that size anyway, and this way everything works out correctly.
Josef Bacik [Mon, 22 Aug 2011 19:23:19 +0000 (15:23 -0400)]
Btrfs: don't increase the block_rsv's size when emergency allocating space
If we have to emergency reserve space we need to not increase the block_rsv
size, otherwise we'll leak space. Take for instance delalloc, say we reserve
4k, and we use that 4k, and then we have to emergency allocate another 4k, we
bump the size up to 8k, however we've only accounted for 4k in reservations in
all of our supporting logic, so we'll go to free the 4k and end up having a size
of 4k, which will cause us to later not free as much space. I saw this doing
testing where I wasn't reserving enough space for something but was still
leaking space, very frustrating. Thanks,
Josef Bacik [Fri, 19 Aug 2011 19:45:52 +0000 (15:45 -0400)]
Btrfs: fix space leak when we fail to make an allocation
When changing back to using a spin_lock to protect the extent counters I decided
that since we would only be dropping our original extent, it was ok to just drop
the extent and return. However since somebody else could have come in and done
a reservation, we need to do the normal song and dance to clear the reservation
out properly. So calculate how much space we need to free, and then subtract
what we just attempted to reserve. If it's more then we know we need to drop
those bytes from the delalloc block rsv. Thanks,
Josef Bacik [Fri, 19 Aug 2011 16:06:12 +0000 (12:06 -0400)]
Btrfs: fix call to btrfs_search_slot in free space cache
We are setting ins_len to 1 even tho we are just modifying an item that should
be there already. This may cause the search stuff to split nodes on the way
down needelessly. Set this to 0 since we aren't inserting anything. Thanks,
Josef Bacik [Fri, 19 Aug 2011 14:31:56 +0000 (10:31 -0400)]
Btrfs: allow callers to specify if flushing can occur for btrfs_block_rsv_check
If you run xfstest 224 it you will get lots of messages about not being able to
delete inodes and that they will be cleaned up next mount. This is because
btrfs_block_rsv_check was not calling reserve_metadata_bytes with the ability to
flush, so if there was not enough space, it simply failed. But in truncate and
evict case we could easily flush space to try and get enough space to do our
work, so make btrfs_block_rsv_check take a flush argument to pass down to
reserve_metadata_bytes. Now xfstests 224 runs fine without all those
complaints. Thanks,
Josef Bacik [Fri, 19 Aug 2011 14:29:59 +0000 (10:29 -0400)]
Btrfs: reduce the amount of space needed for truncates
With btrfs_truncate_inode_items we always return if we have to go to another
leaf, which makes us do our reservation again. This means we will only ever
modify one leaf at a time, so we only need 1 items worth of slack space. Also,
since we are deleting we will not be creating nodes as we go down, if anything
we'll be free'ing them as we merge them together, so make a different
calculation for truncate which will only have the worst case useage of COW'ing
the entire path down to the leaf. Thanks,
Josef Bacik [Wed, 17 Aug 2011 14:19:52 +0000 (10:19 -0400)]
Btrfs: only reserve space in fallocate if we have to do a preallocate
Lukas found a problem where if he tries to fallocate over the same region twice
and the first fallocate took up all the space we would fail with ENOSPC. This
is because we reserve the total space we want to use for fallocate, regardless
of wether or not we will have to actually preallocate. So instead move the
check into the loop where we actually have to do the preallocate. Thanks,
Tested-by: Lukas Czerner <lczerner@redhat.com> Signed-off-by: Josef Bacik <josef@redhat.com>
Josef Bacik [Mon, 8 Aug 2011 17:46:15 +0000 (13:46 -0400)]
Btrfs: optimize how we account for space in truncate
Currently we're starting and stopping a transaction for no real reason, so kill
that and just reserve enough space as if we can truncate all in one transaction.
Also use btrfs_block_rsv_check() for our reserve to minimize the amount of space
we may have to allocate for our slack space. Thanks,
Josef Bacik [Mon, 8 Aug 2011 17:33:21 +0000 (13:33 -0400)]
Btrfs: don't try to commit in btrfs_block_rsv_check
We will try and reserve metadata bytes in btrfs_block_rsv_check and if we cannot
because we have a transaction open it will return EAGAIN, so we do not need to
try and commit the transaction again.
Josef Bacik [Mon, 8 Aug 2011 12:24:46 +0000 (08:24 -0400)]
Btrfs: ratelimit the generation printk for the free space cache
A user reported getting spammed when moving to 3.0 by this message. Since we
switched to the normal checksumming infrastructure all old free space caches
will be wrong and need to be regenerated so people are likely to see this
message a lot, so ratelimit it so it doesn't fill up their logs and freak them
out. Thanks,
Reported-by: Andrew Lutomirski <luto@mit.edu> Signed-off-by: Josef Bacik <josef@redhat.com>
Josef Bacik [Fri, 5 Aug 2011 17:22:24 +0000 (13:22 -0400)]
Btrfs: fix how we reserve space for deleting inodes
I converted btrfs_truncate to do sane reservations for truncate, but didn't
convert btrfs_evict_inode. Basically we need to save the orphan_rsv for
deleting the orphan item, and do normal reservations for our truncate. Thanks,
Josef Bacik [Thu, 4 Aug 2011 19:34:57 +0000 (15:34 -0400)]
Btrfs: kill the orphan space calculation for snapshots
This patch kills off the calculation for the amount of space needed for the
orphan operations during a snapshot. The thing is we only do snapshots on
commit, so any space that is in the block_rsv->freed[] isn't going to be in the
new snapshot anyway, so there isn't any reason to require that space to be
reserved for the snapshot to occur. Thanks,
Josef Bacik [Thu, 4 Aug 2011 14:25:02 +0000 (10:25 -0400)]
Btrfs: calculate checksum space correctly
We have not been reserving enough space for checksums. We were just reserving
bytes for the checksum items themselves, we were not taking into account having
to cow the tree and such. This patch adds a csum_bytes counter to the inode for
keeping track of the number of bytes outstanding we have for checksums. Then we
calculate how many leaves would be required for the checksums we are given and
use that to reserve space. This adds a significant amount of bytes to our
reservations, but we will handle this later. Thanks,
Josef Bacik [Mon, 1 Aug 2011 16:08:18 +0000 (12:08 -0400)]
Btrfs: skip looking for delalloc if we don't have ->fill_delalloc
We always look for delalloc bytes in our io_tree so we can fill in delalloc.
This is fine in most cases, but if we're writing out the btree_inode this is
just a superfluous tree search on the io_tree, and if we have a lot of metadata
dirty this could be an expensive check. So instead check to see if our io_tree
has a ->fill_delalloc op, and if not don't even bother doing the lookup.
Thanks,
Josef Bacik [Tue, 26 Jul 2011 21:00:46 +0000 (17:00 -0400)]
Btrfs: use bytes_may_use for all ENOSPC reservations
We have been using bytes_reserved for metadata reservations, which is wrong
since we use that to keep track of outstanding reservations from the allocator.
This resulted in us doing a lot of silly things to make sure we don't allocate a
bunch of metadata chunks since we never had a real view of how much space was
actually in use by metadata.
This passes Arne's enospc test and xfstests as well as my own enospc tests.
Hopefully this will get us moving in the right direction. Thanks,
Josef Bacik [Mon, 25 Jul 2011 19:55:42 +0000 (15:55 -0400)]
Btrfs: fix how we mount subvol=<whatever>
We've only been able to mount with subvol=<whatever> where whatever was a subvol
within whatever root we had as the default. This allows us to mount -o
subvol=path/to/subvol/you/want relative from the normal fs_tree root. Thanks,
Josef Bacik [Mon, 25 Jul 2011 19:40:35 +0000 (15:40 -0400)]
Btrfs: use d_obtain_alias when mounting subvol/subvolid
Currently what we do is just wrong. We either
1) Alloc a new "root" dentry with sb->s_root as it's parent which is just wrong
as we could walk into this subvol later on via another path and hilarity could
ensue. Also we don't check the return value of d_splice_alias which isn't good
either.
or
2) Do a d_find_alias() which we could have lost our dentry from cache at this
point and found nothing.
So use d_obtain_alias(). In the case that we already have the inode/dentry in
cache we will get the correct dentry. If not we will get a disconnected dentry
tree so if we walk into it later on everything will be connected up properly.
Thanks,
Linus Torvalds [Tue, 4 Oct 2011 17:37:06 +0000 (10:37 -0700)]
Merge git://github.com/davem330/net
* git://github.com/davem330/net:
pch_gbe: Fixed the issue on which a network freezes
pch_gbe: Fixed the issue on which PC was frozen when link was downed.
make PACKET_STATISTICS getsockopt report consistently between ring and non-ring
net: xen-netback: correctly restart Tx after a VM restore/migrate
bonding: properly stop queuing work when requested
can bcm: fix incomplete tx_setup fix
RDSRDMA: Fix cleanup of rds_iw_mr_pool
net: Documentation: Fix type of variables
ibmveth: Fix oops on request_irq failure
ipv6: nullify ipv6_ac_list and ipv6_fl_list when creating new socket
cxgb4: Fix EEH on IBM P7IOC
can bcm: fix tx_setup off-by-one errors
MAINTAINERS: tehuti: Alexander Indenbaum's address bounces
dp83640: reduce driver noise
ptp: fix L2 event message recognition
Linus Torvalds [Tue, 4 Oct 2011 16:59:22 +0000 (09:59 -0700)]
Merge branch 'fix/asoc' of git://github.com/tiwai/sound
* 'fix/asoc' of git://github.com/tiwai/sound:
ASoC: omap_mcpdm_remove cannot be __devexit
ASoC: Fix setting update bits for WM8753_LADC and WM8753_RADC
ASoC: use a valid device for dev_err() in Zylonite
Jon Mason [Mon, 3 Oct 2011 14:50:20 +0000 (09:50 -0500)]
PCI: Disable MPS configuration by default
Add the ability to disable PCI-E MPS turning and using the BIOS
configured MPS defaults. Due to the number of issues recently
discovered on some x86 chipsets, make this the default behavior.
Also, add the option for peer to peer DMA MPS configuration. Peer to
peer DMA is outside the scope of this patch, but MPS configuration could
prevent it from working by having the MPS on one root port different
than the MPS on another. To work around this, simply make the system
wide MPS the smallest possible value (128B).
Signed-off-by: Jon Mason <mason@myri.com> Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Alex Deucher [Tue, 4 Oct 2011 14:46:34 +0000 (10:46 -0400)]
drm/radeon/kms: fix channel_remap setup (v2)
Most asics just use the hw default value which requires
no explicit programming. For those that need a different
value, the vbios will program it properly. As such,
there's no need to program these registers explicitly
in the driver. Changing MC_SHARED_CHREMAP requires a reload
of all data in vram otherwise its contents will be scambled.
Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Reviewed-by: Michel Dänzer <michel.daenzer@amd.com> Cc: stable@kernel.org Signed-off-by: Dave Airlie <airlied@redhat.com>
We found that adding load, Rx data sometimes drops.(with DMA transfer mode)
The cause is that before starting Rx-DMA processing, Tx-DMA processing starts.
This causes FIFO overrun occurs.
This patch fixes the issue by modifying FIFO tx-threshold and DMA descriptor
size like below.
Current this patch
Rx-descriptor 4Byte+12Byte*341 --> 12Byte*340-4Byte-12Byte
Rx-threshold (Not modified)
Tx-descriptor 4Byte+12Byte*341 --> 16Byte-12Byte*340
Rx-threshold 12Byte --> 2Byte
Signed-off-by: Tomoya MORINAGA <tomoya-linux@dsn.okisemi.com> Signed-off-by: Grant Likely <grant.likely@secretlab.ca>
spi-topcliff-pch: add tx-memory clear after complete transmitting
Currently, in case of reading date from SPI flash,
command is sent twice.
The cause is that tx-memory clear processing is missing .
This patch adds the tx-momory clear processing.
Signed-off-by: Tomoya MORINAGA <tomoya-linux@dsn.okisemi.com> Signed-off-by: Grant Likely <grant.likely@secretlab.ca>
Takashi Iwai [Tue, 4 Oct 2011 01:09:14 +0000 (18:09 -0700)]
lis3: fix regression of HP DriveGuard with 8bit chip
Commit 2a7fade7e03 ("hwmon: lis3: Power on corrections") caused a
regression on HP laptops with 8bit chip. Writing CTRL2_BOOT_8B bit seems
clearing the BIOS setup, and no proper interrupt for DriveGuard will be
triggered any more.
Since the init code there is basically only for embedded devices, put a
pdata check so that the problematic initialization will be skipped for
hp_accel stuff.
Borislav Petkov [Mon, 3 Oct 2011 18:28:18 +0000 (14:28 -0400)]
ide-disk: Fix request requeuing
Simon Kirby reported that on his RAID setup with idedisk underneath
the box OOMs after a couple of days of runtime. Running with
CONFIG_DEBUG_KMEMLEAK pointed to idedisk_prep_fn() which unconditionally
allocates an ide_cmd struct. However, ide_requeue_and_plug() can be
called more than once per request, either from the request issue or the
IRQ handler path and do blk_peek_request() ends up in idedisk_prep_fn()
repeatedly, allocating a struct ide_cmd everytime and "forgetting" the
previous pointer.
Make sure the code reuses the old allocated chunk.
pch_gbe: Fixed the issue on which a network freezes
The pch_gbe driver has an issue which a network stops,
when receiving traffic is high.
In the case, The link down and up are necessary to return a network.
This patch fixed this issue.
Signed-off-by: Toshiharu Okada <toshiharu-linux@dsn.okisemi.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Willem de Bruijn [Fri, 30 Sep 2011 10:38:28 +0000 (10:38 +0000)]
make PACKET_STATISTICS getsockopt report consistently between ring and non-ring
This is a minor change.
Up until kernel 2.6.32, getsockopt(fd, SOL_PACKET, PACKET_STATISTICS,
...) would return total and dropped packets since its last invocation. The
introduction of socket queue overflow reporting [1] changed drop
rate calculation in the normal packet socket path, but not when using a
packet ring. As a result, the getsockopt now returns different statistics
depending on the reception method used. With a ring, it still returns the
count since the last call, as counts are incremented in tpacket_rcv and
reset in getsockopt. Without a ring, it returns 0 if no drops occurred
since the last getsockopt and the total drops over the lifespan of
the socket otherwise. The culprit is this line in packet_rcv, executed
on a drop:
As it shows, the new drop number it taken from the socket drop counter,
which is not reset at getsockopt. I put together a small example
that demonstrates the issue [2]. It runs for 10 seconds and overflows
the queue/ring on every odd second. The reported drop rates are:
ring: 16, 0, 16, 0, 16, ...
non-ring: 0, 15, 0, 30, 0, 46, 0, 60, 0 , 74.
Note how the even ring counts monotonically increase. Because the
getsockopt adds tp_drops to tp_packets, total counts are similarly
reported cumulatively. Long story short, reinstating the original code, as
the below patch does, fixes the issue at the cost of additional per-packet
cycles. Another solution that does not introduce per-packet overhead
is be to keep the current data path, record the value of sk_drops at
getsockopt() at call N in a new field in struct packetsock and subtract
that when reporting at call N+1. I'll be happy to code that, instead,
it's just more messy.
David Vrabel [Fri, 30 Sep 2011 06:37:51 +0000 (06:37 +0000)]
net: xen-netback: correctly restart Tx after a VM restore/migrate
If a VM is saved and restored (or migrated) the netback driver will no
longer process any Tx packets from the frontend. xenvif_up() does not
schedule the processing of any pending Tx requests from the front end
because the carrier is off. Without this initial kick the frontend
just adds Tx requests to the ring without raising an event (until the
ring is full).
This was caused by 47103041e91794acdbc6165da0ae288d844c820b (net:
xen-netback: convert to hw_features) which reordered the calls to
xenvif_up() and netif_carrier_on() in xenvif_connect().
Signed-off-by: David Vrabel <david.vrabel@citrix.com> Cc: Ian Campbell <ian.campbell@citrix.com> Acked-by: Ian Campbell <ian.campbell@citrix.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Andy Gospodarek [Fri, 23 Sep 2011 10:53:34 +0000 (10:53 +0000)]
bonding: properly stop queuing work when requested
During a test where a pair of bonding interfaces using ARP monitoring
were both brought up and torn down (with an rmmod) repeatedly, a panic
in the timer code was noticed. I tracked this down and determined that
any of the bonding functions that ran as workqueue handlers and requeued
more work might not properly exit when the module was removed.
There was a flag protected by the bond lock called kill_timers that is
set when the interface goes down or the module is removed, but many of
the functions that monitor link status now unlock the bond lock to take
rtnl first. There is a chance that another CPU running the rmmod could
get the lock and set kill_timers after the first check has passed.
This patch does not allow any function to queue work that will make
itself run unless kill_timers is not set. I also noticed while doing
this work that bond_resend_igmp_join_requests did not have a check for
kill_timers, so I added the needed call there as well.
Signed-off-by: Andy Gospodarek <andy@greyhouse.net> Reported-by: Liang Zheng <lzheng@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Michel Dänzer [Fri, 30 Sep 2011 15:16:53 +0000 (17:16 +0200)]
drm/radeon: Set cursor x/y to 0 when x/yorigin > 0.
Apart from the obvious cleanup, this should make the line
cursor_end = x - xorigin + w;
correct now.
Signed-off-by: Michel Dänzer <michel.daenzer@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Dave Airlie <airlied@redhat.com>
Michel Dänzer [Fri, 30 Sep 2011 15:16:52 +0000 (17:16 +0200)]
drm/radeon: Update AVIVO cursor coordinate origin before x/yorigin calculation.
Fixes cursor disappearing prematurely when moving off a top/left edge which
is not located at the desktop top/left edge.
Signed-off-by: Michel Dänzer <michel.daenzer@amd.com> Cc: stable@kernel.org Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Dave Airlie <airlied@redhat.com>
Signed-off-by: Michel Dänzer <michel.daenzer@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Dave Airlie <airlied@redhat.com>
Signed-off-by: Nicholas Miell <nmiell@gmail.com> Reviewed-by: Michel Dänzer <michel@daenzer.net> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Dave Airlie <airlied@redhat.com>
Alex Deucher [Mon, 3 Oct 2011 13:13:46 +0000 (09:13 -0400)]
drm/radeon/kms: add retry limits for native DP aux defer
The previous code could potentially loop forever. Limit
the number of DP aux defer retries to 4 for native aux
transactions, same as i2c over aux transactions.
Alex Deucher [Mon, 3 Oct 2011 13:13:45 +0000 (09:13 -0400)]
drm/radeon/kms: fix regression in DP aux defer handling
An incorrect ordering in the error checking code lead
to DP aux defer being skipped in the aux native write
path. Move the bytes transferred check (ret == 0)
below the defer check.
Tracked down by: Brad Campbell <brad@fnarfbargle.com>
Linus Torvalds [Sat, 1 Oct 2011 15:37:25 +0000 (08:37 -0700)]
Merge branches 'irq-urgent-for-linus', 'x86-urgent-for-linus' and 'sched-urgent-for-linus' of git://tesla.tglx.de/git/linux-2.6-tip
* 'irq-urgent-for-linus' of git://tesla.tglx.de/git/linux-2.6-tip:
irq: Fix check for already initialized irq_domain in irq_domain_add
irq: Add declaration of irq_domain_simple_ops to irqdomain.h
* 'x86-urgent-for-linus' of git://tesla.tglx.de/git/linux-2.6-tip:
x86/rtc: Don't recursively acquire rtc_lock
* 'sched-urgent-for-linus' of git://tesla.tglx.de/git/linux-2.6-tip:
posix-cpu-timers: Cure SMP wobbles
sched: Fix up wchan borkage
sched/rt: Migrate equal priority tasks to available CPUs
Josef Bacik [Fri, 30 Sep 2011 19:23:54 +0000 (15:23 -0400)]
Btrfs: force a page fault if we have a shorty copy on a page boundary
A user reported a problem where ceph was getting into 100% cpu usage while doing
some writing. It turns out it's because we were doing a short write on a not
uptodate page, which means we'd fall back at one page at a time and fault the
page in. The problem is our position is on the page boundary, so our fault in
logic wasn't actually reading the page, so we'd just spin forever or until the
page got read in by somebody else. This will force a readpage if we end up
doing a short copy. Alexandre could reproduce this easily with ceph and reports
it fixes his problem. I also wrote a reproducer that no longer hangs my box
with this patch. Thanks,
Reported-and-tested-by: Alexandre Oliva <aoliva@redhat.com> Signed-off-by: Josef Bacik <josef@redhat.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
Peter Zijlstra [Thu, 1 Sep 2011 10:42:04 +0000 (12:42 +0200)]
posix-cpu-timers: Cure SMP wobbles
David reported:
Attached below is a watered-down version of rt/tst-cpuclock2.c from
GLIBC. Just build it with "gcc -o test test.c -lpthread -lrt" or
similar.
Run it several times, and you will see cases where the main thread
will measure a process clock difference before and after the nanosleep
which is smaller than the cpu-burner thread's individual thread clock
difference. This doesn't make any sense since the cpu-burner thread
is part of the top-level process's thread group.
I've reproduced this on both x86-64 and sparc64 (using both 32-bit and
64-bit binaries).
The diff of 'process' should always be >= the diff of 'thread'.
I make sure to wrap the 'thread' clock measurements the most tightly
around the nanosleep() call, and that the 'process' clock measurements
are the outer-most ones.
This is due to us using p->se.sum_exec_runtime in
thread_group_cputime() where we iterate the thread group and sum all
data. This does not take time since the last schedule operation (tick
or otherwise) into account. We can cure this by using
task_sched_runtime() at the cost of having to take locks.
This also means we can (and must) do away with
thread_group_sched_runtime() since the modified thread_group_cputime()
is now more accurate and would deadlock when called from
thread_group_sched_runtime().
Aside of that it makes the function safe on 32 bit systems. The old
code added t->se.sum_exec_runtime unprotected. sum_exec_runtime is a
64bit value and could be changed on another cpu at the same time.
Reported-by: David Miller <davem@davemloft.net> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: stable@kernel.org Link: http://lkml.kernel.org/r/1314874459.7945.22.camel@twins Tested-by: David Miller <davem@davemloft.net> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
ALSA: hda - Fix a regression of the position-buffer check
The commit a810364a0424c297242c6c66071a42f7675a5568
ALSA: hda - Handle -1 as invalid position, too
caused a regression on some machines that require the position-buffer
instead of LPIB, e.g. resulting in noises with mic recording with
PulseAudio.
This patch fixes the detection by delaying the test at the timing as
same as 3.0, i.e. doing the position check only when requested in
azx_position_ok().
Ram Pai [Thu, 22 Sep 2011 07:48:58 +0000 (15:48 +0800)]
Resource: fix wrong resource window calculation
__find_resource() incorrectly returns a resource window which overlaps
an existing allocated window. This happens when the parent's
resource-window spans 0x00000000 to 0xffffffff and is entirely allocated
to all its children resource-windows.
__find_resource() looks for gaps in resource allocation among the
children resource windows. When it encounters the last child window it
blindly tries the range next to one allocated to the last child. Since
the last child's window ends at 0xffffffff the calculation overflows,
leading the algorithm to believe that any window in the range 0x0000000
to 0xfffffff is available for allocation. This leads to a conflicting
window allocation.
Michal Ludvig reported this issue seen on his platform. The following
patch fixes the problem and has been verified by Michal. I believe this
bug has been there for ages. It got exposed by git commit 2bbc6942273b
("PCI : ability to relocate assigned pci-resources")
Signed-off-by: Ram Pai <linuxram@us.ibm.com> Tested-by: Michal Ludvig <mludvig@logix.net.nz> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Merge branch 'v4l_for_linus' of git://linuxtv.org/mchehab/for_linus
* 'v4l_for_linus' of git://linuxtv.org/mchehab/for_linus:
[media] omap3isp: Fix build error in ispccdc.c
[media] uvcvideo: Fix crash when linking entities
[media] v4l: Make sure we hold a reference to the v4l2_device before using it
[media] v4l: Fix use-after-free case in v4l2_device_release
[media] uvcvideo: Set alternate setting 0 on resume if the bus has been reset
[media] OMAP_VOUT: Fix build break caused by update_mode removal in DSS2
Merge branch 'for-linus' of git://git390.marist.edu/pub/scm/linux-2.6
* 'for-linus' of git://git390.marist.edu/pub/scm/linux-2.6:
[S390] cio: fix cio_tpi ignoring adapter interrupts
[S390] gmap: always up mmap_sem properly
[S390] Do not clobber personality flags on exec
* git://github.com/davem330/sparc:
sparc64: Force the execute bit in OpenFirmware's translation entries.
sparc: Make '-p' boot option meaningful again.
sparc, exec: remove redundant addr_limit assignment
sparc64: Future proof Niagara cpu detection.
Merge branch 'drm-intel-fixes' of git://people.freedesktop.org/~keithp/linux
* 'drm-intel-fixes' of git://people.freedesktop.org/~keithp/linux:
drm/i915: FBC off for ironlake and older, otherwise on by default
drm/i915: Enable SDVO hotplug interrupts for HDMI and DVI
drm/i915: Enable dither whenever display bpc < frame buffer bpc
powerpc: Fix device-tree matching for Apple U4 bridge
Apple Quad G5 has some oddity in it's device-tree which causes the new
generic matching code to fail to relate nodes for PCI-E devices below U4
with their respective struct pci_dev. This breaks graphics on those
machines among others.
This fixes it using a quirk which copies the node pointer from the host
bridge for the root complex, which makes the generic code work for the
children afterward.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
bootup: move 'usermodehelper_enable()' a little earlier
Commit d5767c53535a ("bootup: move 'usermodehelper_enable()' to the end
of do_basic_setup()") moved 'usermodehelper_enable()' to end of
do_basic_setup() to after the initcalls. But then I get failed to let
uvesafb work on my computer, and lose the splash boot.
So maybe we could start usermodehelper_enable a little early to make
some task work that need eary init with the help of user mode.
[ I would *really* prefer that initcalls not call into user space - even
the real 'init' hasn't been execve'd yet, after all! But for uvesafb
it really does look like we don't have much choice.
I considered doing this when we mount the root filesystem, but
depending on config options that is in multiple places. We could do
the usermode helper enable as a rootfs_initcall()..
So I'm just using wang yanqing's trivial patch. It's not wonderful,
but it's simple and should work. We should revisit this some day,
though. - Linus ]
Oliver Hartkopp [Thu, 29 Sep 2011 19:33:47 +0000 (15:33 -0400)]
can bcm: fix incomplete tx_setup fix
The commit aabdcb0b553b9c9547b1a506b34d55a764745870 ("can bcm: fix tx_setup
off-by-one errors") fixed only a part of the original problem reported by
Andre Naujoks. It turned out that the original code needed to be re-ordered
to reduce complexity and to finally fix the reported frame counting issues.
Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net> Signed-off-by: David S. Miller <davem@davemloft.net>
Jiri Olsa [Thu, 29 Sep 2011 15:05:08 +0000 (17:05 +0200)]
perf tools: Fix raw sample reading
Wrong pointer is being passed for raw data sanity checking, when parsing
sample event.
This ends up with invalid event and perf record being stuck in
__perf_session__process_events function during processing build IDs
(process_buildids function).
Following command hangs up in my setup:
./perf record -e raw_syscalls:sys_enter ls
The fix is to use proper pointer to the raw data instead of the 'u'
union.
Reviewed-by: David Ahern <dsahern@gmail.com> Cc: David Ahern <dsahern@gmail.com> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Neil Horman <nhorman@tuxdriver.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Steven Rostedt <rostedt@goodmis.org> Link: http://lkml.kernel.org/r/1317308709-9474-2-git-send-email-jolsa@redhat.com Signed-off-by: Jiri Olsa <jolsa@redhat.com> Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
David S. Miller [Thu, 29 Sep 2011 19:18:59 +0000 (12:18 -0700)]
sparc64: Force the execute bit in OpenFirmware's translation entries.
In the OF 'translations' property, the template TTEs in the mappings
never specify the executable bit. This is the case even though some
of these mappings are for OF's code segment.
Therefore, we need to force the execute bit on in every mapping.
This problem can only really trigger on Niagara/sun4v machines and the
history behind this is a little complicated.
Previous to sun4v, the sun4u TTE entries lacked a hardware execute
permission bit. So OF didn't have to ever worry about setting
anything to handle executable pages. Any valid TTE loaded into the
I-TLB would be respected by the chip.
But sun4v Niagara chips have a real hardware enforced executable bit
in their TTEs. So it has to be set or else the I-TLB throws an
instruction access exception with type code 6 (protection violation).
We've been extremely fortunate to not get bitten by this in the past.
The best I can tell is that the OF's mappings for it's executable code
were mapped using permanent locked mappings on sun4v in the past.
Therefore, the fact that we didn't have the exec bit set in the OF
translations we would use did not matter in practice.
Thanks to Greg Onufer for helping me track this down.
Signed-off-by: David S. Miller <davem@davemloft.net>
In the rds_iw_mr_pool struct the free_pinned field keeps track of
memory pinned by free MRs. While this field is incremented properly
upon allocation, it is never decremented upon unmapping. This would
cause the rds_rdma module to crash the kernel upon unloading, by
triggering the BUG_ON in the rds_iw_destroy_mr_pool function.
This change keeps track of the MRs that become unpinned, so that
free_pinned can be decremented appropriately.
Signed-off-by: Jonathan Lallinger <jonathan@ogc.us> Signed-off-by: Steve Wise <swise@ogc.us> Signed-off-by: David S. Miller <davem@davemloft.net>
Brian King [Wed, 28 Sep 2011 05:33:43 +0000 (05:33 +0000)]
ibmveth: Fix oops on request_irq failure
If request_irq fails, the ibmveth driver will overwrite
the rc and end up returning a successful rc on its open
function, resulting in an oops later when a packet gets
sent and buffers are not allocated due to the failed open.
Signed-off-by: Brian King <brking@linux.vnet.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Oliver Hartkopp [Fri, 23 Sep 2011 08:23:47 +0000 (08:23 +0000)]
can bcm: fix tx_setup off-by-one errors
This patch fixes two off-by-one errors that canceled each other out.
Checking for the same condition two times in bcm_tx_timeout_tsklet() reduced
the count of frames to be sent by one. This did not show up the first time
tx_setup is invoked as an additional frame is sent due to TX_ANNONCE.
Invoking a second tx_setup on the same item led to a reduced (by 1) number of
sent frames.
Reported-by: Andre Naujoks <nautsch@gmail.com> Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net> Signed-off-by: David S. Miller <davem@davemloft.net>
Ian Campbell [Wed, 21 Sep 2011 22:08:26 +0000 (22:08 +0000)]
MAINTAINERS: tehuti: Alexander Indenbaum's address bounces
I got:
Generating server: Tehuti.onmicrosoft.com
baum@tehutinetworks.net
#< #5.1.1 smtp;550 5.1.1 RESOLVER.ADR.RecipNotFound; not found> #SMTP#
Signed-off-by: Ian Campbell <ian.campbell@citrix.com> Cc: Alexander Indenbaum <baum@tehutinetworks.net> Cc: Andy Gospodarek <andy@greyhouse.net> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: netdev@vger.kernel.org Signed-off-by: David S. Miller <davem@davemloft.net>
Richard Cochran [Tue, 20 Sep 2011 01:25:42 +0000 (01:25 +0000)]
dp83640: reduce driver noise
The driver has two warning messages that might be triggered
by normal use cases. When they appear, the messages give the
impression of a never ending series of errors.
This commit changes them to debug messages instead.
Signed-off-by: Richard Cochran <richard.cochran@omicron.at> Signed-off-by: David S. Miller <davem@davemloft.net>
Richard Cochran [Tue, 20 Sep 2011 01:25:41 +0000 (01:25 +0000)]
ptp: fix L2 event message recognition
The IEEE 1588 standard defines two kinds of messages, event and general
messages. Event messages require time stamping, and general do not. When
using UDP transport, two separate ports are used for the two message
types.
The BPF designed to recognize event messages incorrectly classifies L2
general messages as event messages. This commit fixes the issue by
extending the filter to check the message type field for L2 PTP packets.
Event messages are be distinguished from general messages by testing
the "general" bit.
Signed-off-by: Richard Cochran <richard.cochran@omicron.at> Cc: <stable@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
bootup: move 'usermodehelper_enable()' to the end of do_basic_setup()
Doing it just before starting to call into cpu_idle() made a sick kind
of sense only because the original bug we fixed (see commit 288d5abec831: "Boot up with usermodehelper disabled") was about problems
with some scheduler data structures not being initialized, and they had
better be initialized at that point.
But it really didn't make any other conceptual sense, and doing it after
the initial "schedule()" call for the idle thread actually opened up a
race: what if the main initialization thread did everything without
needing to sleep, and got all the way into user land too? Without
actually having scheduled back to the idle thread?
Now, in normal circumstances that doesn't ever happen, but it looks like
Richard Cochran triggered exactly that on his ARM IXP4xx machines:
"I have some ARM IXP4xx based machines that use the two on chip MAC
ports (aka NPEs). The NPE needs a firmware in order to function.
Ever since the following commit [that 288d5abec831 one], it is no
longer possible to bring up the interfaces during the init scripts."
with a call trace showing an ioctl coming from user space. Richard says:
"The init is busybox, and the startup script does mount, syslogd, and
then ifup, so that all can go by quickly."
The fix is to move the usermodehelper_enable() into the main 'init'
thread, and just put it after we've done all our initcalls. By then,
everything really should be up, but we've obviously not actually started
the user-mode portion of init yet.
Reported-and-tested-by: Richard Cochran <richardcochran@gmail.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Sage Weil [Wed, 28 Sep 2011 17:11:04 +0000 (10:11 -0700)]
libceph: fix pg_temp mapping update
The incremental map updates have a record for each pg_temp mapping that is
to be add/updated (len > 0) or removed (len == 0). The old code was
written as if the updates were a complete enumeration; that was just wrong.
Update the code to remove 0-length entries and drop the rbtree traversal.
This avoids misdirected (and hung) requests that manifest as server
errors like
[WRN] client4104 10.0.1.219:0/275025290 misdirected client4104.1:129 0.1 to osd0 not [1,0] in e11/11