Chris Mason [Tue, 31 Mar 2009 17:27:11 +0000 (13:27 -0400)]
Btrfs: add extra flushing for renames and truncates
Renames and truncates are both common ways to replace old data with new
data. The filesystem can make an effort to make sure the new data is
on disk before actually replacing the old data.
This is especially important for rename, which many application use as
though it were atomic for both the data and the metadata involved. The
current btrfs code will happily replace a file that is fully on disk
with one that was just created and still has pending IO.
If we crash after transaction commit but before the IO is done, we'll end
up replacing a good file with a zero length file. The solution used
here is to create a list of inodes that need special ordering and force
them to disk before the commit is done. This is similar to the
ext3 style data=ordering, except it is only done on selected files.
Btrfs is able to get away with this because it does not wait on commits
very often, even for fsync (which use a sub-commit).
For renames, we order the file when it wasn't already
on disk and when it is replacing an existing file. Larger files
are sent to filemap_flush right away (before the transaction handle is
opened).
For truncates, we order if the file goes from non-zero size down to
zero size. This is a little different, because at the time of the
truncate the file has no dirty bytes to order. But, we flag the inode
so that it is added to the ordered list on close (via release method). We
also immediately add it to the ordered list of the current transaction
so that we can try to flush down any writes the application sneaks in
before commit.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Chris Mason [Wed, 25 Mar 2009 13:55:11 +0000 (09:55 -0400)]
Btrfs: make sure btrfs_update_delayed_ref doesn't increase ref_mod
btrfs_update_delayed_ref is optimized to add and remove different
references in one pass through the delayed ref tree. It is a zero
sum on the total number of refs on a given extent.
But, the code was recording an extra ref in the head node. This
never made it down to the disk but was used when deciding if it was
safe to free the extent while dropping snapshots.
The fix used here is to make sure the ref_mod count is unchanged
on the head ref when btrfs_update_delayed_ref is called.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Chris Mason [Tue, 24 Mar 2009 14:24:31 +0000 (10:24 -0400)]
Btrfs: optimize fsyncs on old files
The fsync log has code to make sure all of the parents of a file are in the
log along with the file. It uses a minimal log of the parent directory
inodes, just enough to get the parent directory on disk.
If the transaction that originally created a file is fully on disk,
and the file hasn't been renamed or linked into other directories, we
can safely skip the parent directory walk. We know the file is on disk
somewhere and we can go ahead and just log that single file.
This is more important now because unrelated unlinks in the parent directory
might make us force a commit if we try to log the parent.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Chris Mason [Tue, 24 Mar 2009 14:24:20 +0000 (10:24 -0400)]
Btrfs: tree logging unlink/rename fixes
The tree logging code allows individual files or directories to be logged
without including operations on other files and directories in the FS.
It tries to commit the minimal set of changes to disk in order to
fsync the single file or directory that was sent to fsync or O_SYNC.
The tree logging code was allowing files and directories to be unlinked
if they were part of a rename operation where only one directory
in the rename was in the fsync log. This patch adds a few new rules
to the tree logging.
1) on rename or unlink, if the inode being unlinked isn't in the fsync
log, we must force a full commit before doing an fsync of the directory
where the unlink was done. The commit isn't done during the unlink,
but it is forced the next time we try to log the parent directory.
Solution: record transid of last unlink/rename per directory when the
directory wasn't already logged. For renames this is only done when
renaming to a different directory.
The fsync above will unlink the original some_dir without recording
it in its new location (foo2). After a crash, some_dir will be gone
unless the fsync of some_file forces a full commit
2) we must log any new names for any file or dir that is in the fsync
log. This way we make sure not to lose files that are unlinked during
the same transaction.
2a) we must log any new names for any file or dir during rename
when the directory they are being removed from was logged.
2a is actually the more important variant. Without the extra logging
a crash might unlink the old name without recreating the new one
3) after a crash, we must go through any directories with a link count
of zero and redo the rm -rf
mkdir f1/foo
normal commit
rm -rf f1/foo
fsync(f1)
The directory f1 was fully removed from the FS, but fsync was never
called on f1, only its parent dir. After a crash the rm -rf must
be replayed. This must be able to recurse down the entire
directory tree. The inode link count fixup code takes care of the
ugly details.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Chris Mason [Tue, 24 Mar 2009 14:24:13 +0000 (10:24 -0400)]
Btrfs: Make sure i_nlink doesn't hit zero too soon during log replay
During log replay, inodes are copied from the log to the main filesystem
btrees. Sometimes they have a zero link count in the log but they actually
gain links during the replay or have some in the main btree.
This patch updates the link count to be at least one after copying the
inode out of the log. This makes sure the inode is deleted during an
iput while the rest of the replay code is still working on it.
The log replay has fixup code to make sure that link counts are correct
at the end of the replay, so we could use any non-zero number here and
it would work fine.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Chris Mason [Mon, 16 Mar 2009 14:59:57 +0000 (10:59 -0400)]
Btrfs: limit balancing work while flushing delayed refs
The delayed reference mechanism is responsible for all updates to the
extent allocation trees, including those updates created while processing
the delayed references.
This commit tries to limit the amount of work that gets created during
the final run of delayed refs before a commit. It avoids cowing new blocks
unless it is required to finish the commit, and so it avoids new allocations
that were not really required.
The goal is to avoid infinite loops where we are always making more work
on the final run of delayed refs. Over the long term we'll make a
special log for the last delayed ref updates as well.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Chris Mason [Fri, 13 Mar 2009 15:41:46 +0000 (11:41 -0400)]
Btrfs: readahead checksums during btrfs_finish_ordered_io
This reads in blocks in the checksum btree before starting the
transaction in btrfs_finish_ordered_io. It makes it much more likely
we'll be able to do operations inside the transaction without
needing any btree reads, which limits transaction latencies overall.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Chris Mason [Fri, 13 Mar 2009 15:00:37 +0000 (11:00 -0400)]
Btrfs: leave btree locks spinning more often
btrfs_mark_buffer dirty would set dirty bits in the extent_io tree
for the buffers it was dirtying. This may require a kmalloc and it
was not atomic. So, anyone who called btrfs_mark_buffer_dirty had to
set any btree locks they were holding to blocking first.
This commit changes dirty tracking for extent buffers to just use a flag
in the extent buffer. Now that we have one and only one extent buffer
per page, this can be safely done without losing dirty bits along the way.
This also introduces a path->leave_spinning flag that callers of
btrfs_search_slot can use to indicate they will properly deal with a
path returned where all the locks are spinning instead of blocking.
Many of the btree search callers now expect spinning paths,
resulting in better btree concurrency overall.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Chris Mason [Fri, 13 Mar 2009 00:12:45 +0000 (20:12 -0400)]
Btrfs: Only let very young transactions grow during commit
Commits are fairly expensive, and so btrfs has code to sit around for a while
during the commit and let new writers come in.
But, while we're sitting there, new delayed refs might be added, and those
can be expensive to process as well. Unless the transaction is very very
young, it makes sense to go ahead and let the commit finish without hanging
around.
The commit grow loop isn't as important as it used to be, the fsync logging
code handles most performance critical syncs now.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Chris Mason [Fri, 13 Mar 2009 00:12:45 +0000 (20:12 -0400)]
Btrfs: reduce stack in cow_file_range
The fs/btrfs/inode.c code to run delayed allocation during writout
needed some stack usage optimization. This is the first pass, it does
the check for compression earlier on, which allows us to do the common
(no compression) case higher up in the call chain.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Chris Mason [Fri, 13 Mar 2009 00:12:45 +0000 (20:12 -0400)]
Btrfs: reduce stalls during transaction commit
To avoid deadlocks and reduce latencies during some critical operations, some
transaction writers are allowed to jump into the running transaction and make
it run a little longer, while others sit around and wait for the commit to
finish.
This is a bit unfair, especially when the callers that jump in do a bunch
of IO that makes all the others procs on the box wait. This commit
reduces the stalls this produces by pre-reading file extent pointers
during btrfs_finish_ordered_io before the transaction is joined.
It also tunes the drop_snapshot code to politely wait for transactions
that have started writing out their delayed refs to finish. This avoids
new delayed refs being flooded into the queue while we're trying to
close off the transaction.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Chris Mason [Fri, 13 Mar 2009 14:17:05 +0000 (10:17 -0400)]
Btrfs: process the delayed reference queue in clusters
The delayed reference queue maintains pending operations that need to
be done to the extent allocation tree. These are processed by
finding records in the tree that are not currently being processed one at
a time.
This is slow because it uses lots of time searching through the rbtree
and because it creates lock contention on the extent allocation tree
when lots of different procs are running delayed refs at the same time.
This commit changes things to grab a cluster of refs for processing,
using a cursor into the rbtree as the starting point of the next search.
This way we walk smoothly through the rbtree.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Chris Mason [Fri, 13 Mar 2009 14:11:24 +0000 (10:11 -0400)]
Btrfs: try to cleanup delayed refs while freeing extents
When extents are freed, it is likely that we've removed the last
delayed reference update for the extent. This checks the delayed
ref tree when things are freed, and if no ref updates area left it
immediately processes the delayed ref.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Chris Mason [Fri, 13 Mar 2009 14:10:06 +0000 (10:10 -0400)]
Btrfs: do extent allocation and reference count updates in the background
The extent allocation tree maintains a reference count and full
back reference information for every extent allocated in the
filesystem. For subvolume and snapshot trees, every time
a block goes through COW, the new copy of the block adds a reference
on every block it points to.
If a btree node points to 150 leaves, then the COW code needs to go
and add backrefs on 150 different extents, which might be spread all
over the extent allocation tree.
These updates currently happen during btrfs_cow_block, and most COWs
happen during btrfs_search_slot. btrfs_search_slot has locks held
on both the parent and the node we are COWing, and so we really want
to avoid IO during the COW if we can.
This commit adds an rbtree of pending reference count updates and extent
allocations. The tree is ordered by byte number of the extent and byte number
of the parent for the back reference. The tree allows us to:
1) Modify back references in something close to disk order, reducing seeks
2) Significantly reduce the number of modifications made as block pointers
are balanced around
3) Do all of the extent insertion and back reference modifications outside
of the performance critical btrfs_search_slot code.
#3 has the added benefit of greatly reducing the btrfs stack footprint.
The extent allocation tree modifications are done without the deep
(and somewhat recursive) call chains used in the past.
These delayed back reference updates must be done before the transaction
commits, and so the rbtree is tied to the transaction. Throttling is
implemented to help keep the queue of backrefs at a reasonable size.
Since there was a similar mechanism in place for the extent tree
extents, that is removed and replaced by the delayed reference tree.
Yan Zheng <yan.zheng@oracle.com> helped review and fixup this code.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Chris Mason [Fri, 13 Mar 2009 14:24:59 +0000 (10:24 -0400)]
Btrfs: don't preallocate metadata blocks during btrfs_search_slot
In order to avoid doing expensive extent management with tree locks held,
btrfs_search_slot will preallocate tree blocks for use by COW without
any tree locks held.
A later commit moves all of the extent allocation work for COW into
a delayed update mechanism, and this preallocation will no longer be
required.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Kyle McMartin [Mon, 23 Mar 2009 19:25:49 +0000 (15:25 -0400)]
Build with -fno-dwarf2-cfi-asm
With a sufficiently new compiler and binutils, code which wasn't
previously generating .eh_frame sections has begun to. Certain
architectures (powerpc, in this case) may generate unexpected relocation
formats in response to this, preventing modules from loading.
While the new relocation types should probably be handled, revert to the
previous behaviour with regards to generation of .eh_frame sections.
(This was reported against Fedora, which appears to be the only distro
doing any building against gcc-4.4 at present: RH bz#486545.)
Signed-off-by: Kyle McMartin <kyle@redhat.com> Acked-by: Roland McGrath <roland@redhat.com> Cc: Alexandre Oliva <aoliva@redhat.com> Cc: Sam Ravnborg <sam@ravnborg.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6: (32 commits)
ucc_geth: Fix oops when using fixed-link support
dm9000: locking bugfix
net: update dnet.c for bus_id removal
dnet: DNET should depend on HAS_IOMEM
dca: add missing copyright/license headers
nl80211: Check that function pointer != NULL before using it
sungem: missing net_device_ops
be2net: fix to restore vlan ids into BE2 during a IF DOWN->UP cycle
be2net: replenish when posting to rx-queue is starved in out of mem conditions
bas_gigaset: correctly allocate USB interrupt transfer buffer
smsc911x: reset last known duplex and carrier on open
sh_eth: Fix mistake of the address of SH7763
sh_eth: Change handling of IRQ
netns: oops in ip[6]_frag_reasm incrementing stats
net: kfree(napi->skb) => kfree_skb
net: fix sctp breakage
ipv6: fix display of local and remote sit endpoints
net: Document /proc/sys/net/core/netdev_budget
tulip: fix crash on iface up with shirq debug
virtio_net: Make virtio_net support carrier detection
...
Miklos Szeredi [Mon, 23 Mar 2009 15:07:24 +0000 (16:07 +0100)]
fix ptrace slowness
This patch fixes bug #12208:
Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=12208
Subject : uml is very slow on 2.6.28 host
This turned out to be not a scheduler regression, but an already
existing problem in ptrace being triggered by subtle scheduler
changes.
The problem is this:
- task A is ptracing task B
- task B stops on a trace event
- task A is woken up and preempts task B
- task A calls ptrace on task B, which does ptrace_check_attach()
- this calls wait_task_inactive(), which sees that task B is still on the runq
- task A goes to sleep for a jiffy
- ...
Since UML does lots of the above sequences, those jiffies quickly add
up to make it slow as hell.
This patch solves this by not rescheduling in read_unlock() after
ptrace_stop() has woken up the tracer.
Thanks to Oleg Nesterov and Ingo Molnar for the feedback.
Anton Vorontsov [Mon, 23 Mar 2009 04:30:52 +0000 (21:30 -0700)]
ucc_geth: Fix oops when using fixed-link support
commit b1c4a9dddf09fe99b8f88252718ac5b357363dc4 ("ucc_geth: Change
uec phy id to the same format as gianfar's") introduced a regression
in the ucc_geth driver that causes this oops when fixed-link is used:
This patch fixes the issue by removing offending (and somewhat
duplicate) code from init_phy() routine, and changes _probe()
function to use uec_mdio_bus_name().
Also, since we fully construct phy_bus_id in the _probe() routine,
we no longer need ->phy_address and ->mdio_bus fields in
ucc_geth_info structure.
I wish the patch would be a bit shorter, but it seems like the only
way to fix the issue in a sane way. Luckily, the patch has been
tested with real PHYs and fixed-link, so no further regressions
expected.
Reported-by: Joakim Tjernlund <Joakim.Tjernlund@transmode.se> Signed-off-by: Anton Vorontsov <avorontsov@ru.mvista.com> Tested-by: Joakim Tjernlund <Joakim.Tjernlund@transmode.se> Signed-off-by: David S. Miller <davem@davemloft.net>
David Brownell [Mon, 23 Mar 2009 04:28:39 +0000 (21:28 -0700)]
dm9000: locking bugfix
This fixes a locking bug in the dm9000 driver. It calls
request_irq() without setting IRQF_DISABLED ... which is
correct for handlers that support IRQ sharing, since that
behavior is not guaranteed for shared IRQs. However, its
IRQ handler then wrongly assumes that IRQs are blocked.
So the fix just uses the right spinlock primitives in the
IRQ handler.
NOTE: this is a classic example of the type of bug which
lockdep currently masks by forcibly setting IRQF_DISABLED
on IRQ handlers that did not request that flag.
Signed-off-by: David Brownell <dbrownell@users.sourceforge.net> Signed-off-by: David S. Miller <davem@davemloft.net>
Linus Torvalds [Sun, 22 Mar 2009 18:38:57 +0000 (11:38 -0700)]
Merge branch 'fix-includes' of git://git.kernel.org/pub/scm/linux/kernel/git/gerg/m68knommu
* 'fix-includes' of git://git.kernel.org/pub/scm/linux/kernel/git/gerg/m68knommu:
m68k: merge the non-MMU and MMU versions of siginfo.h
m68k: use the MMU version of unistd.h for all m68k platforms
m68k: merge the non-MMU and MMU versions of signal.h
m68k: merge the non-MMU and MMU versions of ptrace.h
m68k: use MMU version of setup.h for both MMU and non-MMU
m68k: merge the non-MMU and MMU versions of sigcontext.h
m68k: merge the non-MMU and MMU versions of swab.h
m68k: merge the non-MMU and MMU versions of param.h
Tyler Hicks [Fri, 20 Mar 2009 07:23:57 +0000 (02:23 -0500)]
eCryptfs: NULL crypt_stat dereference during lookup
If ecryptfs_encrypted_view or ecryptfs_xattr_metadata were being
specified as mount options, a NULL pointer dereference of crypt_stat
was possible during lookup.
This patch moves the crypt_stat assignment into
ecryptfs_lookup_and_interpose_lower(), ensuring that crypt_stat
will not be NULL before we attempt to dereference it.
Thanks to Dan Carpenter and his static analysis tool, smatch, for
finding this bug.
Tyler Hicks [Fri, 20 Mar 2009 06:25:09 +0000 (01:25 -0500)]
eCryptfs: Allocate a variable number of pages for file headers
When allocating the memory used to store the eCryptfs header contents, a
single, zeroed page was being allocated with get_zeroed_page().
However, the size of an eCryptfs header is either PAGE_CACHE_SIZE or
ECRYPTFS_MINIMUM_HEADER_EXTENT_SIZE (8192), whichever is larger, and is
stored in the file's private_data->crypt_stat->num_header_bytes_at_front
field.
ecryptfs_write_metadata_to_contents() was using
num_header_bytes_at_front to decide how many bytes should be written to
the lower filesystem for the file header. Unfortunately, at least 8K
was being written from the page, despite the chance of the single,
zeroed page being smaller than 8K. This resulted in random areas of
kernel memory being written between the 0x1000 and 0x1FFF bytes offsets
in the eCryptfs file headers if PAGE_SIZE was 4K.
This patch allocates a variable number of pages, calculated with
num_header_bytes_at_front, and passes the number of allocated pages
along to ecryptfs_write_metadata_to_contents().
Thanks to Florian Streibelt for reporting the data leak and working with
me to find the problem. 2.6.28 is the only kernel release with this
vulnerability. Corresponds to CVE-2009-0787
radeonfb: Whack the PCI PM register until it sticks
This fixes a regression introduced when we switched to using the core
pci_set_power_state(). The chip seems to need the state to be written
over and over again until it sticks, so we do that.
Note that the code is a bit blunt, without timeout, etc... but that's
pretty much because I put back in there the code exactly as it used to
be before the regression. I still add a call to pci_set_power_state()
at the end so that ACPI gets called appropriately on x86.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Tested-by: Raymond Wooninck <tittiatcoke@gmail.com> Acked-by: Rafael J. Wysocki <rjw@sisk.pl> Cc: Jesse Barnes <jbarnes@virtuousgeek.org> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Jouni Malinen [Fri, 20 Mar 2009 15:57:36 +0000 (17:57 +0200)]
nl80211: Check that function pointer != NULL before using it
NL80211_CMD_GET_MESH_PARAMS and NL80211_CMD_SET_MESH_PARAMS handlers
did not verify whether a function pointer is NULL (not supported by
the driver) before trying to call the function. The former nl80211
command is available for unprivileged users, too, so this can
potentially allow normal users to kill networking (or worse..) if
mac80211 is built without CONFIG_MAC80211_MESH=y.
Signed-off-by: Jouni Malinen <jouni.malinen@atheros.com> Signed-off-by: John W. Linville <linville@tuxdriver.com>
Jeff Moyer [Thu, 19 Mar 2009 00:04:21 +0000 (17:04 -0700)]
aio: lookup_ioctx can return the wrong value when looking up a bogus context
The libaio test harness turned up a problem whereby lookup_ioctx on a
bogus io context was returning the 1 valid io context from the list
(harness/cases/3.p).
Because of that, an extra put_iocontext was done, and when the process
exited, it hit a BUG_ON in the put_iocontext macro called from exit_aio
(since we expect a users count of 1 and instead get 0).
Thanks to Zach for pointing out that hlist_for_each_entry_rcu will not
return with a NULL tpos at the end of the loop, even if the entry was
not found.
Signed-off-by: Jeff Moyer <jmoyer@redhat.com> Acked-by: Zach Brown <zach.brown@oracle.com> Acked-by: Jens Axboe <jens.axboe@oracle.com> Cc: Benjamin LaHaise <bcrl@kvack.org> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Davide Libenzi [Thu, 19 Mar 2009 00:04:19 +0000 (17:04 -0700)]
eventfd: remove fput() call from possible IRQ context
Remove a source of fput() call from inside IRQ context. Myself, like Eric,
wasn't able to reproduce an fput() call from IRQ context, but Jeff said he was
able to, with the attached test program. Independently from this, the bug is
conceptually there, so we might be better off fixing it. This patch adds an
optimization similar to the one we already do on ->ki_filp, on ->ki_eventfd.
Playing with ->f_count directly is not pretty in general, but the alternative
here would be to add a brand new delayed fput() infrastructure, that I'm not
sure is worth it.
Signed-off-by: Davide Libenzi <davidel@xmailserver.org> Cc: Benjamin LaHaise <bcrl@kvack.org> Cc: Trond Myklebust <trond.myklebust@fys.uio.no> Cc: Eric Dumazet <dada1@cosmosbay.com> Signed-off-by: Jeff Moyer <jmoyer@redhat.com> Cc: Zach Brown <zach.brown@oracle.com> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Linus Torvalds [Thu, 19 Mar 2009 21:56:35 +0000 (14:56 -0700)]
Merge branch 'for-linus' of git://git390.marist.edu/pub/scm/linux-2.6
* 'for-linus' of git://git390.marist.edu/pub/scm/linux-2.6:
[S390] make page table upgrade work again
[S390] make page table walking more robust
[S390] Dont check for pfn_valid() in uaccess_pt.c
[S390] ftrace/mcount: fix kernel stack backchain
[S390] topology: define SD_MC_INIT to fix performance regression
[S390] __div64_31 broken for CONFIG_MARCH_G5
* git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable:
Btrfs: Clear space_info full when adding new devices
Btrfs: Fix locking around adding new space_info
Linus Torvalds [Thu, 19 Mar 2009 18:32:05 +0000 (11:32 -0700)]
Fix race in create_empty_buffers() vs __set_page_dirty_buffers()
Nick Piggin noticed this (very unlikely) race between setting a page
dirty and creating the buffers for it - we need to hold the mapping
private_lock until we've set the page dirty bit in order to make sure
that create_empty_buffers() might not build up a set of buffers without
the dirty bits set when the page is dirty.
I doubt anybody has ever hit this race (and it didn't solve the issue
Nick was looking at), but as Nick says: "Still, it does appear to solve
a real race, which we should close."
Acked-by: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Mikulas Patocka [Thu, 19 Mar 2009 06:53:16 +0000 (23:53 -0700)]
sparc64: Fix crash with /proc/iomem
When you compile kernel on Sparc64 with heap memory checking and type
"cat /proc/iomem", you get a crash, because pointers in struct
resource are uninitialized.
Most code fills struct resource with zeros, so I assume that it is
responsibility of the caller of request_resource to initialized it,
not the responsibility of request_resource functuion.
After 2.6.29 is out, there could be a check for uninitialized fields
added to request_resource to avoid crashes like this.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Tilman Schmidt [Thu, 19 Mar 2009 06:44:23 +0000 (23:44 -0700)]
bas_gigaset: correctly allocate USB interrupt transfer buffer
Every USB transfer buffer has to be allocated individually by kmalloc.
Impact: bugfix, no functional change
Signed-off-by: Tilman Schmidt <tilman@imap.cc> Tested-by: Kolja Waschk <kawk@users.sourceforge.net> Signed-off-by: David S. Miller <davem@davemloft.net>
smsc911x: reset last known duplex and carrier on open
smsc911x_phy_adjust_link is called periodically by the phy layer (as
it's run in polling mode), and it only updates the hardware when it sees
a change in duplex or carrier. This patch clears the last known values
every time the interface is brought up, instead of only when the module
is loaded.
Without this patch the adjust_link function never updates the hardware
after an ifconfig down; ifconfig up. On a full duplex link this causes
the tx error counter to increment, even though packets are correctly
transmitted, as the default MAC_CR register setting is for half duplex.
The tx errors are "no carrier" errors, which should be ignored in
full-duplex mode. When MAC_CR is set to "full duplex" mode they are
correctly ignored by the hardware.
Note that even with this patch the tx error counter can increment if
packets are transmitted between "ifconfig up" and the first phy poll
interval. An improved solution would use the phy interrupt with phylib,
but I haven't managed to make this work 100% robustly yet.
Signed-off-by: Steve Glendinning <steve.glendinning@smsc.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Aced-by: Vlad Yasevich <vladislav.yasevich@hp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
The NAPI poll parameter netdev_budget is not documented in
kernel-docs. Since it may have a substantial effect on at least some
network loads, it should be.
Signed-off-by: Stanislaw Gruszka <sgruszka@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Kyle McMartin [Thu, 19 Mar 2009 01:49:01 +0000 (18:49 -0700)]
tulip: fix crash on iface up with shirq debug
Tulip is currently doing request_irq before it has done its
initialization. This is usually not a problem because it hasn't
enable interrupts yet, but with DEBUG_SHIRQ on, we call the irq handler
when registering the interrupt as a sanity check.
This can result in a NULL ptr dereference, so call tulip_init_ring
before request_irq, and add a free_ring function to do the freeing
now shared with tulip_close.
Tested with a shell loop running ifup, ifdown in a loop a few hundred
times with DEBUG_SHIRQ on.
Signed-off-by: Kyle McMartin <kyle@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
virtio_net: Make virtio_net support carrier detection
Impact: Make NetworkManager work with virtio_net
For now the semantics are simple: There is always carrier.
This allows a seamless experience with e.g., qemu/kvm
where NetworkManager just configures and sets up
everything automagically.
If/when a generally agreed-upon way to control
carrier on/off in the emulator/hypervisor level
emerges, it will be trivial to extend the driver
to support that too, but for now even this 2-liner
makes user experience that much better.
Signed-off-by: Pantelis Koukousoulas <pktoss@gmail.com> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: David S. Miller <davem@davemloft.net>
The un-refactored code checked the link speed and duplex of
every slave on every pass; the refactored code did not do so.
The 802.3ad and balance-alb/tlb modes utilize the speed and
duplex information, and require it to be kept up to date. This patch
adds a notifier check to perform the appropriate updating when the slave
device speed changes.
Signed-off-by: Jay Vosburgh <fubar@us.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Michael Chan [Thu, 19 Mar 2009 01:11:51 +0000 (18:11 -0700)]
bnx2: Fix problem of using wrong IRQ handler.
The MSI-X handler was chosen before the call to pci_enable_msix().
If MSI-X was not available, the wrong MSI-X handler would be used in
INTA mode. This would cause a screaming interrupt problem because
INTA would not be cleared by the MSI-X handler.
Fixed by assigning MSI-X handler after pci_enable_msix() returns
successfully. Also update version to 1.9.3.
Thomas Chenault <thomas_chenault@dell.com> helped us find this problem.
Signed-off-by: Michael Chan <mchan@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Linus Torvalds [Wed, 18 Mar 2009 14:39:11 +0000 (07:39 -0700)]
Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound-2.6
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound-2.6:
ALSA: Fix vunmap and free order in snd_free_sgbuf_pages()
ALSA: mixart, fix lock imbalance
ALSA: pcm_oss, fix locking typo
ALSA: oss-mixer - Fixes recording gain control
ALSA: hda - Workaround for buggy DMA position on ATI controllers
ALSA: hda - Fix DMA mask for ATI controllers
ALSA: opl3sa2 - Fix NULL dereference when suspending snd_opl3sa2
After TASK_SIZE now gives the current size of the address space the
upgrade of a 64 bit process from 3 to 4 levels of page table needs
to use the arch_mmap_check hook to catch large mmap lengths. The
get_unmapped_area* functions need to check for -ENOMEM from the
arch_get_unmapped_area*, upgrade the page table and retry.
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Make page table walking on s390 more robust. The current code requires
that the pgd/pud/pmd/pte loop is only done for address ranges that are
below the end address of the last vma of the address space. But this
is not always true, e.g. the generic page table walker does not guarantee
this. Change TASK_SIZE/TASK_SIZE_OF to reflect the current size of the
address space. This makes the generic page table walker happy but it
breaks the upgrade of a 3 level page table to a 4 level page table.
To make the upgrade work again another fix is required.
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Gerald Schaefer [Wed, 18 Mar 2009 12:27:35 +0000 (13:27 +0100)]
[S390] Dont check for pfn_valid() in uaccess_pt.c
pfn_valid() actually checks for a valid struct page and not for a
valid pfn. Using xip mappings w/o struct pages, this will result in
-EFAULT returned by the (page table walk) user copy functions,
even though there is valid memory. Those user copy functions don't
need a struct page, so this patch just removes the pfn_valid() check.
Signed-off-by: Gerald Schaefer <gerald.schaefer@de.ibm.com> Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Heiko Carstens [Wed, 18 Mar 2009 12:27:33 +0000 (13:27 +0100)]
[S390] topology: define SD_MC_INIT to fix performance regression
The default values for SD_MC_INIT cause an additional cpu usage of up
to 40% on some network benchmarks compared to the plain SD_CPU_INIT
values. So just define SD_MC_INIT to SD_CPU_INIT.
More tuning needs to be done.
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
The implementation of __div64_31 for G5 machines is broken. The comments
in __div64_31 are correct, only the code does not do what the comments
say. The part "If the remainder has overflown subtract base and increase
the quotient" is only partially realized, the base is subtracted correctly
but the quotient is only increased if the dividend had the last bit set.
Using the correct instruction fixes the problem.
Cc: stable@kernel.org Reported-by: Frans Pop <elendil@planet.nl> Tested-by: Frans Pop <elendil@planet.nl> Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Takashi Iwai [Tue, 17 Mar 2009 13:00:06 +0000 (14:00 +0100)]
ALSA: Fix vunmap and free order in snd_free_sgbuf_pages()
In snd_free_sgbuf_pags(), vunmap() is called after releasing the SG
pages, and it causes errors on Xen as Xen manages the pages
differently. Although no significant errors have been reported on
the actual hardware, this order should be fixed other way round,
first vunmap() then free pages.
Cc: Jan Beulich <jbeulich@novell.com> Cc: <stable@kernel.org> Signed-off-by: Takashi Iwai <tiwai@suse.de>
Viral Mehta [Tue, 10 Mar 2009 14:43:18 +0000 (15:43 +0100)]
ALSA: oss-mixer - Fixes recording gain control
At the time of initialization, SNDRV_MIXER_OSS_PRESENT_PVOLUME bit is not
set for MIC (slot 7).
So, the same should not be checked when an application tries to do gain
control for audio recording devices.
Just check slot->present for SNDRV_MIXER_OSS_PRESENT_CVOLUME independently.
Verified with a simple application which opens /dev/dsp for recording and
/dev/mixer for volume control.
Takashi Iwai [Tue, 17 Mar 2009 06:49:14 +0000 (07:49 +0100)]
ALSA: hda - Workaround for buggy DMA position on ATI controllers
The position-buffer on ATI controllers are unreliable as well as
on VIA chips, thus the same workaround for DMA position reading as
VIA is useful for ATI.
Takashi Iwai [Tue, 17 Mar 2009 06:47:18 +0000 (07:47 +0100)]
ALSA: hda - Fix DMA mask for ATI controllers
ATI controllers (at least some SB0600 models) appear buggy to handle
64bit DMA. As a workaround, reset GCAP bit0 and let the driver to
use only 32bit DMA on these controllers.
Linus Torvalds [Wed, 18 Mar 2009 03:55:40 +0000 (20:55 -0700)]
Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
ext4: fix bb_prealloc_list corruption due to wrong group locking
ext4: fix bogus BUG_ONs in in mballoc code
ext4: Print the find_group_flex() warning only once
ext4: fix header check in ext4_ext_search_right() for deep extent trees.
Masami Hiramatsu [Mon, 16 Mar 2009 22:13:36 +0000 (18:13 -0400)]
module: fix refptr allocation and release order
Impact: fix ref-after-free crash on failed module load
Fix refptr bug: Change refptr allocation and release order not to access a module
data structure pointed by 'mod' after freeing mod->module_core.
This bug will cause kernel panic(e.g. failed to find undefined symbols).
This bug was reported on systemtap bugzilla.
http://sources.redhat.com/bugzilla/show_bug.cgi?id=9927
Signed-off-by: Masami Hiramatsu <mhiramat@redhat.com> Cc: Eric Dumazet <dada1@cosmosbay.com> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Thomas Bartosik [Mon, 16 Mar 2009 15:04:38 +0000 (16:04 +0100)]
USB: storage: Unusual USB device Prolific 2507 variation added
The "c-enter" USB to Toshiba 1.8" IDE enclosure needs special treatment
to work flawlessly. This patch is absolutely trivial, as the integrated
USB-IDE bridge is already identified to be an "unusual" device, only the
bcdDevice is different (lower) to the bcdDeviceMin already included in
the kernel.
It is a Prolific 2507 bridge.
Dirk Hohndel [Sun, 15 Mar 2009 03:47:39 +0000 (20:47 -0700)]
USB: Add Vendor/Product ID for new CDMA U727 to option driver
* newer versions of the Novatel Wireless U727 CDMA 3G USB stick
have a different Product ID (0x5010); adding this ID makes them
work just fine with the option driver
Signed-off-by: Moritz Muehlenhoff <jmm@debian.org> Tested-by: Jan Heitkoetter <devnull@heitkoetter.net> Cc: stable <stable@kernel.org> Signed-off-by: Phil Dibowitz <phil@ipom.com>
Dan Williams [Thu, 12 Mar 2009 10:53:00 +0000 (06:53 -0400)]
USB: Option: let cdc-acm handle Sony Ericsson F3507g / Dell 5530
The generic cdc-acm driver is now the best one to handle Sony Ericsson
F3507g-based devices (which the Dell 5530 is a rebrand of), now that all
the pieces are in place (ie, cac477e8f1038c41b6f29d3161ce351462ef3df7).
Removing the IDs from option allows cdc-acm to handle the device.
Signed-off-by: Dan Williams <dcbw@redhat.com> Cc: stable <stable@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Alan Stern [Mon, 16 Mar 2009 18:21:56 +0000 (14:21 -0400)]
USB: EHCI: expedite unlinks when the root hub is suspended
This patch (as1225) fixes a bug in ehci-hcd. The condition for
whether unlinked QHs can become IDLE should not be that the controller
is halted, but rather that the controller isn't running. In other
words when the root hub is suspended, the hardware doesn't own any
QHs.
This fixes a problem that can show up during hibernation: If a QH is
only partially unlinked when the root hub is frozen, then when the
root hub is thawed the QH won't be in the IDLE state. As a result it
can't be used properly for new URB submissions.
Karsten Wiese [Thu, 26 Feb 2009 00:47:48 +0000 (01:47 +0100)]
USB: EHCI: Fix isochronous URB leak
ehci-hcd uses usb_get_urb() and usb_put_urb() in an unbalanced way causing
isochronous URB's kref.counts incrementing once per usb_submit_urb() call.
The culprit is *usb being set to NULL when usb_put_urb() is called after URB
is given back.
Due to other fixes there is no need for ehci-hcd to deal with usb_get_urb()
nor usb_put_urb() anymore, so patch removes their usages in ehci-hcd.
Patch also makes ehci_to_hcd(ehci)->self.bandwidth_allocated adjust, if a
stream finishes.