Jiaying Zhang [Tue, 22 Mar 2011 01:38:05 +0000 (21:38 -0400)]
ext4: add more tracepoints and use dev_t in the trace buffer
- Add more ext4 tracepoints.
- Change ext4 tracepoints to use dev_t field with MAJOR/MINOR macros
so that we can save 4 bytes in the ring buffer on some platforms.
- Add sync_mode to ext4_da_writepages, ext4_da_write_pages, and
ext4_da_writepages_result tracepoints. Also remove for_reclaim
field from ext4_da_writepages since it is usually not very useful.
Eric Sandeen [Tue, 22 Mar 2011 01:25:13 +0000 (21:25 -0400)]
ext4: don't kfree uninitialized s_group_info members
We can call kfree on uninitialized members of the s_group_info array
on an the error path. We can avoid this by kzalloc'ing the array.
This doesn't entirely solve the oops on mount if we fail down this
path; failed_mount4: frees the sbi, for one, which gets referenced
later in the failed mount paths - I haven't worked that out yet.
https://bugzilla.kernel.org/show_bug.cgi?id=30872
Reported-by: Eugene A. Shatokhin <dame_eugene@mail.ru> Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Robin Dong [Tue, 22 Mar 2011 00:39:22 +0000 (20:39 -0400)]
ext4: add missing space in printk's in __ext4_grp_locked_error()
When we do performence-testing on ext4 filesystem, we observed a
warning like this:
EXT4-fs error (device sda7): ext4_mb_generate_buddy:718: group 259825901 blocks in bitmap, 26057 in gd
instead, it should be
"group 2598, 25901 blocks in bitmap, 26057 in gd"
Reviewed-by: Coly Li <bosong.ly@taobao.com> Cc: Tao Ma <boyu.mt@taobao.com> Signed-off-by: Robin Dong <sanbai@taobao.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Amir Goldstein [Mon, 21 Mar 2011 02:59:02 +0000 (22:59 -0400)]
ext4: handle errors in ext4_clear_blocks()
Checking return code from ext4_journal_get_write_access() is important
with snapshots, because this function invokes COW, so may return new
errors, such as ENOSPC.
ext4_clear_blocks() now returns < 0 for fatal errors, in which case,
ext4_free_data() is aborted.
Signed-off-by: Amir Goldstein <amir73il@users.sf.net> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Amir Goldstein [Mon, 21 Mar 2011 02:57:02 +0000 (22:57 -0400)]
ext4: unify the ext4_handle_release_buffer() api
There are two wrapper functions which do exactly the same thing:
ext4_journal_release_buffer(), and ext4_handle_release_buffer(). In
addition, ext4_xattr_block_set() calls jbd2_journal_release_buffer()
directly.
Unify all of the code to use ext4_handle_release_buffer(), and get rid
of ext4_journal_release_buffer().
Signed-off-by: Amir Goldstein <amir73il@users.sf.net> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Amir Goldstein [Mon, 21 Mar 2011 01:18:44 +0000 (21:18 -0400)]
ext4: handle errors in ext4_rename
Checking return code from ext4_journal_get_write_access() is important
with snapshots, because this function invokes COW, so may return new
errors, such as ENOSPC.
We move the call to ext4_journal_get_write_access earlier in the
function, to simplify error handling in the case that this function
returns returns an error.
Signed-off-by: Amir Goldstein <amir73il@users.sf.net> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Amir Goldstein [Mon, 21 Mar 2011 01:13:43 +0000 (21:13 -0400)]
jbd2: add COW fields to struct jbd2_journal_handle
Add fields needed for the copy-on-write ext4 development work.
The h_cowing flag is used by ext4 snapshots code to mark the task in
COWING state.
The h_XXX_credits fields are used to track buffer credits usage
(accounted by COW and non-COW operations).
The h_cow_XXX fields are used as per task debugging counters.
Merging this commit into mainline will allow users to test ext4
snapshots as a standalone module, without the need to patch and
install a development kernel.
Signed-off-by: Amir Goldstein <amir73il@users.sf.net> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Amir Goldstein [Mon, 21 Mar 2011 00:08:48 +0000 (20:08 -0400)]
jbd2: add the b_cow_tid field to journal_head struct
The b_cow_tid field will be used by the ext4 snapshots code to store
the transaction id when the buffer was last cowed.
Merging this patch to mainline will allow users to test ext4 snapshots
as a standalone module, without the need to patch and install a
development kernel.
On 64bit machines this field uses fills in a padding "hole" and does
not increase the size of the struct. On a 32bit machine this patch
increases the size of the struct from 60 to 64 bytes.
Signed-off-by: Amir Goldstein <amir73il@users.sf.net> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Theodore Ts'o [Wed, 16 Mar 2011 21:16:31 +0000 (17:16 -0400)]
ext4: Initialize fsync transaction ids in ext4_new_inode()
When allocating a new inode, we need to make sure i_sync_tid and
i_datasync_tid are initialized. Otherwise, one or both of these two
values could be left initialized to zero, which could potentially
result in BUG_ON in jbd2_journal_commit_transaction.
(This could happen by having journal->commit_request getting set to
zero, which could wake up the kjournald process even though there is
no running transaction, which then causes a BUG_ON via the
J_ASSERT(j_ruinning_transaction != NULL) statement.
Mingming Cao [Sat, 5 Mar 2011 16:52:45 +0000 (11:52 -0500)]
ext4: Use single thread to perform DIO unwritten convertion
While running ext4 testing on multiple core, we found there are per
cpu ext4-dio-unwritten threads processing conversion from unwritten
extents to written for IOs completed from async direct IO patch. Per
filesystem is enough, we don't need per cpu threads to work on
conversion.
Theodore Ts'o [Mon, 28 Feb 2011 18:12:38 +0000 (13:12 -0500)]
ext4: optimize ext4_bio_write_page() when no extent conversion is needed
If no extent conversion is required, wake up any processes waiting for
the page's writeback to be complete and free the ext4_io_end structure
directly in ext4_end_bio() instead of dropping it on the linked list
(which requires taking a spinlock to queue and dequeue the io_end
structure), and waiting for the workqueue to do this work.
This removes an extra scheduling delay before process waiting for an
fsync() to complete gets woken up, and it also reduces the CPU
overhead for a random write workload.
Amir Goldstein [Mon, 28 Feb 2011 05:53:45 +0000 (00:53 -0500)]
ext4: skip orphan cleanup if fs has unknown ROCOMPAT features
Orphan cleanup is currently executed even if the file system has some
number of unknown ROCOMPAT features, which deletes inodes and frees
blocks, which could be very bad for some RO_COMPAT features,
especially the SNAPSHOT feature.
This patch skips the orphan cleanup if it contains readonly compatible
features not known by this ext4 implementation, which would prevent
the fs from being mounted (or remounted) readwrite.
Signed-off-by: Amir Goldstein <amir73il@users.sf.net> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Amir Goldstein [Mon, 28 Feb 2011 04:32:12 +0000 (23:32 -0500)]
ext4: use the nblocks arg to ext4_truncate_restart_trans()
nblocks is passed into ext4_truncate_restart_trans() from
ext4_ext_truncate_extend_restart() with a value different from the default
blocks_for_truncate(), but is being ignored.
The two other calls to ext4_truncate_restart_trans() already pass the
default value, which is then being recalculated inside the function.
Fix the problem by using the passed argument.
Signed-off-by: Amir Goldstein <amir73il@users.sf.net>
Manish Katiyar [Mon, 28 Feb 2011 01:42:06 +0000 (20:42 -0500)]
ext4: fix missing iput of root inode for some mount error paths
This assures that the root inode is not leaked, and that sb->s_root is
NULL, which will prevent generic_shutdown_super() from doing extra
work, including call sync_filesystem, which ultimately results in
ext4_sync_fs() getting called with an uninitialized struct super,
which is the cause of the crash noted in Kernel Bugzilla #26752.
Yongqiang Yang [Sun, 27 Feb 2011 22:25:47 +0000 (17:25 -0500)]
ext4: make FIEMAP and delayed allocation play well together
Fix the FIEMAP ioctl so that it returns all of the page ranges which
are still subject to delayed allocation. We were missing some cases
if the file was sparse.
Reported by Chris Mason <chris.mason@oracle.com>:
>We've had reports on btrfs that cp is giving us files full of zeros
>instead of actually copying them. It was tracked down to a bug with
>the btrfs fiemap implementation where it was returning holes for
>delalloc ranges.
>
>Newer versions of cp are trusting fiemap to tell it where the holes
>are, which does seem like a pretty neat trick.
>
>I decided to give xfs and ext4 a shot with a few tests cases too, xfs
>passed with all the ones btrfs was getting wrong, and ext4 got the basic
>delalloc case right.
>$ mkfs.ext4 /dev/xxx
>$ mount /dev/xxx /mnt
>$ dd if=/dev/zero of=/mnt/foo bs=1M count=1
>$ fiemap-test foo
>ext: 0 logical: [ 0.. 255] phys: 0.. 255
>flags: 0x007 tot: 256
>
>Horray! But once we throw a hole in, things go bad:
>$ mkfs.ext4 /dev/xxx
>$ mount /dev/xxx /mnt
>$ dd if=/dev/zero of=/mnt/foo bs=1M count=1 seek=1
>$ fiemap-test foo
>< no output >
>
>We've got a delalloc extent after the hole and ext4 fiemap didn't find
>it. If I run sync to kick the delalloc out:
>$sync
>$ fiemap-test foo
>ext: 0 logical: [ 256.. 511] phys: 34048.. 34303
>flags: 0x001 tot: 256
>
>fiemap-test is sitting in my /usr/local/bin, and I have no idea how it
>got there. It's full of pretty comments so I know it isn't mine, but
>you can grab it here:
>
>http://oss.oracle.com/~mason/fiemap-test.c
>
>xfsqa has a fiemap program too.
After Fix, test results are as follows:
ext: 0 logical: [ 256.. 511] phys: 0.. 255
flags: 0x007 tot: 256
ext: 0 logical: [ 256.. 511] phys: 33280.. 33535
flags: 0x001 tot: 256
Theodore Ts'o [Sun, 27 Feb 2011 22:23:47 +0000 (17:23 -0500)]
ext4: suppress verbose debugging information if malloc-debug is off
If CONFIG_EXT4_DEBUG is enabled, then if a block allocation fails due
to disk being full, a verbose debugging message is printed, even if
the malloc-debug switch has not been enabled. Suppress the debugging
message so that nothing is printed unless malloc-debug has been turned
on.
Theodore Ts'o [Sun, 27 Feb 2011 21:43:24 +0000 (16:43 -0500)]
ext4: don't leave PageWriteback set after memory failure
In ext4_bio_write_page(), if the memory allocation for the struct
ext4_io_page fails, it returns with the page's PageWriteback flag set.
This will end up causing the page not to skip writeback in
WB_SYNC_NONE mode, and in WB_SYNC_ALL mode (i.e., on a sync, fsync, or
umount) the writeback daemon will get stuck forever on the
wait_on_page_writeback() function in write_cache_pages_da().
Or, if journalling is enabled and the file gets deleted, it the
journal thread can get stuck in journal_finish_inode_data_buffers()
call to filemap_fdatawait().
Another place where things can get hung up is in
truncate_inode_pages(), called out of ext4_evict_inode().
Fix this by not setting PageWriteback until after we have successfully
allocated the struct ext4_io_page.
Theodore Ts'o [Sat, 26 Feb 2011 19:09:14 +0000 (14:09 -0500)]
ext4: don't lock the next page in write_cache_pages if not needed
If we have accumulated a contiguous region of memory to be written
out, and the next page can added to this region, don't bother locking
(and then unlocking the page) before writing out the memory. In the
unlikely event that the next page was being written back by some other
CPU, we can also skip waiting that page to finish writeback.
Theodore Ts'o [Sat, 26 Feb 2011 19:08:11 +0000 (14:08 -0500)]
ext4: remove page_skipped hackery in ext4_da_writepages()
Because the ext4 page writeback codepath had been prematurely calling
clear_page_dirty_for_io(), if it turned out that a particular page
couldn't be written out during a particular pass of
write_cache_pages_da(), the page would have to get redirtied by
calling redirty_pages_for_writeback(). Not only was this wasted work,
but redirty_page_for_writeback() would increment wbc->pages_skipped to
signal to writeback_sb_inodes() that buffers were locked, and that it
should skip this inode until later.
Since this signal was incorrect in ext4's case --- which was caused by
ext4's historically incorrect use of write_cache_pages() ---
ext4_da_writepages() saved and restored wbc->skipped_pages to avoid
confusing writeback_sb_inodes().
Now that we've fixed ext4 to call clear_page_dirty_for_io() right
before initiating the page I/O, we can nuke the page_skipped
save/restore hackery, and breathe a sigh of relief.
Theodore Ts'o [Sat, 26 Feb 2011 19:08:01 +0000 (14:08 -0500)]
ext4: clear the dirty bit for a page in writeback at the last minute
Move when we call clear_page_dirty_for_io() to just before we actually
write the page. This simplifies the code somewhat, and avoids marking
pages as clean and then needing to remark them as dirty later.
Curt Wohlgemuth [Sat, 26 Feb 2011 17:27:52 +0000 (12:27 -0500)]
ext4: fix ext4_da_block_invalidatepages() to handle page range properly
If ext4_da_block_invalidatepages() is called because of a
failure from ext4_map_blocks() in mpage_da_map_and_submit(),
it's supposed to clean up -- including unlock -- all the
pages in the mpd structure. But these values may not match
up, even on a system in which block size == page size:
ext4_da_block_invalidatepages() has been using b_blocknr and
b_size; this patch changes it to use first_page and
next_page.
Tested: I injected a small number (5%) of failures in
ext4_map_blocks() in the case that the flags contain
EXT4_GET_BLOCKS_DELALLOC_RESERVE, and ran fsstress on this
kernel. Without this patch, I got hung tasks every time.
With this patch, I see no hangs in many runs of fsstress.
Curt Wohlgemuth [Sat, 26 Feb 2011 17:25:52 +0000 (12:25 -0500)]
ext4: mark multi-page IO complete on mapping failure
In mpage_da_map_and_submit(), if we have a delayed block
allocation failure from ext4_map_blocks(), we need to mark
the IO as complete, by setting
mpd->io_done = 1;
Otherwise, we could end up submitting the pages in an outer
loop; since they are unlocked on mapping failure in
ext4_da_block_invalidatepages(), this will cause a bug check
in mpage_da_submit_io().
I tested this by injected failures into ext4_map_blocks().
Without this patch, a simple fsstress run will bug check;
with the patch, it works fine.
Coly Li [Thu, 24 Feb 2011 19:10:05 +0000 (14:10 -0500)]
ext4: mballoc: don't replace the current preallocation group unnecessarily
In ext4_mb_check_group_pa(), the current preallocation space is
replaced with a new preallocation space when the two have the same
distance from the goal block.
This doesn't actually gain us anything, so change things so that the
function only switches to the new preallocation group if its distance
from the goal block is strictly smaller than the current preallocaiton
group's distance from the goal block.
Signed-off-by: Coly Li <bosong.ly@taobao.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Coly Li [Thu, 24 Feb 2011 18:24:18 +0000 (13:24 -0500)]
ext4: remove unncessary call mb_find_buddy() in debugging code
In __mb_check_buddy(), look at the code below:
591 fstart = -1;
592 buddy = mb_find_buddy(e4b, 0, &max);
593 for (i = 0; i < max; i++) {
594 if (!mb_test_bit(i, buddy)) {
595 MB_CHECK_ASSERT(i >= e4b->bd_info->bb_first_free);
596 if (fstart == -1) {
597 fragments++;
598 fstart = i;
599 }
600 continue;
601 }
602 fstart = -1;
603 /* check used bits only */
604 for (j = 0; j < e4b->bd_blkbits + 1; j++) {
605 buddy2 = mb_find_buddy(e4b, j, &max2);
606 k = i >> j;
607 MB_CHECK_ASSERT(k < max2);
608 MB_CHECK_ASSERT(mb_test_bit(k, buddy2));
609 }
610 }
611 MB_CHECK_ASSERT(!EXT4_MB_GRP_NEED_INIT(e4b->bd_info));
612 MB_CHECK_ASSERT(e4b->bd_info->bb_fragments == fragments);
613
614 grp = ext4_get_group_info(sb, e4b->bd_group);
615 buddy = mb_find_buddy(e4b, 0, &max);
On line 592, buddy is fetched by mb_find_buddy() with order 0, between
line 593 to line 615, buddy is not changed, therefore there is
no need to fetch buddy again from mb_find_buddy() with order 0 again.
We can safely remove the second mb_find_buddy() on line 615.
Signed-off-by: Coly Li <bosong.ly@taobao.com> Cc: Alex Tomas <alex@clusterfs.com> Cc: Theodore Tso <tytso@google.com>
Coly Li [Thu, 24 Feb 2011 17:51:59 +0000 (12:51 -0500)]
ext4: code cleanup in mb_find_buddy()
Current code calculate max no matter whether order is zero, it's
unnecessary. This cleanup patch sets max to "1 << (e4b->bd_blkbits
+ 3)" only when order == 0.
Signed-off-by: Coly Li <bosong.ly@taobao.com> Cc: Alex Tomas <alex@clusterfs.com> Cc: Theodore Tso <tytso@google.com>
Eric Sandeen [Wed, 23 Feb 2011 22:51:51 +0000 (17:51 -0500)]
ext4: enable acls and user_xattr by default
There's no good reason to require the extra step of providing
a mount option for acl or user_xattr once the feature is configured
on; no other filesystem that I know of requires this.
Userspace patches have set these options in default mount options,
and this patch makes them default in the kernel. At some point
we can start to deprecate the options, perhaps.
For now I've removed default mount option checks in show_options()
to be explicit about what's set, since it's changing the default,
but I'm open to alternatives if desired.
Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Lukas Czerner [Wed, 23 Feb 2011 22:49:51 +0000 (17:49 -0500)]
ext4: Adjust minlen with discard_granularity in the FITRIM ioctl
Discard granularity tells us the minimum size of extent that can be
discarded by the device. If the user supplies a minimum extent that
should be discarded (range.minlen) which is smaller than the discard
granularity, increase minlen to the discard granularity, since there's
no point submitting trim requests that the device will reject anyway.
Lukas Czerner [Wed, 23 Feb 2011 17:42:32 +0000 (12:42 -0500)]
ext4: check if device support discard in FITRIM ioctl
For a device that does not support discard, the FITRIM ioctl returns
-EOPNOTSUPP when blkdev_issue_discard() returns this error code, which
is how the user is informed that the device does not support discard.
If there are no suitable free extents to be trimmed, then FITRIM will
return success even though the device does not support discard, which
could confuse the user. So check explicitly if the device supports
discard and return an error code at the beginning of the FITRIM ioctl
processing.
On a server with lots of small files and random access this read-ahead makes
performance worse, and I'd like to disable it. I work around this problem
by using value of 1, but it still reads an extra block.
This patch fixes the problem by checking for zero explicitly.
Signed-off-by: Alexander V. Lukyanov <lav@netis.ru> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Theodore Ts'o [Tue, 22 Feb 2011 01:39:58 +0000 (20:39 -0500)]
ext4: fix compile warnings with EXT4FS_DEBUG enabled
Compile 2.6.38-rc1 with turning EXT4FS_DEBUG on,
we get following compile warnings. This patch fixes them.
CC fs/ext4/hash.o
CC fs/ext4/resize.o
fs/ext4/resize.c: In function 'setup_new_group_blocks':
fs/ext4/resize.c:233:2: warning: format '%#04llx' expects type 'long long
unsigned int', but argument 3 has type 'long unsigned int'
fs/ext4/resize.c:251:2: warning: format '%#04llx' expects type 'long long
unsigned int', but argument 3 has type 'long unsigned int'
CC fs/ext4/extents.o
CC fs/ext4/ext4_jbd2.o
CC fs/ext4/migrate.o
Lukas Czerner [Tue, 22 Feb 2011 01:16:21 +0000 (20:16 -0500)]
ext4: update ext4 documentation
Add documentation for mount options and ioctls to
Documentation/filesystem/ext4.txt, which has not been udpated for some
time. Also add for ext4 sysfs tunables to the
Documentation/ABI/testing/sysfs-fs-ext4 file, and fix a few
typographical errors in that file.
Linus Torvalds [Wed, 16 Feb 2011 16:56:55 +0000 (08:56 -0800)]
vfs: fix BUG_ON() in fs/namei.c:1461
When Al moved the nameidata_dentry_drop_rcu_maybe() call into the
do_follow_link function in commit 844a391799c2 ("nothing in
do_follow_link() is going to see RCU"), he mistakenly left the
BUG_ON(inode != path->dentry->d_inode);
behind. Which would otherwise be ok, but that BUG_ON() really needs to
be _after_ dropping RCU, since the dentry isn't necessarily stable
otherwise.
So complete the code movement in that commit, and move the BUG_ON() into
do_follow_link() too. This means that we need to pass in 'inode' as an
argument (just for this one use), but that's a small thing. And
eventually we may be confident enough in our path lookup that we can
just remove the BUG_ON() and the unnecessary inode argument.
Reported-and-tested-by: Eric Dumazet <eric.dumazet@gmail.com> Acked-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Linus Torvalds [Wed, 16 Feb 2011 01:51:18 +0000 (17:51 -0800)]
Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/gerg/m68knommu
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/gerg/m68knommu:
m68knommu: set flow handler for secondary interrupt controller of 5249
m68knommu: remove use of IRQ_FLG_LOCK from 68360 platform support
m68knommu: fix dereference of port.tty
m68knommu: add missing linker __modver section
m68knommu: fix mis-named variable int set_irq_chip loop
m68knommu: add optimize memmove() function
m68k: remove arch specific non-optimized memcmp()
m68knommu: fix use of un-defined _TIF_WORK_MASK
m68knommu: Rename m548x_wdt.c to m54xx_wdt.c
m68knommu: fix m548x_wdt.c compilation after headers renaming
m68knommu: Remove dependencies on nonexistent M68KNOMMU
Greg Ungerer [Tue, 8 Feb 2011 04:40:44 +0000 (14:40 +1000)]
m68knommu: fix mis-named variable int set_irq_chip loop
Compiling for 68360 targets gives:
CC arch/m68knommu/platform/68360/ints.o
arch/m68knommu/platform/68360/ints.c: In function ‘init_IRQ’:
arch/m68knommu/platform/68360/ints.c:135:16: error: ‘irq’ undeclared (first use in this function)
arch/m68knommu/platform/68360/ints.c:135:16: note: each undeclared identifier is reported only once for each function it appears in
Greg Ungerer [Thu, 3 Feb 2011 11:58:39 +0000 (21:58 +1000)]
m68knommu: add optimize memmove() function
Add an m68k/coldfire optimized memmove() function for the m68knommu arch.
This is the same function as used by m68k. Simple speed tests show this
is faster once buffers are larger than 4 bytes, and significantly faster
on much larger buffers (4 times faster above about 100 bytes).
This also goes part of the way to fixing a regression caused by commit ea61bc461d09e8d331a307916530aaae808c72a2 ("m68k/m68knommu: merge MMU and
non-MMU string.h"), which breaks non-coldfire non-mmu builds (which is
the 68x328 and 68360 families). They currently have no memmove() fucntion
defined, since there was none in the m68knommu/lib functions.
Greg Ungerer [Thu, 3 Feb 2011 11:31:20 +0000 (21:31 +1000)]
m68k: remove arch specific non-optimized memcmp()
The m68k arch implements its own memcmp() function. It is not optimized
in any way (it is the most strait forward coding of memcmp you can get).
Remove it and use the kernels standard memcmp() implementation.
This also goes part of the way to fixing a regression caused by commit ea61bc461d09e8d331a307916530aaae808c72a2 ("m68k/m68knommu: merge MMU and
non-MMU string.h"), which breaks non-coldfire non-mmu builds (which is
the 68x328 and 68360 families). They currently have no memcmp() function
defined, since there is none in the m68knommu/lib functions.
Linus Torvalds [Tue, 15 Feb 2011 23:25:33 +0000 (15:25 -0800)]
Merge branch 'drm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/airlied/drm-2.6
* 'drm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/airlied/drm-2.6: (27 commits)
drm/radeon/kms: hopefully fix pll issues for real (v3)
drm/radeon/kms: add bounds checking to avivo pll algo
drm: fix wrong usages of drm_device in DRM Developer's Guide
drm/radeon/kms: fix a few more atombios endian issues
drm/radeon/kms: improve 6xx/7xx CS error output
drm/radeon/kms: check AA resolve registers on r300
drm/radeon/kms: fix tracking of BLENDCNTL, COLOR_CHANNEL_MASK, and GB_Z on r300
drm/radeon/kms: use linear aligned for evergreen/ni bo blits
drm/radeon/kms: use linear aligned for 6xx/7xx bo blits
drm/radeon: fix race between GPU reset and TTM delayed delete thread.
drm/radeon/kms: evergreen/ni big endian fixes (v2)
drm/radeon/kms: 6xx/7xx big endian fixes
drm/radeon/kms: atombios big endian fixes
drm/radeon: 6xx/7xx non-kms endian fixes
drm/radeon/kms: optimize CS state checking for r100->r500
drm: do not leak kernel addresses via /proc/dri/*/vma
drm/radeon/kms: add connector table for mac g5 9600
radeon mkregtable: Add missing fclose() calls
drm/radeon/kms: fix interlaced modes on dce4+
drm/radeon: fix memory debugging since d961db75ce86a84f1f04e91ad1014653ed7d9f46
...
Linus Torvalds [Tue, 15 Feb 2011 23:25:11 +0000 (15:25 -0800)]
Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/security-testing-2.6
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/security-testing-2.6:
pci: use security_capable() when checking capablities during config space read
Andrea Arcangeli [Tue, 15 Feb 2011 18:02:45 +0000 (19:02 +0100)]
thp: prevent hugepages during args/env copying into the user stack
Transparent hugepages can only be created if rmap is fully
functional. So we must prevent hugepages to be created while
is_vma_temporary_stack() is true.
This also optmizes away some harmless but unnecessary setting of
khugepaged_scan.address and it switches some BUG_ON to VM_BUG_ON.
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Acked-by: Rik van Riel <riel@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Linus Torvalds [Tue, 15 Feb 2011 23:19:45 +0000 (15:19 -0800)]
Merge branch 'release' of git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux-acpi-2.6
* 'release' of git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux-acpi-2.6:
ACPI / Video: Probe for output switch method when searching video devices.
ACPI / Wakeup: Enable button GPEs unconditionally during initialization
ACPI / ACPICA: Avoid crashing if _PRW is defined for the root object
ACPI: Fix acpi_os_read_memory() and acpi_os_write_memory() (v2)
Linus Torvalds [Tue, 15 Feb 2011 20:07:35 +0000 (12:07 -0800)]
Merge branch 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/djbw/async_tx
* 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/djbw/async_tx: (21 commits)
dmaengine: add slave-dma maintainer
dma: ipu_idmac: do not lose valid received data in the irq handler
dmaengine: imx-sdma: fix up param for the last BD in sdma_prep_slave_sg()
dmaengine: imx-sdma: correct sdmac->status in sdma_handle_channel_loop()
dmaengine: imx-sdma: return sdmac->status in sdma_tx_status()
dmaengine: imx-sdma: set sdmac->status to DMA_ERROR in err_out of sdma_prep_slave_sg()
dmaengine: imx-sdma: remove IMX_DMA_SG_LOOP handling in sdma_prep_slave_sg()
dmaengine i.MX dma: initialize dma capabilities outside channel loop
dmaengine i.MX DMA: do not initialize chan_id field
dmaengine i.MX dma: check sg entries for valid addresses and lengths
dmaengine i.MX dma: set maximum segment size for our device
dmaengine i.MX SDMA: reserve channel 0 by not registering it
dmaengine i.MX SDMA: initialize dma capabilities outside channel loop
dmaengine i.MX SDMA: do not initialize chan_id field
dmaengine i.MX sdma: check sg entries for valid addresses and lengths
dmaengine i.MX sdma: set maximum segment size for our device
DMA: PL08x: fix channel pausing to timeout rather than lockup
DMA: PL08x: fix infinite wait when terminating transfers
dmaengine: imx-sdma: fix inconsistent naming in sdma_assign_cookie()
dmaengine: imx-sdma: propagate error in sdma_probe() instead of returning 0
...
Linus Torvalds [Tue, 15 Feb 2011 20:06:38 +0000 (12:06 -0800)]
Merge branch 'for-2.6.38' of git://linux-nfs.org/~bfields/linux
* 'for-2.6.38' of git://linux-nfs.org/~bfields/linux:
nfsd: break lease on unlink due to rename
nfsd4: acquire only one lease per file
nfsd4: modify fi_delegations under recall_lock
nfsd4: remove unused deleg dprintk's.
nfsd4: split lease setting into separate function
nfsd4: fix leak on allocation error
nfsd4: add helper function for lease setup
nfsd4: split up nfsd_break_deleg_cb
NFSD: memory corruption due to writing beyond the stat array
NFSD: use nfserr for status after decode_cb_op_status
nfsd: don't leak dentry count on mnt_want_write failure
Linus Torvalds [Tue, 15 Feb 2011 18:19:18 +0000 (10:19 -0800)]
Merge branches 'core-fixes-for-linus' and 'timers-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
Revert "lockdep, timer: Fix del_timer_sync() annotation"
* 'timers-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
timer debug: Hide kernel addresses via %pK in /proc/timer_list
Linus Torvalds [Tue, 15 Feb 2011 18:18:48 +0000 (10:18 -0800)]
Merge branch 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
x86: Fix text_poke_smp_batch() deadlock
perf tools: Fix thread_map event synthesizing in top and record
watchdog, nmi: Lower the severity of error messages
ARM: oprofile: Fix backtraces in timer mode
oprofile: Fix usage of CONFIG_HW_PERF_EVENTS for oprofile_perf_init and friends
Linus Torvalds [Tue, 15 Feb 2011 18:18:29 +0000 (10:18 -0800)]
Merge branch 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
x86, dmi, debug: Log board name (when present) in dmesg/oops output
x86, ioapic: Don't warn about non-existing IOAPICs if we have none
x86: Fix mwait_usable section mismatch
x86: Readd missing irq_to_desc() in fixup_irq()
x86: Fix section mismatch in LAPIC initialization
Linus Torvalds [Tue, 15 Feb 2011 16:06:36 +0000 (08:06 -0800)]
Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6:
get rid of nameidata_dentry_drop_rcu() calling nameidata_drop_rcu()
drop out of RCU in return_reval
split do_revalidate() into RCU and non-RCU cases
in do_lookup() split RCU and non-RCU cases of need_revalidate
nothing in do_follow_link() is going to see RCU
task_show_regs used to be a debugging aid in the early bringup days
of Linux on s390. /proc/<pid>/status is a world readable file, it
is not a good idea to show the registers of a process. The only
correct fix is to remove task_show_regs.
Reported-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Chris Wright [Tue, 15 Feb 2011 01:21:49 +0000 (17:21 -0800)]
pci: use security_capable() when checking capablities during config space read
This reintroduces commit 47970b1b which was subsequently reverted
as f00eaeea. The original change was broken and caused X startup
failures and generally made privileged processes incapable of reading
device dependent config space. The normal capable() interface returns
true on success, but the LSM interface returns 0 on success. This thinko
is now fixed in this patch, and has been confirmed to work properly.
So, once again...Eric Paris noted that commit de139a3 ("pci: check caps
from sysfs file open to read device dependent config space") caused the
capability check to bypass security modules and potentially auditing.
Rectify this by calling security_capable() when checking the open file's
capabilities for config space reads.
Reported-by: Eric Paris <eparis@redhat.com> Tested-by: Dave Young <hidave.darkstar@gmail.com> Acked-by: James Morris <jmorris@namei.org> Cc: Dave Airlie <airlied@gmail.com> Cc: Alex Riesen <raa.lkml@gmail.com> Cc: Sedat Dilek <sedat.dilek@googlemail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Chris Wright <chrisw@sous-sol.org> Signed-off-by: James Morris <jmorris@namei.org>
Naga Chumbalkar [Mon, 14 Feb 2011 22:47:17 +0000 (22:47 +0000)]
x86, dmi, debug: Log board name (when present) in dmesg/oops output
The "Type 2" SMBIOS record that contains Board Name is not
strictly required and may be absent in the SMBIOS on some
platforms.
( Please note that Type 2 is not listed in Table 3 in Sec 6.2
("Required Structures and Data") of the SMBIOS v2.7
Specification. )
Use the Manufacturer Name (aka System Vendor) name.
Print Board Name only when it is present.
Before the fix:
(i) dmesg output: DMI: /ProLiant DL380 G6, BIOS P62 01/29/2011
(ii) oops output: Pid: 2170, comm: bash Not tainted 2.6.38-rc4+ #3 /ProLiant DL380 G6
After the fix:
(i) dmesg output: DMI: HP ProLiant DL380 G6, BIOS P62 01/29/2011
(ii) oops output: Pid: 2278, comm: bash Not tainted 2.6.38-rc4+ #4 HP ProLiant DL380 G6
Signed-off-by: Naga Chumbalkar <nagananda.chumbalkar@hp.com> Reviewed-by: Bjorn Helgaas <bjorn.helgaas@hp.com> Cc: <stable@kernel.org> # .3x - good for debugging, please apply as far back as it applies cleanly
LKML-Reference: <20110214224423.2182.13929.sendpatchset@nchumbalkar.americas.hpqcorp.net> Signed-off-by: Ingo Molnar <mingo@elte.hu>
Paul Bolle [Mon, 14 Feb 2011 21:52:38 +0000 (22:52 +0100)]
x86, ioapic: Don't warn about non-existing IOAPICs if we have none
mp_find_ioapic() prints errors like:
ERROR: Unable to locate IOAPIC for GSI 13
if it can't find the IOAPIC that manages that specific GSI. I
see errors like that at every boot of a laptop that apparently
doesn't have any IOAPICs.
But if there are no IOAPICs it doesn't seem to be an error that
none can be found. A solution that gets rid of this message is
to directly return if nr_ioapics (still) is zero. (But keep
returning -1 in that case, so nothing breaks from this change.)
The call chain that generates this error is:
pnpacpi_allocated_resource()
case ACPI_RESOURCE_TYPE_IRQ:
pnpacpi_parse_allocated_irqresource()
acpi_get_override_irq()
mp_find_ioapic()
Signed-off-by: Paul Bolle <pebolle@tiscali.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>
Alex Deucher [Mon, 14 Feb 2011 16:43:11 +0000 (11:43 -0500)]
drm/radeon/kms: hopefully fix pll issues for real (v3)
The problematic boards have a recommended reference divider
to be used when spread spectrum is enabled on the laptop panel.
Enable the use of the recommended reference divider along with
the new pll algo.
v2: testing options
v3: When using the fixed reference divider with LVDS, prefer
min m to max p and use fractional feedback dividers.
Linus Torvalds [Mon, 14 Feb 2011 22:49:29 +0000 (14:49 -0800)]
Merge branch 'fixes' of master.kernel.org:/home/rmk/linux-2.6-arm
* 'fixes' of master.kernel.org:/home/rmk/linux-2.6-arm:
ARM: 6657/1: hw_breakpoint: fix ptrace breakpoint advertising on unsupported arch
ARM: 6656/1: hw_breakpoint: avoid UNPREDICTABLE behaviour when reading DBGDSCR
ARM: 6658/1: collie: do actually pass locomo_info to locomo driver
ARM: 6659/1: Thumb-2: Make CONFIG_OABI_COMPAT depend on !CONFIG_THUMB2_KERNEL
ARM: 6654/1: perf/oprofile: fix off-by-one in stack check
ARM: fixup SMP alternatives in modules
ARM: make SWP emulation explicit on !CPU_USE_DOMAINS
ARM: Avoid building unsafe kernels on OMAP2 and MX3
ARM: pxa: Properly configure PWM period for palm27x
ARM: pxa: only save/restore registers when pm functions are defined
ARM: pxa/colibri: use correct SD detect pin
ARM: pxa: fix mfpr_sync to read from valid offset
Tsutomu Itoh [Mon, 14 Feb 2011 00:45:29 +0000 (00:45 +0000)]
Btrfs: check return value of alloc_extent_map()
I add the check on the return value of alloc_extent_map() to several places.
In addition, alloc_extent_map() returns only the address or NULL.
Therefore, check by IS_ERR() is unnecessary. So, I remove IS_ERR() checking.
Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
space_args.space_slots is an unsigned 64-bit type controlled by a
possibly unprivileged caller. The comparison as a signed int type
allows providing values that are treated as negative and cause the
subsequent allocation size calculation to wrap, or be truncated to 0.
By providing a size that's truncated to 0, kmalloc() will return
ZERO_SIZE_PTR. It's also possible to provide a value smaller than the
slot count. The subsequent loop ignores the allocation size when
copying data in, resulting in a heap overflow or write to ZERO_SIZE_PTR.
The fix changes the slot count type and comparison typecast to u64,
which prevents truncation or signedness errors, and also ensures that we
don't copy more data than we've allocated in the subsequent loop. Note
that zero-size allocations are no longer possible since there is already
an explicit check for space_args.space_slots being 0 and truncation of
this value is no longer an issue.
Signed-off-by: Dan Rosenberg <drosenberg@vsecurity.com> Signed-off-by: Josef Bacik <josef@redhat.com> Reviewed-by: Josef Bacik <josef@redhat.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
Linus Torvalds [Mon, 14 Feb 2011 18:10:07 +0000 (10:10 -0800)]
Merge branch 'rtc-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'rtc-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
RTC: Fix minor compile warning
RTC: Convert rtc drivers to use the alarm_irq_enable method
RTC: Fix rtc driver ioctl specific shortcutting
Chris Mason [Mon, 14 Feb 2011 17:52:08 +0000 (12:52 -0500)]
Btrfs: don't release pages when we can't clear the uptodate bits
Btrfs tracks uptodate state in an rbtree as well as in the
page bits. This is supposed to enable us to use block sizes other than
the page size, but there are a few parts still missing before that
completely works.
But, our readpage routine trusts this additional range based tracking
of uptodateness, much in the same way the buffer head up to date bits
are trusted for the other filesystems.
The problem is that sometimes we need to allocate memory in order to
split records in the rbtree, even when we are just clearing bits. This
can be difficult when our clearing function is called GFP_ATOMIC, which
can happen in the releasepage path.
So, what happens today looks like this:
releasepage called with GFP_ATOMIC
btrfs_releasepage calls clear_extent_bit
clear_extent_bit fails to allocate ram, leaving the up to date bit set
btrfs_releasepage returns success
The end result is the page being gone, but btrfs thinking the range is
up to date. Later on if someone tries to read that same page, the
btrfs readpage code will return immediately thinking the page is already
up to date.
This commit fixes things to fail the releasepage when we can't clear the
extent state bits. It covers both data pages and metadata tree blocks.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Chris Mason [Thu, 10 Feb 2011 17:35:00 +0000 (12:35 -0500)]
Btrfs: fix page->private races
There is a race where btrfs_releasepage can drop the
page->private contents just as alloc_extent_buffer is setting
up pages for metadata. Because of how the Btrfs page flags work,
this results in us skipping the crc on the page during IO.
This patch sovles the race by waiting until after the extent buffer
is inserted into the radix tree before it sets page private.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
J. Bruce Fields [Sun, 6 Feb 2011 21:46:30 +0000 (16:46 -0500)]
nfsd: break lease on unlink due to rename
4795bb37effb7b8fe77e2d2034545d062d3788a8 "nfsd: break lease on unlink,
link, and rename", only broke the lease on the file that was being
renamed, and didn't handle the case where the target path refers to an
already-existing file that will be unlinked by a rename--in that case
the target file should have any leases broken as well.
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
J. Bruce Fields [Tue, 1 Feb 2011 00:20:39 +0000 (19:20 -0500)]
nfsd4: acquire only one lease per file
Instead of acquiring one lease each time another client opens a file,
nfsd can acquire just one lease to represent all of them, and reference
count it to determine when to release it.
This fixes a regression introduced by c45821d263a8a5109d69a9e8942b8d65bcd5f31a "locks: eliminate fl_mylease
callback": after that patch, only the struct file * is used to determine
who owns a given lease. But since we recently converted the server to
share a single struct file per open, if we acquire multiple leases on
the same file from nfsd, it then becomes impossible on unlocking a lease
to determine which of those leases (all of whom share the same struct
file *) we meant to remove.
Thanks to Takashi Iwai <tiwai@suse.de> for catching a bug in a previous
version of this patch.
Tested-by: Takashi Iwai <tiwai@suse.de> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
NFSD: memory corruption due to writing beyond the stat array
If nfsd fails to find an exported via NFS file in the readahead cache, it
should increment corresponding nfsdstats counter (ra_depth[10]), but due to a
bug it may instead write to ra_depth[11], corrupting the following field.
In a kernel with NFSDv4 compiled in the corruption takes the form of an
increment of a counter of the number of NFSv4 operation 0's received; since
there is no operation 0, this is harmless.
In a kernel with NFSDv4 disabled it corrupts whatever happens to be in the
memory beyond nfsdstats.
Signed-off-by: Konstantin Khorenko <khorenko@openvz.org> Cc: stable@kernel.org Signed-off-by: J. Bruce Fields <bfields@redhat.com>