Jan Kara [Thu, 5 May 2011 11:59:35 +0000 (13:59 +0200)]
jbd: Fix forever sleeping process in do_get_write_access()
In do_get_write_access() we wait on BH_Unshadow bit for buffer to get
from shadow state. The waking code in journal_commit_transaction() has
a bug because it does not issue a memory barrier after the buffer is moved
from the shadow state and before wake_up_bit() is called. Thus a waitqueue
check can happen before the buffer is actually moved from the shadow state
and waiting process may never be woken. Fix the problem by issuing proper
barrier.
CC: stable@kernel.org Reported-by: Tao Ma <boyu.mt@taobao.com> Signed-off-by: Jan Kara <jack@suse.cz>
Robin Dong [Thu, 5 May 2011 02:44:04 +0000 (10:44 +0800)]
ext2: fix error msg when mounting fs with too-large blocksize
When ext2 mounts a filesystem, it attempts to set the block device
blocksize with a call to sb_set_blocksize, which can fail for
several reasons. The current failure message in ext2 prints:
EXT2-fs (loop1): error: blocksize is too small
which is not correct in all cases. This can be demonstrated
by creating a filesystem with
# mkfs.ext2 -b 8192
on a 4k page system, and attempting to mount it.
Change the error message to a more generic:
EXT2-fs (loop1): bad blocksize 8192
to match the error message in ext3.
Signed-off-by: Robin Dong <sanbai@taobao.com> Reviewed-by: Coly Li <bosong.ly@taobao.com> Reviewed-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Jan Kara <jack@suse.cz>
If an application program does not make any changes to the indirect
blocks or extent tree, i_datasync_tid will not get updated. If there
are enough commits (i.e., 2**31) such that tid_geq()'s calculations
wrap, and there isn't a currently active transaction at the time of
the fdatasync() call, this can end up triggering a BUG_ON in
fs/jbd/commit.c:
J_ASSERT(journal->j_running_transaction != NULL);
It's pretty rare that this can happen, since it requires the use of
fdatasync() plus *very* frequent and excessive use of fsync(). But
with the right workload, it can.
We fix this by replacing the use of tid_geq() with an equality test,
since there's only one valid transaction id that is valid for us to
start: namely, the currently running transaction (if it exists).
CC: stable@kernel.org Reported-by: Martin_Zielinski@McAfee.com Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Jan Kara <jack@suse.cz>
Jan Kara [Wed, 27 Apr 2011 16:20:44 +0000 (18:20 +0200)]
ext3: Fix fs corruption when make_indexed_dir() fails
When make_indexed_dir() fails (e.g. because of ENOSPC) after it has allocated
block for index tree root, we did not properly mark all changed buffers dirty.
This lead to only some of these buffers being written out and thus effectively
corrupting the directory.
Fix the issue by marking all changed data dirty even in the error failure case.
CC: stable@kernel.org Signed-off-by: Jan Kara <jack@suse.cz>
Jan Kara [Thu, 21 Apr 2011 21:47:16 +0000 (23:47 +0200)]
ext3: Fix lock inversion in ext3_symlink()
ext3_symlink() cannot call __page_symlink() with transaction open.
__page_symlink() calls ext3_write_begin() which gets page lock which ranks
above transaction start (thus lock ordering is violated) and and also
ext3_write_begin() waits for a transaction commit when we run out of space
which never happens if we hold transaction open.
Fix the problem by stopping a transaction before calling __page_symlink()
(we have to be careful and put inode to orphan list so that it gets deleted
in case of crash) and starting another one after __page_symlink() returns
for addition of symlink into a directory.
Merge branch 'drm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/airlied/drm-2.6
* 'drm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/airlied/drm-2.6:
drm/i915: restore only the mode of this driver on lastclose (v2)
drm/radeon/kms: add info query for tile pipes
drm/radeon/kms: add missing safe regs for 6xx/7xx
drm: select FRAMEBUFFER_CONSOLE_PRIMARY if we have FRAMEBUFFER_CONSOLE
Merge branch 'bugfixes' of git://git.linux-nfs.org/projects/trondmy/nfs-2.6
* 'bugfixes' of git://git.linux-nfs.org/projects/trondmy/nfs-2.6:
nfs: don't lose MS_SYNCHRONOUS on remount of noac mount
NFS: Return meaningful status from decode_secinfo()
NFSv4: Ensure we request the ordinary fileid when doing readdirplus
NFSv4: Ensure that clientid and session establishment can time out
SUNRPC: Allow RPC calls to return ETIMEDOUT instead of EIO
NFSv4.1: Don't loop forever in nfs4_proc_create_session
NFSv4: Handle NFS4ERR_WRONGSEC outside of nfs4_handle_exception()
NFSv4.1: Don't update sequence number if rpc_task is not sent
NFSv4.1: Ensure state manager thread dies on last umount
SUNRPC: Fix the SUNRPC Kerberos V RPCSEC_GSS module dependencies
NFS: Use correct variable for page bounds checking
NFS: don't negotiate when user specifies sec flavor
NFS: Attempt mount with default sec flavor first
NFS: flav_array honors NFS_MAX_SECFLAVORS
NFS: Fix infinite loop in gss_create_upcall()
Don't mark_inode_dirty_sync() while holding lock
NFS: Get rid of pointless test in nfs_commit_done
NFS: Remove unused argument from nfs_find_best_sec()
NFS: Eliminate duplicate call to nfs_mark_request_dirty
NFS: Remove dead code from nfs_fs_mount()
mm: check if PTE is already allocated during page fault
With transparent hugepage support, handle_mm_fault() has to be careful
that a normal PMD has been established before handling a PTE fault. To
achieve this, it used __pte_alloc() directly instead of pte_alloc_map as
pte_alloc_map is unsafe to run against a huge PMD. pte_offset_map() is
called once it is known the PMD is safe.
pte_alloc_map() is smart enough to check if a PTE is already present
before calling __pte_alloc but this check was lost. As a consequence,
PTEs may be allocated unnecessarily and the page table lock taken. Thi
useless PTE does get cleaned up but it's a performance hit which is
visible in page_test from aim9.
This patch simply re-adds the check normally done by pte_alloc_map to
check if the PTE needs to be allocated before taking the page table lock.
The effect is noticable in page_test from aim9.
(While this affects 2.6.38, it is a performance rather than a functional
bug and normally outside the rules -stable. While the big performance
differences are to a microbench, the difference in fork and exec
performance may be significant enough that -stable wants to consider the
patch)
Reported-by: Raz Ben Yehuda <raziebe@gmail.com> Signed-off-by: Mel Gorman <mgorman@suse.de> Reviewed-by: Rik van Riel <riel@redhat.com> Reviewed-by: Andrea Arcangeli <aarcange@redhat.com> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: <stable@kernel.org> [2.6.38.x] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
kernel/watchdog.c: disable nmi perf event in the error path of enabling watchdog
In corner cases where softlockup watchdog is not setup successfully, the
relevant nmi perf event for hardlockup watchdog could be disabled, then
the status of the underlying hardware remains unchanged.
Also, if the kthread doesn't start then the hrtimer won't run and the
hardlockup detector will falsely fire.
Signed-off-by: Hillf Danton <dhillf@gmail.com> Signed-off-by: Don Zickus <dzickus@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Reported-by: Simon Danner <danner.simon@gmail.com> Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com> Cc: Jeff Dike <jdike@addtoit.com> Cc: Christoph Hellwig <hch@infradead.org> Signed-off-by: Richard Weinberger <richard@nod.at> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This adds support for 64 bit atomic operations on 32 bit UML systems. XFS
needs them since 2.6.38.
$ make ARCH=um SUBARCH=i386
...
LD .tmp_vmlinux1
fs/built-in.o: In function `xlog_regrant_reserve_log_space':
xfs_log.c:(.text+0xd8584): undefined reference to `atomic64_read_386'
xfs_log.c:(.text+0xd85ac): undefined reference to `cmpxchg8b_emu'
...
PTE pages eat up memory just like anything else, but we do not account for
them in any way in the OOM scores. They are also _guaranteed_ to get
freed up when a process is OOM killed, while RSS is not.
Kukjin Kim [Wed, 27 Apr 2011 22:26:49 +0000 (15:26 -0700)]
MAINTAINERS: add EXYNOS ARM architectures
Signed-off-by: Kukjin Kim <kgene.kim@samsung.com> Cc: Ben Dooks <ben-linux@fluff.org> Cc: Russell King <rmk@arm.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
memcg: update documentation to describe usage_in_bytes
Since 569b846d ("memcg: coalesce uncharge during unmap/truncate"), we do
batched (delayed) uncharge at truncation/unmap. And since cdec2e42(memcg:
coalesce charging via percpu storage), we have percpu cache for
res_counter.
These changes improved performance of memory cgroup very much, but made
res_counter->usage usually have a bigger value than the actual value of
memory usage. So, *.usage_in_bytes, which show res_counter->usage, are
not desirable for precise values of memory(and swap) usage anymore.
Instead of removing these files completely(because we cannot know
res_counter->usage without them), this patch updates the meaning of those
files.
Andrea Arcangeli [Wed, 27 Apr 2011 22:26:45 +0000 (15:26 -0700)]
mm: thp: fix /dev/zero MAP_PRIVATE and vm_flags cleanups
The huge_memory.c THP page fault was allowed to run if vm_ops was null
(which would succeed for /dev/zero MAP_PRIVATE, as the f_op->mmap wouldn't
setup a special vma->vm_ops and it would fallback to regular anonymous
memory) but other THP logics weren't fully activated for vmas with vm_file
not NULL (/dev/zero has a not NULL vma->vm_file).
So this removes the vm_file checks so that /dev/zero also can safely use
THP (the other albeit safer approach to fix this bug would have been to
prevent the THP initial page fault to run if vm_file was set).
After removing the vm_file checks, this also makes huge_memory.c stricter
in khugepaged for the DEBUG_VM=y case. It doesn't replace the vm_file
check with a is_pfn_mapping check (but it keeps checking for VM_PFNMAP
under VM_BUG_ON) because for a is_cow_mapping() mapping VM_PFNMAP should
only be allowed to exist before the first page fault, and in turn when
vma->anon_vma is null (so preventing khugepaged registration). So I tend
to think the previous comment saying if vm_file was set, VM_PFNMAP might
have been set and we could still be registered in khugepaged (despite
anon_vma was not NULL to be registered in khugepaged) was too paranoid.
The is_linear_pfn_mapping check is also I think superfluous (as described
by comment) but under DEBUG_VM it is safe to stay.
Andrew Morton [Wed, 27 Apr 2011 22:26:41 +0000 (15:26 -0700)]
vfs: avoid large kmalloc()s for the fdtable
Azurit reports large increases in system time after 2.6.36 when running
Apache. It was bisected down to a892e2d7dcdfa6c76e6 ("vfs: use kmalloc()
to allocate fdmem if possible").
That patch caused the vfs to use kmalloc() for very large allocations and
this is causing excessive work (and presumably excessive reclaim) within
the page allocator.
Fix it by falling back to vmalloc() earlier - when the allocation attempt
would have been considered "costly" by reclaim.
Merge branch 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/parisc-2.6
* 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/parisc-2.6:
[PARISC] slub: fix panic with DISCONTIGMEM
[PARISC] set memory ranges in N_NORMAL_MEMORY when onlined
Merge branch 'v4l_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-2.6
* 'v4l_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-2.6: (42 commits)
[media] media: vb2: correct queue initialization order
[media] media: vb2: fix incorrect v4l2_buffer->flags handling
[media] s5p-fimc: Add support for the buffer timestamps and sequence
[media] s5p-fimc: Fix bytesperline and plane payload setup
[media] s5p-fimc: Do not allow changing format after REQBUFS
[media] s5p-fimc: Fix FIMC3 pixel limits on Exynos4
[media] tda18271: update tda18271c2_rf_cal as per NXP's rev.04 datasheet
[media] tda18271: update tda18271_rf_band as per NXP's rev.04 datasheet
[media] tda18271: fix bad calculation of main post divider byte
[media] tda18271: prog_cal and prog_tab variables should be s32, not u8
[media] tda18271: fix calculation bug in tda18271_rf_tracking_filters_init
[media] omap3isp: queue: Don't corrupt buf->npages when get_user_pages() fails
[media] v4l: Don't register media entities for subdev device nodes
[media] omap3isp: Don't increment node entity use count when poweron fails
[media] omap3isp: lane shifter support
[media] omap3isp: ccdc: support Y10/12, 8-bit bayer fmts
[media] media: add missing 8-bit bayer formats and Y12
[media] v4l: add V4L2_PIX_FMT_Y12 format
cx23885: Fix stv0367 Kconfig dependency
[media] omap3isp: Use isp xclk defines
...
Fix up trivial conflict (spelink errurs) in drivers/media/video/omap3isp/isp.c
Jeff Layton [Wed, 27 Apr 2011 15:49:09 +0000 (11:49 -0400)]
nfs: don't lose MS_SYNCHRONOUS on remount of noac mount
On a remount, the VFS layer will clear the MS_SYNCHRONOUS bit on the
assumption that the flags on the mount syscall will have it set if the
remounted fs is supposed to keep it.
In the case of "noac" though, MS_SYNCHRONOUS is implied. A remount of
such a mount will lose the MS_SYNCHRONOUS flag since "sync" isn't part
of the mount options.
Reported-by: Max Matveev <makc@redhat.com> Signed-off-by: Jeff Layton <jlayton@redhat.com> Cc: stable@kernel.org Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
NFS: Return meaningful status from decode_secinfo()
When compiling, I was getting this warning:
fs/nfs/nfs4xdr.c: In function ‘decode_secinfo’:
fs/nfs/nfs4xdr.c:4839:6: warning: variable ‘status’ set but not used
[-Wunused-but-set-variable]
We were unconditionally returning 0 as long as there wasn't an error
coming out of xdr_inline_decode(). We probably want to check the error
status coming out of decode_op_hdr() and decode_secinfo_gss(), rather
than assuming that everything is OK all the time.
NFSv4: Ensure we request the ordinary fileid when doing readdirplus
When readdir() returns a directory entry for the root of a mounted
filesystem, Linux follows the old convention of returning the inode
number of the covered directory (despite newer versions of POSIX declaring
that this is a bug).
To ensure this continues to work, the NFSv4 readdir implementation requests
the 'mounted-on-fileid' from the server.
However, readdirplus also needs to instantiate an inode for this entry, and
for that, we also need to request the real fileid as per this patch.
Michael Schmitz [Tue, 26 Apr 2011 02:51:53 +0000 (14:51 +1200)]
m68k/mm: Set all online nodes in N_NORMAL_MEMORY
For m68k, N_NORMAL_MEMORY represents all nodes that have present memory
since it does not support HIGHMEM. This patch sets the bit at the time
node_present_pages has been set by free_area_init_node.
At the time the node is brought online, the node state would have to be
done unconditionally since information about present memory has not yet
been recorded.
If N_NORMAL_MEMORY is not accurate, slub may encounter errors since it
uses this nodemask to setup per-cache kmem_cache_node data structures.
This pach is an alternative to the one proposed by David Rientjes
<rientjes@google.com> attempting to set node state immediately when
bringing the node online.
Dave Airlie [Thu, 21 Apr 2011 21:18:32 +0000 (22:18 +0100)]
drm/i915: restore only the mode of this driver on lastclose (v2)
i915 calls the panic handler function on last close to reset the modes,
however this is a really bad idea for multi-gpu machines, esp shareable
gpus machines. So add a new entry point for the driver to just restore
its own fbcon mode.
v2: move code into fb helper, fix panic code to block mode change on
powered off GPUs.
[airlied: this hits drm core and I wrote it and it was reviewed on intel-gfx
so really I signed it off twice ;-).] Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Signed-off-by: Dave Airlie <airlied@redhat.com>
Randy Dunlap [Tue, 26 Apr 2011 19:33:21 +0000 (12:33 -0700)]
init/Kconfig: fix EXPERT menu list
The EXPERT menu list was recently broken by the insertion of a
kconfig symbol (EMBEDDED) at the beginning of the EXPERT list of
kconfig items. Broken by:
commit 6a108a14fa356ef607be308b68337939e56ea94e
Author: David Rientjes <rientjes@google.com>
Date: Thu Jan 20 14:44:16 2011 -0800
kconfig: rename CONFIG_EMBEDDED to CONFIG_EXPERT
Restore the EXPERT menu list -- don't inject a symbol (EMBEDDED)
that does not depend on EXPERT into the list.
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com> Cc: David Rientjes <rientjes@google.com> Cc: Peter Foley <pefoley2@verizon.net> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Merge branch 'for-linus' of git://git390.marist.edu/pub/scm/linux-2.6
* 'for-linus' of git://git390.marist.edu/pub/scm/linux-2.6:
[S390] kvm-390: Let kernel exit SIE instruction on work
[S390] dasd: check sense type in device change handler
[S390] pfault: fix token handling
[S390] qdio: reset error states immediately
[S390] fix page table walk for changing page attributes
[S390] prng: prevent access beyond end of stack
[S390] dasd: fix race between open and offline
* git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable:
Btrfs: cleanup error handling in inode.c
Btrfs: put the right bio if we have an error
Btrfs: free bitmaps properly when evicting the cache
Btrfs: Free free_space item properly in btrfs_trim_block_group()
btrfs: add missing spin_unlock to a rare exit path
Btrfs: check return value of kmalloc()
btrfs: fix wrong allocating flag when reading page
Btrfs: fix missing mutex_unlock in btrfs_del_dir_entries_in_log()
Borislav Petkov [Wed, 30 Mar 2011 13:42:10 +0000 (15:42 +0200)]
amd64_edac: Erratum #637 workaround
F15h CPUs may report a non-DRAM address when reporting an error address
belonging to a CC6 state save area. Add a workaround to detect this
condition and compute the actual DRAM address of the error as documented
in the Revision Guide for AMD Family 15h Models 00h-0Fh Processors.
Borislav Petkov [Mon, 21 Mar 2011 19:45:06 +0000 (20:45 +0100)]
amd64_edac: Factor in CC6 save area
F15h and later use a portion of DRAM as a CC6 storage area. BIOS
programs D18F1x[17C:140,7C:40] DRAM Base/Limit accordingly by
subtracting the storage area from the DRAM limit setting. However, in
order for edac to consider that part of DRAM too, we need to include it
into the per-node range.
This warning was wrongfully added for a normal condition - intlvsel
actually selects the destination node when node interleaving is enabled
and it is not a mismatch. For a detailed example, see section 2.8.10.2
"Node Interleaving" in F10h BKDG.
ACPI / PM: Avoid infinite recurrence while registering power resources
There is at least one BIOS with a DSDT containing a power resource
object with a _PR0 entry pointing back to that power resource. In
consequence, while registering that power resource
acpi_bus_get_power_flags() sees that it depends on itself and tries
to register it again, which leads to an infinitely deep recurrence.
This problem was introduced by commit bf325f9538d8c89312be305b9779e
(ACPI / PM: Register power resource devices as soon as they are
needed).
To fix this problem use the observation that power resources cannot
be power manageable and prevent acpi_bus_get_power_flags() from
being called for power resource objects.
References: https://bugzilla.kernel.org/show_bug.cgi?id=31872 Reported-and-tested-by: Pascal Dormeau <pdormeau@free.fr> Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Acked-by: Len Brown <lenb@kernel.org> Cc: stable@kernel.org
PM / Wakeup: Fix initialization of wakeup-related device sysfs files
It turns out that some PCI devices are only found to be
wakeup-capable during registration, in which case, when
device_set_wakeup_capable() is called, device_is_registered() already
returns 'true' for the given device, but dpm_sysfs_add() hasn't been
called for it yet. This leads to situations in which the device's
power.can_wakeup flag is not set as requested because of failing
wakeup_sysfs_add() and its wakeup-related sysfs files are not
created, although they should be present. This is a post-2.6.38
regression introduced by commit cb8f51bdadb7969139c2e39c2defd4cde98c1
(PM: Do not create wakeup sysfs files for devices that cannot wake
up).
To work around this problem initialize the device's power.entry
field to an empty list head and make device_set_wakeup_capable()
check if it is still empty before attempting to add the devices
wakeup-related sysfs files with wakeup_sysfs_add(). Namely, if
power.entry is still empty at this point, device_pm_add() hasn't been
called yet for the device and its wakeup-related files will be
created later, so device_set_wakeup_capable() doesn't have to create
them.
Reported-and-tested-by: Tino Keitel <tino.keitel@tikei.de> Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Acked-by: Greg Kroah-Hartman <gregkh@suse.de>
Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ecryptfs/ecryptfs-2.6
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ecryptfs/ecryptfs-2.6:
eCryptfs: Flush dirty pages in setattr
eCryptfs: Handle failed metadata read in lookup
eCryptfs: Add reference counting to lower files
eCryptfs: dput dentries returned from dget_parent
eCryptfs: Remove extra d_delete in ecryptfs_rmdir
Eric Paris [Mon, 25 Apr 2011 20:26:29 +0000 (16:26 -0400)]
SELINUX: Make selinux cache VFS RCU walks safe
Now that the security modules can decide whether they support the
dcache RCU walk or not it's possible to make selinux a bit more
RCU friendly. The SELinux AVC and security server access decision
code is RCU safe. A specific piece of the LSM audit code may not
be RCU safe.
This patch makes the VFS RCU walk retry if it would hit the non RCU
safe chunk of code. It will normally just work under RCU. This is
done simply by passing the VFS RCU state as a flag down into the
avc_audit() code and returning ECHILD there if it would have an issue.
Based-on-patch-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: Eric Paris <eparis@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Now that the whole dcache_hash_bucket crap is gone, go all the way and
also remove the weird locking layering violations for locking the hash
buckets. Add hlist_bl_lock/unlock helpers to move the locking into the
list abstraction instead of requiring each caller to open code it.
After all allowing for the bit locks is the whole point of these helpers
over the plain hlist variant.
Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
bit_spinlock: don't play preemption games inside the busy loop
When we are waiting for the bit-lock to be released, and are looping
over the 'cpu_relax()' should not be doing anything else - otherwise we
miss the point of trying to do the whole 'cpu_relax()'.
Do the preemption enable/disable around the loop, rather than inside of
it.
Noticed when I was looking at the code generation for the dcache
__d_drop usage, and the code just looked very odd.
After 57db4e8d73ef2b5e94a3f412108dff2576670a8a changed eCryptfs to
write-back caching, eCryptfs page writeback updates the lower inode
times due to the use of vfs_write() on the lower file.
To preserve inode metadata changes, such as 'cp -p' does with
utimensat(), we need to flush all dirty pages early in
ecryptfs_setattr() so that the user-updated lower inode metadata isn't
clobbered later in writeback.
Tyler Hicks [Tue, 15 Mar 2011 19:54:00 +0000 (14:54 -0500)]
eCryptfs: Handle failed metadata read in lookup
When failing to read the lower file's crypto metadata during a lookup,
eCryptfs must continue on without throwing an error. For example, there
may be a plaintext file in the lower mount point that the user wants to
delete through the eCryptfs mount.
If an error is encountered while reading the metadata in lookup(), the
eCryptfs inode's size could be incorrect. We must be sure to reread the
plaintext inode size from the metadata when performing an open() or
setattr(). The metadata is already being read in those paths, so this
adds minimal performance overhead.
This patch introduces a flag which will track whether or not the
plaintext inode size has been read so that an incorrect i_size can be
fixed in the open() or setattr() paths.
Josef Bacik [Mon, 25 Apr 2011 23:43:52 +0000 (19:43 -0400)]
Btrfs: free bitmaps properly when evicting the cache
If our space cache is wrong, we do the right thing and free up everything that
we loaded, however we don't reset the total_bitmaps counter or the thresholds or
anything. So in btrfs_remove_free_space_cache make sure to call free_bitmap()
if it's a bitmap, this will keep us from panicing when we check to make sure we
don't have too many bitmaps. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>
btrfs: fix wrong allocating flag when reading page
the space cache use extent_readpages() to read free space information,
so we can not use GFP_KERNEL flag to allocate memory, or it may lead
to deadlock.
For any given lower inode, eCryptfs keeps only one lower file open and
multiplexes all eCryptfs file operations through that lower file. The
lower file was considered "persistent" and stayed open from the first
lookup through the lifetime of the inode.
This patch keeps the notion of a single, per-inode lower file, but adds
reference counting around the lower file so that it is closed when not
currently in use. If the reference count is at 0 when an operation (such
as open, create, etc.) needs to use the lower file, a new lower file is
opened. Since the file is no longer persistent, all references to the
term persistent file are changed to lower file.
Locking is added around the sections of code that opens the lower file
and assign the pointer in the inode info, as well as the code the fputs
the lower file when all eCryptfs users are done with it.
This patch is needed to fix issues, when mounted on top of the NFSv3
client, where the lower file is left silly renamed until the eCryptfs
inode is destroyed.
Call dput on the dentries previously returned by dget_parent() in
ecryptfs_rename(). This is needed for supported eCryptfs mounts on top
of the NFSv3 client.
vfs_rmdir() already calls d_delete() on the lower dentry. That was being
duplicated in ecryptfs_rmdir() and caused a NULL pointer dereference
when NFSv3 was the lower filesystem.
NFSv4: Ensure that clientid and session establishment can time out
The following patch ensures that we do not get permanently trapped in
the RPC layer when trying to establish a new client id or session.
This again ensures that the state manager can finish in a timely
fashion when the last filesystem to reference the nfs_client exits.
SUNRPC: Allow RPC calls to return ETIMEDOUT instead of EIO
On occasion, it is useful for the NFS layer to distinguish between
soft timeouts and other EIO errors due to (say) encoding errors,
or authentication errors.
The following patch ensures that the default behaviour of the RPC
layer remains to return EIO on soft timeouts (until we have
audited all the callers).
NFSv4.1: Don't loop forever in nfs4_proc_create_session
If a server for some reason keeps sending NFS4ERR_DELAY errors, we can end
up looping forever inside nfs4_proc_create_session, and so the usual
mechanisms for detecting if the nfs_client is dead don't work.
Fix this by ensuring that we loop inside the nfs4_state_manager thread
instead.
Merge branch 'upstream-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jgarzik/libata-dev
* 'upstream-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jgarzik/libata-dev:
libata: ahci_start_engine compliant to AHCI spec
ata: pata_at91.c bugfix for initial_timing initialisation
ata: pata_at91.c bugfix for high master clock
ahci: AHCI-mode SATA patch for Intel Panther Point DeviceIDs
ata_piix: IDE-mode SATA patch for Intel Panther Point DeviceIDs
libata: Pioneer DVR-216D can't do SETXFER
ahci: don't enable port irq before handler is registered
libata: Implement ATA_FLAG_NO_DIPM and apply it to mcp65
libata: Kill unused ATA_DFLAG_{H|D}IPM flags
ahci: EM supported message type sysfs attribute
Merge branch 'for-linus' of git://git.infradead.org/ubifs-2.6
* 'for-linus' of git://git.infradead.org/ubifs-2.6:
UBIFS: fix master node recovery
UBIFS: fix false assertion warning in case of I/O failures
UBIFS: fix false space checking failure
At the end of section 10.1 of AHCI spec (rev 1.3), it states
Software shall not set PxCMD.ST to 1 until it is determined that
a functoinal device is present on the port as determined by
PxTFD.STS.BSY=0, PxTFD.STS.DRQ=0 and PxSSTS.DET=3h
Even though most AHCI host controller works without this check,
specific controller will fail under this condition.
Signed-off-by: Jian Peng <jipeng2005@gmail.com> Signed-off-by: Jeff Garzik <jgarzik@pobox.com>
Igor Plyatov [Mon, 28 Mar 2011 12:56:14 +0000 (16:56 +0400)]
ata: pata_at91.c bugfix for initial_timing initialisation
The "struct ata_timing" must contain 10 members, but ".dmack_hold" member was
forgotten for "initial_timing" initialisation. This patch fixes such a problem.
Signed-off-by: Igor Plyatov <plyatov@gmail.com> Signed-off-by: Jeff Garzik <jgarzik@pobox.com>
Igor Plyatov [Mon, 28 Mar 2011 12:56:15 +0000 (16:56 +0400)]
ata: pata_at91.c bugfix for high master clock
The AT91SAM9 microcontrollers with master clock higher then 105 MHz
and PIO0, have overflow of the NCS_RD_PULSE value in the MSB. This
lead to "NCS_RD_PULSE" pulse longer then "NRD_CYCLE" pulse and driver
does not detect ATA device.
Signed-off-by: Igor Plyatov <plyatov@gmail.com> Signed-off-by: Jeff Garzik <jgarzik@pobox.com>
Jeff Mahoney [Tue, 19 Apr 2011 15:13:32 +0000 (11:13 -0400)]
libata: Pioneer DVR-216D can't do SETXFER
Commit 4a5610a04d415ed94af75bb1159d2621d62c8328 fixed an issue with
the Pioneer DVR-212D not handling SETXFER correctly. An openSUSE user
reported a similar issue with his DVR-216D that the NOSETXFER horkage
worked around for him as well.
This patch adds the DVR-216D (1.08) to the horkage list for NOSETXFER.
The issue was reported at:
https://bugzilla.novell.com/show_bug.cgi?id=679143
Reported-by: Volodymyr Kyrychenko <vladimir.kirichenko@gmail.com> Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: Jeff Garzik <jgarzik@pobox.com>
Maxime Bizon [Wed, 16 Mar 2011 13:58:32 +0000 (14:58 +0100)]
ahci: don't enable port irq before handler is registered
The ahci_pmp_attach() & ahci_pmp_detach() unmask port irqs, but they
are also called during port initialization, before ahci host irq
handler is registered. On ce4100 platform, this sometimes triggers
"irq 4: nobody cared" message when loading driver.
Fixed this by not touching the register if the port is in frozen
state, and mark all uninitialized port as frozen.
Signed-off-by: Maxime Bizon <mbizon@freebox.fr> Acked-by: Tejun Heo <tj@kernel.org> Cc: stable@kernel.org Signed-off-by: Jeff Garzik <jgarzik@pobox.com>
Ben Hutchings [Sat, 23 Apr 2011 17:42:56 +0000 (18:42 +0100)]
kconfig: Avoid buffer underrun in choice input
Commit 40aee729b350 ('kconfig: fix default value for choice input')
fixed some cases where kconfig would select the wrong option from a
choice with a single valid option and thus enter an infinite loop.
However, this broke the test for user input of the form 'N?', because
when kconfig selects the single valid option the input is zero-length
and the test will read the byte before the input buffer. If this
happens to contain '?' (as it will in a mips build on Debian unstable
today) then kconfig again enters an infinite loop.
The dentry hashing rules have been really quite complicated for a long
while, in odd ways. That made functions like __d_drop() very fragile
and non-obvious.
In particular, whether a dentry was hashed or not was indicated with an
explicit DCACHE_UNHASHED bit. That's despite the fact that the hash
abstraction that the dentries use actually have a 'is this entry hashed
or not' model (which is a simple test of the 'pprev' pointer).
The reason that was done is because we used the normal 'is this entry
unhashed' model to mark whether the dentry had _ever_ been hashed in the
dentry hash tables, and that logic goes back many years (commit b3423415fbc2: "dcache: avoid RCU for never-hashed dentries").
That, in turn, meant that __d_drop had totally different unhashing logic
for the dentry hash table case and for the anonymous dcache case,
because in order to use the "is this dentry hashed" logic as a flag for
whether it had ever been on the RCU hash table, we had to unhash such a
dentry differently so that we'd never think that it wasn't 'unhashed'
and wouldn't be free'd correctly.
That's just insane. It made the logic really hard to follow, when there
were two different kinds of "unhashed" states, and one of them (the one
that used "list_bl_unhashed()") really had nothing at all to do with
being unhashed per se, but with a very subtle lifetime rule instead.
So turn all of it around, and make it logical.
Instead of having a DENTRY_UNHASHED bit in d_flags to indicate whether
the dentry is on the hash chains or not, use the hash chain unhashed
logic for that. Suddenly "d_unhashed()" just uses "list_bl_unhashed()",
and everything makes sense.
And for the lifetime rule, just use an explicit DENTRY_RCUACCEES bit.
If we ever insert the dentry into the dentry hash table so that it is
visible to RCU lookup, we mark it DENTRY_RCUACCESS to show that it now
needs the RCU lifetime rules. Now suddently that test at dentry free
time makes sense too.
And because unhashing now is sane and doesn't depend on where the dentry
got unhashed from (because the dentry hash chain details doesn't have
some subtle side effects), we can re-unify the __d_drop() logic and use
common code for the unhashing.
Also fix one more open-coded hash chain bit_spin_lock() that I missed in
the previous chain locking cleanup commit.
vfs: get rid of 'struct dcache_hash_bucket' abstraction
It's a useless abstraction for 'hlist_bl_head', and it doesn't actually
help anything - quite the reverse. All the users end up having to know
about the hlist_bl_head details anyway, using 'struct hlist_bl_node *'
etc. So it just makes the code look confusing.
And the cost of it is extra '&b->head' syntactic noise, but more
importantly it spuriously makes the hash table dentry list look
different from the per-superblock DCACHE_DISCONNECTED dentry list.
As a result, the code ended up using ad-hoc locking for one case and
special helper functions for what is really another totally identical
case in the very same function.
Merge branch 'tty-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty-2.6
* 'tty-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty-2.6:
tty/n_gsm: fix bug in CRC calculation for gsm1 mode
serial/imx: read cts state only after acking cts change irq
parport_pc.c: correctly release the requested region for the IT887x
SECURITY: Move exec_permission RCU checks into security modules
Right now all RCU walks fall back to reference walk when CONFIG_SECURITY
is enabled, even though just the standard capability module is active.
This is because security_inode_exec_permission unconditionally fails
RCU walks.
Move this decision to the low level security module. This requires
passing the RCU flags down the security hook. This way at least
the capability module and a few easy cases in selinux/smack work
with RCU walks with CONFIG_SECURITY=y
Signed-off-by: Andi Kleen <ak@linux.intel.com> Acked-by: Eric Paris <eparis@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound-2.6
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound-2.6:
ALSA: hda - Fix unused warnings when !SND_HDA_NEEDS_RESUME
ALSA: hda - Add a fix-up for Acer dmic with ALC271x codec
ASoC: add a module alias to the FSI driver
ALSA: emu10k1 - Fix "Music" controls to "Synth" controls in documents
ARM: s3c2440: gta02; Register dfbmcs320 device for BT audio interface
ASoC: codecs: JZ4740: Fix OOPS
ASoC: Fix output PGA enabling in wm_hubs CODECs
ASoC: sn95031: decorate function with __devexit_p()
ASoC: SAMSUNG: Fix the inverted clocks handling for pcm driver
ASoC: sst_platform: Fix lock acquring
ASoC: fsi: driver safely remove for against irq
ASoC: fsi: modify vague PM control on probe
ASoC: fsi: take care in failing case of dai register
MAINTAINERS: Update Samsung ASoC maintainer's id
ASoC: WM8903: HP and Line out PGA/mixer DAPM fixes
ASoC: Set left channel volume update bits for WM8994
ASoC: fix config error path
ASoC: check channel mismatch between cpu_dai and codec_dai
ASoC: Tegra: Suspend/resume support
James Bottomley [Tue, 19 Apr 2011 21:29:36 +0000 (16:29 -0500)]
[PARISC] slub: fix panic with DISCONTIGMEM
Slub makes assumptions about page_to_nid() which are violated by
DISCONTIGMEM and !NUMA. This violation results in a panic because
page_to_nid() can be non-zero for pages in the discontiguous ranges and
this leads to a null return by get_node(). The assertion by the
maintainer is that DISCONTIGMEM should only be allowed when NUMA is also
defined. However, at least six architectures: alpha, ia64, m32r, m68k,
mips, parisc violate this. The panic is a regression against slab, so
just mark slub broken in the problem configuration to prevent users
reporting these panics.
Cc: stable@kernel.org Acked-by: David Rientjes <rientjes@google.com> Acked-by: Pekka Enberg <penberg@kernel.org> Signed-off-by: James Bottomley <James.Bottomley@suse.de>
Merge branch 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
perf, x86: Update/fix Intel Nehalem cache events
perf, x86: P4 PMU - Don't forget to clear cpuc->active_mask on overflow
x86, perf event: Turn off unstructured raw event access to offcore registers
perf: Support Xeon E7's via the Westmere PMU driver
perf, x86: P4 PMU - Don't forget to clear cpuc->active_mask on overflow
It's not enough to simply disable event on overflow the
cpuc->active_mask should be cleared as well otherwise counter
may stall in "active" even in real being already disabled (which
potentially may lead to the situation that user may not use this
counter further).
Don pointed out that:
" I also noticed this patch fixed some unknown NMIs
on a P4 when I stressed the box".
Tested-by: Lin Ming <ming.m.lin@intel.com> Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org> Acked-by: Don Zickus <dzickus@redhat.com> Signed-off-by: Don Zickus <dzickus@redhat.com> Cc: Cyrill Gorcunov <gorcunov@gmail.com> Link: http://lkml.kernel.org/r/1303398203-2918-3-git-send-email-dzickus@redhat.com Signed-off-by: Ingo Molnar <mingo@elte.hu>
x86, perf event: Turn off unstructured raw event access to offcore registers
Andi Kleen pointed out that the Intel offcore support patches were merged
without user-space tool support to the functionality:
|
| The offcore_msr perf kernel code was merged into 2.6.39-rc*, but the
| user space bits were not. This made it impossible to set the extra mask
| and actually do the OFFCORE profiling
|
Andi submitted a preliminary patch for user-space support, as an
extension to perf's raw event syntax:
|
| Some raw events -- like the Intel OFFCORE events -- support additional
| parameters. These can be appended after a ':'.
|
| For example on a multi socket Intel Nehalem:
|
| perf stat -e r1b7:20ff -a sleep 1
|
| Profile the OFFCORE_RESPONSE.ANY_REQUEST with event mask REMOTE_DRAM_0
| that measures any access to DRAM on another socket.
|
But this kind of usability is absolutely unacceptable - users should not
be expected to type in magic, CPU and model specific incantations to get
access to useful hardware functionality.
The proper solution is to expose useful offcore functionality via
generalized events - that way users do not have to care which specific
CPU model they are using, they can use the conceptual event and not some
model specific quirky hexa number.
We already have such generalization in place for CPU cache events,
and it's all very extensible.
"Offcore" events measure general DRAM access patters along various
parameters. They are particularly useful in NUMA systems.
We want to support them via generalized DRAM events: either as the
fourth level of cache (after the last-level cache), or as a separate
generalization category.
That way user-space support would be very obvious, memory access
profiling could be done via self-explanatory commands like:
perf record -e dram ./myapp
perf record -e dram-remote ./myapp
... to measure DRAM accesses or more expensive cross-node NUMA DRAM
accesses.
These generalized events would work on all CPUs and architectures that
have comparable PMU features.
( Note, these are just examples: actual implementation could have more
sophistication and more parameter - as long as they center around
similarly simple usecases. )
Now we do not want to revert *all* of the current offcore bits, as they
are still somewhat useful for generic last-level-cache events, implemented
in this commit:
e994d7d23a0b: perf: Fix LLC-* events on Intel Nehalem/Westmere
But we definitely do not yet want to expose the unstructured raw events
to user-space, until better generalization and usability is implemented
for these hardware event features.
( Note: after generalization has been implemented raw offcore events can be
supported as well: there can always be an odd event that is marginally
useful but not useful enough to generalize. DRAM profiling is definitely
*not* such a category so generalization must be done first. )
Furthermore, PERF_TYPE_RAW access to these registers was not intended
to go upstream without proper support - it was a side-effect of the above e994d7d23a0b commit, not mentioned in the changelog.
As v2.6.39 is nearing release we go for the simplest approach: disable
the PERF_TYPE_RAW offcore hack for now, before it escapes into a released
kernel and becomes an ABI.
Once proper structure is implemented for these hardware events and users
are offered usable solutions we can revisit this issue.
Merge branch 'for-linus' of git://git.kernel.dk/linux-2.6-block
* 'for-linus' of git://git.kernel.dk/linux-2.6-block:
ide: unexport DISK_EVENT_MEDIA_CHANGE for ide-gd and ide-cd
block: don't propagate unlisted DISK_EVENTs to userland
elevator: check for ELEVATOR_INSERT_SORT_MERGE in !elvpriv case too
ide: unexport DISK_EVENT_MEDIA_CHANGE for ide-gd and ide-cd
check_events() implementations in both ide-gd and ide-cd are
inadequate for in-kernel event polling. Both generate media change
events continuously when certain conditions are met causing infinite
event loop between the driver and userland event handler.
As disk event now supports suppression of unlisted events, simply
de-listing DISK_EVENT_MEDIA_CHANGE from disk->events resolves the
problem. Internal handling around media revalidation will behave the
same while userland will fall back to userland event polling after
detecting the device doesn't support disk events.
block: don't propagate unlisted DISK_EVENTs to userland
DISK_EVENT_MEDIA_CHANGE is used for both userland visible event and
internal event for revalidation of removeable devices. Some legacy
drivers don't implement proper event detection and continuously
generate events under certain circumstances. For example, ide-cd
generates media changed continuously if there's no media in the drive,
which can lead to infinite loop of events jumping back and forth
between the driver and userland event handler.
This patch updates disk event infrastructure such that it never
propagates events not listed in disk->events to userland. Those
events are processed the same for internal purposes but uevent
generation is suppressed.
This also ensures that userland only gets events which are advertised
in the @events sysfs node lowering risk of confusion.
elevator: check for ELEVATOR_INSERT_SORT_MERGE in !elvpriv case too
The sort insert is the one that goes to the IO scheduler. With
the SORT_MERGE addition, we could bypass IO scheduler setup
but still ask the IO scheduler to insert the request. This would
cause an oops on switching IO schedulers through the sysfs
interface, unless the disk just happened to be idle while it
occured.