Bryan Schumaker [Thu, 10 May 2012 19:07:34 +0000 (15:07 -0400)]
NFS: Create a common fs_mount() function
The nfs4_remote_mount() function was only slightly different from the
nfs_fs_mount() function used by the generic client. I created a new
nfs_mount_info structure to set different parameters to help combine
these functions.
Bryan Schumaker [Thu, 10 May 2012 19:07:32 +0000 (15:07 -0400)]
NFS: Don't pass mount data to nfs_fscache_get_super_cookie()
I intend on creating a single nfs_fs_mount() function used by all our
mount paths. To avoid checking between new mounts and clone mounts, I
instead pass both structures to a new function in super.c that finds the
cache key and then looks up the super cookie.
Bryan Schumaker [Thu, 10 May 2012 19:07:31 +0000 (15:07 -0400)]
NFS: Create a single nfs_get_root()
This patch splits out the NFS v4 specific functionality of
nfs4_get_root() into its own rpc_op called by the generic client, and
leaves nfs4_proc_get_rootfh() as its own stand alone function. This
also allows me to change nfs4_remote_mount(), nfs4_xdev_mount() and
nfs4_remote_referral_mount() to use the generic client's nfs_get_root()
function. Later patches in this series will collapse these functions
into one common function, so using the same get_root() function
everywhere simplifies future changes.
Bryan Schumaker [Thu, 10 May 2012 19:07:30 +0000 (15:07 -0400)]
NFS: Rename nfs4_proc_get_root()
This function is really getting the root filehandle and not the root
dentry of the filesystem. I also removed the rpc_ops lookup from
nfs4_get_rootfh() under the assumption that if we reach this function
then we already know we are using NFS v4.
Trond Myklebust [Wed, 9 May 2012 18:04:55 +0000 (14:04 -0400)]
NFS: Clean up - Rename nfs_unlock_request and nfs_unlock_request_dont_release
Function rename to ensure that the functionality of nfs_unlock_request()
mirrors that of nfs_lock_request(). Then let nfs_unlock_and_release_request()
do the work of what used to be called nfs_unlock_request()...
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> Cc: Fred Isaman <iisaman@netapp.com>
Trond Myklebust [Wed, 9 May 2012 17:37:43 +0000 (13:37 -0400)]
NFS: nfs_set_page_writeback no longer needs to reference the page
We now hold a reference to the nfs_page across the calls to
nfs_set_page_writeback and nfs_end_page_writeback, and that
means we already have a reference to the struct page.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> Cc: Fred Isaman <iisaman@netapp.com>
Trond Myklebust [Wed, 9 May 2012 18:30:35 +0000 (14:30 -0400)]
NFS: Prevent a deadlock in the new writeback code
We have to unlock the nfs_page before we call nfs_end_page_writeback
to avoid races with functions that expect the page to be unlocked
when PG_locked and PG_writeback are not set.
The problem is that nfs_unlock_request also releases the nfs_page,
causing a deadlock if the release of the nfs_open_context
triggers an iput() while the PG_writeback flag is still set...
The solution is to separate the unlocking and release of the nfs_page,
so that we can do the former before nfs_end_page_writeback and the
latter after.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> Cc: Fred Isaman <iisaman@netapp.com>
Trond Myklebust [Sun, 6 May 2012 23:46:30 +0000 (19:46 -0400)]
NFSv4: nfs_client_return_marked_delegations can't flush data
Since even filemap_flush() needs to lock pages that are dirty, we
cannot risk calling it from the state manager context. Therefore,
we need to move the call to filemap_flush() to
nfs_async_inode_return_delegation().
Trond Myklebust [Sun, 6 May 2012 23:10:59 +0000 (19:10 -0400)]
NFS: Don't do a full flush to disk on close() if we hold a delegation
If we hold a delegation then we know that it should be safe to continue
to cache the data beyond the close(). However since the process that wrote
the data may die after close(), we may still want to send the data to
server before those RPCSEC_GSS credentials expire. We therefore compromise
by starting writeback to the server, but don't wait for completion.
Trond Myklebust [Fri, 4 May 2012 17:54:24 +0000 (13:54 -0400)]
NFS: Fix sparse warnings
Fix the following sparse warnings:
fs/nfs/direct.c:221:6: warning: symbol 'nfs_direct_readpage_release' was
not declared. Should it be static?
fs/nfs/read.c:38:43: warning: non-ANSI function declaration of function
'nfs_readhdr_alloc'
fs/nfs/objlayout/objio_osd.c:214:5: warning: symbol '__alloc_objio_seg'
was not declared. Should it be static?
Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> Cc: Fred Isaman <iisaman@netapp.com> Cc: Boaz Harrosh <bharrosh@panasas.com>
Trond Myklebust [Fri, 4 May 2012 17:47:16 +0000 (13:47 -0400)]
NFS: Fix O_DIRECT compile warnings
Fix the following compile warnings:
fs/nfs/direct.c: In function 'nfs_direct_read_schedule_segment':
fs/nfs/direct.c:325:11: warning: comparison of distinct pointer types
lacks a cast [enabled by default]
fs/nfs/direct.c:325:11: warning: comparison of distinct pointer types
lacks a cast [enabled by default]
fs/nfs/direct.c:325:11: warning: comparison of distinct pointer types
lacks a cast [enabled by default]
fs/nfs/direct.c:352:27: warning: comparison of distinct pointer types
lacks a cast [enabled by default]
fs/nfs/direct.c: In function 'nfs_direct_write_schedule_segment':
fs/nfs/direct.c:622:11: warning: comparison of distinct pointer types
lacks a cast [enabled by default]
fs/nfs/direct.c:622:11: warning: comparison of distinct pointer types
lacks a cast [enabled by default]
fs/nfs/direct.c:622:11: warning: comparison of distinct pointer types
lacks a cast [enabled by default]
fs/nfs/direct.c:650:27: warning: comparison of distinct pointer types
lacks a cast [enabled by default]
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> Cc: Fred Isaman <iisaman@netapp.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Trond Myklebust [Tue, 1 May 2012 21:37:59 +0000 (17:37 -0400)]
NFS: Adapt readdirplus to application usage patterns
While the use of READDIRPLUS is significantly more efficient than
READDIR followed by many LOOKUP calls, it is still less efficient
than just READDIR if the attributes are not required.
This patch tracks when lookups are attempted on the directory,
and uses that information to selectively disable READDIRPLUS
on that directory.
The first 'readdir' call is always served using READDIRPLUS.
Subsequent calls only use READDIRPLUS if there was a successful
lookup or revalidation on a child in the mean time.
Credit for the original idea should go to Neil Brown. See:
http://www.spinics.net/lists/linux-nfs/msg19996.html
However, the implementation in this patch differs from Neil's
in that it focuses on tracking lookups rather than calls to
stat().
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> Cc: Neil Brown <neilb@suse.de>
Now that NFSv2 and NFSv3 have simulated change attributes,
instead of using all three of mtime, ctime and change attribute to
manage data cache consistency, we can simplify the code to just use
the change attribute.
If the inode is being initialised, there is no point in
setting flags such as NFS_INO_INVALID_ACCESS,
NFS_INO_INVALID_ACL or NFS_INO_INVALID_DATA since there are
no cached access calls, acls or data caches to invalidate.
In order to retrieve cache consistency attributes before
anyone else has a chance to change the inode, we need to
put the GETATTR op _before_ the DELEGRETURN op.
We can then use that as part of a 'nfs_post_op_update_inode_force_wcc()'
call, to ensure that we update the attributes without clearing our
cached data.
Fred Isaman [Fri, 20 Apr 2012 18:47:53 +0000 (14:47 -0400)]
NFS: create struct nfs_commit_info
It is COMMIT that is handled the most differently between
the paged and direct paths. Create a structure that encapsulates
everything either path needs to know about the commit state.
We could use void to hide some of the layout driver stuff, but
Trond suggests pulling it out to ensure type checking, given the
huge changes being made, and the fact that it doesn't interfere
with other drivers.
Signed-off-by: Fred Isaman <iisaman@netapp.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Fred Isaman [Fri, 20 Apr 2012 23:55:31 +0000 (19:55 -0400)]
NFS: prepare coalesce testing for directio
The coalesce code made assumptions that will no longer be true once
non-page aligned io occurs. This introduces no change in
current behavior, but allows for more general situations to come.
Signed-off-by: Fred Isaman <iisaman@netapp.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Fred Isaman [Fri, 20 Apr 2012 18:47:48 +0000 (14:47 -0400)]
NFS: create completion structure to pass into page_init functions
Factors out the code that will need to change when directio
starts using these code paths. This will allow directio to use
the generic pagein and flush routines
Signed-off-by: Fred Isaman <iisaman@netapp.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Fred Isaman [Fri, 20 Apr 2012 18:47:47 +0000 (14:47 -0400)]
NFS: merge _full and _partial write rpc_ops
Decouple nfs_pgio_header and nfs_write_data, and have (possibly
multiple) nfs_write_datas each take a refcount on nfs_pgio_header.
For the moment keeps nfs_write_header as a way to preallocate a single
nfs_write_data with the nfs_pgio_header. The code doesn't need this,
and would be prettier without, but given the amount of churn I am
already introducing I didn't want to play with tuning new mempools.
This also fixes bug in pnfs_ld_handle_write_error. In the case of
desc->pg_bsize < PAGE_CACHE_SIZE, the pages list was empty, causing
replay attempt to do nothing.
Signed-off-by: Fred Isaman <iisaman@netapp.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Fred Isaman [Fri, 20 Apr 2012 18:47:46 +0000 (14:47 -0400)]
NFS: merge _full and _partial read rpc_ops
Decouple nfs_pgio_header and nfs_read_data, and have (possibly
multiple) nfs_read_datas each take a refcount on nfs_pgio_header.
For the moment keeps nfs_read_header as a way to preallocate a single
nfs_read_data with the nfs_pgio_header. The code doesn't need this,
and would be prettier without, but given the amount of churn I am
already introducing I didn't want to play with tuning new mempools.
This also fixes bug in pnfs_ld_handle_read_error. In the case of
desc->pg_bsize < PAGE_CACHE_SIZE, the pages list was empty, causing
replay attempt to do nothing.
Signed-off-by: Fred Isaman <iisaman@netapp.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Fred Isaman [Fri, 20 Apr 2012 18:47:44 +0000 (14:47 -0400)]
NFS: create common nfs_pgio_header for both read and write
In order to avoid duplicating all the data in nfs_read_data whenever we
split it up into multiple RPC calls (either due to a short read result
or due to rsize < PAGE_SIZE), we split out the bits that are the same
per RPC call into a separate "header" structure.
The goal this patch moves towards is to have a single header
refcounted by several rpc_data structures. Thus, want to always refer
from rpc_data to the header, and not the other way. This patch comes
close to that ideal, but the directio code currently needs some
special casing, isolated in the nfs_direct_[read_write]hdr_release()
functions. This will be dealt with in a future patch.
Signed-off-by: Fred Isaman <iisaman@netapp.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Fred Isaman [Fri, 20 Apr 2012 18:47:38 +0000 (14:47 -0400)]
NFS4.1: Add lseg to struct nfs4_fl_commit_bucket
Also create a commit_info structure to hold the bucket array and push
it up from the lseg to the layout where it really belongs.
While we are at it, fix a refcounting bug due to an (incorrect)
implicit assumption that filelayout_scan_ds_commit_list always
completely emptied the src list.
This clarifies refcounting, removes the ugly find_only_write_lseg
functions, and pushes the file layout commit code along on the path to
supporting multiple lsegs.
Signed-off-by: Fred Isaman <iisaman@netapp.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Fred Isaman [Fri, 20 Apr 2012 18:47:37 +0000 (14:47 -0400)]
NFS4.1: make pnfs_ld_[read|write]_done consistent
The two functions had diverged quite a bit, with the write function
being a bit more robust than the read.
However, these still break badly in the desc->pg_bsize < PAGE_CACHE_SIZE case,
as then there is nothing hanging on the data->pages list, and the resend
ends up doing nothing. This will be fixed in a patch later in the series.
Signed-off-by: Fred Isaman <iisaman@netapp.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
NFS: Honor the authflavor set in the clone mount data
The authflavor is set in an nfs_clone_mount structure and passed to the
xdev_mount() functions where it was promptly ignored. Instead, use it
to initialize an rpc_clnt for the cloned server.
NFS: Fix following referral mount points with different security
I create a new proc_lookup_mountpoint() to use when submounting an NFS
v4 share. This function returns an rpc_clnt to use for performing an
fs_locations() call on a referral's mountpoint.
Whenever lookup sees wrongsec do a secinfo and retry the lookup to find
attributes of the file or directory, such as "is this a referral
mountpoint?". This also allows me to remove handling -NFS4ERR_WRONSEC
as part of getattr xdr decoding.
I was using the same decoder function for SECINFO and SECINFO_NO_NAME,
so it was returning an error when it tried to decode an OP_SECINFO_NO_NAME
header as OP_SECINFO.
SUNRPC: set per-net PipeFS superblock before notification
There can be a case, when on MOUNT event RPC client (after it's dentries were
created) is not longer hold by anyone except notification callback.
I.e. on release this client will be destoroyed. And it's dentries have to be
destroyed as well. Which in turn requires per-net PipeFS superblock to be set.
Signed-off-by: Stanislav Kinsbursky <skinsbursky@parallels.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
When attempting to cache ACLs returned from the server, if the bitmap
size + the ACL size is greater than a PAGE_SIZE but the ACL size itself
is smaller than a PAGE_SIZE, we can read past the buffer page boundary.
When calling GETACL, if the size of the bitmap array, the length
attribute and the acl returned by the server is greater than the
allocated buffer(args.acl_len), we can Oops with a General Protection
fault at _copy_from_pages() when we attempt to read past the pages
allocated.
This patch allocates an extra PAGE for the bitmap and checks to see that
the bitmap + attribute_length + ACLs don't exceed the buffer space
allocated to it.
Signed-off-by: Sachin Prabhu <sprabhu@redhat.com> Reported-by: Jian Li <jiali@redhat.com>
[Trond: Fixed a size_t vs unsigned int printk() warning] Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
All referrals (IPv4 addr, IPv6 addr, and DNS) are broken on mounts of
IPv6 addresses, because validation code uses a path that is parsed
from the dev_name ("<server>:<path>") by splitting on the first colon and
colons are used in IPv6 addrs.
This patch ignores colons within IPv6 addresses that are escaped by '[' and ']'.
Merge tag 'nfs-for-3.4-3' of git://git.linux-nfs.org/projects/trondmy/linux-nfs
Pull NFS client bugfixes from Trond Myklebust:
- Fix NFSv4 infinite loops on open(O_TRUNC)
- Fix an Oops and an infinite loop in the NFSv4 flock code
- Don't register the PipeFS filesystem until it has been set up
- Fix an Oops in nfs_try_to_update_request
- Don't reuse NFSv4 open owners: fixes a bad sequence id storm.
* tag 'nfs-for-3.4-3' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
NFSv4: Keep dropped state owners on the LRU list for a while
NFSv4: Ensure that we don't drop a state owner more than once
NFSv4: Ensure we do not reuse open owner names
nfs: Enclose hostname in brackets when needed in nfs_do_root_mount
NFS: put open context on error in nfs_flush_multi
NFS: put open context on error in nfs_pagein_multi
NFSv4: Fix open(O_TRUNC) and ftruncate() error handling
NFSv4: Ensure that we check lock exclusive/shared type against open modes
NFSv4: Ensure that the LOCK code sets exception->inode
NFS: check for req==NULL in nfs_try_to_update_request cleanup
SUNRPC: register PipeFS file system after pernet sybsystem
Merge branch 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fixes from H. Peter Anvin.
* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x32, siginfo: Provide proper overrides for x32 siginfo_t
asm-generic: Allow overriding clock_t and add attributes to siginfo_t
x32: Check __ILP32__ instead of __LP64__ for x32
x86, acpi: Call acpi_enter_sleep_state via an asmlinkage C function from assembler
ACPI: Convert wake_sleep_flags to a value instead of function
x86, apic: APIC code touches invalid MSR on P5 class machines
i387: ptrace breaks the lazy-fpu-restore logic
x86/platform: Remove incorrect error message in x86_default_fixup_cpu_id()
x86, efi: Add dedicated EFI stub entry point
x86/amd: Remove broken links from comment and kernel message
x86, microcode: Ensure that module is only loaded on supported AMD CPUs
x86, microcode: Fix sysfs warning during module unload on unsupported CPUs
Merge branch 'for_linus' of git://cavan.codon.org.uk/platform-drivers-x86
Pull x86 platform driver fixes from Matthew Garrett:
"One annoyance fix (make intel_ips stop complaining unnecessarily) and
one oops fix (unterminated list in dell-laptop). Both have been in
-next for a while with no complaints."
* 'for_linus' of git://cavan.codon.org.uk/platform-drivers-x86:
dell-laptop: Terminate quirks list properly
intel_ips: Hush the i915 symbols message
Johannes Weiner [Tue, 24 Apr 2012 18:22:33 +0000 (20:22 +0200)]
mm: memcg: move pc lookup point to commit_charge()
None of the callsites actually need the page_cgroup descriptor
themselves, so just pass the page and do the look up in there.
We already had two bugs (6568d4a 'mm: memcg: update the correct soft
limit tree during migration' and 'memcg: fix Bad page state after
replace_page_cache') where the passed page and pc were not referring
to the same page frame.
David Miller [Wed, 25 Apr 2012 20:10:50 +0000 (16:10 -0400)]
mm: nobootmem: Correct alloc_bootmem semantics.
The comments above __alloc_bootmem_node() claim that the code will
first try the allocation using 'goal' and if that fails it will
try again but with the 'goal' requirement dropped.
Unfortunately, this is not what the code does, so fix it to do so.
This is important for nobootmem conversions to architectures such
as sparc where MAX_DMA_ADDRESS is infinity.
On such architectures all of the allocations done by generic spots,
such as the sparse-vmemmap implementation, will pass in:
__pa(MAX_DMA_ADDRESS)
as the goal, and with the limit given as "-1" this will always fail
unless we add the appropriate fallback logic here.
Signed-off-by: David S. Miller <davem@davemloft.net> Acked-by: Yinghai Lu <yinghai@kernel.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Merge tag 'hsi_fixes_for_3.4' of git://gitorious.org/kernel-hsi/kernel-hsi
Pull HSI fixes and ABI documentation from Carlos Chinea
* tag 'hsi_fixes_for_3.4' of git://gitorious.org/kernel-hsi/kernel-hsi:
HSI: Add HSI ABI documentation
HSI: hsi_char: Remove max_data_size from sysfs
HSI: hsi: Rework hsi_event interface
HSI: hsi: Remove controllers and ports from the bus
HSI: hsi: Fix error path cleanup on client registration
HSI: hsi: Rework hsi_controller release
Bob Peterson [Tue, 10 Apr 2012 18:45:24 +0000 (14:45 -0400)]
GFS2: Instruct DLM to avoid queue convert slowdown
This patch instructs DLM to prevent an "in place" conversion, where the
lock just stays on the granted queue, and instead forces the conversion to
the back of the convert queue. This is done on upward conversions only.
This is useful in cases where, for example, a lock is frequently needed in
PR on one node, but another node needs it temporarily in EX to update it.
This may happen, for example, when the rindex is being updated by gfs2_grow.
The gfs2_grow needs to have the lock in EX, but the other nodes need to
re-read it to retrieve the updates. The glock is already granted in PR on
the non-growing nodes, so this prevents them from continually re-granting
the lock in PR, and forces the EX from gfs2_grow to go through.
Signed-off-by: Bob Peterson <rpeterso@redhat.com> Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
Pull ext4 bug fixes from Ted Ts'o:
"These are two low-risk bug fixes for ext4, fixing a compile warning
and a potential deadlock."
* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
super.c: unused variable warning without CONFIG_QUOTA
jbd2: use GFP_NOFS for blkdev_issue_flush
Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rkuo/linux-hexagon-kernel
Pull Hexagon fixes from Richard Kuo:
"It's mostly compile fixes and the Hexagon portion of a CPU hotplug
patch set."
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rkuo/linux-hexagon-kernel:
hexagon: add missing cpu.h include
hexagon/CPU hotplug: Add missing call to notify_cpu_starting()
hexagon: use renamed tick_nohz_idle_* functions
Hexagon: misc compile warning/error cleanup due to missing headers
Merge branch 'rc-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/mmarek/kbuild
Pull build system failure fix from Michal Marek:
"This fixes build failure with newer gcc that adds some internal
symbols that end in "__mod_*_device_table", but are not actually the
tables themselves."
* 'rc-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/mmarek/kbuild:
Fix modpost failures in fedora 17
Shaohua Li [Fri, 13 Apr 2012 02:27:35 +0000 (10:27 +0800)]
jbd2: use GFP_NOFS for blkdev_issue_flush
flush request is issued in transaction commit code path, so looks using
GFP_KERNEL to allocate memory for flush request bio falls into the classic
deadlock issue. I saw btrfs and dm get it right, but ext4, xfs and md are
using GFP.
Signed-off-by: Shaohua Li <shli@fusionio.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Jan Kara <jack@suse.cz> Cc: stable@vger.kernel.org
Merge tag 'md-3.4-fixes' of git://neil.brown.name/md
Pull a few more md bug fixes from NeilBrown:
"2 are tagged for -stable, one being for a fairly serious bug that can
corrupt metadata and make it hard to recovery an array. The other is
for a more recent regression since 3.3"
* tag 'md-3.4-fixes' of git://neil.brown.name/md:
md: fix possible corruption of array metadata on shutdown.
md: don't call ->add_disk unless there is good reason.
DM RAID: Use safe version of rdev_for_each