Let filesystems handle waiting for direct I/O requests themselves instead
of doing it beforehand. This means filesystem-specific locks to prevent
new dio referenes from appearing can be held. This is important to allow
generalizing i_dio_count to non-DIO_LOCKING filesystems.
Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
i_alloc_sem is a rather special rw_semaphore. It's the last one that may
be released by a non-owner, and it's write side is always mirrored by
real exclusion. It's intended use it to wait for all pending direct I/O
requests to finish before starting a truncate.
Replace it with a hand-grown construct:
- exclusion for truncates is already guaranteed by i_mutex, so it can
simply fall way
- the reader side is replaced by an i_dio_count member in struct inode
that counts the number of pending direct I/O requests. Truncate can't
proceed as long as it's non-zero
- when i_dio_count reaches non-zero we wake up a pending truncate using
wake_up_bit on a new bit in i_flags
- new references to i_dio_count can't appear while we are waiting for
it to read zero because the direct I/O count always needs i_mutex
(or an equivalent like XFS's i_iolock) for starting a new operation.
This scheme is much simpler, and saves the space of a spinlock_t and a
struct list_head in struct inode (typically 160 bits on a non-debug 64-bit
system).
Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Jan Kara [Fri, 24 Jun 2011 18:29:41 +0000 (14:29 -0400)]
ext4: Rewrite ext4_page_mkwrite() to use generic helpers
Rewrite ext4_page_mkwrite() to use __block_page_mkwrite() helper. This
removes the need of using i_alloc_sem to avoid races with truncate which
seems to be the wrong locking order according to lock ordering documented in
mm/rmap.c. Also calling ext4_da_write_begin() as used by the old code seems to
be problematic because we can decide to flush delay-allocated blocks which
will acquire s_umount semaphore - again creating unpleasant lock dependency
if not directly a deadlock.
Also add a check for frozen filesystem so that we don't busyloop in page fault
when the filesystem is frozen.
Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Dave Chinner [Fri, 8 Jul 2011 04:14:46 +0000 (14:14 +1000)]
xfs: make use of new shrinker callout for the inode cache
Convert the inode reclaim shrinker to use the new per-sb shrinker
operations. This allows much bigger reclaim batches to be used, and
allows the XFS inode cache to be shrunk in proportion with the VFS
dentry and inode caches. This avoids the problem of the VFS caches
being shrunk significantly before the XFS inode cache is shrunk
resulting in imbalances in the caches during reclaim.
Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Dave Chinner [Fri, 8 Jul 2011 04:14:45 +0000 (14:14 +1000)]
vfs: increase shrinker batch size
Now that the per-sb shrinker is responsible for shrinking 2 or more
caches, increase the batch size to keep econmies of scale for
shrinking each cache. Increase the shrinker batch size to 1024
objects.
To allow for a large increase in batch size, add a conditional
reschedule to prune_icache_sb() so that we don't hold the LRU spin
lock for too long. This mirrors the behaviour of the
__shrink_dcache_sb(), and allows us to increase the batch size
without needing to worry about problems caused by long lock hold
times.
To ensure that filesystems using the per-sb shrinker callouts don't
cause problems, document that the object freeing method must
reschedule appropriately inside loops.
Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Dave Chinner [Fri, 8 Jul 2011 04:14:44 +0000 (14:14 +1000)]
superblock: add filesystem shrinker operations
Now we have a per-superblock shrinker implementation, we can add a
filesystem specific callout to it to allow filesystem internal
caches to be shrunk by the superblock shrinker.
Rather than perpetuate the multipurpose shrinker callback API (i.e.
nr_to_scan == 0 meaning "tell me how many objects freeable in the
cache), two operations will be added. The first will return the
number of objects that are freeable, the second is the actual
shrinker call.
Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Dave Chinner [Fri, 8 Jul 2011 04:14:43 +0000 (14:14 +1000)]
inode: remove iprune_sem
Now that we have per-sb shrinkers with a lifecycle that is a subset
of the superblock lifecycle and can reliably detect a filesystem
being unmounted, there is not longer any race condition for the
iprune_sem to protect against. Hence we can remove it.
Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
With context based shrinkers, we can implement a per-superblock
shrinker that shrinks the caches attached to the superblock. We
currently have global shrinkers for the inode and dentry caches that
split up into per-superblock operations via a coarse proportioning
method that does not batch very well. The global shrinkers also
have a dependency - dentries pin inodes - so we have to be very
careful about how we register the global shrinkers so that the
implicit call order is always correct.
With a per-sb shrinker callout, we can encode this dependency
directly into the per-sb shrinker, hence avoiding the need for
strictly ordering shrinker registrations. We also have no need for
any proportioning code for the shrinker subsystem already provides
this functionality across all shrinkers. Allowing the shrinker to
operate on a single superblock at a time means that we do less
superblock list traversals and locking and reclaim should batch more
effectively. This should result in less CPU overhead for reclaim and
potentially faster reclaim of items from each filesystem.
Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Dave Chinner [Fri, 8 Jul 2011 04:14:41 +0000 (14:14 +1000)]
superblock: move pin_sb_for_writeback() to fs/super.c
The per-sb shrinker has the same requirement as the writeback
threads of ensuring that the superblock is usable and pinned for the
time it takes to run the work. Both need to take a passive reference
to the sb, take a read lock on the s_umount lock and then only
continue if an unmount is not in progress.
pin_sb_for_writeback() does this exactly, so move it to fs/super.c
and rename it to grab_super_passive() and exporting it via
fs/internal.h for all the VFS code to be able to use.
Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Dave Chinner [Fri, 8 Jul 2011 04:14:40 +0000 (14:14 +1000)]
inode: move to per-sb LRU locks
With the inode LRUs moving to per-sb structures, there is no longer
a need for a global inode_lru_lock. The locking can be made more
fine-grained by moving to a per-sb LRU lock, isolating the LRU
operations of different filesytsems completely from each other.
Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Dave Chinner [Fri, 8 Jul 2011 04:14:39 +0000 (14:14 +1000)]
inode: Make unused inode LRU per superblock
The inode unused list is currently a global LRU. This does not match
the other global filesystem cache - the dentry cache - which uses
per-superblock LRU lists. Hence we have related filesystem object
types using different LRU reclaimation schemes.
To enable a per-superblock filesystem cache shrinker, both of these
caches need to have per-sb unused object LRU lists. Hence this patch
converts the global inode LRU to per-sb LRUs.
The patch only does rudimentary per-sb propotioning in the shrinker
infrastructure, as this gets removed when the per-sb shrinker
callouts are introduced later on.
Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Dave Chinner [Fri, 8 Jul 2011 04:14:38 +0000 (14:14 +1000)]
inode: convert inode_stat.nr_unused to per-cpu counters
Before we split up the inode_lru_lock, the unused inode counter
needs to be made independent of the global inode_lru_lock. Convert
it to per-cpu counters to do this.
Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Dave Chinner [Fri, 8 Jul 2011 04:14:37 +0000 (14:14 +1000)]
vmscan: add customisable shrinker batch size
For shrinkers that have their own cond_resched* calls, having
shrink_slab break the work down into small batches is not
paticularly efficient. Add a custom batchsize field to the struct
shrinker so that shrinkers can use a larger batch size if they
desire.
A value of zero (uninitialised) means "use the default", so
behaviour is unchanged by this patch.
Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Dave Chinner [Fri, 8 Jul 2011 04:14:36 +0000 (14:14 +1000)]
vmscan: reduce wind up shrinker->nr when shrinker can't do work
When a shrinker returns -1 to shrink_slab() to indicate it cannot do
any work given the current memory reclaim requirements, it adds the
entire total_scan count to shrinker->nr. The idea ehind this is that
whenteh shrinker is next called and can do work, it will do the work
of the previously aborted shrinker call as well.
However, if a filesystem is doing lots of allocation with GFP_NOFS
set, then we get many, many more aborts from the shrinkers than we
do successful calls. The result is that shrinker->nr winds up to
it's maximum permissible value (twice the current cache size) and
then when the next shrinker call that can do work is issued, it
has enough scan count built up to free the entire cache twice over.
This manifests itself in the cache going from full to empty in a
matter of seconds, even when only a small part of the cache is
needed to be emptied to free sufficient memory.
Under metadata intensive workloads on ext4 and XFS, I'm seeing the
VFS caches increase memory consumption up to 75% of memory (no page
cache pressure) over a period of 30-60s, and then the shrinker
empties them down to zero in the space of 2-3s. This cycle repeats
over and over again, with the shrinker completely trashing the inode
and dentry caches every minute or so the workload continues.
This behaviour was made obvious by the shrink_slab tracepoints added
earlier in the series, and made worse by the patch that corrected
the concurrent accounting of shrinker->nr.
To avoid this problem, stop repeated small increments of the total
scan value from winding shrinker->nr up to a value that can cause
the entire cache to be freed. We still need to allow it to wind up,
so use the delta as the "large scan" threshold check - if the delta
is more than a quarter of the entire cache size, then it is a large
scan and allowed to cause lots of windup because we are clearly
needing to free lots of memory.
If it isn't a large scan then limit the total scan to half the size
of the cache so that windup never increases to consume the whole
cache. Reducing the total scan limit further does not allow enough
wind-up to maintain the current levels of performance, whilst a
higher threshold does not prevent the windup from freeing the entire
cache under sustained workloads.
Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Dave Chinner [Fri, 8 Jul 2011 04:14:35 +0000 (14:14 +1000)]
vmscan: shrinker->nr updates race and go wrong
shrink_slab() allows shrinkers to be called in parallel so the
struct shrinker can be updated concurrently. It does not provide any
exclusio for such updates, so we can get the shrinker->nr value
increasing or decreasing incorrectly.
As a result, when a shrinker repeatedly returns a value of -1 (e.g.
a VFS shrinker called w/ GFP_NOFS), the shrinker->nr goes haywire,
sometimes updating with the scan count that wasn't used, sometimes
losing it altogether. Worse is when a shrinker does work and that
update is lost due to racy updates, which means the shrinker will do
the work again!
Fix this by making the total_scan calculations independent of
shrinker->nr, and making the shrinker->nr updates atomic w.r.t. to
other updates via cmpxchg loops.
Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Dave Chinner [Fri, 8 Jul 2011 04:14:34 +0000 (14:14 +1000)]
vmscan: add shrink_slab tracepoints
It is impossible to understand what the shrinkers are actually doing
without instrumenting the code, so add a some tracepoints to allow
insight to be gained.
Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Al Viro [Thu, 7 Jul 2011 19:03:58 +0000 (15:03 -0400)]
Make ->d_sb assign-once and always non-NULL
New helper (non-exported, fs/internal.h-only): __d_alloc(sb, name).
Allocates dentry, sets its ->d_sb to given superblock and sets
->d_op accordingly. Old d_alloc(NULL, name) callers are converted
to that (all of them know what superblock they want). d_alloc()
itself is left only for parent != NULl case; uses __d_alloc(),
inserts result into the list of parent's children.
Note that now ->d_sb is assign-once and never NULL *and*
->d_parent is never NULL either.
Al Viro [Mon, 27 Jun 2011 20:37:12 +0000 (16:37 -0400)]
devtmpfs: get rid of bogus mkdir in create_path()
We do _NOT_ want to mkdir the path itself - we are preparing to
mknod it, after all. Normally it'll fail with -ENOENT and
just do nothing, but if somebody has created the parent in
the meanwhile, we'll get buggered...
Al Viro [Mon, 27 Jun 2011 20:25:29 +0000 (16:25 -0400)]
switch devtmpfs object creation/removal to separate kernel thread
... and give it a namespace where devtmpfs would be mounted on root,
thus avoiding abuses of vfs_path_lookup() (it was never intended to
be used with LOOKUP_PARENT). Games with credentials are also gone.
Al Viro [Sun, 26 Jun 2011 01:08:31 +0000 (21:08 -0400)]
don't pass nameidata to vfs_create() from ecryptfs_create()
Instead of playing with removal of LOOKUP_OPEN, mangling (and
restoring) nd->path, just pass NULL to vfs_create(). The whole
point of what's being done there is to suppress any attempts
to open file by underlying fs, which is what nd == NULL indicates.
Al Viro [Wed, 22 Jun 2011 22:53:18 +0000 (18:53 -0400)]
fix mknod() on nfs4 (hopefully)
a) check the right flags in ->create() (LOOKUP_OPEN, not LOOKUP_CREATE)
b) default (!LOOKUP_OPEN) open_flags is O_CREAT|O_EXCL|FMODE_READ, not 0
c) lookup_instantiate_filp() should be done only with LOOKUP_OPEN;
otherwise we need to issue CLOSE, lest we leak stateid on server.
Al Viro [Tue, 21 Jun 2011 12:51:28 +0000 (08:51 -0400)]
cifs: fix the type of cifs_demultiplex_thread()
... and get rid of a bogus typecast, while we are at it; it's not
just that we want a function returning int and not void, but cast
to pointer to function taking void * and returning void would be
(void (*)(void *)) and not (void *)(void *), TYVM...
Al Viro [Sun, 19 Jun 2011 16:49:47 +0000 (12:49 -0400)]
consolidate BINPRM_FLAGS_ENFORCE_NONDUMP handling
new helper: would_dump(bprm, file). Checks if we are allowed to
read the file and if we are not - sets ENFORCE_NODUMP. Exported,
used in places that previously open-coded the same logics.
Al Viro [Sun, 19 Jun 2011 05:50:08 +0000 (01:50 -0400)]
make exec_permission(dir) really equivalent to inode_permission(dir, MAY_EXEC)
capability overrides apply only to the default case; if fs has ->permission()
that does _not_ call generic_permission(), we have no business doing them.
Moreover, if it has ->permission() that does call generic_permission(), we
have no need to recheck capabilities.
Besides, the capability overrides should apply only if we got EACCES from
acl_permission_check(); any other value (-EIO, etc.) should be returned
to caller, capabilities or not capabilities.
Al Viro [Sat, 4 Jun 2011 00:16:57 +0000 (20:16 -0400)]
new helper: iterate_supers_type()
Call the given function for all superblocks of given type. Function
gets a superblock (with s_umount locked shared) and (void *) argument
supplied by caller of iterator.
Josef Bacik [Tue, 31 May 2011 15:58:49 +0000 (11:58 -0400)]
fs: add a DCACHE_NEED_LOOKUP flag for d_flags
Btrfs (and I'd venture most other fs's) stores its indexes in nice disk order
for readdir, but unfortunately in the case of anything that stats the files in
order that readdir spits back (like oh say ls) that means we still have to do
the normal lookup of the file, which means looking up our other index and then
looking up the inode. What I want is a way to create dummy dentries when we
find them in readdir so that when ls or anything else subsequently does a
stat(), we already have the location information in the dentry and can go
straight to the inode itself. The lookup stuff just assumes that if it finds a
dentry it is done, it doesn't perform a lookup. So add a DCACHE_NEED_LOOKUP
flag so that the lookup code knows it still needs to run i_op->lookup() on the
parent to get the inode for the dentry. I have tested this with btrfs and I
went from something that looks like this
http://people.redhat.com/jwhiter/ls-noreada.png
To this
http://people.redhat.com/jwhiter/ls-good.png
Thats a savings of 1300 seconds, or 22 minutes. That is a significant savings.
Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Shaohua Li [Tue, 19 Jul 2011 15:49:26 +0000 (08:49 -0700)]
vmscan: fix a livelock in kswapd
I'm running a workload which triggers a lot of swap in a machine with 4
nodes. After I kill the workload, I found a kswapd livelock. Sometimes
kswapd3 or kswapd2 are keeping running and I can't access filesystem,
but most memory is free.
This looks like a regression since commit 08951e545918c159 ("mm: vmscan:
correct check for kswapd sleeping in sleeping_prematurely").
Node 2 and 3 have only ZONE_NORMAL, but balance_pgdat() will return 0
for classzone_idx. The reason is end_zone in balance_pgdat() is 0 by
default, if all zones have watermark ok, end_zone will keep 0.
Later sleeping_prematurely() always returns true. Because this is an
order 3 wakeup, and if classzone_idx is 0, both balanced_pages and
present_pages in pgdat_balanced() are 0. We add a special case here.
If a zone has no page, we think it's balanced. This fixes the livelock.
Signed-off-by: Shaohua Li <shaohua.li@intel.com> Acked-by: Mel Gorman <mgorman@suse.de> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
A value larger than INT_MAX cannot be written to the debugfs file created
by debugfs_create_u64 or debugfs_create_x64 on 32bit machine. Because
simple_attr_write() uses simple_strtol() for the conversion.
To fix this, use simple_strtoll() instead.
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Cc: Greg Kroah-Hartman <gregkh@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Don't update *inode in __follow_mount_rcu() until we'd verified that
there is mountpoint there. Kudos to Hugh Dickins for catching that
one in the first place and eventually figuring out the solution (and
catching a braino in the earlier version of patch).
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6:
pppoe: Must flush connections when MAC address changes too.
include/linux/sdla.h: remove the prototype of sdla()
tulip: dmfe: Remove old log spamming pr_debugs
Al Viro [Mon, 18 Jul 2011 17:50:40 +0000 (13:50 -0400)]
Fix cifs_get_root()
Add missing ->i_mutex, convert to lookup_one_len() instead of
(broken) open-coded analog, cope with getting something like
a//b as relative pathname. Simplify the hell out of it, while
we are there...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Reviewed-by: Jeff Layton <jlayton@redhat.com>
Joe Perches [Mon, 18 Jul 2011 17:44:44 +0000 (10:44 -0700)]
tulip: dmfe: Remove old log spamming pr_debugs
Commit 726b65ad444d ("tulip: Convert uses of KERN_DEBUG") enabled
some old previously inactive uses of pr_debug converted by
commit dde7c8ef1679 ("tulip/dmfe.c: Use dev_<level> and pr_<level>").
Remove these pr_debugs.
Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: David S. Miller <davem@davemloft.net>
si4713-i2c: avoid potential buffer overflow on si4713
While compiling it with Fedora 15, I noticed this issue:
inlined from ‘si4713_write_econtrol_string’ at drivers/media/radio/si4713-i2c.c:1065:24:
arch/x86/include/asm/uaccess_32.h:211:26: error: call to ‘copy_from_user_overflow’ declared with attribute error: copy_from_user() buffer size is not provably correct
Al Viro [Mon, 18 Jul 2011 02:27:22 +0000 (22:27 -0400)]
hppfs_lookup(): don't open-code lookup_one_len()
... and it's getting it wrong, too - missing ->d_revalidate() calls when
it's dealing with filesystem (procfs) that has non-trivial ->d_revalidate()...