xen/blkback: Squash the checking for operation into dispatch_rw_block_io
We do a check for the operations right before calling dispatch_rw_block_io.
And then we do the same check in dispatch_rw_block_io. This patch
squashes those checks into the 'dispatch_rw_block_io' function.
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
xen-blkfront: Provide for 'feature-flush-cache' the BLKIF_OP_WRITE_FLUSH_CACHE operation.
The operation BLKIF_OP_WRITE_FLUSH_CACHE has existed in the Xen
tree header file for years but it was never present in the Linux tree
because the frontend (nor the backend) supported this interface.
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Revert "xen/blkback: Move the plugging/unplugging to a higher level."
This reverts commit 97961ef46b9b5a6a7c918a38b898a7b3e49869f4 b/c
we lose about 15% performance if we do the unplugging and the
end of the reading the ring buffer.
".. that difference is that sync vs async requests. In the case of
a kernel thread submitting IO, [..] all the WRITES might be being
considered as async and will go in a different queue. If you mix those
with some READS, they are always sync and will go in differnet queue.
In presence of sync queue, CFQ will idle and choke up WRITES in
an attempt to improve latencies of READs.
In case of AIO [note: this is what QEMU qdisk is doing] , [..]
it is direct IO and both READS and WRITES will be considered SYNC
and will go in a single queue and no choking of WRITES will take place."
The solution is quite simple, tack on REQ_SYNC (which is
what the WRITE_ODIRECT macro points to) and the numbers go
back up.
Suggested-by: Vivek Goyal <vgoyal@redhat.com Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
xen/blkback: Move the plugging/unplugging to a higher level.
We used to the plug/unplug on the submit_bio. But that means
if within a stream of WRITE, WRITE, WRITE,...,WRITE we have
one READ, it could stall the pipeline (as the 'submio_bio'
could trigger the unplug_fnc to be called and stall/sync
when doing the READ). Instead we want to move the unplugging
when the whole (or as a much as possible) ring buffer has been
processed. This also eliminates us doing plug/unplug for
each request.
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
They were used to check if the queue does not have QUEUE_FLAG_DEAD
set. That is not necessary anymore as the 'submit_io' call
ends up doing that for us.
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
After the commit 0faa8cca883bbc6a0919e3c89128672659b75820
(" xen/blkback: remove per-queue plugging") we forgot
to retrieve the 'struct request_queue' from the block device.
This puts the functionality back in and fixes a NULL pointer
bug.
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
xen/blkback: Change fast_flush_area to xen_blkbk_unmap, and tweak xen_blk_map_seg.
The previous name ('fast_flush_area') had nothing to do with what it
does right now. Changing the names so that the code dealing with
mapping pages in and out of the guest is called xen_blkbk_[map|unmap].
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
xen/blkback: Shuffle code around (vbd_translate moved higher).
We take out the chunk of code dealing with mapping to the guest
of pages into the xen_blk_map_buf code. And we also move the
vbd_translate to be done much earlier.
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
xen/blkback: Seperate the bio allocation and the bio submission.
We seperate the bio allocation (bio_alloc) from the bio submission so
that the error paths are much easier, and also so that the bio
submission can be done in one tight loop. It also makes the
plug/unplug calls much much easier.
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
commit 7eaceaccab5f40bbfda044629a6298616aeaed50 ("block: remove per-queue plugging")
added two new interfaces to plug and unplug: blk_start_plug
and blk_finish_plug. Lets use those.
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
xen/blkback: Use kzalloc's, and GFP_KERNEL for data structures.
The patch titled:"xen/blkback: Use 'vzalloc' for page arrays and pre-allocate pages."
allocates the structures and its member variables using the 'vzalloc'.
Daniel Stodden pointed out that vzalloc is good when we use
big number of pages - while these are at the max two pages.
We can do this using kzalloc. Also the GFP_HIGHMEM does not
work properly with Xen, so take that out.
We will have to revisit this when a "get_empty_pages_and_pagevec"
type API shows up to leverage that.
xen/blkback: Use 'vzalloc' for page arrays and pre-allocate pages.
Previously we would allocate the array for page using 'kmalloc' which
we can as easily do with 'vzalloc'. The pre-allocation of pages
was done a bit differently in the past - it used to be that
the balloon driver would export "alloc_empty_pages_and_pagevec"
which would have in one function created an array, allocated
the pages, balloned the pages out (so the memory behind those
pages would be non-present), and provide us those pages.
This was OK as those pages were shared between other guest and
the only thing we needed was to "swizzel" the MFN of those pages
to point to the other guest MFN. We can still "swizzel" the MFNs
using the M2P (and P2M) override API calls, but for the sake of
simplicity we are dropping the balloon API calls. We can return
to those later on.
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
xen/blkback: Union the blkif_request request specific fields
Following in the steps of patch:
"xen: Union the blkif_request request specific fields" this patch
changes the blkback. Per the original patch:
"Prepare for extending the block device ring to allow request
specific fields, by moving the request specific fields for
reads, writes and barrier requests to a union member."
Cc: Owen Smith <owen.smith@citrix.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
xen/blkback: Move global/static variables into struct xen_blkbk.
Bundle the lot of discrete variables into a single structure.
This is based on what was done in the xen-netback driver:
xen: netback: Move global/static variables into struct xen_netbk.
(094944631cc5a9d6e623302c987f78117c0bf7ac)
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
"There are quite a number of places where e.g. page->va->page
translations happen.
Besides yielding smaller code (source and binary), a second goal is to
make it easier to determine where virtual addresses of pages allocated
through alloc_empty_pages_and_pagevec() are really used (in turn in
order to determine whether using highmem pages would be possible
there)."
The second goal is not the purpose of this patch - it is just to
make it easier to read the code.
Signed-off-by: Jan Beulich <jbeulich@novell.com> Signed-off-by: Daniel Stodden <daniel.stodden@citrix.com> Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Keir Fraser [Thu, 25 Nov 2010 06:08:20 +0000 (22:08 -0800)]
blkback: Fix CVE-2010-3699
A guest can cause the backend driver to leak a kernel thread. Such
leaked threads hold references to the device, whichmakes the device
impossible to tear down. If shut down, the guest remains a zombie
domain, the xenwatch process hangs, and most xm commands will stop
working.
This patch tries to do the following for blkback:
- identify/extract idempotent teardown operations,
- add/move the invocation of said teardown operation
right before we're about to allocate new resources in the
Connected states.
K. Y. Srinivasan [Thu, 11 Mar 2010 21:39:50 +0000 (13:39 -0800)]
xen/blkback: Propagate changed size of VBDs
Support dynamic resizing of virtual block devices. This patch supports
both file backed block devices as well as physical devices that can be
dynamically resized on the host side.
Signed-off-by: K. Y. Srinivasan <ksrinivasan@novell.com> Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Ian Campbell [Thu, 3 Dec 2009 21:56:18 +0000 (21:56 +0000)]
xen: rename blkbk module xen-blkback.
blkbk is rather generic for a modular distro style kernel.
Signed-off-by: Ian Campbell <ian.campbell@citrix.com> Cc: Jeremy Fitzhardinge <jeremy@goop.org> Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Fix compile warnings: ignoring return value of 'xenbus_register_backend' ..
We neglect to check the return value of xenbus_register_backend
and take actions when that fails. This patch fixes that and adds
code to deal with those type of failures.
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Signed-off-by: Ian Campbell <ian.campbell@citrix.com> Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Conflicts:
Merge branch 'for-linus' of git://oss.sgi.com/xfs/xfs
* 'for-linus' of git://oss.sgi.com/xfs/xfs:
xfs: use proper interfaces for on-stack plugging
xfs: fix xfs_debug warnings
xfs: fix variable set but not used warnings
xfs: convert log tail checking to a warning
xfs: catch bad block numbers freeing extents.
xfs: push the AIL from memory reclaim and periodic sync
xfs: clean up code layout in xfs_trans_ail.c
xfs: convert the xfsaild threads to a workqueue
xfs: introduce background inode reclaim work
xfs: convert ENOSPC inode flushing to use new syncd workqueue
xfs: introduce a xfssyncd workqueue
xfs: fix extent format buffer allocation size
xfs: fix unreferenced var error in xfs_buf.c
Also, applied patch from Tony Luck that fixes ia64:
xfs_destroy_workqueues() should not be tagged with__exit
in the branch before merging.
Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
ext4: fix data corruption regression by reverting commit 6de9843dab3f
ext4: Allow indirect-block file to grow the file size to max file size
ext4: allow an active handle to be started when freezing
ext4: sync the directory inode in ext4_sync_parent()
ext4: init timer earlier to avoid a kernel panic in __save_error_info
jbd2: fix potential memory leak on transaction commit
ext4: fix a double free in ext4_register_li_request
ext4: fix credits computing for indirect mapped files
ext4: remove unnecessary [cm]time update of quota file
jbd2: move bdget out of critical section
Merge branch 'spi/merge' of git://git.secretlab.ca/git/linux-2.6
* 'spi/merge' of git://git.secretlab.ca/git/linux-2.6:
dt/fsldma: fix build warning caused by of_platform_device changes
spi: Fix race condition in stop_queue()
gpio/pch_gpio: Fix output value of pch_gpio_direction_output()
gpio/ml_ioh_gpio: Fix output value of ioh_gpio_direction_output()
gpio/pca953x: fix error handling path in probe() call
In commit 13583b16592a ("PCI: refactor io size calculation code") Ram
had a thinko in the refactorization of the code: the end result used the
variable 'align' for the bus alignment, but the original code used
'min_align'.
Since then, another use of that 'align' variable got introduced by
commit c8adf9a3e873 ("PCI: pre-allocate additional resources to devices
only after successful allocation of essential resources.")
Fix both of those uses to use 'min_align' as they should.
Daniel Hellstrom <daniel@gaisler.com> Acked-by: Ram Pai <linuxram@us.ibm.com> Acked-by: Jesse Barnes <jbarnes@virtuousgeek.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6: (34 commits)
net: Add support for SMSC LAN9530, LAN9730 and LAN89530
mlx4_en: Restoring RX buffer pointer in case of failure
mlx4: Sensing link type at device initialization
ipv4: Fix "Set rt->rt_iif more sanely on output routes."
MAINTAINERS: add entry for Xen network backend
be2net: Fix suspend/resume operation
be2net: Rename some struct members for clarity
pppoe: drop PPPOX_ZOMBIEs in pppoe_flush_dev
dsa/mv88e6131: add support for mv88e6085 switch
ipv6: Enable RFS sk_rxhash tracking for ipv6 sockets (v2)
be2net: Fix a potential crash during shutdown.
bna: Fix for handling firmware heartbeat failure
can: mcp251x: Allow pass IRQ flags through platform data.
smsc911x: fix mac_lock acquision before calling smsc911x_mac_read
iwlwifi: accept EEPROM version 0x423 for iwl6000
rt2x00: fix cancelling uninitialized work
rtlwifi: Fix some warnings/bugs
p54usb: IDs for two new devices
wl12xx: fix potential buffer overflow in testmode nvs push
zd1211rw: reset rx idle timer from tasklet
...
Ira W. Snyder [Thu, 7 Apr 2011 17:33:03 +0000 (10:33 -0700)]
dt/fsldma: fix build warning caused by of_platform_device changes
Commit 000061245a6797d542854106463b6b20fbdcb12e, "dt/powerpc:
Eliminate users of of_platform_{,un}register_driver" forgot to convert
the type of structure passed into platform_device_register() when it
was converted from of_platform_device_register. Fix it.
Signed-off-by: Ira W. Snyder <iws@ovro.caltech.edu> Signed-off-by: Grant Likely <grant.likely@secretlab.ca>
ext4: fix data corruption regression by reverting commit 6de9843dab3f
Revert commit 6de9843dab3f2a1d4d66d80aa9e5782f80977d20, since it
caused a data corruption regression with BitTorrent downloads. Thanks
to Damien for discovering and bisecting to find the problem commit.
ext4: Allow indirect-block file to grow the file size to max file size
We can create 4402345721856 byte file with indirect block mapping.
However, if we grow an indirect-block file to the size with ftruncate(),
we can see an ext4 warning. The following patch fixes this problem.
How to reproduce:
# dd if=/dev/zero of=/mnt/mp1/hoge bs=1 count=0 seek=4402345721856
0+0 records in
0+0 records out
0 bytes (0 B) copied, 0.000221428 s, 0.0 kB/s
# tail -n 1 /var/log/messages
Nov 25 15:10:27 test kernel: EXT4-fs warning (device sda8): ext4_block_to_path:345: block 1074791436 > max in inode 12
Yongqiang Yang [Mon, 11 Apr 2011 02:06:07 +0000 (22:06 -0400)]
ext4: allow an active handle to be started when freezing
ext4_journal_start_sb() should not prevent an active handle from being
started due to s_frozen. Otherwise, deadlock is easy to happen, below
is a situation.
================================================
freeze | truncate
================================================
| ext4_ext_truncate()
freeze_super() | starts a handle
sets s_frozen |
| ext4_ext_truncate()
| holds i_data_sem
ext4_freeze() |
waits for updates |
| ext4_free_blocks()
| calls dquot_free_block()
|
| dquot_free_blocks()
| calls ext4_dirty_inode()
|
| ext4_dirty_inode()
| trys to start an active
| handle
|
| block due to s_frozen
================================================
Signed-off-by: Yongqiang Yang <xiaoqiangnk@gmail.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Reported-by: Amir Goldstein <amir73il@users.sf.net> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Andreas Dilger <adilger@dilger.ca>
ext4: sync the directory inode in ext4_sync_parent()
ext4 has taken the stance that, in the absence of a journal,
when an fsync/fdatasync of an inode is done, the parent
directory should be sync'ed if this inode entry is new.
ext4_sync_parent(), which implements this, does indeed sync
the dirent pages for parent directories, but it does not
sync the directory *inode*. This patch fixes this.
Also now return error status from ext4_sync_parent().
I tested this using a power fail test, which panics a
machine running a file server getting requests from a
client. Without this patch, on about every other test run,
the server is missing many, many files that had been synced.
With this patch, on > 6 runs, I see zero files being lost.
net: Add support for SMSC LAN9530, LAN9730 and LAN89530
This patch adds support for SMSC's LAN9530, LAN9730 and LAN89530 USB
ethernet controllers to the existing smsc95xx driver by adding
their new USB VID/PID pairs.
Signed-off-by: Steve Glendinning <steve.glendinning@smsc.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound-2.6
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound-2.6:
ALSA: hda - Don't query connections for widgets have no connections
ALSA: HDA: Fix single internal mic on ALC275 (Sony Vaio VPCSB1C5E)
ALSA: hda - HDMI: Fix MCP7x audio infoframe checksums
ALSA: usb-audio: define another USB ID for a buggy USB MIDI cable
ALSA: HDA: Fix dock mic for Lenovo X220-tablet
ASoC: format_register_str: Don't clip register values
ASoC: PXA: Fix oops in __pxa2xx_pcm_prepare
ASoC: zylonite: set .codec_dai_name in initializer
J. Bruce Fields [Mon, 28 Mar 2011 07:15:09 +0000 (15:15 +0800)]
nfsd4: fix oops on lock failure
Lock stateid's can have access_bmap 0 if they were only partially
initialized (due to a failed lock request); handle that case in
free_generic_stateid.
Reported-by: Mi Jinlong <mijinlong@cn.fujitsu.com> Tested-by: Mi Jinlong <mijinlong@cn.fujitsu.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
* git://git.infradead.org/mtd-2.6:
mtd: atmel_nand: use CPU I/O when buffer is in vmalloc(ed) region
mtd: atmel_nand: modify test case for using DMA operations
mtd: atmel_nand: fix support for CPUs that do not support DMA access
mtd: atmel_nand: trivial: change DMA usage information trace
mtd: mtdswap: fix printk format warning
Merge branch 'bugfixes' of git://git.linux-nfs.org/projects/trondmy/nfs-2.6
* 'bugfixes' of git://git.linux-nfs.org/projects/trondmy/nfs-2.6:
NFS: Change initial mount authflavor only when server returns NFS4ERR_WRONGSEC
NFS: Fix a signed vs. unsigned secinfo bug
Revert "net/sunrpc: Use static const char arrays"
Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs-2.6
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs-2.6:
quota: Don't write quota info in dquot_commit()
ext3: Fix writepage credits computation for ordered mode
Add proper blk_start_plug/blk_finish_plug pairs for the two places where
we issue buffer I/O, and remove the blk_flush_plug in xfs_buf_lock and
xfs_buf_iowait, given that context switches already flush the per-process
plugging lists.
Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>
For a CONFIG_XFS_DEBUG=n build gcc complains about statements with no
effect in xfs_debug:
fs/xfs/quota/xfs_qm_syscalls.c: In function 'xfs_qm_scall_trunc_qfiles':
fs/xfs/quota/xfs_qm_syscalls.c:291:3: warning: statement with no effect
The reason for that is that the various new xfs message functions have a
return value which is never used, and in case of the non-debug build
xfs_debug the macro evaluates to a plain 0 which produces the above
warnings. This can be fixed by turning xfs_debug into an inline function
instead of a macro, but in addition to that I've also changed all the
message helpers to return void as we never use their return values.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Alex Elder <aelder@sgi.com>
When bringing the port up, performing a SENSE_PORT command
To try and check to which physical link type (IB or Ethernet) the physical
port is connected.
In case there is no valid link partner, the port will come up as its
supported default.
Signed-off-by: Yevgeny Petrilin <yevgenyp@mellanox.co.il> Signed-off-by: David S. Miller <davem@davemloft.net>
Dave Chinner [Fri, 8 Apr 2011 02:45:07 +0000 (12:45 +1000)]
xfs: convert log tail checking to a warning
On the Power platform, the log tail debug checks fire excessively
causing the system to panic early in testing. The debug checks are
known to be racy, though on x86_64 there is no evidence that they
trigger at all.
We want to keep the checks active on debug systems to alert us to
problems with log space accounting, but we need to reduce the impact
of a racy check on testing on the Power platform.
As a result, convert the ASSERT conditions to warnings, and
allow them to fire only once per filesystem mount. This will prevent
false positives from interfering with testing, whilst still
providing us with the indication that they may be a problem with log
space accounting should that occur.
Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Alex Elder <aelder@sgi.com>
Dave Chinner [Fri, 8 Apr 2011 02:45:07 +0000 (12:45 +1000)]
xfs: catch bad block numbers freeing extents.
A fuzzed filesystem crashed a kernel when freeing an extent with a
block number beyond the end of the filesystem. Convert all the debug
asserts in xfs_free_extent() to active checks so that we catch bad
extents and return that the filesytsem is corrupted rather than
crashing.
Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Alex Elder <aelder@sgi.com>
Dave Chinner [Fri, 8 Apr 2011 02:45:07 +0000 (12:45 +1000)]
xfs: push the AIL from memory reclaim and periodic sync
When we are short on memory, we want to expedite the cleaning of
dirty objects. Hence when we run short on memory, we need to kick
the AIL flushing into action to clean as many dirty objects as
quickly as possible. To implement this, sample the lsn of the log
item at the head of the AIL and use that as the push target for the
AIL flush.
Further, we keep items in the AIL that are dirty that are not
tracked any other way, so we can get objects sitting in the AIL that
don't get written back until the AIL is pushed. Hence to get the
filesystem to the idle state, we might need to push the AIL to flush
out any remaining dirty objects sitting in the AIL. This requires
the same push mechanism as the reclaim push.
This patch also renames xfs_trans_ail_tail() to xfs_ail_min_lsn() to
match the new xfs_ail_max_lsn() function introduced in this patch.
Similarly for xfs_trans_ail_push -> xfs_ail_push.
Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Alex Elder <aelder@sgi.com>
Dave Chinner [Fri, 8 Apr 2011 02:45:07 +0000 (12:45 +1000)]
xfs: clean up code layout in xfs_trans_ail.c
This patch rearranges the location of functions in xfs_trans_ail.c
to remove the need for forward declarations of those functions in
preparation for adding new functions without the need for forward
declarations.
Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Alex Elder <aelder@sgi.com>
Dave Chinner [Fri, 8 Apr 2011 02:45:07 +0000 (12:45 +1000)]
xfs: convert the xfsaild threads to a workqueue
Similar to the xfssyncd, the per-filesystem xfsaild threads can be
converted to a global workqueue and run periodically by delayed
works. This makes sense for the AIL pushing because it uses
variable timeouts depending on the work that needs to be done.
By removing the xfsaild, we simplify the AIL pushing code and
remove the need to spread the code to implement the threading
and pushing across multiple files.
Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Alex Elder <aelder@sgi.com>
Dave Chinner [Fri, 8 Apr 2011 02:45:07 +0000 (12:45 +1000)]
xfs: introduce background inode reclaim work
Background inode reclaim needs to run more frequently that the XFS
syncd work is run as 30s is too long between optimal reclaim runs.
Add a new periodic work item to the xfs syncd workqueue to run a
fast, non-blocking inode reclaim scan.
Background inode reclaim is kicked by the act of marking inodes for
reclaim. When an AG is first marked as having reclaimable inodes,
the background reclaim work is kicked. It will continue to run
periodically untill it detects that there are no more reclaimable
inodes. It will be kicked again when the first inode is queued for
reclaim.
To ensure shrinker based inode reclaim throttles to the inode
cleaning and reclaim rate but still reclaim inodes efficiently, make it kick the
background inode reclaim so that when we are low on memory we are
trying to reclaim inodes as efficiently as possible. This kick shoul
d not be necessary, but it will protect against failures to kick the
background reclaim when inodes are first dirtied.
To provide the rate throttling, make the shrinker pass do
synchronous inode reclaim so that it blocks on inodes under IO. This
means that the shrinker will reclaim inodes rather than just
skipping over them, but it does not adversely affect the rate of
reclaim because most dirty inodes are already under IO due to the
background reclaim work the shrinker kicked.
These two modifications solve one of the two OOM killer invocations
Chris Mason reported recently when running a stress testing script.
The particular workload trigger for the OOM killer invocation is
where there are more threads than CPUs all unlinking files in an
extremely memory constrained environment. Unlike other solutions,
this one does not have a performance impact on performance when
memory is not constrained or the number of concurrent threads
operating is <= to the number of CPUs.
Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Alex Elder <aelder@sgi.com>