Darren Hart [Fri, 10 Apr 2009 16:50:05 +0000 (09:50 -0700)]
futex: fix futex_wait_setup key handling
If the get_futex_key() call were to fail, the existing code would
try and put_futex_key() prior to returning. This patch makes sure
we only put_futex_key() if get_futex_key() succeeded.
Reported-by: Clark Williams <williams@redhat.com> Signed-off-by: Darren Hart <dvhltc@us.ibm.com>
LKML-Reference: <20090410165005.14342.16973.stgit@Aeon> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Today's linux-next build (sparc64 defconfig) failed like this:
arch/sparc/kernel/built-in.o: In function `trap_init':
(.init.text+0x4): undefined reference to `thread_info_offsets_are_bolixed_dave'
Caused by commit 52400ba946759af28442dee6265c5c0180ac7122 ("futex: add
requeue_pi functionality") (from the tip-core tree) which changed the
size of struct restart_block.
Shift TI_KUNA_REGS and TI_KUNA_INSN up by 8 bytes to make space for the
larger restart block.
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> Acked-by: "David S. Miller" <davem@davemloft.net> Cc: Darren Hart <dvhltc@us.ibm.com>
LKML-Reference: <20090409151722.c8eabb56.sfr@canb.auug.org.au> Signed-off-by: Ingo Molnar <mingo@elte.hu>
Darren Hart [Wed, 8 Apr 2009 06:23:50 +0000 (23:23 -0700)]
futex: fixup unlocked requeue pi case
Thomas's testing caught a problem when the requeue target futex is
unowned and multiple tasks are requeued to it. This patch ensures
the FUTEX_WAITERS bit gets set if futex_requeue() will requeue one
or more tasks in addition to the one acquiring the lock.
Signed-off-by: Darren Hart <dvhltc@us.ibm.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Darren Hart [Fri, 3 Apr 2009 20:40:49 +0000 (13:40 -0700)]
futex: add requeue_pi functionality
PI Futexes and their underlying rt_mutex cannot be left ownerless if
there are pending waiters as this will break the PI boosting logic, so
the standard requeue commands aren't sufficient. The new commands
properly manage pi futex ownership by ensuring a futex with waiters
has an owner at all times. This will allow glibc to properly handle
pi mutexes with pthread_condvars.
The approach taken here is to create two new futex op codes:
FUTEX_WAIT_REQUEUE_PI:
Tasks will use this op code to wait on a futex (such as a non-pi waitqueue)
and wake after they have been requeued to a pi futex. Prior to returning to
userspace, they will acquire this pi futex (and the underlying rt_mutex).
futex_wait_requeue_pi() is the result of a high speed collision between
futex_wait() and futex_lock_pi() (with the first part of futex_lock_pi() being
done by futex_proxy_trylock_atomic() on behalf of the top_waiter).
FUTEX_REQUEUE_PI (and FUTEX_CMP_REQUEUE_PI):
This call must be used to wake tasks waiting with FUTEX_WAIT_REQUEUE_PI,
regardless of how many tasks the caller intends to wake or requeue.
pthread_cond_broadcast() should call this with nr_wake=1 and
nr_requeue=INT_MAX. pthread_cond_signal() should call this with nr_wake=1 and
nr_requeue=0. The reason being we need both callers to get the benefit of the
futex_proxy_trylock_atomic() routine. futex_requeue() also enqueues the
top_waiter on the rt_mutex via rt_mutex_start_proxy_lock().
Signed-off-by: Darren Hart <dvhltc@us.ibm.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Darren Hart [Fri, 3 Apr 2009 20:40:31 +0000 (13:40 -0700)]
futex: distangle futex_requeue()
futex_requeue() is getting a bit long-winded, and will be getting more
so after the requeue_pi patch. Factor out the actual requeueing into a
nicely contained inline function to reduce function length and improve
legibility.
Signed-off-by: Darren Hart <dvhltc@us.ibm.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Darren Hart [Fri, 3 Apr 2009 20:40:22 +0000 (13:40 -0700)]
futex: add FUTEX_HAS_TIMEOUT flag to restart.futex.flags
Currently restart is only used if there is a timeout. The requeue_pi
functionality requires restarting to futex_lock_pi() on signal after
wakeup in futex_wait_requeue_pi() regardless of if there was a timeout
or not. Using 0 for the timeout value is confusing as that could
indicate an expired timer. The flag makes this explicit. While the
check is not technically needed in futex_wait_restart(), doing so
makes the code consistent with and will avoid confusion should the
need arise to restart wait without a timeout.
Signed-off-by: Darren Hart <dvhltc@us.ibm.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Darren Hart [Fri, 3 Apr 2009 20:40:12 +0000 (13:40 -0700)]
rt_mutex: add proxy lock routines
This patch is a prerequisite for futex requeue_pi. It basically splits
rt_mutex_slowlock() right down the middle, just before the first call
to schedule(). It further adds helper functions which make use of the
split and provide the rt-mutex preliminaries for futex requeue_pi.
Signed-off-by: Darren Hart <dvhltc@us.ibm.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Darren Hart [Fri, 3 Apr 2009 20:39:52 +0000 (13:39 -0700)]
futex: split out atomic logic from futex_lock_pi()
Refactor the atomic portion of futex_lock_pi() into futex_lock_pi_atomic().
This logic will be needed by requeue_pi, so modularize it to reduce
code duplication. The only significant change is passing of the task
to try and take the lock for. This simplifies the -EDEADLK test as if
the lock is owned by task t, it's a deadlock, regardless of if we are
doing requeue pi or not. This patch updates the corresponding comment
accordingly.
Signed-off-by: Darren Hart <dvhltc@us.ibm.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Darren Hart [Fri, 3 Apr 2009 20:39:33 +0000 (13:39 -0700)]
futex: separate futex_wait_queue_me() logic from futex_wait()
Refactor futex_wait() in preparation for futex_wait_requeue_pi(). In
order to reuse a good chunk of the futex_wait() code for the upcoming
futex_wait_requeue_pi() function, this patch breaks out the
queue-to-wakeup section of futex_wait() into futex_wait_queue_me().
Signed-off-by: Darren Hart <dvhltc@us.ibm.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Merge branch 'stacktrace-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'stacktrace-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
symbols, stacktrace: look up init symbols after module symbols
Merge branch 'rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
rcu: rcu_barrier VS cpu_hotplug: Ensure callbacks in dead cpu are migrated to online cpu
Merge branch 'ipi-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'ipi-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
s390: remove arch specific smp_send_stop()
panic: clean up kernel/panic.c
panic, smp: provide smp_send_stop() wrapper on UP too
panic: decrease oops_in_progress only after having done the panic
generic-ipi: eliminate WARN_ON()s during oops/panic
generic-ipi: cleanups
generic-ipi: remove CSD_FLAG_WAIT
generic-ipi: remove kmalloc()
generic IPI: simplify barriers and locking
Merge branch 'locking-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'locking-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
locking: rename trace_softirq_[enter|exit] => lockdep_softirq_[enter|exit]
lockdep: remove duplicate CONFIG_DEBUG_LOCKDEP definitions
lockdep: require framepointers for x86
lockdep: remove extra "irq" string
lockdep: fix incorrect state name
Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/hid
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/hid:
HID: remove compat stuff
HID: constify arrays of struct apple_key_translation
HID: add support for Kye/Genius Ergo 525V
HID: Support Apple mini aluminum keyboard
HID: support for Kensington slimblade device
HID: DragonRise game controller force feedback driver
HID: add support for another version of 0e8f:0003 device in hid-pl
HID: fix race between usb_register_dev() and hiddev_open()
HID: bring back possibility to specify vid/pid ignore on module load
HID: make HID_DEBUG defaults consistent
HID: autosuspend -- fix lockup of hid on reset
HID: hid_reset_resume() needs to be defined only when CONFIG_PM is set
HID: fix USB HID devices after STD with autosuspend
HID: do not try to compile PM code with CONFIG_PM unset
HID: autosuspend support for USB HID
Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (28 commits)
trivial: Update my email address
trivial: NULL noise: drivers/mtd/tests/mtd_*test.c
trivial: NULL noise: drivers/media/dvb/frontends/drx397xD_fw.h
trivial: Fix misspelling of "Celsius".
trivial: remove unused variable 'path' in alloc_file()
trivial: fix a pdlfush -> pdflush typo in comment
trivial: jbd header comment typo fix for JBD_PARANOID_IOFAIL
trivial: wusb: Storage class should be before const qualifier
trivial: drivers/char/bsr.c: Storage class should be before const qualifier
trivial: h8300: Storage class should be before const qualifier
trivial: fix where cgroup documentation is not correctly referred to
trivial: Give the right path in Documentation example
trivial: MTD: remove EOL from MODULE_DESCRIPTION
trivial: Fix typo in bio_split()'s documentation
trivial: PWM: fix of #endif comment
trivial: fix typos/grammar errors in Kconfig texts
trivial: Fix misspelling of firmware
trivial: cgroups: documentation typo and spelling corrections
trivial: Update contact info for Jochen Hein
trivial: fix typo "resgister" -> "register"
...
* git://git.kernel.org/pub/scm/linux/kernel/git/czankel/xtensa-2.6: (21 commits)
xtensa: we don't need to include asm/io.h
xtensa: only build platform or variant if they contain a Makefile
xtensa: make startup code discardable
xtensa: ccount clocksource
xtensa: remove platform rtc hooks
xtensa: use generic sched_clock()
xtensa: platform: s6105
xtensa: let platform override KERNELOFFSET
xtensa: s6000 variant
xtensa: s6000 variant core definitions
xtensa: variant irq set callbacks
xtensa: variant-specific code
xtensa: nommu support
xtensa: add flat support
xtensa: enforce slab alignment to maximum register width
xtensa: cope with ram beginning at higher addresses
xtensa: don't make bootmem bitmap larger than required
xtensa: fix init_bootmem_node() argument order
xtensa: use correct stack pointer for stack traces
xtensa: beat Kconfig into shape
...
* git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable:
Btrfs: BUG to BUG_ON changes
Btrfs: remove dead code
Btrfs: remove dead code
Btrfs: fix typos in comments
Btrfs: remove unused ftrace include
Btrfs: fix __ucmpdi2 compile bug on 32 bit builds
Btrfs: free inode struct when btrfs_new_inode fails
Btrfs: fix race in worker_loop
Btrfs: add flushoncommit mount option
Btrfs: notreelog mount option
Btrfs: introduce btrfs_show_options
Btrfs: rework allocation clustering
Btrfs: Optimize locking in btrfs_next_leaf()
Btrfs: break up btrfs_search_slot into smaller pieces
Btrfs: kill the pinned_mutex
Btrfs: kill the block group alloc mutex
Btrfs: clean up find_free_extent
Btrfs: free space cache cleanups
Btrfs: unplug in the async bio submission threads
Btrfs: keep processing bios for a given bdev if our proc is batching
x86, PAT: Remove duplicate memtype reserve in pci mmap
pci mmap code was doing memtype reserve for a while now. Recently we
added memtype tracking in remap_pfn_range, and pci code indirectly calls
remap_pfn_range. So, we don't need seperate tracking in pci code
anymore. Which means a patch that removes ~50 lines of code :-).
Also, recently we found out that the pci tracking is not working as we expect
it to work in some cases. Specifically, userlevel X mmap of pci, with some
recent version of X, is having a problem with vm_page_prot getting reset.
The pci tracking uses vm_page_prot to pass on the protection type from parent
to child during fork.
a) Parent does a pci mmap
b) We look at PAT and get either UC_MINUS or WC mapping for parent
c) Store that mapping type in vma vm_page_prot for future use
d) This thread does a fork
e) Fork results in mmap_ops ->open for the child process
f) We get the vm_page_prot from vma and reserve that type for the child process
But, between c) and e) above, the vma vm_page_prot is getting reset to zero.
This results in PAT reserve failing at the time of fork as in here.
http://marc.info/?l=linux-kernel&m=123858163103240&w=2
This cleanup makes the above problem go away as we do not depend on
vm_page_prot in our PAT code anymore.
Merge branch 'upstream-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mfasheh/ocfs2
* 'upstream-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mfasheh/ocfs2: (32 commits)
ocfs2: recover orphans in offline slots during recovery and mount
ocfs2: Pagecache usage optimization on ocfs2
ocfs2: fix rare stale inode errors when exporting via nfs
ocfs2/dlm: Tweak mle_state output
ocfs2/dlm: Do not purge lockres that is being migrated dlm_purge_lockres()
ocfs2/dlm: Remove struct dlm_lock_name in struct dlm_master_list_entry
ocfs2/dlm: Show the number of lockres/mles in dlm_state
ocfs2/dlm: dlm_set_lockres_owner() and dlm_change_lockres_owner() inlined
ocfs2/dlm: Improve lockres counts
ocfs2/dlm: Track number of mles
ocfs2/dlm: Indent dlm_cleanup_master_list()
ocfs2/dlm: Activate dlm->master_hash for master list entries
ocfs2/dlm: Create and destroy the dlm->master_hash
ocfs2/dlm: Refactor dlm_clean_master_list()
ocfs2/dlm: Clean up struct dlm_lock_name
ocfs2/dlm: Encapsulate adding and removing of mle from dlm->master_list
ocfs2: Optimize inode group allocation by recording last used group.
ocfs2: Allocate inode groups from global_bitmap.
ocfs2: Optimize inode allocation by remembering last group
ocfs2: fix leaf start calculation in ocfs2_dx_dir_rebalance()
...
Merge branch 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/djbw/async_tx
* 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/djbw/async_tx:
dma: Add SoF and EoF debugging to ipu_idmac.c, minor cleanup
dw_dmac: add cyclic API to DW DMA driver
dmaengine: Add privatecnt to revert DMA_PRIVATE property
dmatest: add dma interrupts and callbacks
dmatest: add xor test
dmaengine: allow dma support for async_tx to be toggled
async_tx: provide __async_inline for HAS_DMA=n archs
dmaengine: kill some unused headers
dmaengine: initialize tx_list in dma_async_tx_descriptor_init
dma: i.MX31 IPU DMA robustness improvements
dma: improve section assignment in i.MX31 IPU DMA driver
dma: ipu_idmac driver cosmetic clean-up
dmaengine: fail device registration if channel registration fails
Srinivas Eeda [Fri, 6 Mar 2009 22:21:46 +0000 (14:21 -0800)]
ocfs2: recover orphans in offline slots during recovery and mount
During recovery, a node recovers orphans in it's slot and the dead node(s). But
if the dead nodes were holding orphans in offline slots, they will be left
unrecovered.
If the dead node is the last one to die and is holding orphans in other slots
and is the first one to mount, then it only recovers it's own slot, which
leaves orphans in offline slots.
This patch queues complete_recovery to clean orphans for all offline slots
during mount and node recovery.
Signed-off-by: Srinivas Eeda <srinivas.eeda@oracle.com> Acked-by: Joel Becker <joel.becker@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Hisashi Hifumi [Thu, 5 Mar 2009 08:22:21 +0000 (17:22 +0900)]
ocfs2: Pagecache usage optimization on ocfs2
A page can have multiple buffers and even if a page is not uptodate, some buffers
can be uptodate on pagesize != blocksize environment.
This aops checks that all buffers which correspond to a part of a file
that we want to read are uptodate. If so, we do not have to issue actual
read IO to HDD even if a page is not uptodate because the portion we
want to read are uptodate.
"block_is_partially_uptodate" function is already used by ext2/3/4.
With the following patch random read/write mixed workloads or random read after
random write workloads can be optimized and we can get performance improvement.
Signed-off-by: Hisashi Hifumi <hifumi.hisashi@oss.ntt.co.jp> Signed-off-by: Mark Fasheh <mfasheh@suse.com>
wengang wang [Fri, 6 Mar 2009 13:29:10 +0000 (21:29 +0800)]
ocfs2: fix rare stale inode errors when exporting via nfs
For nfs exporting, ocfs2_get_dentry() returns the dentry for fh.
ocfs2_get_dentry() may read from disk when the inode is not in memory,
without any cross cluster lock. this leads to the file system loading a
stale inode.
This patch fixes above problem.
Solution is that in case of inode is not in memory, we get the cluster
lock(PR) of alloc inode where the inode in question is allocated from (this
causes node on which deletion is done sync the alloc inode) before reading
out the inode itsself. then we check the bitmap in the group (the inode in
question allcated from) to see if the bit is clear. if it's clear then it's
stale. if the bit is set, we then check generation as the existing code
does.
We have to read out the inode in question from disk first to know its alloc
slot and allot bit. And if its not stale we read it out using ocfs2_iget().
The second read should then be from cache.
And also we have to add a per superblock nfs_sync_lock to cover the lock for
alloc inode and that for inode in question. this is because ocfs2_get_dentry()
and ocfs2_delete_inode() lock on them in reverse order. nfs_sync_lock is locked
in EX mode in ocfs2_get_dentry() and in PR mode in ocfs2_delete_inode(). so
that mutliple ocfs2_delete_inode() can run concurrently in normal case.
[mfasheh@suse.com: build warning fixes and comment cleanups] Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com> Acked-by: Joel Becker <joel.becker@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Sunil Mushran [Thu, 26 Feb 2009 23:00:47 +0000 (15:00 -0800)]
ocfs2/dlm: Remove struct dlm_lock_name in struct dlm_master_list_entry
This patch removes struct dlm_lock_name and adds the entries directly
to struct dlm_master_list_entry. Under the new scheme, both mles that
are backed by a lockres or not, will have the name populated in mle->mname.
This allows us to get rid of code that was figuring out the location of
the mle name.
Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Sunil Mushran [Thu, 26 Feb 2009 23:00:44 +0000 (15:00 -0800)]
ocfs2/dlm: Improve lockres counts
This patch replaces the lockres counts that tracked the number number of
locally and remotely mastered lockres' with a current and total count. The
total count is the number of lockres' that have been created since the dlm
domain was created.
The number of locally and remotely mastered counts can be computed using
the locking_state output.
Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Sunil Mushran [Thu, 26 Feb 2009 23:00:43 +0000 (15:00 -0800)]
ocfs2/dlm: Track number of mles
The lifetime of a mle is limited to the duration of the lockres mastery
process. While typically this lifetime is fairly short, we have noticed
the number of mles explode under certain circumstances. This patch tracks
the number of each different types of mles and should help us determine
how best to speed up the mastery process.
Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Sunil Mushran [Thu, 26 Feb 2009 23:00:41 +0000 (15:00 -0800)]
ocfs2/dlm: Activate dlm->master_hash for master list entries
With this patch, the mles are stored in a hash and not a simple list.
This should improve the mle lookup time when the number of outstanding
masteries is large.
Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Sunil Mushran [Thu, 26 Feb 2009 23:00:38 +0000 (15:00 -0800)]
ocfs2/dlm: Clean up struct dlm_lock_name
For master mle, the name it stored in the attached lockres in struct qstr.
For block and migration mle, the name is stored inline in struct dlm_lock_name.
This patch attempts to make struct dlm_lock_name look like a struct qstr. While
we could use struct qstr, we don't because we want to avoid having to malloc
and free the lockname string as the mle's lifetime is fairly short.
Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Sunil Mushran [Thu, 26 Feb 2009 23:00:37 +0000 (15:00 -0800)]
ocfs2/dlm: Encapsulate adding and removing of mle from dlm->master_list
This patch encapsulates adding and removing of the mle from the
dlm->master_list. This patch is part of the series of patches that
converts the mle list to a mle hash.
Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Tao Ma [Tue, 24 Feb 2009 16:53:25 +0000 (00:53 +0800)]
ocfs2: Optimize inode group allocation by recording last used group.
In ocfs2, the block group search looks for the "emptiest" group
to allocate from. So if the allocator has many equally(or almost
equally) empty groups, new block group will tend to get spread
out amongst them.
So we add osb_inode_alloc_group in ocfs2_super to record the last
used inode allocation group.
For more details, please see
http://oss.oracle.com/osswiki/OCFS2/DesignDocs/InodeAllocationStrategy.
I have done some basic test and the results are a ten times improvement on
some cold-cache stat workloads.
Signed-off-by: Tao Ma <tao.ma@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Tao Ma [Tue, 24 Feb 2009 16:53:24 +0000 (00:53 +0800)]
ocfs2: Allocate inode groups from global_bitmap.
Inode groups used to be allocated from local alloc file,
but since we want all inodes to be contiguous enough, we
will try to allocate them directly from global_bitmap.
Signed-off-by: Tao Ma <tao.ma@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Tao Ma [Tue, 24 Feb 2009 16:53:23 +0000 (00:53 +0800)]
ocfs2: Optimize inode allocation by remembering last group
In ocfs2, the inode block search looks for the "emptiest" inode
group to allocate from. So if an inode alloc file has many equally
(or almost equally) empty groups, new inodes will tend to get
spread out amongst them, which in turn can put them all over the
disk. This is undesirable because directory operations on conceptually
"nearby" inodes force a large number of seeks.
So we add ip_last_used_group in core directory inodes which records
the last used allocation group. Another field named ip_last_used_slot
is also added in case inode stealing happens. When claiming new inode,
we passed in directory's inode so that the allocation can use this
information.
For more details, please see
http://oss.oracle.com/osswiki/OCFS2/DesignDocs/InodeAllocationStrategy.
Signed-off-by: Tao Ma <tao.ma@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Mark Fasheh [Thu, 19 Feb 2009 21:17:05 +0000 (13:17 -0800)]
ocfs2: fix leaf start calculation in ocfs2_dx_dir_rebalance()
ocfs2_dx_dir_rebalance() is passed the block offset of a dx leaf which needs
rebalancing. Since we rebalance an entire cluster at a time however, this
function needs to calculate the beginning of that cluster, in blocks. The
calculation was wrong, which would result in a read of non-leaf blocks. Fix
the calculation by adding ocfs2_block_to_cluster_start() which is a more
straight-forward way of determining this.
Reported-by: Tristan Ye <tristan.ye@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Mark Fasheh [Wed, 18 Feb 2009 19:41:38 +0000 (11:41 -0800)]
ocfs2: re-order ocfs2_empty_dir checks
ocfs2_empty_dir() is far more expensive than checking link count. Since both
need to be checked at the same time, we can improve performance by checking
link count first.
Mark Fasheh [Fri, 21 Nov 2008 01:54:57 +0000 (17:54 -0800)]
ocfs2: Increase max links count
Since we've now got a directory format capable of handling a large number of
entries, we can increase the maximum link count supported. This only gets
increased if the directory indexing feature is turned on.
Signed-off-by: Mark Fasheh <mfasheh@suse.com> Acked-by: Joel Becker <joel.becker@oracle.com>
Mark Fasheh [Fri, 30 Jan 2009 02:17:46 +0000 (18:17 -0800)]
ocfs2: Introduce dir free space list
The only operation which doesn't get faster with directory indexing is
insert, which still has to walk the entire unindexed directory portion to
find a free block. This patch provides an improvement in directory insert
performance by maintaining a singly linked list of directory leaf blocks
which have space for additional dirents.
Signed-off-by: Mark Fasheh <mfasheh@suse.com> Acked-by: Joel Becker <joel.becker@oracle.com>
Mark Fasheh [Tue, 25 Nov 2008 01:02:08 +0000 (17:02 -0800)]
ocfs2: Store dir index records inline
Allow us to store a small number of directory index records in the
ocfs2_dx_root_block. This saves us a disk read on small to medium sized
directories (less than about 250 entries). The inline root is automatically
turned into a root block with extents if the directory size increases beyond
it's capacity.
Signed-off-by: Mark Fasheh <mfasheh@suse.com> Acked-by: Joel Becker <joel.becker@oracle.com>
Mark Fasheh [Thu, 13 Nov 2008 00:27:44 +0000 (16:27 -0800)]
ocfs2: Add a name indexed b-tree to directory inodes
This patch makes use of Ocfs2's flexible btree code to add an additional
tree to directory inodes. The new tree stores an array of small,
fixed-length records in each leaf block. Each record stores a hash value,
and pointer to a block in the traditional (unindexed) directory tree where a
dirent with the given name hash resides. Lookup exclusively uses this tree
to find dirents, thus providing us with constant time name lookups.
Some of the hashing code was copied from ext3. Unfortunately, it has lots of
unfixed checkpatch errors. I left that as-is so that tracking changes would
be easier.
Signed-off-by: Mark Fasheh <mfasheh@suse.com> Acked-by: Joel Becker <joel.becker@oracle.com>
Mark Fasheh [Wed, 12 Nov 2008 23:43:34 +0000 (15:43 -0800)]
ocfs2: Introduce dir lookup helper struct
Many directory manipulation calls pass around a tuple of dirent, and it's
containing buffer_head. Dir indexing has a bit more state, but instead of
adding yet more arguments to functions, we introduce 'struct
ocfs2_dir_lookup_result'. In this patch, it simply holds the same tuple, but
future patches will add more state.
Signed-off-by: Mark Fasheh <mfasheh@suse.com> Acked-by: Joel Becker <joel.becker@oracle.com>
Sunil Mushran [Wed, 17 Dec 2008 22:17:43 +0000 (14:17 -0800)]
ocfs2: Expose the file system state via debugfs
This patch creates a per mount debugfs file, fs_state, which exposes
information like, cluster stack in use, states of the downconvert, recovery
and commit threads, number of journal txns, some allocation stats, list of
all slots, etc.
Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Merge branch 'ext3-latency-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
* 'ext3-latency-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
ext3: Add replace-on-rename hueristics for data=writeback mode
ext3: Add replace-on-truncate hueristics for data=writeback mode
ext3: Use WRITE_SYNC for commits which are caused by fsync()
block_write_full_page: Use synchronous writes for WBC_SYNC_ALL writebacks
Joseph Cihula [Mon, 30 Mar 2009 21:03:01 +0000 (14:03 -0700)]
x86: disable stack-protector for __restore_processor_state()
The __restore_processor_state() fn restores %gs on resume from S3. As
such, it cannot be protected by the stack-protector guard since %gs will
not be correct on function entry.
There are only a few other fns in this file and it should not negatively
impact kernel security that they will also have the stack-protector
guard removed (and so it's not worth moving them to another file).
Without this change, S3 resume on a kernel built with
CONFIG_CC_STACKPROTECTOR_ALL=y will fail.
Signed-off-by: Joseph Cihula <joseph.cihula@intel.com> Tested-by: Chris Wright <chrisw@sous-sol.org> Cc: Arjan van de Ven <arjan@linux.intel.com> Cc: Tejun Heo <tj@kernel.org>
LKML-Reference: <49D13385.5060900@intel.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/lrg/voltage-2.6
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/lrg/voltage-2.6: (32 commits)
regulator: twl4030 VAUX3 supports 3.0V
regulator: Support disabling of unused regulators by machines
regulator: Don't increment use_count for boot_on regulators
twl4030-regulator: expose VPLL2
regulator: refcount fixes
regulator: Don't warn if we failed to get a regulator
regulator: Allow boot_on regulators to be disabled by clients
regulator: Implement list_voltage for WM835x LDOs and DCDCs
twl4030-regulator: list more VAUX4 voltages
regulator: Don't warn on omitted voltage constraints
regulator: Implement list_voltage() for WM8400 DCDCs and LDOs
MMC: regulator utilities
regulator: twl4030 voltage enumeration (v2)
regulator: twl4030 regulators
regulator: get_status() grows kerneldoc
regulator: enumerate voltages (v2)
regulator: Fix get_mode() for WM835x DCDCs
regulator: Allow regulators to set the initial operating mode
regulator: Suggest use of datasheet supply or pin names for consumers
regulator: email - update email address and regulator webpage.
...
* git://git.infradead.org/iommu-2.6:
intel-iommu: Fix address wrap on 32-bit kernel.
intel-iommu: Enable DMAR on 32-bit kernel.
intel-iommu: fix PCI device detach from virtual machine
intel-iommu: VT-d page table to support snooping control bit
iommu: Add domain_has_cap iommu_ops
intel-iommu: Snooping control support
Fixed trivial conflicts in arch/x86/Kconfig and drivers/pci/intel-iommu.c
* git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-2.6-fscache: (41 commits)
NFS: Add mount options to enable local caching on NFS
NFS: Display local caching state
NFS: Store pages from an NFS inode into a local cache
NFS: Read pages from FS-Cache into an NFS inode
NFS: nfs_readpage_async() needs to be accessible as a fallback for local caching
NFS: Add read context retention for FS-Cache to call back with
NFS: FS-Cache page management
NFS: Add some new I/O counters for FS-Cache doing things for NFS
NFS: Invalidate FsCache page flags when cache removed
NFS: Use local disk inode cache
NFS: Define and create inode-level cache objects
NFS: Define and create superblock-level objects
NFS: Define and create server-level objects
NFS: Register NFS for caching and retrieve the top-level index
NFS: Permit local filesystem caching to be enabled for NFS
NFS: Add FS-Cache option bit and debug bit
NFS: Add comment banners to some NFS functions
FS-Cache: Make kAFS use FS-Cache
CacheFiles: A cache that backs onto a mounted filesystem
CacheFiles: Export things for CacheFiles
...
Merge branch 'for-linus' of git://oss.sgi.com/xfs/xfs
* 'for-linus' of git://oss.sgi.com/xfs/xfs: (61 commits)
Revert "xfs: increase the maximum number of supported ACL entries"
xfs: cleanup uuid handling
xfs: remove m_attroffset
xfs: fix various typos
xfs: pagecache usage optimization
xfs: remove m_litino
xfs: kill ino64 mount option
xfs: kill mutex_t typedef
xfs: increase the maximum number of supported ACL entries
xfs: factor out code to find the longest free extent in the AG
xfs: kill VN_BAD
xfs: kill vn_atime_* helpers.
xfs: cleanup xlog_bread
xfs: cleanup xlog_recover_do_trans
xfs: remove another leftover of the old inode log item format
xfs: cleanup log unmount handling
Fix xfs debug build breakage by pushing xfs_error.h after
xfs: include header files for prototypes
xfs: make symbols static
xfs: move declaration to header file
...
Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-udf-2.6
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-udf-2.6:
udf: Don't write integrity descriptor too often
udf: Try anchor in block 256 first
udf: Some type fixes and cleanups
udf: use hardware sector size
udf: fix novrs mount option
udf: Fix oops when invalid character in filename occurs
udf: return f_fsid for statfs(2)
udf: Add checks to not underflow sector_t
udf: fix default mode and dmode options handling
udf: fix sparse warnings:
udf: unsigned last[i] cannot be less than 0
udf: implement mode and dmode mounting options
udf: reduce stack usage of udf_get_filename
udf: reduce stack usage of udf_load_pvoldesc
Fix the udf code not to pass structs on stack where possible.
Remove struct typedefs from fs/udf/ecma_167.h et al.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/rcu-doc-2.6:
Doc: Fix spelling in RCU/rculist_nulls.txt.
Doc: Fix wrong API example usage of call_rcu().
Doc: Fix missing whitespaces in RCU documentation.
CC init/main.o
In file included from include/linux/highmem.h:25,
from include/linux/pagemap.h:11,
from include/linux/mempolicy.h:63,
from init/main.c:53:
arch/powerpc/include/asm/highmem.h: In function 'kmap_atomic_prot':
arch/powerpc/include/asm/highmem.h:98: error: implicit declaration of function 'debug_kmap_atomic'
In file included from include/linux/pagemap.h:11,
from include/linux/mempolicy.h:63,
from init/main.c:53:
include/linux/highmem.h: At top level:
include/linux/highmem.h:196: warning: conflicting types for 'debug_kmap_atomic'
include/linux/highmem.h:196: error: static declaration of 'debug_kmap_atomic' follows non-static declaration
include/asm/highmem.h:98: error: previous implicit declaration of 'debug_kmap_atomic' was here
make[1]: *** [init/main.o] Error 1
make: *** [init] Error 2
Signed-off-by: Kumar Gala <galak@kernel.crashing.org> Acked-by: Akinobu Mita <akinobu.mita@gmail.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6:
crypto: ixp4xx - Fix handling of chained sg buffers
crypto: shash - Fix unaligned calculation with short length
hwrng: timeriomem - Use phys address rather than virt
Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/gerg/m68knommu
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/gerg/m68knommu: (41 commits)
m68knommu: improve compile arch switch settings
m68knommu: fix 5407 ColdFire UART vector setup
m68knommu: fix 5307 ColdFire UART vector setup
m68knommu: fix 5249 ColdFire UART vector setup
m68knommu: fix 5249 ColdFire UART setup
m68knommu: fix end of uart table marker
m68knommu: switch to using generic_handle_irq()
m68k: merge the mmu and non-mmu versions of tlbflush.h
m68knommu: introduce basic clk infrastructure
m68k: merge the mmu and non-mmu versions of module.h
m68knommu: add missing interrupt line definition for UART 2
m68k: merge the mmu and non-mmu versions of mmu_context.h
m68k: merge the mmu and non-mmu versions of current.h
m68k: merge the mmu and non-mmu versions of div64.h
m68k: merge the mmu and non-mmu versions of bugs.h
m68k: merge the mmu and non-mmu versions of bug.h
m68k: use the mmu version of cache.h for m68knommu as well
m68k: use the mmu version of bootinfo.h for m68knommu as well
m68k: merge the mmu and non-mmu versions of fb.h
m68k: merge the mmu and non-mmu versions of segment.h
...
Andrew Morton [Thu, 2 Apr 2009 23:44:38 +0000 (16:44 -0700)]
x86: fix is_io_mapping_possible() build warning on i386 allnoconfig
i386 allnoconfig:
arch/x86/mm/iomap_32.c: In function 'is_io_mapping_possible':
arch/x86/mm/iomap_32.c:27: warning: comparison is always false due to limited range of data type
Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>
Merge branch 'for-linus' of git://neil.brown.name/md
* 'for-linus' of git://neil.brown.name/md: (53 commits)
md/raid5 revise rules for when to update metadata during reshape
md/raid5: minor code cleanups in make_request.
md: remove CONFIG_MD_RAID_RESHAPE config option.
md/raid5: be more careful about write ordering when reshaping.
md: don't display meaningless values in sysfs files resync_start and sync_speed
md/raid5: allow layout and chunksize to be changed on active array.
md/raid5: reshape using largest of old and new chunk size
md/raid5: prepare for allowing reshape to change layout
md/raid5: prepare for allowing reshape to change chunksize.
md/raid5: clearly differentiate 'before' and 'after' stripes during reshape.
Documentation/md.txt update
md: allow number of drives in raid5 to be reduced
md/raid5: change reshape-progress measurement to cope with reshaping backwards.
md: add explicit method to signal the end of a reshape.
md/raid5: enhance raid5_size to work correctly with negative delta_disks
md/raid5: drop qd_idx from r6_state
md/raid6: move raid6 data processing to raid6_pq.ko
md: raid5 run(): Fix max_degraded for raid level 4.
md: 'array_size' sysfs attribute
md: centralize ->array_sectors modifications
...
* master.kernel.org:/home/rmk/linux-2.6-arm:
[ARM] fix build-breaking 7a192ec commit
ARM: Add SMSC911X support to Overo platform (V2)
arm: update omap_ldp defconfig to use smsc911x
arm: update realview defconfigs to use smsc911x
arm: update pcm037 defconfig to use smsc911x
arm: convert omap ldp platform to use smsc911x
arm: convert realview platform to use smsc911x
arm: convert pcm037 platform to use smsc911x
[ARM] 5444/1: ARM: Realview: Fix event-device multiplicators in localtimer.c
[ARM] 5442/1: pxa/cm-x255: fix reverse RDY gpios in PCMCIA driver
[ARM] 5441/1: Use pr_err on error paths in at91 pm
[ARM] 5440/1: Fix VFP state corruption due to preemption during VFP exceptions
[ARM] 5439/1: Do not clear bit 10 of DFSR during abort handling on ARMv6
[ARM] 5437/1: Add documentation for "nohlt" kernel parameter
[ARM] 5436/1: ARM: OMAP: Fix compile for rx51
[ARM] arch_reset() now takes a second parameter
[ARM] Kirkwood: small L2 code cleanup
[ARM] Kirkwood: invalidate L2 cache before enabling it
* git://git.kernel.org/pub/scm/linux/kernel/git/bart/linux-hdreg-h-cleanup:
remove <linux/ata.h> include from <linux/hdreg.h>
include/linux/hdreg.h: remove unused defines
isd200: use ATA_* defines instead of *_STAT and *_ERR ones
include/linux/hdreg.h: cover WIN_* and friends with #ifndef/#endif __KERNEL__
aoe: WIN_* -> ATA_CMD_*
isd200: WIN_* -> ATA_CMD_*
include/linux/hdreg.h: cover struct hd_driveid with #ifndef/#endif __KERNEL__
xsysace: make it 'struct hd_driveid'-free
ubd_kern: make it 'struct hd_driveid'-free
isd200: make it 'struct hd_driveid'-free
David Howells [Fri, 3 Apr 2009 15:42:48 +0000 (16:42 +0100)]
NFS: Add mount options to enable local caching on NFS
Add NFS mount options to allow the local caching support to be enabled.
The attached patch makes it possible for the NFS filesystem to be told to make
use of the network filesystem local caching service (FS-Cache).
To be able to use this, a recent nfsutils package is required.
There are three variant NFS mount options that can be added to a mount command
to control caching for a mount. Only the last one specified takes effect:
(*) Adding "fsc" will request caching.
(*) Adding "fsc=<string>" will request caching and also specify a uniquifier.
(*) Adding "nofsc" will disable caching.
For example:
mount warthog:/ /a -o fsc
The cache of a particular superblock (NFS FSID) will be shared between all
mounts of that volume, provided they have the same connection parameters and
are not marked 'nosharecache'.
Where it is otherwise impossible to distinguish superblocks because all the
parameters are identical, but the 'nosharecache' option is supplied, a
uniquifying string must be supplied, else only the first mount will be
permitted to use the cache.
If there's a key collision, then the second mount will disable caching and give
a warning into the kernel log.
Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: Steve Dickson <steved@redhat.com> Acked-by: Trond Myklebust <Trond.Myklebust@netapp.com> Acked-by: Al Viro <viro@zeniv.linux.org.uk> Tested-by: Daire Byrne <Daire.Byrne@framestore.com>
David Howells [Fri, 3 Apr 2009 15:42:44 +0000 (16:42 +0100)]
NFS: nfs_readpage_async() needs to be accessible as a fallback for local caching
nfs_readpage_async() needs to be non-static so that it can be used as a
fallback for the local on-disk caching should an EIO crop up when reading the
cache.
Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: Steve Dickson <steved@redhat.com> Acked-by: Trond Myklebust <Trond.Myklebust@netapp.com> Acked-by: Al Viro <viro@zeniv.linux.org.uk> Tested-by: Daire Byrne <Daire.Byrne@framestore.com>
David Howells [Fri, 3 Apr 2009 15:42:44 +0000 (16:42 +0100)]
NFS: Add read context retention for FS-Cache to call back with
Add read context retention so that FS-Cache can call back into NFS when a read
operation on the cache fails EIO rather than reading data. This permits NFS to
then fetch the data from the server instead using the appropriate security
context.
Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: Steve Dickson <steved@redhat.com> Acked-by: Trond Myklebust <Trond.Myklebust@netapp.com> Acked-by: Al Viro <viro@zeniv.linux.org.uk> Tested-by: Daire Byrne <Daire.Byrne@framestore.com>
David Howells [Fri, 3 Apr 2009 15:42:44 +0000 (16:42 +0100)]
NFS: FS-Cache page management
FS-Cache page management for NFS. This includes hooking the releasing and
invalidation of pages marked with PG_fscache (aka PG_private_2) and waiting for
completion of the write-to-cache flag (PG_fscache_write aka PG_owner_priv_2).
Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: Steve Dickson <steved@redhat.com> Acked-by: Trond Myklebust <Trond.Myklebust@netapp.com> Acked-by: Al Viro <viro@zeniv.linux.org.uk> Tested-by: Daire Byrne <Daire.Byrne@framestore.com>
David Howells [Fri, 3 Apr 2009 15:42:43 +0000 (16:42 +0100)]
NFS: Add some new I/O counters for FS-Cache doing things for NFS
Add some new NFS I/O counters for FS-Cache doing things for NFS. A new line is
emitted into /proc/pid/mountstats if caching is enabled that looks like:
fsc: <rok> <rfl> <wok> <wfl> <unc>
Where <rok> is the number of pages read successfully from the cache, <rfl> is
the number of failed page reads against the cache, <wok> is the number of
successful page writes to the cache, <wfl> is the number of failed page writes
to the cache, and <unc> is the number of NFS pages that have been disconnected
from the cache.
Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: Steve Dickson <steved@redhat.com> Acked-by: Trond Myklebust <Trond.Myklebust@netapp.com> Acked-by: Al Viro <viro@zeniv.linux.org.uk> Tested-by: Daire Byrne <Daire.Byrne@framestore.com>
David Howells [Fri, 3 Apr 2009 15:42:43 +0000 (16:42 +0100)]
NFS: Define and create inode-level cache objects
Define and create inode-level cache data storage objects (as managed by
nfs_inode structs).
Each inode-level object is created in a superblock-level index object and is
itself a data storage object into which pages from the inode are stored.
The inode object key is the NFS file handle for the inode.
The inode object is given coherency data to carry in the auxiliary data
permitted by the cache. This is a sequence made up of:
(1) i_mtime from the NFS inode.
(2) i_ctime from the NFS inode.
(3) i_size from the NFS inode.
(4) change_attr from the NFSv4 attribute data.
As the cache is a persistent cache, the auxiliary data is checked when a new
NFS in-memory inode is set up that matches an already existing data storage
object in the cache. If the coherency data is the same, the on-disk object is
retained and used; if not, it is scrapped and a new one created.
Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: Steve Dickson <steved@redhat.com> Acked-by: Trond Myklebust <Trond.Myklebust@netapp.com> Acked-by: Al Viro <viro@zeniv.linux.org.uk> Tested-by: Daire Byrne <Daire.Byrne@framestore.com>
David Howells [Fri, 3 Apr 2009 15:42:42 +0000 (16:42 +0100)]
NFS: Define and create superblock-level objects
Define and create superblock-level cache index objects (as managed by
nfs_server structs).
Each superblock object is created in a server level index object and is itself
an index into which inode-level objects are inserted.
Ideally there would be one superblock-level object per server, and the former
would be folded into the latter; however, since the "nosharecache" option
exists this isn't possible.
The superblock object key is a sequence consisting of:
(1) Certain superblock s_flags.
(2) Various connection parameters that serve to distinguish superblocks for
sget().
(3) The volume FSID.
(4) The security flavour.
(5) The uniquifier length.
(6) The uniquifier text. This is normally an empty string, unless the fsc=xyz
mount option was used to explicitly specify a uniquifier.
The key blob is of variable length, depending on the length of (6).
The superblock object is given no coherency data to carry in the auxiliary data
permitted by the cache. It is assumed that the superblock is always coherent.
This patch also adds uniquification handling such that two otherwise identical
superblocks, at least one of which is marked "nosharecache", won't end up
trying to share the on-disk cache. It will be possible to manually provide a
uniquifier through a mount option with a later patch to avoid the error
otherwise produced.
Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: Steve Dickson <steved@redhat.com> Acked-by: Trond Myklebust <Trond.Myklebust@netapp.com> Acked-by: Al Viro <viro@zeniv.linux.org.uk> Tested-by: Daire Byrne <Daire.Byrne@framestore.com>
David Howells [Fri, 3 Apr 2009 15:42:42 +0000 (16:42 +0100)]
NFS: Define and create server-level objects
Define and create server-level cache index objects (as managed by nfs_client
structs).
Each server object is created in the NFS top-level index object and is itself
an index into which superblock-level objects are inserted.
Ideally there would be one superblock-level object per server, and the former
would be folded into the latter; however, since the "nosharecache" option
exists this isn't possible.
The server object key is a sequence consisting of:
(1) NFS version
(2) Server address family (eg: AF_INET or AF_INET6)
(3) Server port.
(4) Server IP address.
The key blob is of variable length, depending on the length of (4).
The server object is given no coherency data to carry in the auxiliary data
permitted by the cache.
Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: Steve Dickson <steved@redhat.com> Acked-by: Trond Myklebust <Trond.Myklebust@netapp.com> Acked-by: Al Viro <viro@zeniv.linux.org.uk> Tested-by: Daire Byrne <Daire.Byrne@framestore.com>
David Howells [Fri, 3 Apr 2009 15:42:41 +0000 (16:42 +0100)]
FS-Cache: Make kAFS use FS-Cache
The attached patch makes the kAFS filesystem in fs/afs/ use FS-Cache, and
through it any attached caches. The kAFS filesystem will use caching
automatically if it's available.
Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: Steve Dickson <steved@redhat.com> Acked-by: Trond Myklebust <Trond.Myklebust@netapp.com> Acked-by: Al Viro <viro@zeniv.linux.org.uk> Tested-by: Daire Byrne <Daire.Byrne@framestore.com>
David Howells [Fri, 3 Apr 2009 15:42:41 +0000 (16:42 +0100)]
CacheFiles: A cache that backs onto a mounted filesystem
Add an FS-Cache cache-backend that permits a mounted filesystem to be used as a
backing store for the cache.
CacheFiles uses a userspace daemon to do some of the cache management - such as
reaping stale nodes and culling. This is called cachefilesd and lives in
/sbin. The source for the daemon can be downloaded from:
The filesystem and data integrity of the cache are only as good as those of the
filesystem providing the backing services. Note that CacheFiles does not
attempt to journal anything since the journalling interfaces of the various
filesystems are very specific in nature.
CacheFiles creates a misc character device - "/dev/cachefiles" - that is used
to communication with the daemon. Only one thing may have this open at once,
and whilst it is open, a cache is at least partially in existence. The daemon
opens this and sends commands down it to control the cache.
CacheFiles is currently limited to a single cache.
CacheFiles attempts to maintain at least a certain percentage of free space on
the filesystem, shrinking the cache by culling the objects it contains to make
space if necessary - see the "Cache Culling" section. This means it can be
placed on the same medium as a live set of data, and will expand to make use of
spare space and automatically contract when the set of data requires more
space.
============
REQUIREMENTS
============
The use of CacheFiles and its daemon requires the following features to be
available in the system and in the cache filesystem:
- dnotify.
- extended attributes (xattrs).
- openat() and friends.
- bmap() support on files in the filesystem (FIBMAP ioctl).
- The use of bmap() to detect a partial page at the end of the file.
It is strongly recommended that the "dir_index" option is enabled on Ext3
filesystems being used as a cache.
=============
CONFIGURATION
=============
The cache is configured by a script in /etc/cachefilesd.conf. These commands
set up cache ready for use. The following script commands are available:
Configure the culling limits. Optional. See the section on culling
The defaults are 7% (run), 5% (cull) and 1% (stop) respectively.
The commands beginning with a 'b' are file space (block) limits, those
beginning with an 'f' are file count limits.
(*) dir <path>
Specify the directory containing the root of the cache. Mandatory.
(*) tag <name>
Specify a tag to FS-Cache to use in distinguishing multiple caches.
Optional. The default is "CacheFiles".
(*) debug <mask>
Specify a numeric bitmask to control debugging in the kernel module.
Optional. The default is zero (all off). The following values can be
OR'd into the mask to collect various information:
1 Turn on trace of function entry (_enter() macros)
2 Turn on trace of function exit (_leave() macros)
4 Turn on trace of internal debug points (_debug())
This mask can also be set through sysfs, eg:
echo 5 >/sys/modules/cachefiles/parameters/debug
==================
STARTING THE CACHE
==================
The cache is started by running the daemon. The daemon opens the cache device,
configures the cache and tells it to begin caching. At that point the cache
binds to fscache and the cache becomes live.
Increase the debugging level. This can be specified multiple times and
is cumulative with itself.
(*) -s
Send messages to stderr instead of syslog.
(*) -n
Don't daemonise and go into background.
(*) -f <configfile>
Use an alternative configuration file rather than the default one.
===============
THINGS TO AVOID
===============
Do not mount other things within the cache as this will cause problems. The
kernel module contains its own very cut-down path walking facility that ignores
mountpoints, but the daemon can't avoid them.
Do not create, rename or unlink files and directories in the cache whilst the
cache is active, as this may cause the state to become uncertain.
Renaming files in the cache might make objects appear to be other objects (the
filename is part of the lookup key).
Do not change or remove the extended attributes attached to cache files by the
cache as this will cause the cache state management to get confused.
Do not create files or directories in the cache, lest the cache get confused or
serve incorrect data.
Do not chmod files in the cache. The module creates things with minimal
permissions to prevent random users being able to access them directly.
=============
CACHE CULLING
=============
The cache may need culling occasionally to make space. This involves
discarding objects from the cache that have been used less recently than
anything else. Culling is based on the access time of data objects. Empty
directories are culled if not in use.
Cache culling is done on the basis of the percentage of blocks and the
percentage of files available in the underlying filesystem. There are six
"limits":
(*) brun
(*) frun
If the amount of free space and the number of available files in the cache
rises above both these limits, then culling is turned off.
(*) bcull
(*) fcull
If the amount of available space or the number of available files in the
cache falls below either of these limits, then culling is started.
(*) bstop
(*) fstop
If the amount of available space or the number of available files in the
cache falls below either of these limits, then no further allocation of
disk space or files is permitted until culling has raised things above
these limits again.
Note that these are percentages of available space and available files, and do
_not_ appear as 100 minus the percentage displayed by the "df" program.
The userspace daemon scans the cache to build up a table of cullable objects.
These are then culled in least recently used order. A new scan of the cache is
started as soon as space is made in the table. Objects will be skipped if
their atimes have changed or if the kernel module says it is still using them.
===============
CACHE STRUCTURE
===============
The CacheFiles module will create two directories in the directory it was
given:
(*) cache/
(*) graveyard/
The active cache objects all reside in the first directory. The CacheFiles
kernel module moves any retired or culled objects that it can't simply unlink
to the graveyard from which the daemon will actually delete them.
The daemon uses dnotify to monitor the graveyard directory, and will delete
anything that appears therein.
The module represents index objects as directories with the filename "I..." or
"J...". Note that the "cache/" directory is itself a special index.
Data objects are represented as files if they have no children, or directories
if they do. Their filenames all begin "D..." or "E...". If represented as a
directory, data objects will have a file in the directory called "data" that
actually holds the data.
Special objects are similar to data objects, except their filenames begin
"S..." or "T...".
If an object has children, then it will be represented as a directory.
Immediately in the representative directory are a collection of directories
named for hash values of the child object keys with an '@' prepended. Into
this directory, if possible, will be placed the representations of the child
objects:
INDEX INDEX INDEX DATA FILES
========= ========== ================================= ================
cache/@4a/I03nfs/@30/Ji000000000000000--fHg8hi8400
cache/@4a/I03nfs/@30/Ji000000000000000--fHg8hi8400/@75/Es0g000w...DB1ry
cache/@4a/I03nfs/@30/Ji000000000000000--fHg8hi8400/@75/Es0g000w...N22ry
cache/@4a/I03nfs/@30/Ji000000000000000--fHg8hi8400/@75/Es0g000w...FP1ry
If the key is so long that it exceeds NAME_MAX with the decorations added on to
it, then it will be cut into pieces, the first few of which will be used to
make a nest of directories, and the last one of which will be the objects
inside the last directory. The names of the intermediate directories will have
'+' prepended:
J1223/@23/+xy...z/+kl...m/Epqr
Note that keys are raw data, and not only may they exceed NAME_MAX in size,
they may also contain things like '/' and NUL characters, and so they may not
be suitable for turning directly into a filename.
To handle this, CacheFiles will use a suitably printable filename directly and
"base-64" encode ones that aren't directly suitable. The two versions of
object filenames indicate the encoding:
OBJECT TYPE PRINTABLE ENCODED
=============== =============== ===============
Index "I..." "J..."
Data "D..." "E..."
Special "S..." "T..."
Intermediate directories are always "@" or "+" as appropriate.
Each object in the cache has an extended attribute label that holds the object
type ID (required to distinguish special objects) and the auxiliary data from
the netfs. The latter is used to detect stale objects in the cache and update
or retire them.
Note that CacheFiles will erase from the cache any file it doesn't recognise or
any file of an incorrect type (such as a FIFO file or a device file).
==========================
SECURITY MODEL AND SELINUX
==========================
CacheFiles is implemented to deal properly with the LSM security features of
the Linux kernel and the SELinux facility.
One of the problems that CacheFiles faces is that it is generally acting on
behalf of a process, and running in that process's context, and that includes a
security context that is not appropriate for accessing the cache - either
because the files in the cache are inaccessible to that process, or because if
the process creates a file in the cache, that file may be inaccessible to other
processes.
The way CacheFiles works is to temporarily change the security context (fsuid,
fsgid and actor security label) that the process acts as - without changing the
security context of the process when it the target of an operation performed by
some other process (so signalling and suchlike still work correctly).
When the CacheFiles module is asked to bind to its cache, it:
(1) Finds the security label attached to the root cache directory and uses
that as the security label with which it will create files. By default,
this is:
cachefiles_var_t
(2) Finds the security label of the process which issued the bind request
(presumed to be the cachefilesd daemon), which by default will be:
cachefilesd_t
and asks LSM to supply a security ID as which it should act given the
daemon's label. By default, this will be:
cachefiles_kernel_t
SELinux transitions the daemon's security ID to the module's security ID
based on a rule of this form in the policy.
type_transition <daemon's-ID> kernel_t : process <module's-ID>;
For instance:
type_transition cachefilesd_t kernel_t : process cachefiles_kernel_t;
The module's security ID gives it permission to create, move and remove files
and directories in the cache, to find and access directories and files in the
cache, to set and access extended attributes on cache objects, and to read and
write files in the cache.
The daemon's security ID gives it only a very restricted set of permissions: it
may scan directories, stat files and erase files and directories. It may
not read or write files in the cache, and so it is precluded from accessing the
data cached therein; nor is it permitted to create new files in the cache.
and later versions. In that tarball, see the files:
cachefilesd.te
cachefilesd.fc
cachefilesd.if
They are built and installed directly by the RPM.
If a non-RPM based system is being used, then copy the above files to their own
directory and run:
make -f /usr/share/selinux/devel/Makefile
semodule -i cachefilesd.pp
You will need checkpolicy and selinux-policy-devel installed prior to the
build.
By default, the cache is located in /var/fscache, but if it is desirable that
it should be elsewhere, than either the above policy files must be altered, or
an auxiliary policy must be installed to label the alternate location of the
cache.
For instructions on how to add an auxiliary policy to enable the cache to be
located elsewhere when SELinux is in enforcing mode, please see:
/usr/share/doc/cachefilesd-*/move-cache.txt
When the cachefilesd rpm is installed; alternatively, the document can be found
in the sources.
==================
A NOTE ON SECURITY
==================
CacheFiles makes use of the split security in the task_struct. It allocates
its own task_security structure, and redirects current->act_as to point to it
when it acts on behalf of another process, in that process's context.
The reason it does this is that it calls vfs_mkdir() and suchlike rather than
bypassing security and calling inode ops directly. Therefore the VFS and LSM
may deny the CacheFiles access to the cache data because under some
circumstances the caching code is running in the security context of whatever
process issued the original syscall on the netfs.
Furthermore, should CacheFiles create a file or directory, the security
parameters with that object is created (UID, GID, security label) would be
derived from that process that issued the system call, thus potentially
preventing other processes from accessing the cache - including CacheFiles's
cache management daemon (cachefilesd).
What is required is to temporarily override the security of the process that
issued the system call. We can't, however, just do an in-place change of the
security data as that affects the process as an object, not just as a subject.
This means it may lose signals or ptrace events for example, and affects what
the process looks like in /proc.
So CacheFiles makes use of a logical split in the security between the
objective security (task->sec) and the subjective security (task->act_as). The
objective security holds the intrinsic security properties of a process and is
never overridden. This is what appears in /proc, and is what is used when a
process is the target of an operation by some other process (SIGKILL for
example).
The subjective security holds the active security properties of a process, and
may be overridden. This is not seen externally, and is used whan a process
acts upon another object, for example SIGKILLing another process or opening a
file.
LSM hooks exist that allow SELinux (or Smack or whatever) to reject a request
for CacheFiles to run in a context of a specific security label, or to create
files and directories with another security label.
This documentation is added by the patch to:
Documentation/filesystems/caching/cachefiles.txt
Signed-Off-By: David Howells <dhowells@redhat.com> Acked-by: Steve Dickson <steved@redhat.com> Acked-by: Trond Myklebust <Trond.Myklebust@netapp.com> Acked-by: Al Viro <viro@zeniv.linux.org.uk> Tested-by: Daire Byrne <Daire.Byrne@framestore.com>