Since commit 4bdadb9785696439c6e2b3efe34aa76df1149c83 ("drm/i915:
Selectively enable self-reclaim"), we've been passing GFP_MOVABLE to the
i915 page allocator where we weren't before due to some over-eager
removal of the page mapping gfp_flags games the code used to play.
This caused hibernate on Intel hardware to result in a lot of memory
corruptions on resume. See for example
http://bugzilla.kernel.org/show_bug.cgi?id=13811
Reported-by: Evengi Golov (in bugzilla) Signed-off-by: Dave Airlie <airlied@redhat.com> Tested-by: M. Vefa Bicakci <bicave@superonline.com> Cc: Chris Wilson <chris@chris-wilson.co.uk> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Like the mlock() change previously, this makes the stack guard check
code use vma->vm_prev to see what the mapping below the current stack
is, rather than have to look it up with find_vma().
Also, accept an abutting stack segment, since that happens naturally if
you split the stack with mlock or mprotect.
It's a really simple list, and several of the users want to go backwards
in it to find the previous vma. So rather than have to look up the
previous entry with 'find_vma_prev()' or something similar, just make it
doubly linked instead.
This patch changes dm_hash_remove_all() to release _hash_lock when
removing a device. After removing the device, dm_hash_remove_all()
takes _hash_lock and searches the hash from scratch again.
This patch is a preparation for the next patch, which changes device
deletion code to wait for md reference to be 0. Without this patch,
the wait in the next patch may cause AB-BA deadlock:
CPU0 CPU1
-----------------------------------------------------------------------
dm_hash_remove_all()
down_write(_hash_lock)
table_status()
md = find_device()
dm_get(md)
<increment md->holders>
dm_get_live_or_inactive_table()
dm_get_inactive_table()
down_write(_hash_lock)
<in the md deletion code>
<wait for md->holders to be 0>
Test on a PXA310 platform with Samsung K9F2G08X0B NAND flash,
with tCH=5 and clk is 156MHz, ns2cycle(5, 156000000) returns -1.
ns2cycle returns negtive value will break NDTR0_tXX macros.
After checking the commit log, I found the problem is introduced by
commit 5b0d4d7c8a67c5ba3d35e6ceb0c5530cc6846db7
"[MTD] [NAND] pxa3xx: convert from ns to clock ticks more accurately"
To get num of clock cycles, we use below equation:
num of clock cycles = time (ns) / one clock cycle (ns) + 1
We need to add 1 cycle here because integer division will truncate the result.
It is possible the developers set the Min values in SPEC for timing settings.
Thus the truncate may cause problem, and it is safe to add an extra cycle here.
The various fields in NDTR{01} are in units of clock ticks minus one,
thus we should subtract 1 cycle then.
Thus the correct equation should be:
num of clock cycles = time (ns) / one clock cycle (ns) + 1 - 1
= time (ns) / one clock cycle (ns)
Signed-off-by: Axel Lin <axel.lin@gmail.com> Signed-off-by: Lei Wen <leiwen@marvell.com> Acked-by: Eric Miao <eric.y.miao@gmail.com> Signed-off-by: David Woodhouse <David.Woodhouse@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Atheros PCIe wireless cards handled by ath5k do require L0s disabled.
For distributions shipping with CONFIG_PCIEASPM (this will be enabled
by default in the future in 2.6.36) this will also mean both L1 and L0s
will be disabled when a pre 1.1 PCIe device is detected. We do know L1
works correctly even for all ath5k pre 1.1 PCIe devices though but cannot
currently undue the effect of a blacklist, for details you can read
pcie_aspm_sanity_check() and see how it adjusts the device link
capability.
It may be possible in the future to implement some PCI API to allow
drivers to override blacklists for pre 1.1 PCIe but for now it is
best to accept that both L0s and L1 will be disabled completely for
distributions shipping with CONFIG_PCIEASPM rather than having this
issue present. Motivation for adding this new API will be to help
with power consumption for some of these devices.
Example of issues you'd see:
- On the Acer Aspire One (AOA150, Atheros Communications Inc. AR5001
Wireless Network Adapter [168c:001c] (rev 01)) doesn't work well
with ASPM enabled, the card will eventually stall on heavy traffic
with often 'unsupported jumbo' warnings appearing. Disabling
ASPM L0s in ath5k fixes these problems.
- On the same card you would see a storm of RXORN interrupts
even though medium is idle.
Credit for root causing and fixing the bug goes to Jussi Kivilinna.
Cc: David Quan <David.Quan@atheros.com> Cc: Matthew Garrett <mjg59@srcf.ucam.org> Cc: Tim Gardner <tim.gardner@canonical.com> Cc: Jussi Kivilinna <jussi.kivilinna@mbnet.fi> Signed-off-by: Luis R. Rodriguez <lrodriguez@atheros.com> Signed-off-by: Maxim Levitsky <maximlevitsky@gmail.com> Signed-off-by: John W. Linville <linville@tuxdriver.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
It's not OK to call platform_device_add_resources() multiple times
in a row. Despite its name, this functions sets the resources, it
doesn't add them. So we have to prepare an array with all the
resources, and then call platform_device_add_resources() once.
Before this fix, only the last I/O resource would be actually
registered. The other I/O resources were leaked.
Signed-off-by: Jean Delvare <khali@linux-fr.org> Cc: Jim Cromie <jim.cromie@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Thanks a lot for all the review and comments so far;) I'd like to send
the improved (V4) version of this patch.
This patch fixes a deadlock in OCFS2 ACL. We found this bug in OCFS2
and Samba integration using scenario, the symptom is several smbd
processes will be hung under heavy workload. Finally we found out it
is the nested PR lock calling that leads to this deadlock:
node1 node2
gr PR
|
V
PR(EX)---> BAST:OCFS2_LOCK_BLOCKED
|
V
rq PR
|
V
wait=1
After requesting the 2nd PR lock, the process "smbd" went into D
state. It can only be woken up when the 1st PR lock's RO holder equals
zero. There should be an ocfs2_inode_unlock in the calling path later
on, which can decrement the RO holder. But since it has been in
uninterruptible sleep, the unlock function has no chance to be called.
When testing cpu hotplug code on 32-bit we kept hitting the "CPU%d:
Stuck ??" message due to multiple cores concurrently accessing the
cpu_callin_mask, among others.
Since these codepaths are not protected from concurrent access due to
the fact that there's no sane reason for making an already complex
code unnecessarily more complex - we hit the issue only when insanely
switching cores off- and online - serialize hotplugging cores on the
sysfs level and be done with it.
[ v2.1: fix !HOTPLUG_CPU build ]
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
LKML-Reference: <20100819181029.GC17171@aftab> Signed-off-by: H. Peter Anvin <hpa@linux.intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
When we need to take both dlm_domain_lock and dlm->spinlock, we should take
them in order of: dlm_domain_lock then dlm->spinlock.
There is pathes disobey this order. That is calling dlm_lockres_put() with
dlm->spinlock held in dlm_run_purge_list. dlm_lockres_put() calls dlm_put() at
the ref and dlm_put() locks on dlm_domain_lock.
Fix:
Don't grab/put the dlm when the initialising/releasing lockres.
That grab is not required because we don't call dlm_unregister_domain()
based on refcount.
Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com> Signed-off-by: Joel Becker <joel.becker@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
In the following situation, there remains an incorrect bit in refmap on the
recovery master. Finally the recovery master will fail at purging the lockres
due to the incorrect bit in refmap.
1) node A has no interest on lockres A any longer, so it is purging it.
2) the owner of lockres A is node B, so node A is sending de-ref message
to node B.
3) at this time, node B crashed. node C becomes the recovery master. it recovers
lockres A(because the master is the dead node B).
4) node A migrated lockres A to node C with a refbit there.
5) node A failed to send de-ref message to node B because it crashed. The failure
is ignored. no other action is done for lockres A any more.
For mormal, re-send the deref message to it to recovery master can fix it. Well,
ignoring the failure of deref to the original master and not recovering the lockres
to recovery master has the same effect. And the later is simpler.
Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com> Acked-by: Srinivas Eeda <srinivas.eeda@oracle.com> Signed-off-by: Joel Becker <joel.becker@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
The refcount record calculation in ocfs2_calc_refcount_meta_credits
is too optimistic that we can always allocate contiguous clusters
and handle an already existed refcount rec as a whole. Actually
because of file system fragmentation, we may have the chance to split
a refcount record into 3 parts during the transaction. So consider
the worst case in record calculation.
Signed-off-by: Tao Ma <tao.ma@oracle.com> Signed-off-by: Joel Becker <joel.becker@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
This patch fixes two problems in dlm_run_purgelist
1. If a lockres is found to be in use, dlm_run_purgelist keeps trying to purge
the same lockres instead of trying the next lockres.
2. When a lockres is found unused, dlm_run_purgelist releases lockres spinlock
before setting DLM_LOCK_RES_DROPPING_REF and calls dlm_purge_lockres.
spinlock is reacquired but in this window lockres can get reused. This leads
to BUG.
This patch modifies dlm_run_purgelist to skip lockres if it's in use and purge
next lockres. It also sets DLM_LOCK_RES_DROPPING_REF before releasing the
lockres spinlock protecting it from getting reused.
When we have to take both dlm->master_lock and lockres->spinlock,
take them in order
lockres->spinlock and then dlm->master_lock.
The patch fixes a violation of the rule.
We can simply move taking dlm->master_lock to where we have dropped res->spinlock
since when we access res->state and free mle memory we don't need master_lock's
protection.
Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com> Signed-off-by: Joel Becker <joel.becker@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Setting the acl while creating a new inode depends on
the error codes of posix_acl_create_masq. This patch fix
a issue of overwriting the error codes of it.
Reported-by: Pawel Zawora <pzawora@gmail.com> Signed-off-by: Tiger Yang <tiger.yang@oracle.com> Signed-off-by: Joel Becker <joel.becker@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
I discovered tonight that ALSA no longer sets up a stream for the second ADC
provided by the Realtek ALC260 HDA codec. At some point alc_build_pcms()
started using stream_analog_alt_capture when constructing the second ADC
stream, but patch_alc260() was never updated accordingly. I have no idea
when this regression occurred. The trivial patch to patch_alc260() given
below fixes the problem as far as I can tell. The patch is against 2.6.35.
With some hardware combinations, the PCM interrupts are acknowledged
before the period boundary from the emu10k1 chip. The midlevel PCM code
gets confused and the playback stream is interrupted.
It seems that the interrupt processing shift by 2 samples is enough
to fix this issue. This default value does not harm other,
non-affected hardware.
The detection and loading of firmeware on riptide driver has been broken
due to rewrite of some codes, checking the presense wrongly.
This patch fixes the logic again.
mspro_block_remove() is called from detect thread that first calls the
mspro_block_stop(), which stops the request queue. If we call
del_gendisk() with the queue stopped we get a deadlock.
Signed-off-by: Maxim Levitsky <maximlevitsky@gmail.com> Cc: Alex Dubov <oakad@yahoo.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
This commit makes the stack guard page somewhat less visible to user
space. It does this by:
- not showing the guard page in /proc/<pid>/maps
It looks like lvm-tools will actually read /proc/self/maps to figure
out where all its mappings are, and effectively do a specialized
"mlockall()" in user space. By not showing the guard page as part of
the mapping (by just adding PAGE_SIZE to the start for grows-up
pages), lvm-tools ends up not being aware of it.
- by also teaching the _real_ mlock() functionality not to try to lock
the guard page.
That would just expand the mapping down to create a new guard page,
so there really is no point in trying to lock it in place.
It would perhaps be nice to show the guard page specially in
/proc/<pid>/maps (or at least mark grow-down segments some way), but
let's not open ourselves up to more breakage by user space from programs
that depends on the exact deails of the 'maps' file.
Special thanks to Henrique de Moraes Holschuh for diving into lvm-tools
source code to see what was going on with the whole new warning.
Reported-and-tested-by: François Valenduc <francois.valenduc@tvcablenet.be Reported-by: Henrique de Moraes Holschuh <hmh@hmh.eng.br> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
We do in fact need to unmap the page table _before_ doing the whole
stack guard page logic, because if it is needed (mainly 32-bit x86 with
PAE and CONFIG_HIGHPTE, but other architectures may use it too) then it
will do a kmap_atomic/kunmap_atomic.
And those kmaps will create an atomic region that we cannot do
allocations in. However, the whole stack expand code will need to do
anon_vma_prepare() and vma_lock_anon_vma() and they cannot do that in an
atomic region.
Now, a better model might actually be to do the anon_vma_prepare() when
_creating_ a VM_GROWSDOWN segment, and not have to worry about any of
this at page fault time. But in the meantime, this is the
straightforward fix for the issue.
See https://bugzilla.kernel.org/show_bug.cgi?id=16588 for details.
Reported-by: Wylda <wylda@volny.cz> Reported-by: Sedat Dilek <sedat.dilek@gmail.com> Reported-by: Mike Pagano <mpagano@gentoo.org> Reported-by: François Valenduc <francois.valenduc@tvcablenet.be> Tested-by: Ed Tomlinson <edt@aei.ca> Cc: Pekka Enberg <penberg@kernel.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
It's wrong for several reasons, but the most direct one is that the
fault may be for the stack accesses to set up a previous SIGBUS. When
we have a kernel exception, the kernel exception handler does all the
fixups, not some user-level signal handler.
Even apart from the nested SIGBUS issue, it's also wrong to give out
kernel fault addresses in the signal handler info block, or to send a
SIGBUS when a system call already returns EFAULT.
.. which didn't show up in my tests because it's a no-op on x86-64 and
most other architectures. But we enter the function with the last-level
page table mapped, and should unmap it at exit.
This is a rather minimally invasive patch to solve the problem of the
user stack growing into a memory mapped area below it. Whenever we fill
the first page of the stack segment, expand the segment down by one
page.
Now, admittedly some odd application might _want_ the stack to grow down
into the preceding memory mapping, and so we may at some point need to
make this a process tunable (some people might also want to have more
than a single page of guarding), but let's try the minimal approach
first.
Tested with trivial application that maps a single page just below the
stack, and then starts recursing. Without this, we will get a SIGSEGV
_after_ the stack has smashed the mapping. With this patch, we'll get a
nice SIGBUS just as the stack touches the page just above the mapping.
Since 2.6.31, swap_map[]'s refcounting was changed to show that a used
swap entry is just for swap-cache, can be reused. Then, while scanning
free entry in swap_map[], a swap entry may be able to be reclaimed and
reused. It was caused by commit c9e444103b5e7a5 ("mm: reuse unused swap
entry if necessary").
But this caused deta corruption at resume. The scenario is
- Assume a clean-swap cache, but mapped.
- at hibernation_snapshot[], clean-swap-cache is saved as
clean-swap-cache and swap_map[] is marked as SWAP_HAS_CACHE.
- then, save_image() is called. And reuse SWAP_HAS_CACHE entry to save
image, and break the contents.
After resume:
- the memory reclaim runs and finds clean-not-referenced-swap-cache and
discards it because it's marked as clean. But here, the contents on
disk and swap-cache is inconsistent.
Hance memory is corrupted.
This patch avoids the bug by not reclaiming swap-entry during hibernation.
This is a quick fix for backporting.
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Rafael J. Wysocki <rjw@sisk.pl> Reported-by: Ondreg Zary <linux@rainbow-software.org> Tested-by: Ondreg Zary <linux@rainbow-software.org> Tested-by: Andrea Gelmini <andrea.gelmini@gmail.com> Signed-off-by: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
When a raid1 array is configured to support write-behind
on some devices, it normally only reads from other devices.
If all devices are write-behind (because the rest have failed)
it is possible for a read request to be serviced before a
behind-write request, which would appear as data corruption.
So when forced to read from a WriteMostly device, wait for any
write-behind to complete, and don't start any more behind-writes.
If a command times out resulting in EH getting invoked, we wait for the
aborted commands to come back after sending the abort. Shorten
the amount of time we wait for these responses, to ensure we don't
get stuck in EH for several minutes.
Signed-off-by: Brian King <brking@linux.vnet.ibm.com> Signed-off-by: James Bottomley <James.Bottomley@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Commands which are completed by the VIOS are placed on a CRQ
in kernel memory for the ibmvfc driver to process. Each CRQ
entry is 16 bytes. The ibmvfc driver reads the first 8 bytes
to check if the entry is valid, then reads the next 8 bytes to get
the handle, which is a pointer the completed command. This fixes
an issue seen on Power 7 where the processor reordered the
loads from memory, resulting in processing command completion
with a stale handle. This could result in command timeouts,
and also early completion of commands.
Signed-off-by: Brian King <brking@linux.vnet.ibm.com> Signed-off-by: James Bottomley <James.Bottomley@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
When removing several devices aic79xx will occasionally Oops
in ahd_handle_nonpkt_busfree during rescan. Looking at the
code I found that we're indeed not checking if the scb in
question is NULL. So check for it before accessing it.
Signed-off-by: Hannes Reinecke <hare@suse.de> Signed-off-by: James Bottomley <James.Bottomley@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
ocfs2_lock() will skip locks on file which has mode set to 02666. This
is a problem in cases where the mode of the file is changed after a
process has obtained a lock on the file.
ocfs2_lock() should skip the check for mandatory locks when unlocking a
file.
Signed-off-by: Sachin Prabhu <sprabhu@redhat.com> Signed-off-by: Joel Becker <joel.becker@oracle.com> Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
We have to set MS_POSIXACL on remount as well. Otherwise VFS
would not know we started supporting ACLs after remount and
thus ACLs would not work.
Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Joel Becker <joel.becker@oracle.com> Signed-off-by: Mark Fasheh <mfasheh@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
ocfs2 refcount tree is stored as an extent tree while
the leaf ocfs2_refcount_rec points to a refcount block.
The following step can trip a kernel panic.
mkfs.ocfs2 -b 512 -C 1M --fs-features=refcount $DEVICE
mount -t ocfs2 $DEVICE $MNT_DIR
FILE_NAME=$RANDOM
FILE_NAME_1=$RANDOM
FILE_REF="${FILE_NAME}_ref"
FILE_REF_1="${FILE_NAME}_ref_1"
for((i=0;i<305;i++))
do
# /mnt/1048576 is a file with 1048576 sizes.
cat /mnt/1048576 >> $MNT_DIR/$FILE_NAME
cat /mnt/1048576 >> $MNT_DIR/$FILE_NAME_1
done
for((i=0;i<3;i++))
do
cat /mnt/1048576 >> $MNT_DIR/$FILE_NAME
done
for((i=0;i<11;i++))
do
cat /mnt/1048576 >> $MNT_DIR/$FILE_NAME
cat /mnt/1048576 >> $MNT_DIR/$FILE_NAME_1
done
reflink $MNT_DIR/$FILE_NAME $MNT_DIR/$FILE_REF
# write_f is a program which will write some bytes to a file at offset.
# write_f -f file_name -l offset -w write_bytes.
./write_f -f $MNT_DIR/$FILE_REF -l $[310*1048576] -w 4096
./write_f -f $MNT_DIR/$FILE_REF -l $[306*1048576] -w 4096
./write_f -f $MNT_DIR/$FILE_REF -l $[311*1048576] -w 4096
./write_f -f $MNT_DIR/$FILE_NAME -l $[310*1048576] -w 4096
./write_f -f $MNT_DIR/$FILE_NAME -l $[311*1048576] -w 4096
reflink $MNT_DIR/$FILE_NAME $MNT_DIR/$FILE_REF_1
./write_f -f $MNT_DIR/$FILE_NAME -l $[311*1048576] -w 4096
#kernel panic here.
The reason is that if the ocfs2_extent_rec is the last record
in a leaf extent block, the old solution fails to find the
suitable end cpos. So this patch try to walk through the b-tree,
find the next sub root and get the c_pos the next sub-tree starts
from.
btw, I have runned tristan's test case against the patched kernel
for several days and this type of kernel panic never happens again.
Signed-off-by: Tao Ma <tao.ma@oracle.com> Signed-off-by: Joel Becker <joel.becker@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
When the lock master processes a successful operation (request,
convert, cancel, or unlock), it will process the effects of the
change before sending the reply for the operation. The "effects"
of the operation are:
- blocking callbacks (basts) for any newly granted locks
- waiting or converting locks that can now be granted
The cast is queued on the local node when the reply from the lock
master is received. This means that a lock holder can receive a
bast for a lock mode that is doesn't yet know has been granted.
Signed-off-by: David Teigland <teigland@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
When both blocking and completion callbacks are queued for lock,
the dlm would always deliver the completion callback (cast) first.
In some cases the blocking callback (bast) is queued before the
cast, though, and should be delivered first. This patch keeps
track of the order in which they were queued and delivers them
in that order.
This patch also keeps track of the granted mode in the last cast
and eliminates the following bast if the bast mode is compatible
with the preceding cast mode. This happens when a remotely mastered
lock is demoted, e.g. EX->NL, in which case the local node queues
a cast immediately after sending the demote message. In this way
a cast can be queued for a mode, e.g. NL, that makes an in-transit
bast extraneous.
Signed-off-by: David Teigland <teigland@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Replace all GFP_KERNEL and ls_allocation with GFP_NOFS.
ls_allocation would be GFP_KERNEL for userland lockspaces
and GFP_NOFS for file system lockspaces.
It was discovered that any lockspaces on the system can
affect all others by triggering memory reclaim in the
file system which could in turn call back into the dlm
to acquire locks, deadlocking dlm threads that were
shared by all lockspaces, like dlm_recv.
Signed-off-by: David Teigland <teigland@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Commit 57fe60df ("reiserfs: add atomic addition of selinux attributes
during inode creation") contains a bug that will cause it to oops when
mounting a file system that didn't previously contain extended attributes
on a system using security.* xattrs.
The issue is that while creating the privroot during mount
reiserfs_security_init calls reiserfs_xattr_jcreate_nblocks which
dereferences the xattr root. The xattr root doesn't exist, so we get an
oops.
The reiserfs journal behaves inconsistently when determining whether to
allow a mount of a read-only device.
This is due to the use of the continue_replay variable to short circuit
the journal scanning. If it's set, it's assumed that there are
transactions to replay, but there may not be. If it's unset, it's assumed
that there aren't any, and that may not be the case either.
I've observed two failure cases:
1) Where a clean file system on a read-only device refuses to mount
2) Where a clean file system on a read-only device passes the
optimization and then tries writing the journal header to update
the latest mount id.
The former is easily observable by using a freshly created file system on
a read-only loopback device.
This patch moves the check into journal_read_transaction, where it can
bail out before it's about to replay a transaction. That way it can go
through and skip transactions where appropriate, yet still refuse to mount
a file system with outstanding transactions.
Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
We have 2 mount options, "barrier" and "auto_da_alloc" which may or
may not take a 1/0 argument. This causes the ext4 superblock mount
code to subtract uninitialized pointers and pass the result to
kmalloc, which results in very noisy failures.
Per Ted's suggestion, initialize the args struct so that
we know whether match_token() found an argument for the
option, and skip match_int() if not.
Also, return error (0) from parse_options if we thought
we found an argument, but match_int() Fails.
Reported-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Acked-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Dan Roseberg has reported a problem with the MOVE_EXT ioctl. If the
donor file is an append-only file, we should not allow the operation
to proceed, lest we end up overwriting the contents of an append-only
file.
Earlier, Ingo Molnar posted a patch to make it so that the kernel would avoid
reading _PPC on his broken T60. Unfortunately, it seems that with Thomas
Renninger's patch last July to eliminate _PPC evaluations when the processor
driver loads, the kernel never actually reads _PPC at all! This is problematic
if you happen to boot your non-T60 computer in a state where the BIOS _wants_
_PPC to be something other than zero.
So, put the _PPC evaluation back into acpi_processor_get_performance_info if
ignore_ppc isn't 1.
Signed-off-by: Darrick J. Wong <djwong@us.ibm.com> Signed-off-by: Len Brown <len.brown@intel.com> Acked-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
During a EEH recover, the pci_dev structure can be null, mainly if an
eeh event is detected during cpi config operation. In this case, the
pci_dev will not be known (and will be null) the kernel will crash
with the following message:
Unable to handle kernel paging request for data at address 0x000000a0
Faulting instruction address: 0xc00000000006b8b4
Oops: Kernel access of bad area, sig: 11 [#1]
It turns out that the system has three io apic controllers, but
boot ioapic routing is in the second one, and that gsi_base is
not 0 - it is using a bunch of INT_SRC_OVR...
So these recent changes:
1. one set routing for first io apic controller
2. assume irq = gsi
... will break that system.
So try to remap those gsis, need to seperate boot_ioapic_idx
detection out of enable_IO_APIC() and call them early.
So introduce boot_ioapic_idx, and remap_ioapic_gsi()...
-v2: shift gsi with delta instead of gsi_base of boot_ioapic_idx
-v3: double check with find_isa_irq_apic(0, mp_INT) to get right
boot_ioapic_idx
-v4: nr_legacy_irqs
-v5: add print out for boot_ioapic_idx, and also make it could be
applied for current kernel and previous kernel
-v6: add bus_irq, in acpi_sci_ioapic_setup, so can get overwride
for sci right mapping...
-v7: looks like pnpacpi get irq instead of gsi, so need to revert
them back...
-v8: split into two patches
-v9: according to Eric, use fixed 16 for shifting instead of remap
-v10: still need to touch rsparser.c
-v11: just revert back to way Eric suggest...
anyway the ioapic in first ioapic is blocked by second...
-v12: two patches, this one will add more loop but check apic_id and irq > 16
Reported-by: Iranna D Ankad <iranna.ankad@in.ibm.com> Bisected-by: Iranna D Ankad <iranna.ankad@in.ibm.com> Tested-by: Gary Hade <garyhade@us.ibm.com> Signed-off-by: Yinghai Lu <yinghai@kernel.org> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Thomas Renninger <trenn@suse.de> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Suresh Siddha <suresh.b.siddha@intel.com> Cc: len.brown@intel.com
LKML-Reference: <4B8A321A.1000008@kernel.org> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
When load aesni-intel and ghash_clmulni-intel driver,kernel will complain no
test for some internal used algorithm.
The strange information as following:
alg: No test for __aes-aesni (__driver-aes-aesni)
alg: No test for __ecb-aes-aesni (__driver-ecb-aes-aesni)
alg: No test for __cbc-aes-aesni (__driver-cbc-aes-aesni)
alg: No test for __ecb-aes-aesni (cryptd(__driver-ecb-aes-aesni)
alg: No test for __ghash (__ghash-pclmulqdqni)
alg: No test for __ghash (cryptd(__ghash-pclmulqdqni))
This patch add NULL test entries for these algorithm and driver.
Signed-off-by: Song Youquan <youquan.song@intel.com> Signed-off-by: Hang Ying <ying.huang@intel.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Acked-by: Jiri Kosina <jkosina@suse.cz> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
It's possible that SBA IOMMU might fail to find I/O space under heavy
I/Os. SBA IOMMU panics on allocation failure but it shouldn't; drivers
can handle the failure. The majority of other IOMMU drivers don't panic
on allocation failure.
This patch fixes SBA IOMMU path to handle allocation failure properly.
Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp> Cc: Fenghua Yu <fenghua.yu@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Tony Luck <tony.luck@intel.com> Acked-by: Leonardo Chiquitto <lchiquitto@novell.com> Acked-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Due to recent load-balancer changes that delay the task migration to
the next wakeup, the adaptive mutex spinning ends up in a live lock
when the owner's CPU gets offlined because the cpu_online() check
lives before the owner running check.
This patch changes mutex_spin_on_owner() to return 0 (don't spin) in
any case where we aren't sure about the owner struct validity or CPU
number, and if the said CPU is offline. There is no point going back &
re-evaluate spinning in corner cases like that, let's just go to
sleep.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1271212509.13059.135.camel@pasglop> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
This is a real fix for problem of utime/stime values decreasing
described in the thread:
http://lkml.org/lkml/2009/11/3/522
Now cputime is accounted in the following way:
- {u,s}time in task_struct are increased every time when the thread
is interrupted by a tick (timer interrupt).
- When a thread exits, its {u,s}time are added to signal->{u,s}time,
after adjusted by task_times().
- When all threads in a thread_group exits, accumulated {u,s}time
(and also c{u,s}time) in signal struct are added to c{u,s}time
in signal struct of the group's parent.
So {u,s}time in task struct are "raw" tick count, while
{u,s}time and c{u,s}time in signal struct are "adjusted" values.
And accounted values are used by:
- task_times(), to get cputime of a thread:
This function returns adjusted values that originates from raw
{u,s}time and scaled by sum_exec_runtime that accounted by CFS.
- thread_group_cputime(), to get cputime of a thread group:
This function returns sum of all {u,s}time of living threads in
the group, plus {u,s}time in the signal struct that is sum of
adjusted cputimes of all exited threads belonged to the group.
The problem is the return value of thread_group_cputime(),
because it is mixed sum of "raw" value and "adjusted" value:
This misbehavior can break {u,s}time monotonicity.
Assume that if there is a thread that have raw values greater
than adjusted values (e.g. interrupted by 1000Hz ticks 50 times
but only runs 45ms) and if it exits, cputime will decrease (e.g.
-5ms).
But task_times() contains hard divisions, so applying it for
every thread should be avoided.
This patch fixes the above problem in the following way:
- Modify thread's exit (= __exit_signal()) not to use task_times().
It means {u,s}time in signal struct accumulates raw values instead
of adjusted values. As the result it makes thread_group_cputime()
to return pure sum of "raw" values.
- Introduce a new function thread_group_times(*task, *utime, *stime)
that converts "raw" values of thread_group_cputime() to "adjusted"
values, in same calculation procedure as task_times().
- Modify group's exit (= wait_task_zombie()) to use this introduced
thread_group_times(). It make c{u,s}time in signal struct to
have adjusted values like before this patch.
- Replace some thread_group_cputime() by thread_group_times().
This replacements are only applied where conveys the "adjusted"
cputime to users, and where already uses task_times() near by it.
(i.e. sys_times(), getrusage(), and /proc/<PID>/stat.)
This patch have a positive side effect:
- Before this patch, if a group contains many short-life threads
(e.g. runs 0.9ms and not interrupted by ticks), the group's
cputime could be invisible since thread's cputime was accumulated
after adjusted: imagine adjustment function as adj(ticks, runtime),
{adj(0, 0.9) + adj(0, 0.9) + ....} = {0 + 0 + ....} = 0.
After this patch it will not happen because the adjustment is
applied after accumulated.
v2:
- remove if()s, put new variables into signal_struct.
It only changed the type of return value, but not the
implementation. As the result the granularity of task_s/utime()
is still that of clock_t, not that of cputime_t.
So using task_s/utime() in __exit_signal() makes values
accumulated to the signal struct to be rounded and coarse
grained.
This patch removes casts to clock_t in task_u/stime(), to keep
granularity of cputime_t over the calculation.
v2:
Use div_u64() to avoid error "undefined reference to `__udivdi3`"
on some 32bit systems.
Since commit 0a544198 "timekeeping: Move NTP adjusted clock multiplier
to struct timekeeper" the clock multiplier of vsyscall is updated with
the unmodified clock multiplier of the clock source and not with the
NTP adjusted multiplier of the timekeeper.
Add a new argument "mult" to update_vsyscall() and hand in the
timekeeping internal NTP adjusted multiplier.
Signed-off-by: Lin Ming <ming.m.lin@intel.com> Cc: "Zhang Yanmin" <yanmin_zhang@linux.intel.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Tony Luck <tony.luck@intel.com>
LKML-Reference: <1258436990.17765.83.camel@minggr.sh.intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Kurt Garloff <garloff@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
On a system with NOHZ=y tick_check_idle calls tick_nohz_stop_idle and
tick_nohz_update_jiffies. Given the right conditions (ts->idle_active
and/or ts->tick_stopped) both function get a time stamp with ktime_get.
The same time stamp can be reused if both function require one.
On s390 this change has the additional benefit that gcc inlines the
tick_nohz_stop_idle function into tick_check_idle. The number of
instructions to execute tick_check_idle drops from 225 to 144
(without the ktime_get optimization it is 367 vs 215 instructions).
before:
0) | tick_check_idle() {
0) | tick_nohz_stop_idle() {
0) | ktime_get() {
0) | read_tod_clock() {
0) 0.601 us | }
0) 1.765 us | }
0) 3.047 us | }
0) | ktime_get() {
0) | read_tod_clock() {
0) 0.570 us | }
0) 1.727 us | }
0) | tick_do_update_jiffies64() {
0) 0.609 us | }
0) 8.055 us | }
after:
0) | tick_check_idle() {
0) | ktime_get() {
0) | read_tod_clock() {
0) 0.617 us | }
0) 1.773 us | }
0) | tick_do_update_jiffies64() {
0) 0.593 us | }
0) 4.477 us | }
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: john stultz <johnstul@us.ibm.com>
LKML-Reference: <20090929122533.206589318@de.ibm.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: John Jolly <jjolly@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Allow the architecture to request a normal jiffy tick when the system
goes idle and tick_nohz_stop_sched_tick is called . On s390 the hook is
used to prevent the system going fully idle if there has been an
interrupt other than a clock comparator interrupt since the last wakeup.
On s390 the HiperSockets response time for 1 connection ping-pong goes
down from 42 to 34 microseconds. The CPU cost decreases by 27%.
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
LKML-Reference: <20090929122533.402715150@de.ibm.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: John Jolly <jjolly@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
My test do: fallocate a big file and do write. The file is 512M, but
after file write is done btrfs-debug-tree shows:
item 6 key (257 EXTENT_DATA 0) itemoff 3516 itemsize 53
extent data disk byte 1103101952 nr 536870912
extent data offset 0 nr 399634432 ram 536870912
extent compression 0
Looks like a regression introducted by 6c7d54ac87f338c479d9729e8392eca3f76e11e1, where we set wrong slot.
Signed-off-by: Shaohua Li <shaohua.li@intel.com> Acked-by: Yan Zheng <zheng.yan@oracle.com> Signed-off-by: Chris Mason <chris.mason@oracle.com> Acked-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
you would see long stalls where no work was being done. That is because we were
doing all this extra work to read in the file extent outside of the transaction,
however in the random io case this ends up hurting us because the file extents
are not there to begin with. So axe this logic, since we end up reading in the
file extent when we go to update it anyway. This took the fio job from 11 mb/s
with several ~10 second stalls to 24 mb/s to a couple of 1-2 second stalls.
Signed-off-by: Josef Bacik <josef@redhat.com> Signed-off-by: Chris Mason <chris.mason@oracle.com> Acked-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Mounting a bad filesystem caused a BUG_ON(). The following is steps to
reproduce it.
# mkfs.btrfs /dev/sda2
# mount /dev/sda2 /mnt
# mkfs.btrfs /dev/sda1 /dev/sda2
(the program says that /dev/sda2 was mounted, and then exits. )
# umount /mnt
# mount /dev/sda1 /mnt
At the third step, mkfs.btrfs exited in the way of make filesystem. So the
initialization of the filesystem didn't finish. So the filesystem was bad, and
it caused BUG_ON() when mounting it. But BUG_ON() should be called by the wrong
code, not user's operation, so I think it is a bug of btrfs.
This patch fixes it.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@oracle.com> Acked-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
If you have a disk failure in RAID1 and then add a new disk to the
array, and then try to remove the missing volume, it will fail. The
reason is the sanity check only looks at the total number of rw devices,
which is just 2 because we have 2 good disks and 1 bad one. Instead
check the total number of devices in the array to make sure we can
actually remove the device. Tested this with a failed disk setup and
with this test we can now run
btrfs-vol -r missing /mount/point
and it works fine.
Signed-off-by: Josef Bacik <josef@redhat.com> Signed-off-by: Chris Mason <chris.mason@oracle.com> Acked-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Hit this problem while testing RAID1 failure stuff. open_bdev_exclusive
returns ERR_PTR(), not NULL. So change the return value properly. This
is important if you accidently specify a device that doesn't exist when
trying to add a new device to an array, you will panic the box
dereferencing bdev.
Signed-off-by: Josef Bacik <josef@redhat.com> Signed-off-by: Chris Mason <chris.mason@oracle.com> Acked-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
If a RAID setup has chunks that span multiple disks, and one of those
disks has failed, btrfs_chunk_readonly will return 1 since one of the
disks in that chunk's stripes is dead and therefore not writeable. So
instead if we are in degraded mode, return 0 so we can go ahead and
allocate stuff. Without this patch all of the block groups in a RAID1
setup will end up read-only, which will mean we can't add new disks to
the array since we won't be able to make allocations.
Signed-off-by: Josef Bacik <josef@redhat.com> Signed-off-by: Chris Mason <chris.mason@oracle.com> Acked-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Since it introduces this problem where we can run orphan cleanup on a
volume that can have orphan entries re-added. Instead of my original
fix, Yan Zheng pointed out that we can just revert my original fix and
then run the orphan cleanup in open_ctree after we look up the fs_root.
I have tested this with all the tests that gave me problems and this
patch fixes both problems. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com> Signed-off-by: Chris Mason <chris.mason@oracle.com> Acked-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
We can race with the unmount of an fs and the stopping of a kthread where we
will free the block group before we're done using it. The reason for this is
because we do not hold a reference on the block group while its caching, since
the allocator drops its reference once it exits or moves on to the next block
group. This patch fixes the problem by taking a reference to the block group
before we start caching and dropping it when we're done to make sure all
accesses to the block group are safe. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com> Signed-off-by: Chris Mason <chris.mason@oracle.com> Acked-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
It is legal for btrfs_set_acl to be sent a NULL acl. This
makes sure we don't dereference it. A similar patch was sent by
Johannes Hirte <johannes.hirte@fem.tu-ilmenau.de>
Signed-off-by: Chris Mason <chris.mason@oracle.com> Acked-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Currently orphan cleanup only ever gets triggered if we cross subvolumes during
a lookup, which means that if we just mount a plain jane fs that has orphans in
it, they will never get cleaned up. This results in panic's like these
where adding an orphan entry results in -EEXIST being returned and we panic. In
order to fix this, we check to see on lookup if our root has had the orphan
cleanup done, and if not go ahead and do it. This is easily reproduceable by
running this testcase
for (i = 0; i < 512; i++)
write(fd2, newdata, 4096);
close(fd2);
i = rename("file2", "file1");
unlink("file1");
}
return 0;
}
and then pulling the power on the box, and then trying to run that test again
when the box comes back up. I've tested this locally and it fixes the problem.
Thanks to Tomas Carnecky for helping me track this down initially.
Signed-off-by: Josef Bacik <josef@redhat.com> Signed-off-by: Chris Mason <chris.mason@oracle.com> Acked-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Fix bug reported by Johannes Hirte. The reason of that bug
is btrfs_del_items is called after btrfs_duplicate_item and
btrfs_del_items triggers tree balance. The fix is check that
case and call btrfs_search_slot when needed.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com> Signed-off-by: Chris Mason <chris.mason@oracle.com> Acked-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Some callers of btrfs_ordered_update_i_size can now pass in
a NULL for the ordered extent to update against. This makes
sure we properly align the offset they pass in when deciding
how much to bump the on disk i_size.
Signed-off-by: Chris Mason <chris.mason@oracle.com> Acked-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
When one does a 32-bit readdir(3), the last entry of a directory is
missing. This is however not due to passing a large value to filldir,
but it seems to have to do with glibc doing telldir or something
quirky. In any case, this patch fixes it in practice.
Signed-off-by: Jan Engelhardt <jengelh@medozas.de> Signed-off-by: Chris Mason <chris.mason@oracle.com> Acked-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
The recent patch to make fallocate enospc friendly would send
down a NULL trans handle to the allocator. This moves the
transaction start to properly fix things.
Signed-off-by: Chris Mason <chris.mason@oracle.com> Acked-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
This patch makes us a bit less zealous about making sure we have enough free
metadata space by pearing down the size of new metadata chunks to 256mb instead
of 1gb. Also, we used to try an allocate metadata chunks when allocating data,
but that sort of thing is done elsewhere now so we can just remove it. With my
-ENOSPC test I used to have 3gb reserved for metadata out of 75gb, now I have
1.7gb. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com> Signed-off-by: Chris Mason <chris.mason@oracle.com> Acked-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Christoph's patch e244a0aeb6a599c19a7c802cda6e2d67c847b154 doesn't display
the discard option in /proc/mounts, leading to some confusion for me.
Here's the missing bit.
Signed-off-by: Matthew Wilcox <willy@linux.intel.com> Signed-off-by: Chris Mason <chris.mason@oracle.com> Acked-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
I rebased Christian Parpart's patch to deny hard link across
subvolumes. Original patch modifies also btrfs_rename, but
I excluded it because we can move across subvolumes now and
it make no problem.
-----------------
Hard link across subvolumes should not allowed in Btrfs.
btrfs_link checks root of 'to' directory is same as root
of 'from' file. If not same, btrfs_link returns -EPERM.
Signed-off-by: TARUISI Hiroaki <taruishi.hiroak@jp.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@oracle.com> Acked-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
The bytes_used field in root item was originally planned to
trace the amount of used data and tree blocks. But it never
worked right since we can't trace freeing of data accurately.
This patch changes it to only trace the amount of tree blocks.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com> Signed-off-by: Chris Mason <chris.mason@oracle.com> Acked-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
iput() can trigger new transactions if we are dropping the
final reference, so calling it in btrfs_commit_transaction
may end up deadlock. This patch adds delayed iput to avoid
the issue.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com> Signed-off-by: Chris Mason <chris.mason@oracle.com> Acked-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
truncating and deleting regular files are unbound operations,
so it's not good to do them in a single transaction. This
patch makes btrfs_truncate and btrfs_delete_inode start a
new transaction after all items in a tree leaf are deleted.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com> Signed-off-by: Chris Mason <chris.mason@oracle.com> Acked-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
fallocate(2) may allocate large number of file extents, so it's not
good to do it in a single transaction. This patch make fallocate(2)
start a new transaction for each file extents it allocates.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com> Signed-off-by: Chris Mason <chris.mason@oracle.com> Acked-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
We do log replay in a single transaction, so it's not good to do unbound
operations. This patch cleans up orphan inodes cleanup after replaying
the log. It also avoids doing other unbound operations such as truncating
a file during replaying log. These unbound operations are postponed to
the orphan inode cleanup stage.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com> Signed-off-by: Chris Mason <chris.mason@oracle.com> Acked-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
There are some cases file extents are inserted without involving
ordered struct. In these cases, we update disk_i_size directly,
without checking pending ordered extent and DELALLOC bit. This
patch extends btrfs_ordered_update_i_size() to handle these cases.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com> Signed-off-by: Chris Mason <chris.mason@oracle.com> Acked-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>