Sage Weil [Mon, 10 May 2010 23:12:25 +0000 (16:12 -0700)]
ceph: throw out dirty caps metadata, data on session teardown
The remove_session_caps() helper is called when an MDS closes out our
session (either normally, or as a result of a failed reconnect), and when
we tear down state for umount. If we remove the last cap, and there are
no cap migrations in progress, then there is little hope of us flushing
out that data to the mds (without heroic efforts to reconnect and flush).
So, to avoid leaving inodes pinned (due to dirty state) and crashing after
umount, throw out dirty caps state and unpin the inodes. Print a warning
to the console so we know something was lost.
NOTE: Although we drop wrbuffer refs, we don't actually mark pages clean;
maybe a truncate should be queued?
Sage Weil [Thu, 18 Mar 2010 20:59:12 +0000 (13:59 -0700)]
ceph: attempt mds reconnect if mds closes our session
Currently, if our session is closed (due to a timeout, or explicit close,
or whatever), we just sit there doing nothing unless/until the MDS
restarts, at which point we try to reconnect.
Change client to attempt an immediate reconnect if our session is closed.
Note that currently the MDS doesn't support this, and our attempt will
fail. We'll get a session CLOSE, our caps and dirty cap state will be
dropped, and the client will be free to attempt to reconnect. That's
clearly not as nice as a successful reconnect, but it at least allows us
to try to carry on, and in the future the MDS will support a reconnect
and we will fare better.
Sage Weil [Mon, 10 May 2010 23:31:25 +0000 (16:31 -0700)]
ceph: clean up send_mds_reconnect interface
Pass a ceph_mds_session, since the caller has it.
Remove the dead code for sending empty reconnects. It used to be used
when the MDS contacted _us_ to solicit a reconnect, and we could reply
saying "go away, I have no session." Now we only send reconnects based
on the mds map, and only when we do in fact have an open session.
Sage Weil [Thu, 18 Mar 2010 21:45:05 +0000 (14:45 -0700)]
ceph: wait for mds OPEN reply to indicate reconnect success
We used to infer reconnect success by watching the MDS state, essentially
assuming that hearing nothing meant things were ok. That wasn't
particularly reliable. Instead, the MDS replies with an explicit OPEN
message to indicate success.
Strictly speaking, this is a protocol change, but it is a backwards
compatible one that does not break new clients + old servers or old
clients + new servers. At least not yet.
Drop unused @all argument from kick_requests while we're at it.
Sage Weil [Wed, 17 Mar 2010 23:30:21 +0000 (16:30 -0700)]
ceph: only send cap releases when mds is OPEN|HUNG
On OPENING we shouldn't have any caps (or releases).
On CLOSING, we should wait until we succeed (and throw it all out), or
don't (and are OPEN again).
On RECONNECTING we can wait until we are OPEN.
Sage Weil [Mon, 10 May 2010 22:36:44 +0000 (15:36 -0700)]
ceph: dicard cap releases on mds restart
If the MDS restarts, the expire caps state is no longer shared, and can be
thrown out. Caps state will be rebuilt on the MDS during the reconnect
process that follows. Zero out any release messages and adjust the
release counter accordingly.
Sage Weil [Thu, 25 Mar 2010 22:45:38 +0000 (15:45 -0700)]
ceph: drop src address(es) from message header [new protocol feature]
The CEPH_FEATURE_NOSRCADDR protocol feature avoids putting the full source
address in each message header (twice). This patch switches the client to
the new scheme, and _requires_ this feature on the server. The server
will support both the old and new schemes. That means an old client will
work with a new server, but a new client will not work with an old server.
Sage Weil [Tue, 4 May 2010 05:08:02 +0000 (22:08 -0700)]
ceph: set dn offset when spliced
We want to assign an offset when the dentry goes from null to linked, which
is always done by splice_dentry(). Notably, we should NOT assign an
offset when a dentry is first created and is still null.
BUG if we try to splice a non-null dentry (we shouldn't).
Sage Weil [Thu, 1 Apr 2010 22:23:14 +0000 (15:23 -0700)]
ceph: rewrite msgpool using mempool_t
Since we don't need to maintain large pools of messages, we can just
use the standard mempool_t. We maintain a msgpool 'wrapper' because we
need the mempool_t* in the alloc function, and mempool gives us only
pool_data.
Cheng Renquan [Fri, 26 Mar 2010 09:40:33 +0000 (17:40 +0800)]
ceph: use ceph_sb_to_client instead of ceph_client
ceph_sb_to_client and ceph_client are really identical, we need to dump
one; while function ceph_client is confusing with "struct ceph_client",
ceph_sb_to_client's definition is more clear; so we'd better switch all
call to ceph_sb_to_client.
Sage Weil [Fri, 14 May 2010 16:35:38 +0000 (09:35 -0700)]
ceph: invalidate affected dentry leases on aborted requests
If we abort a request, we return to caller, but the request may still
complete. And if we hold the dir FILE_EXCL bit, we may not release a
lease when sending a request. A simple un-tar, control-c, un-tar again
will reproduce the bug (manifested as a 'Cannot open: File exists').
Ensure we invalidate affected dentry leases (as well dir I_COMPLETE) so
we don't have valid (but incorrect) leases. Do the same, consistently, at
other sites where I_COMPLETE is similarly cleared.
Sage Weil [Thu, 13 May 2010 19:01:13 +0000 (12:01 -0700)]
ceph: fix race between aborted requests and fill_trace
When we abort requests we need to prevent fill_trace et al from doing
anything that relies on locks held by the VFS caller. This fixes a race
between the reply handler and the abort code, ensuring that continue
holding the dir mutex until the reply handler completes.
Sage Weil [Thu, 13 May 2010 18:19:06 +0000 (11:19 -0700)]
ceph: clean up mds reply, error handling
We would occasionally BUG out in the reply handler because r_reply was
nonzero, due to a race with ceph_mdsc_do_request temporarily setting
r_reply to an ERR_PTR value. This is unnecessary, messy, and also wrong
in the EIO case.
Clean up by consistently using r_err for errors and r_reply for messages.
Also fix the abort logic to trigger consistently for all errors that return
to the caller early (e.g., EIO from timeout case). If an abort races with
a reply, use the result from the reply.
Also fix locking for r_err, r_reply update in the reply handler.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6:
rtnetlink: make SR-IOV VF interface symmetric
sctp: delete active ICMP proto unreachable timer when free transport
tcp: fix MD5 (RFC2385) support
Linus Torvalds [Sun, 16 May 2010 18:11:31 +0000 (11:11 -0700)]
Merge branch 'upstream' of git://ftp.linux-mips.org/pub/scm/upstream-linus
* 'upstream' of git://ftp.linux-mips.org/pub/scm/upstream-linus:
MIPS: Oprofile: Fix Loongson irq handler
MIPS: N32: Use compat version for sys_ppoll.
MIPS FPU emulator: allow Cause bits of FCSR to be writeable by ctc1
Wei Yongjun [Sun, 9 May 2010 16:56:07 +0000 (16:56 +0000)]
sctp: delete active ICMP proto unreachable timer when free transport
transport may be free before ICMP proto unreachable timer expire, so
we should delete active ICMP proto unreachable timer when transport
is going away.
Signed-off-by: Wei Yongjun <yjwei@cn.fujitsu.com> Acked-by: Vlad Yasevich <vladislav.yasevich@hp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Sun, 16 May 2010 07:34:04 +0000 (00:34 -0700)]
tcp: fix MD5 (RFC2385) support
TCP MD5 support uses percpu data for temporary storage. It currently
disables preemption so that same storage cannot be reclaimed by another
thread on same cpu.
We also have to make sure a softirq handler wont try to use also same
context. Various bug reports demonstrated corruptions.
Fix is to disable preemption and BH.
Reported-by: Bhaskar Dutta <bhaskie@gmail.com> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Wu Zhangjin [Thu, 6 May 2010 16:59:46 +0000 (00:59 +0800)]
MIPS: Oprofile: Fix Loongson irq handler
The interrupt enable bit for the performance counters is in the Control
Register $24, not in the counter register.
loongson2_perfcount_handler(), we need to use
Shane McDonald [Fri, 7 May 2010 05:26:57 +0000 (23:26 -0600)]
MIPS FPU emulator: allow Cause bits of FCSR to be writeable by ctc1
In the FPU emulator code of the MIPS, the Cause bits of the FCSR register
are not currently writeable by the ctc1 instruction. In odd corner cases,
this can cause problems. For example, a case existed where a divide-by-zero
exception was generated by the FPU, and the signal handler attempted to
restore the FPU registers to their state before the exception occurred. In
this particular setup, writing the old value to the FCSR register would
cause another divide-by-zero exception to occur immediately. The solution
is to change the ctc1 instruction emulator code to allow the Cause bits of
the FCSR register to be writeable. This is the behaviour of the hardware
that the code is emulating.
This problem was found by Shane McDonald, but the credit for the fix goes
to Kevin Kissell. In Kevin's words:
I submit that the bug is indeed in that ctc_op: case of the emulator. The
Cause bits (17:12) are supposed to be writable by that instruction, but the
CTC1 emulation won't let them be updated by the instruction. I think that
actually if you just completely removed lines 387-388 [...] things would
work a good deal better. At least, it would be a more accurate emulation of
the architecturally defined FPU. If I wanted to be really, really pedantic
(which I sometimes do), I'd also protect the reserved bits that aren't
necessarily writable.
Signed-off-by: Shane McDonald <mcdonald.shane@gmail.com>
To: anemo@mba.ocn.ne.jp
To: kevink@paralogos.com
To: sshtylyov@mvista.com
Patchwork: http://patchwork.linux-mips.org/patch/1205/ Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
---
Nicolas Ferre [Sat, 15 May 2010 16:32:31 +0000 (12:32 -0400)]
mmc: at91_mci: modify cache flush routines
As we were using an internal dma flushing routine, this patch changes to
the DMA API flush_kernel_dcache_page(). Driver is able to compile now.
[akpm@linux-foundation.org: flush_kernel_dcache_page() comes before kunmap_atomic()] Signed-off-by: Nicolas Ferre <nicolas.ferre@atmel.com> Cc: James Bottomley <James.Bottomley@HansenPartnership.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Linus Torvalds [Sat, 15 May 2010 16:03:15 +0000 (09:03 -0700)]
Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6:
JFS: Free sbi memory in error path
fs/sysv: dereferencing ERR_PTR()
Fix double-free in logfs
Fix the regression created by "set S_DEAD on unlink()..." commit
Linus Torvalds [Sat, 15 May 2010 16:03:02 +0000 (09:03 -0700)]
Merge branch 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
perf record: Add a fallback to the reference relocation symbol
Jan Blunck [Mon, 12 Apr 2010 23:44:08 +0000 (16:44 -0700)]
JFS: Free sbi memory in error path
I spotted the missing kfree() while removing the BKL.
[akpm@linux-foundation.org: avoid multiple returns so it doesn't happen again] Signed-off-by: Jan Blunck <jblunck@suse.de> Cc: Dave Kleikamp <shaggy@austin.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Al Viro [Fri, 30 Apr 2010 21:17:09 +0000 (17:17 -0400)]
Fix the regression created by "set S_DEAD on unlink()..." commit
1) i_flags simply doesn't work for mount/unlink race prevention;
we may have many links to file and rm on one of those obviously
shouldn't prevent bind on top of another later on. To fix it
right way we need to mark _dentry_ as unsuitable for mounting
upon; new flag (DCACHE_CANT_MOUNT) is protected by d_flags and
i_mutex on the inode in question. Set it (with dont_mount(dentry))
in unlink/rmdir/etc., check (with cant_mount(dentry)) in places
in namespace.c that used to check for S_DEAD. Setting S_DEAD
is still needed in places where we used to set it (for directories
getting killed), since we rely on it for readdir/rmdir race
prevention.
2) rename()/mount() protection has another bogosity - we unhash
the target before we'd checked that it's not a mountpoint. Fixed.
3) ancient bogosity in pivot_root() - we locked i_mutex on the
right directory, but checked S_DEAD on the different (and wrong)
one. Noticed and fixed.
Linus Torvalds [Sat, 15 May 2010 04:28:42 +0000 (21:28 -0700)]
Merge master.kernel.org:/home/rmk/linux-2.6-arm
* master.kernel.org:/home/rmk/linux-2.6-arm:
ARM: 6126/1: ARM mpcore_wdt: fix build failure and other fixes
ARM: 6125/1: ARM TWD: move TWD registers to common header
ARM: 6110/1: Fix Thumb-2 kernel builds when UACCESS_WITH_MEMCPY is enabled
ARM: 6112/1: Use the Inner Shareable I-cache and BTB ops on ARMv7 SMP
ARM: 6111/1: Implement read/write for ownership in the ARMv6 DMA cache ops
ARM: 6106/1: Implement copy_to_user_page() for noMMU
ARM: 6105/1: Fix the __arm_ioremap_caller() definition in nommu.c
Hugh Dickins [Sat, 15 May 2010 02:44:10 +0000 (19:44 -0700)]
profile: fix stats and data leakage
If the kernel is large or the profiling step small, /proc/profile
leaks data and readprofile shows silly stats, until readprofile -r
has reset the buffer: clear the prof_buffer when it is vmalloc()ed.
H. Peter Anvin [Fri, 14 May 2010 20:55:57 +0000 (13:55 -0700)]
x86, mrst: Don't blindly access extended config space
Do not blindly access extended configuration space unless we actively
know we're on a Moorestown platform. The fixed-size BAR capability
lives in the extended configuration space, and thus is not applicable
if the configuration space isn't appropriately sized.
This fixes booting certain VMware configurations with CONFIG_MRST=y.
Moorestown will add a fake PCI-X 266 capability to advertise the
presence of extended configuration space.
Reported-and-tested-by: Petr Vandrovec <petr@vandrovec.name> Signed-off-by: H. Peter Anvin <hpa@linux.intel.com> Acked-by: Jacob Pan <jacob.jun.pan@intel.com> Acked-by: Jesse Barnes <jbarnes@virtuousgeek.org>
LKML-Reference: <AANLkTiltKUa3TrKR1M51eGw8FLNoQJSLT0k0_K5X3-OJ@mail.gmail.com>
Linus Torvalds [Fri, 14 May 2010 19:20:09 +0000 (12:20 -0700)]
Merge branch 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
x86, cacheinfo: Turn off L3 cache index disable feature in virtualized environments
x86, k8: Fix build error when K8_NB is disabled
x86, amd: Check X86_FEATURE_OSVW bit before accessing OSVW MSRs
x86: Fix fake apicid to node mapping for numa emulation
The L3 cache index disable feature of AMD CPUs has to be disabled if the
kernel is running as guest on top of a hypervisor because northbridge
devices are not available to the guest. Currently, this fixes a boot
crash on top of Xen. In the future this will become an issue on KVM as
well.
Check if northbridge devices are present and do not enable the feature
if there are none.
[ hpa: backported to 2.6.34 ]
Signed-off-by: Frank Arnold <frank.arnold@amd.com>
LKML-Reference: <1271945222-5283-3-git-send-email-bp@amd64.org> Acked-by: Borislav Petkov <borislav.petkov@amd.com> Signed-off-by: H. Peter Anvin <hpa@zytor.com> Cc: <stable@kernel.org>
K8_NB depends on PCI and when the last is disabled (allnoconfig) we fail
at the final linking stage due to missing exported num_k8_northbridges.
Add a header stub for that.
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
LKML-Reference: <20100503183036.GJ26107@aftab> Signed-off-by: H. Peter Anvin <hpa@zytor.com> Cc: <stable@kernel.org>
Linus Torvalds [Fri, 14 May 2010 18:49:42 +0000 (11:49 -0700)]
Merge branch 'for-linus' of git://git.infradead.org/users/eparis/notify
* 'for-linus' of git://git.infradead.org/users/eparis/notify:
inotify: don't leak user struct on inotify release
inotify: race use after free/double free in inotify inode marks
inotify: clean up the inotify_add_watch out path
Inotify: undefined reference to `anon_inode_getfd'
Manual merge to remove duplicate "select ANON_INODES" from Kconfig file
Sergei Shtylyov [Thu, 13 May 2010 18:51:51 +0000 (22:51 +0400)]
DA830: fix USB 2.0 clock entry
DA8xx OHCI driver fails to load due to failing clk_get() call for the USB 2.0
clock. Arrange matching USB 2.0 clock by the clock name instead of the device.
(Adding another CLK() entry for "ohci.0" device won't do -- in the future I'll
also have to enable USB 2.0 clock to configure CPPI 4.1 module, in which case
I won't have any device at all.)
Signed-off-by: Sergei Shtylyov <sshtylyov@ru.mvista.com> Signed-off-by: Kevin Hilman <khilman@deeprootsystems.com>
Pavel Emelyanov [Wed, 12 May 2010 22:34:07 +0000 (15:34 -0700)]
inotify: don't leak user struct on inotify release
inotify_new_group() receives a get_uid-ed user_struct and saves the
reference on group->inotify_data.user. The problem is that free_uid() is
never called on it.
Issue seem to be introduced by 63c882a0 (inotify: reimplement inotify
using fsnotify) after 2.6.30.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Eric Paris <eparis@parisplace.org> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Eric Paris <eparis@redhat.com>
Eric Paris [Tue, 11 May 2010 21:17:40 +0000 (17:17 -0400)]
inotify: race use after free/double free in inotify inode marks
There is a race in the inotify add/rm watch code. A task can find and
remove a mark which doesn't have all of it's references. This can
result in a use after free/double free situation.
Task A Task B
------------ -----------
inotify_new_watch()
allocate a mark (refcnt == 1)
add it to the idr
inotify_rm_watch()
inotify_remove_from_idr()
fsnotify_put_mark()
refcnt hits 0, free
take reference because we are on idr
[at this point it is a use after free]
[time goes on]
refcnt may hit 0 again, double free
The fix is to take the reference BEFORE the object can be found in the
idr.
Signed-off-by: Eric Paris <eparis@redhat.com> Cc: <stable@kernel.org>
Linus Torvalds [Fri, 14 May 2010 14:29:29 +0000 (07:29 -0700)]
Merge branch 'for-linus' of git://git.monstr.eu/linux-2.6-microblaze
* 'for-linus' of git://git.monstr.eu/linux-2.6-microblaze:
microblaze: Fix module loading on system with WB cache
microblaze: export assembly functions used by modules
microblaze: Remove powerpc code from Microblaze port
microblaze: Remove compilation warnings in cache macro
microblaze: export assembly functions used by modules
microblaze: fix get_user/put_user side-effects
microblaze: re-enable interrupts before calling schedule
Redirecting directly to lsm, here's the patch discussed on lkml:
http://lkml.org/lkml/2010/4/22/219
The mmap_min_addr value is useful information for an admin to see without
being root ("is my system vulnerable to kernel NULL pointer attacks?") and
its setting is trivially easy for an attacker to determine by calling
mmap() in PAGE_SIZE increments starting at 0, so trying to keep it private
has no value.
Only require CAP_SYS_RAWIO if changing the value, not reading it.
Comment from Serge :
Me, I like to write my passwords with light blue pen on dark blue
paper, pasted on my window - if you're going to get my password, you're
gonna get a headache.
Signed-off-by: Kees Cook <kees.cook@canonical.com> Acked-by: Serge Hallyn <serue@us.ibm.com> Signed-off-by: James Morris <jmorris@namei.org>
(cherry picked from commit 822cceec7248013821d655545ea45d1c6a9d15b3)
Linus Torvalds [Thu, 13 May 2010 21:36:19 +0000 (14:36 -0700)]
Merge branch 'kvm-updates/2.6.34' of git://git.kernel.org/pub/scm/virt/kvm/kvm
* 'kvm-updates/2.6.34' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
KVM: PPC: Keep index within boundaries in kvmppc_44x_emul_tlbwe()
KVM: VMX: blocked-by-sti must not defer NMI injections
KVM: x86: Call vcpu_load and vcpu_put in cpuid_update
KVM: SVM: Fix wrong intercept masks on 32 bit
KVM: convert ioapic lock to spinlock
serial: imx.c: fix CTS trigger level lower to avoid lost chars
The imx CTS trigger level is left at its reset value that is 32
chars. Since the RX FIFO has 32 entries, when CTS is raised, the
FIFO already is full. However, some serial port devices first empty
their TX FIFO before stopping when CTS is raised, resulting in lost
chars.
This patch sets the trigger level lower so that other chars arrive
after CTS is raised, there is still room for 16 of them.