Xiao Guangrong [Wed, 20 Jun 2012 07:59:18 +0000 (15:59 +0800)]
KVM: MMU: fast path of handling guest page fault
If the the present bit of page fault error code is set, it indicates
the shadow page is populated on all levels, it means what we do is
only modify the access bit which can be done out of mmu-lock
Currently, in order to simplify the code, we only fix the page fault
caused by write-protect on the fast path
Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> Signed-off-by: Avi Kivity <avi@redhat.com>
Xiao Guangrong [Wed, 20 Jun 2012 07:58:58 +0000 (15:58 +0800)]
KVM: MMU: introduce SPTE_MMU_WRITEABLE bit
This bit indicates whether the spte can be writable on MMU, that means
the corresponding gpte is writable and the corresponding gfn is not
protected by shadow page protection
In the later path, SPTE_MMU_WRITEABLE will indicates whether the spte
can be locklessly updated
Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> Signed-off-by: Avi Kivity <avi@redhat.com>
Avi Kivity [Tue, 12 Jun 2012 17:30:18 +0000 (20:30 +0300)]
KVM: VMX: Emulate invalid guest state by default
Our emulation should be complete enough that we can emulate guests
while they are in big real mode, or in a mode transition that is not
virtualizable without unrestricted guest support.
Avi Kivity [Wed, 13 Jun 2012 13:30:53 +0000 (16:30 +0300)]
KVM: x86 emulator: make loading TR set the busy bit
Guest software doesn't actually depend on it, but vmx will refuse us
entry if we don't. Set the bit in both the cached segment and memory,
just to be nice.
Avi Kivity [Tue, 12 Jun 2012 17:03:23 +0000 (20:03 +0300)]
KVM: x86 emulator: implement ENTER
Opcode C8.
Only ENTER with lexical nesting depth 0 is implemented, since others are
very rare. We'll fail emulation if nonzero lexical depth is used so data
is not corrupted.
Avi Kivity [Mon, 11 Jun 2012 16:40:15 +0000 (19:40 +0300)]
KVM: x86 emulator: fix byte-sized MOVZX/MOVSX
Commit 2adb5ad9fe1 removed ByteOp from MOVZX/MOVSX, replacing them by
SrcMem8, but neglected to fix the dependency in the emulation code
on ByteOp. This caused the instruction not to have any effect in
some circumstances.
Fix by replacing the check for ByteOp with the equivalent src.op_bytes == 1.
Avi Kivity [Sun, 10 Jun 2012 15:07:57 +0000 (18:07 +0300)]
KVM: VMX: Fix interrupt exit condition during emulation
Checking EFLAGS.IF is incorrect as we might be in interrupt shadow. If
that is the case, the main loop will notice that and not inject the interrupt,
causing an endless loop.
Fix by using vmx_interrupt_allowed() to check if we can inject an interrupt
instead.
Avi Kivity [Sun, 10 Jun 2012 14:11:00 +0000 (17:11 +0300)]
KVM: x86 emulator: initialize memop
memop is not initialized; this can lead to a two-byte operation
following a 4-byte operation to see garbage values. Usually
truncation fixes things fot us later on, but at least in one case
(call abs) it doesn't.
Fix by moving memop to the auto-initialized field area.
Avi Kivity [Thu, 7 Jun 2012 14:06:10 +0000 (17:06 +0300)]
KVM: VMX: Relax check on unusable segment
Some userspace (e.g. QEMU 1.1) munge the d and g bits of segment
descriptors, causing us not to recognize them as unusable segments
with emulate_invalid_guest_state=1. Relax the check by testing for
segment not present (a non-present segment cannot be usable).
Avi Kivity [Thu, 7 Jun 2012 11:11:36 +0000 (14:11 +0300)]
KVM: x86 emulator: emulate cpuid
Opcode 0F A2.
Used by Linux during the mode change trampoline while in a state that is
not virtualizable on vmx without unrestricted_guest, so we need to emulate
it is emulate_invalid_guest_state=1.
Avi Kivity [Thu, 7 Jun 2012 11:10:16 +0000 (14:10 +0300)]
KVM: x86 emulator: change ->get_cpuid() accessor to use the x86 semantics
Instead of getting an exact leaf, follow the spec and fall back to the last
main leaf instead. This lets us easily emulate the cpuid instruction in the
emulator.
Avi Kivity [Thu, 7 Jun 2012 11:07:48 +0000 (14:07 +0300)]
KVM: Split cpuid register access from computation
Introduce kvm_cpuid() to perform the leaf limit check and calculate
register values, and let kvm_emulate_cpuid() just handle reading and
writing the registers from/to the vcpu. This allows us to reuse
kvm_cpuid() in a context where directly reading and writing registers
is not desired.
Avi Kivity [Wed, 6 Jun 2012 15:36:48 +0000 (18:36 +0300)]
KVM: VMX: Return correct CPL during transition to protected mode
In protected mode, the CPL is defined as the lower two bits of CS, as set by
the last far jump. But during the transition to protected mode, there is no
last far jump, so we need to return zero (the inherited real mode CPL).
Fix by reading CPL from the cache during the transition. This isn't 100%
correct since we don't set the CPL cache on a far jump, but since protected
mode transition will always jump to a segment with RPL=0, it will always
work.
Avi Kivity [Sun, 8 Jul 2012 14:16:30 +0000 (17:16 +0300)]
KVM: MMU: Force cr3 reload with two dimensional paging on mov cr3 emulation
Currently the MMU's ->new_cr3() callback does nothing when guest paging
is disabled or when two-dimentional paging (e.g. EPT on Intel) is active.
This means that an emulated write to cr3 can be lost; kvm_set_cr3() will
write vcpu-arch.cr3, but the GUEST_CR3 field in the VMCS will retain its
old value and this is what the guest sees.
This bug did not have any effect until now because:
- with unrestricted guest, or with svm, we never emulate a mov cr3 instruction
- without unrestricted guest, and with paging enabled, we also never emulate a
mov cr3 instruction
- without unrestricted guest, but with paging disabled, the guest's cr3 is
ignored until the guest enables paging; at this point the value from arch.cr3
is loaded correctly my the mov cr0 instruction which turns on paging
However, the patchset that enables big real mode causes us to emulate mov cr3
instructions in protected mode sometimes (when guest state is not virtualizable
by vmx); this mov cr3 is effectively ignored and will crash the guest.
The fix is to make nonpaging_new_cr3() call mmu_free_roots() to force a cr3
reload. This is awkward because now all the new_cr3 callbacks to the same
thing, and because mmu_free_roots() is somewhat of an overkill; but fixing
that is more complicated and will be done after this minimal fix.
Observed in the Window XP 32-bit installer while bringing up secondary vcpus.
Rik van Riel [Tue, 19 Jun 2012 20:51:04 +0000 (16:51 -0400)]
KVM: handle last_boosted_vcpu = 0 case
If last_boosted_vcpu == 0, then we fall through all test cases and
may end up with all VCPUs pouncing on vcpu 0. With a large enough
guest, this can result in enormous runqueue lock contention, which
can prevent vcpu0 from running, leading to a livelock.
Changing < to <= makes sure we properly handle that case.
Signed-off-by: Rik van Riel <riel@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Heiko Carstens [Tue, 26 Jun 2012 14:06:39 +0000 (16:06 +0200)]
KVM: s390: fix sigp set prefix status stored cases
If an invalid parameter is passed or the addressed cpu is in an
incorrect state sigp set prefix will store a status.
This status must only have bits set as defined by the architecture.
The current kvm implementation missed to clear bits and also did
not set the intended status bit ("and" instead of "or" operation).
Heiko Carstens [Tue, 26 Jun 2012 14:06:38 +0000 (16:06 +0200)]
KVM: s390: fix sigp sense running condition code handling
Only if the sensed cpu is not running a status is stored, which
is reflected by condition code 1. If the cpu is running, condition
code 0 should be returned.
Just the opposite of what the code is doing.
Heiko Carstens [Tue, 26 Jun 2012 14:06:37 +0000 (16:06 +0200)]
s390/smp/kvm: unifiy sigp definitions
The smp and the kvm code have different defines for the sigp order codes.
Let's just have a single place where these are defined.
Also move the sigp condition code and sigp cpu status bits to the new
sigp.h header file.
Heiko Carstens [Tue, 26 Jun 2012 14:06:36 +0000 (16:06 +0200)]
s390/smp: remove redundant check
condition code "status stored" for sigp sense running always implies
that only the "not running" status bit is set. Therefore no need to
check if it is set.
Marc Zyngier [Fri, 15 Jun 2012 19:07:24 +0000 (15:07 -0400)]
KVM: Guard mmu_notifier specific code with CONFIG_MMU_NOTIFIER
In order to avoid compilation failure when KVM is not compiled in,
guard the mmu_notifier specific sections with both CONFIG_MMU_NOTIFIER
and KVM_ARCH_WANT_MMU_NOTIFIER, like it is being done in the rest of
the KVM code.
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com> Signed-off-by: Christoffer Dall <c.dall@virtualopensystems.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
On UP i386, when APIC is disabled
# CONFIG_X86_UP_APIC is not set
# CONFIG_PCI_IOAPIC is not set
code looking at apicdrivers never has any effect but it
still gets compiled in. In particular, this causes
build failures with kvm, but it generally bloats the kernel
unnecessarily.
Fix by defining both __apicdrivers and __apicdrivers_end
to be NULL when CONFIG_X86_LOCAL_APIC is unset: I verified
that as the result any loop scanning __apicdrivers gets optimized out by
the compiler.
Warning: a .config with apic disabled doesn't seem to boot
for me (even without this patch). Still verifying why,
meanwhile this patch is compile-tested only.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Reported-by: Randy Dunlap <rdunlap@xenotime.net> Acked-by: Randy Dunlap <rdunlap@xenotime.net> Acked-by: H. Peter Anvin <hpa@linux.intel.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Implementation of PV EOI using shared memory.
This reduces the number of exits an interrupt
causes as much as by half.
The idea is simple: there's a bit, per APIC, in guest memory,
that tells the guest that it does not need EOI.
We set it before injecting an interrupt and clear
before injecting a nested one. Guest tests it using
a test and clear operation - this is necessary
so that host can detect interrupt nesting -
and if set, it can skip the EOI MSR.
There's a new MSR to set the address of said register
in guest memory. Otherwise not much changed:
- Guest EOI is not required
- Register is tested & ISR is automatically cleared on exit
For testing results see description of previous patch
'kvm_para: guest side for eoi avoidance'.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>
Each time we need to cancel injection we invoke same code
(cancel_injection callback). Move it towards the end of function using
the familiar goto on error pattern.
Will make it easier to do more cleanups for PV EOI.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>
Commit eb0dc6d0368072236dcd086d7fdc17fd3c4574d4 introduced apic
attention bitmask but kvm still syncs lapic unconditionally.
As that commit suggested and in anticipation of adding more attention
bits, only sync lapic if(apic_attention).
Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>
x86, bitops: note on __test_and_clear_bit atomicity
__test_and_clear_bit is actually atomic with respect
to the local CPU. Add a note saying that KVM on x86
relies on this behaviour so people don't accidentaly break it.
Also warn not to rely on this in portable code.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>
The idea is simple: there's a bit, per APIC, in guest memory,
that tells the guest that it does not need EOI.
Guest tests it using a single est and clear operation - this is
necessary so that host can detect interrupt nesting - and if set, it can
skip the EOI MSR.
I run a simple microbenchmark to show exit reduction
(note: for testing, need to apply follow-up patch
'kvm: host side for eoi optimization' + a qemu patch
I posted separately, on host):
We perform ISR lookups twice: during interrupt
injection and on EOI. Typical workloads only have
a single bit set there. So we can avoid ISR scans by
1. counting bits as we set/clear them in ISR
2. on set, caching the injected vector number
3. on clear, invalidating the cache
The real purpose of this is enabling PV EOI
which needs to quickly validate the vector.
But non PV guests also benefit: with this patch,
and without interrupt nesting, apic_find_highest_isr
will always return immediately without scanning ISR.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>
Christoffer Dall [Fri, 15 Jun 2012 19:07:13 +0000 (15:07 -0400)]
KVM: Introduce __KVM_HAVE_IRQ_LINE
This is a preparatory patch for the KVM/ARM implementation. KVM/ARM will use
the KVM_IRQ_LINE ioctl, which is currently conditional on
__KVM_HAVE_IOAPIC, but ARM obviously doesn't have any IOAPIC support and we
need a separate define.
Signed-off-by: Christoffer Dall <c.dall@virtualopensystems.com> Signed-off-by: Avi Kivity <avi@redhat.com>
Marc Zyngier [Fri, 15 Jun 2012 19:07:02 +0000 (15:07 -0400)]
KVM: use KVM_CAP_IRQ_ROUTING to protect the routing related code
The KVM code sometimes uses CONFIG_HAVE_KVM_IRQCHIP to protect
code that is related to IRQ routing, which not all in-kernel
irqchips may support.
Use KVM_CAP_IRQ_ROUTING instead.
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com> Signed-off-by: Christoffer Dall <c.dall@virtualopensystems.com> Signed-off-by: Avi Kivity <avi@redhat.com>
KVM: s390: Set CPU in stopped state on initial cpu reset
The initial cpu reset sets the cpu in the stopped state.
Several places check for the cpu state (e.g. sigp set prefix) and
not setting the STOPPED state triggered errors with newer guest
kernels after reboot.
Avi Kivity [Wed, 6 Jun 2012 12:31:34 +0000 (15:31 +0300)]
Merge branch 'for-upstream' of git://github.com/agraf/linux-2.6 into next
Alex says:
"Changes this time include:
- Generalize KVM_GUEST support to overall ePAPR code
- Fix reset for Book3S HV
- Fix machine check deferral when CONFIG_KVM_GUEST=y
- Add support for BookE register DECAR"
* 'for-upstream' of git://github.com/agraf/linux-2.6:
KVM: PPC: Not optimizing MSR_CE and MSR_ME with paravirt.
KVM: PPC: booke: Added DECAR support
KVM: PPC: Book3S HV: Make the guest hash table size configurable
KVM: PPC: Factor out guest epapr initialization
static int kvm_test_age_rmapp(struct kvm *kvm, unsigned long *rmapp,
unsigned long data)
{
- u64 *spte;
+ u64 *sptep;
+ struct rmap_iterator iter; <- line 1271
int young = 0;
/*
The reason I think is that the compiler assumes that
the rmap value could be 0, so
static u64 *rmap_get_first(unsigned long rmap, struct rmap_iterator
*iter)
{
if (!rmap)
return NULL;
Orit Wasserman [Thu, 31 May 2012 11:49:22 +0000 (14:49 +0300)]
KVM: VMX: Fix KVM_SET_SREGS with big real mode segments
For example migration between Westmere and Nehelem hosts, caught in big real mode.
The code that fixes the segments for real mode guest was moved from enter_rmode
to vmx_set_segments. enter_rmode calls vmx_set_segments for each segment.
Signed-off-by: Orit Wasserman <owasserm@rehdat.com> Signed-off-by: Avi Kivity <avi@redhat.com>
Gleb Natapov [Mon, 4 Jun 2012 11:53:23 +0000 (14:53 +0300)]
KVM: MMU: do not iterate over all VMs in mmu_shrink()
mmu_shrink() needlessly iterates over all VMs even though it will not
attempt to free mmu pages from more than one on them. Fix that and also
check used mmu pages count outside of VM lock to skip inactive VMs faster.
Signed-off-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>
Avi Kivity [Thu, 17 May 2012 10:14:08 +0000 (13:14 +0300)]
KVM: ia64: Mark ia64 KVM as BROKEN
Practically all patches to ia64 KVM are build fixes; numerous warnings remain;
the last patch from the maintainer was committed more than three years ago. It
is clear that no one is using this thing.
Mark as BROKEN to ensure people don't get hit by pointless build problems.
Signed-off-by: Avi Kivity <avi@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>
Takuya Yoshikawa [Sun, 20 May 2012 04:15:07 +0000 (13:15 +0900)]
KVM: Avoid wasting pages for small lpage_info arrays
lpage_info is created for each large level even when the memory slot is
not for RAM. This means that when we add one slot for a PCI device, we
end up allocating at least KVM_NR_PAGE_SIZES - 1 pages by vmalloc().
To make things worse, there is an increasing number of devices which
would result in more pages being wasted this way.
This patch mitigates this problem by using kvm_kvzalloc().
Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp> Signed-off-by: Avi Kivity <avi@redhat.com>
Linus Torvalds [Mon, 4 Jun 2012 22:00:58 +0000 (15:00 -0700)]
Merge branch 'for-next' of git://git.samba.org/sfrench/cifs-2.6
Pull cifs fixes from Steve French.
* 'for-next' of git://git.samba.org/sfrench/cifs-2.6:
CIFS: Move get_next_mid to ops struct
CIFS: Make accessing is_valid_oplock/dump_detail ops struct field safe
CIFS: Improve identation in cifs_unlock_range
CIFS: Fix possible wrong memory allocation
Greg Ungerer [Mon, 4 Jun 2012 04:29:59 +0000 (14:29 +1000)]
nommu: fix compilation of nommu.c
Compiling 3.5-rc1 for nommu targets gives:
CC mm/nommu.o
mm/nommu.c: In function ‘sys_mmap_pgoff’:
mm/nommu.c:1489:2: error: ‘ret’ undeclared (first use in this function)
mm/nommu.c:1489:2: note: each undeclared identifier is reported only once for each function it appears in
It is trivially fixed by replacing 'ret' with the local variable that is
already defined for the return value 'retval'.
Signed-off-by: Greg Ungerer <gerg@uclinux.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Linus Torvalds [Mon, 4 Jun 2012 19:28:45 +0000 (12:28 -0700)]
Merge tag 'stable/frontswap.v16-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/mm
Pull frontswap feature from Konrad Rzeszutek Wilk:
"Frontswap provides a "transcendent memory" interface for swap pages.
In some environments, dramatic performance savings may be obtained
because swapped pages are saved in RAM (or a RAM-like device) instead
of a swap disk. This tag provides the basic infrastructure along with
some changes to the existing backends."
Fix up trivial conflict in mm/Makefile due to removal of swap token code
changing a line next to the new frontswap entry.
This pull request came in before the merge window even opened, it got
delayed to after the merge window by me just wanting to make sure it had
actual users. Apparently IBM is using this on their embedded side, and
Jan Beulich says that it's already made available for SLES and OpenSUSE
users.
Also acked by Rik van Riel, and Konrad points to other people liking it
too. So in it goes.
By Dan Magenheimer (4) and Konrad Rzeszutek Wilk (2)
via Konrad Rzeszutek Wilk
* tag 'stable/frontswap.v16-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/mm:
frontswap: s/put_page/store/g s/get_page/load
MAINTAINER: Add myself for the frontswap API
mm: frontswap: config and doc files
mm: frontswap: core frontswap functionality
mm: frontswap: core swap subsystem hooks and headers
mm: frontswap: add frontswap header file
Linus Torvalds [Mon, 4 Jun 2012 18:36:51 +0000 (11:36 -0700)]
Merge branches 'irq-urgent-for-linus' and 'smp-hotplug-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull irq and smpboot updates from Thomas Gleixner:
"Just cleanup patches with no functional change and a fix for suspend
issues."
* 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
genirq: Introduce irq_do_set_affinity() to reduce duplicated code
genirq: Add IRQS_PENDING for nested and simple irq
* 'smp-hotplug-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
smpboot, idle: Fix comment mismatch over idle_threads_init()
smpboot, idle: Optimize calls to smp_processor_id() in idle_threads_init()
Linus Torvalds [Mon, 4 Jun 2012 18:25:31 +0000 (11:25 -0700)]
Merge branch 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer updates from Thomas Gleixner:
"The clocksource driver is pure hardware enablement and the skew option
is default off, well tested and non dangerous."
* 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
tick: Move skew_tick option into the HIGH_RES_TIMER section
clocksource: em_sti: Add DT support
clocksource: em_sti: Emma Mobile STI driver
clockevents: Make clockevents_config() a global symbol
tick: Add tick skew boot option
Linus Torvalds [Mon, 4 Jun 2012 18:00:45 +0000 (11:00 -0700)]
vfs: Fix /proc/<tid>/fdinfo/<fd> file handling
Cyrill Gorcunov reports that I broke the fdinfo files with commit 30a08bf2d31d ("proc: move fd symlink i_mode calculations into
tid_fd_revalidate()"), and he's quite right.
The tid_fd_revalidate() function is not just used for the <tid>/fd
symlinks, it's also used for the <tid>/fdinfo/<fd> files, and the
permission model for those are different.
So do the dynamic symlink permission handling just for symlinks, making
the fdinfo files once more appear as the proper regular files they are.
Of course, Al Viro argued (probably correctly) that we shouldn't do the
symlink permission games at all, and make the symlinks always just be
the normal 'lrwxrwxrwx'. That would have avoided this issue too, but
since somebody noticed that the permissions had changed (which was the
reason for that original commit 30a08bf2d31d in the first place), people
do apparently use this feature.
[ Basically, you can use the symlink permission data as a cheap "fdinfo"
replacement, since you see whether the file is open for reading and/or
writing by just looking at st_mode of the symlink. So the feature
does make sense, even if the pain it has caused means we probably
shouldn't have done it to begin with. ]
That commit seems to be the cause of the mm compation list corruption
issues that Dave Jones reported. The locking (or rather, absense
there-of) is dubious, as is the use of the 'page' variable once it has
been found to be outside the pageblock range.
So revert it for now, we can re-visit this for 3.6. If we even need to:
as Minchan Kim says, "The patch wasn't a bug fix and even test workload
was very theoretical".
Reported-and-tested-by: Dave Jones <davej@redhat.com> Acked-by: Hugh Dickins <hughd@google.com> Acked-by: KOSAKI Motohiro <kosaki.motohiro@gmail.com> Acked-by: Minchan Kim <minchan@kernel.org> Cc: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com> Cc: Kyungmin Park <kyungmin.park@samsung.com> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Hugh Dickins [Sat, 2 Jun 2012 07:27:47 +0000 (00:27 -0700)]
mm: fix warning in __set_page_dirty_nobuffers
New tmpfs use of !PageUptodate pages for fallocate() is triggering the
WARNING: at mm/page-writeback.c:1990 when __set_page_dirty_nobuffers()
is called from migrate_page_copy() for compaction.
It is anomalous that migration should use __set_page_dirty_nobuffers()
on an address_space that does not participate in dirty and writeback
accounting; and this has also been observed to insert surprising dirty
tags into a tmpfs radix_tree, despite tmpfs not using tags at all.
We should probably give migrate_page_copy() a better way to preserve the
tag and migrate accounting info, when mapping_cap_account_dirty(). But
that needs some more work: so in the interim, avoid the warning by using
a simple SetPageDirty on PageSwapBacked pages.
Reported-and-tested-by: Dave Jones <davej@redhat.com> Signed-off-by: Hugh Dickins <hughd@google.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Linus Torvalds [Sun, 3 Jun 2012 21:50:19 +0000 (14:50 -0700)]
vfs: move inode stat information closer together
The comment above it says "Stat data, not accessed from path walking",
but in fact some of inode fields we use for the common stat data was way
down at the end of the inode, causing unnecessary cache misses for the
common stat operations.
The inode structure is pretty big, and this can change padding depending
on field width, but at least on the common 64-bit configurations this
doesn't change the size. Some of our inode layout has historically been
to tro to avoid unnecessary padding fields, but cache locality is at
least as important for layout, if not more.
Noticed by looking at kernel profiles, and noticing that the "i_blkbits"
access stood out like a sore thumb.
Linus Torvalds [Sun, 3 Jun 2012 00:39:40 +0000 (17:39 -0700)]
Merge tag 'dm-3.5-changes-1' of git://git.kernel.org/pub/scm/linux/kernel/git/agk/linux-dm
Pull device-mapper updates from Alasdair G Kergon:
"Improve multipath's retrying mechanism in some defined circumstances
and provide a simple reserve/release mechanism for userspace tools to
access thin provisioning metadata while the pool is in use."
* tag 'dm-3.5-changes-1' of git://git.kernel.org/pub/scm/linux/kernel/git/agk/linux-dm:
dm thin: provide userspace access to pool metadata
dm thin: use slab mempools
dm mpath: allow ioctls to trigger pg init
dm mpath: delay retry of bypassed pg
dm mpath: reduce size of struct multipath
Joe Thornber [Sat, 2 Jun 2012 23:30:01 +0000 (00:30 +0100)]
dm thin: provide userspace access to pool metadata
This patch implements two new messages that can be sent to the thin
pool target allowing it to take a snapshot of the _metadata_. This,
read-only snapshot can be accessed by userland, concurrently with the
live target.
Only one metadata snapshot can be held at a time. The pool's status
line will give the block location for the current msnap.
Since version 0.1.5 of the userland thin provisioning tools, the
thin_dump program displays the msnap as follows:
thin_dump -m <msnap root> <metadata dev>
Available here: https://github.com/jthornber/thin-provisioning-tools
Now that userland can access the metadata we can do various things
that have traditionally been kernel side tasks:
i) Incremental backups.
By using metadata snapshots we can work out what blocks have
changed over time. Combined with data snapshots we can ensure
the data doesn't change while we back it up.
A short proof of concept script can be found here:
Mike Snitzer [Sat, 2 Jun 2012 23:30:00 +0000 (00:30 +0100)]
dm thin: use slab mempools
Use dedicated caches prefixed with a "dm_" name rather than relying on
kmalloc mempools backed by generic slab caches so the memory usage of
thin provisioning (and any leaks) can be accounted for independently.
Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Mikulas Patocka [Sat, 2 Jun 2012 23:29:58 +0000 (00:29 +0100)]
dm mpath: allow ioctls to trigger pg init
After the failure of a group of paths, any alternative paths that
need initialising do not become available until further I/O is sent to
the device. Until this has happened, ioctls return -EAGAIN.
With this patch, new paths are made available in response to an ioctl
too. The processing of the ioctl gets delayed until this has happened.
Instead of returning an error, we submit a work item to kmultipathd
(that will potentially activate the new path) and retry in ten
milliseconds.
Note that the patch doesn't retry an ioctl if the ioctl itself fails due
to a path failure. Such retries should be handled intelligently by the
code that generated the ioctl in the first place, noting that some SCSI
commands should not be retried because they are not idempotent (XOR write
commands). For commands that could be retried, there is a danger that
if the device rejected the SCSI command, the path could be errorneously
marked as failed, and the request would be retried on another path which
might fail too. It can be determined if the failure happens on the
device or on the SCSI controller, but there is no guarantee that all
SCSI drivers set these flags correctly.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Mike Christie [Sat, 2 Jun 2012 23:29:45 +0000 (00:29 +0100)]
dm mpath: delay retry of bypassed pg
If I/O needs retrying and only bypassed priority groups are available,
set the pg_init_delay_retry flag to wait before retrying.
If, for example, the reason for the bypass is that the controller is
getting reset or there is a firmware upgrade happening, retrying right
away would cause a flood of log messages and retries for what could be a
few seconds or even several minutes.
Signed-off-by: Mike Christie <michaelc@cs.wisc.edu> Acked-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Mike Snitzer [Sat, 2 Jun 2012 23:29:43 +0000 (00:29 +0100)]
dm mpath: reduce size of struct multipath
Move multipath structure's 'lock' and 'queue_size' members to eliminate
two 4-byte holes. Also use a bit within a single unsigned int for each
existing flag (saves 8-bytes). This allows future flags to be added
without each consuming an unsigned int.
Signed-off-by: Mike Snitzer <snitzer@redhat.com> Acked-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
1) Make syn floods consume significantly less resources by
a) Not pre-COW'ing routing metrics for SYN/ACKs
b) Mirroring the device queue mapping of the SYN for the SYN/ACK
reply.
Both from Eric Dumazet.
2) Fix calculation errors in Byte Queue Limiting, from Hiroaki SHIMODA.
3) Validate the length requested when building a paged SKB for a
socket, so we don't overrun the page vector accidently. From Jason
Wang.
4) When netlabel is disabled, we abort all IP option processing when we
see a CIPSO option. This isn't the right thing to do, we should
simply skip over it and continue processing the remaining options
(if any). Fix from Paul Moore.
5) SRIOV fixes for the mellanox driver from Jack orgenstein and Marcel
Apfelbaum.
6) 8139cp enables the receiver before the ring address is properly
programmed, which potentially lets the device crap over random
memory. Fix from Jason Wang.
7) e1000/e1000e fixes for i217 RST handling, and an improper buffer
address reference in jumbo RX frame processing from Bruce Allan and
Sebastian Andrzej Siewior, respectively.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net:
fec_mpc52xx: fix timestamp filtering
mcs7830: Implement link state detection
e1000e: fix Rapid Start Technology support for i217
e1000: look into the page instead of skb->data for e1000_tbi_adjust_stats()
r8169: call netif_napi_del at errpaths and at driver unload
tcp: reflect SYN queue_mapping into SYNACK packets
tcp: do not create inetpeer on SYNACK message
8139cp/8139too: terminate the eeprom access with the right opmode
8139cp: set ring address before enabling receiver
cipso: handle CIPSO options correctly when NetLabel is disabled
net: sock: validate data_len before allocating skb in sock_alloc_send_pskb()
bql: Avoid possible inconsistent calculation.
bql: Avoid unneeded limit decrement.
bql: Fix POSDIFF() to integer overflow aware.
net/mlx4_core: Fix obscure mlx4_cmd_box parameter in QUERY_DEV_CAP
net/mlx4_core: Check port out-of-range before using in mlx4_slave_cap
net/mlx4_core: Fixes for VF / Guest startup flow
net/mlx4_en: Fix improper use of "port" parameter in mlx4_en_event
net/mlx4_core: Fix number of EQs used in ICM initialisation
net/mlx4_core: Fix the slave_id out-of-range test in mlx4_eq_int
Linus Torvalds [Sat, 2 Jun 2012 23:17:03 +0000 (16:17 -0700)]
Merge branch 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull straggler x86 fixes from Peter Anvin:
"Three groups of patches:
- EFI boot stub documentation and the ability to print error messages;
- Removal for PTRACE_ARCH_PRCTL for x32 (obsolete interface which
should never have been ported, and the port is broken and
potentially dangerous.)
- ftrace stack corruption fixes. I'm not super-happy about the
technical implementation, but it is probably the least invasive in
the short term. In the future I would like a single method for
nesting the debug stack, however."
* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86, x32, ptrace: Remove PTRACE_ARCH_PRCTL for x32
x86, efi: Add EFI boot stub documentation
x86, efi; Add EFI boot stub console support
x86, efi: Only close open files in error path
ftrace/x86: Do not change stacks in DEBUG when calling lockdep
x86: Allow nesting of the debug stack IDT setting
x86: Reset the debug_stack update counter
ftrace: Use breakpoint method to update ftrace caller
ftrace: Synchronize variable setting with breakpoints
Linus Torvalds [Sat, 2 Jun 2012 22:21:43 +0000 (15:21 -0700)]
tty: Revert the tty locking series, it needs more work
This reverts the tty layer change to use per-tty locking, because it's
not correct yet, and fixing it will require some more deep surgery.
The main revert is d29f3ef39be4 ("tty_lock: Localise the lock"), but
there are several smaller commits that built upon it, they also get
reverted here. The list of reverted commits is:
fde86d310886 - tty: add lockdep annotations 8f6576ad476b - tty: fix ldisc lock inversion trace d3ca8b64b97e - pty: Fix lock inversion b1d679afd766 - tty: drop the pty lock during hangup abcefe5fc357 - tty/amiserial: Add missing argument for tty_unlock() fd11b42e3598 - cris: fix missing tty arg in wait_event_interruptible_tty call d29f3ef39be4 - tty_lock: Localise the lock
The revert had a trivial conflict in the 68360serial.c staging driver
that got removed in the meantime.
Stephan Gatzka [Sat, 2 Jun 2012 03:04:06 +0000 (03:04 +0000)]
fec_mpc52xx: fix timestamp filtering
skb_defer_rx_timestamp was called with a freshly allocated skb but must
be called with rskb instead.
Signed-off-by: Stephan Gatzka <stephan@gatzka.org> Cc: stable <stable@vger.kernel.org> Acked-by: Richard Cochran <richardcochran@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Ondrej Zary [Fri, 1 Jun 2012 10:29:08 +0000 (10:29 +0000)]
mcs7830: Implement link state detection
Add .status callback that detects link state changes.
Tested with MCS7832CV-AA chip (9710:7830, identified as rev.C by the driver).
Fixes https://bugzilla.kernel.org/show_bug.cgi?id=28532
Signed-off-by: Ondrej Zary <linux@rainbow-software.org> Signed-off-by: David S. Miller <davem@davemloft.net>
Linus Torvalds [Sat, 2 Jun 2012 16:03:54 +0000 (09:03 -0700)]
Merge 'for-linus' branches from git://git.kernel.org/pub/scm/linux/kernel/git/viro/{vfs,signal}
Pull vfs fix and a fix from the signal changes for frv from Al Viro.
The __kernel_nlink_t for powerpc got scrogged because 64-bit powerpc
actually depended on the default "unsigned long", while 32-bit powerpc
had an explicit override to "unsigned short". Al didn't notice, and
made both of them be the unsigned short.
The frv signal fix is fallout from simplifying the do_notify_resume()
code, and leaving an extra parenthesis.
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
powerpc: Fix size of st_nlink on 64bit
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/signal:
frv: Remove bogus closing parenthesis