Steven Rostedt [Sat, 14 Feb 2009 06:15:39 +0000 (01:15 -0500)]
ftrace: convert ftrace_lock from a spinlock to mutex
Impact: clean up
The older versions of ftrace required doing the ftrace list
search under atomic context. Now all the calls are in non-atomic
context. There is no reason to keep the ftrace_lock as a spinlock.
This patch converts it to a mutex.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Steven Rostedt [Sat, 14 Feb 2009 05:40:25 +0000 (00:40 -0500)]
ftrace: add command interface for function selection
Allow for other tracers to add their own commands for function
selection. This interface gives a trace the ability to name a
command for function selection. Right now it is pretty limited
in what it offers, but this is a building step for more features.
The :mod: command is converted to this interface and also serves
as a template for other implementations.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Steven Rostedt [Sat, 14 Feb 2009 01:53:42 +0000 (20:53 -0500)]
ftrace: enable filtering only when a function is filtered on
Impact: fix to prevent empty set_ftrace_filter and no ftrace output
The function filter is used to only trace a given set of functions.
The filter is enabled when a function name is echoed into the
set_ftrace_filter file. But if the name has a typo and the function
is not found, the filter is enabled, but no function is listed.
This makes a confusing situation where set_ftrace_filter is empty
but no functions ever get enabled for tracing.
This patch changes that to only enable filtering if a function
is set to be filtered on. Now, the filter is not enabled if
a bad name is echoed into set_ftrace_filter.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Steven Rostedt [Fri, 13 Feb 2009 20:56:43 +0000 (15:56 -0500)]
ftrace: break up ftrace_match_records into smaller components
Impact: clean up
ftrace_match_records does a lot of things that other features
can use. This patch breaks up ftrace_match_records and pulls
out ftrace_setup_glob and ftrace_match_record.
ftrace_setup_glob prepares a simple glob expression for use with
ftrace_match_record. ftrace_match_record compares a single record
with a glob type.
Breaking this up will allow for more features to run on individual
records.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Steven Rostedt [Fri, 13 Feb 2009 19:37:33 +0000 (14:37 -0500)]
ftrace: rename ftrace_match to ftrace_match_records
Impact: clean up
ftrace_match is too generic of a name. What it really does is
search all records and matches the records with the given string,
and either sets or unsets the functions to be traced depending
on if the parameter 'enable' is set or not.
This allows us to make another function called ftrace_match that
can be used to test a single record.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Steven Rostedt [Fri, 13 Feb 2009 17:43:56 +0000 (12:43 -0500)]
ftrace: add do_for_each_ftrace_rec and while_for_each_ftrace_rec
Impact: clean up
To iterate over all the functions that dynamic trace knows about
it requires two for loops. One to iterate over the pages and the
other to iterate over the records within the page.
There are several duplications of these loops in ftrace.c. This
patch creates the macros do_for_each_ftrace_rec and
while_for_each_ftrace_rec to handle this logic, and removes the
duplicate code.
While making this change, I also discovered and fixed a small
bug that one of the iterations should exit the loop after it found the
record it was searching for. This used a break when it should have
used a goto, since there were two loops it needed to break out
from. No real harm was done by this bug since it would only continue
to search the other records, and the code was in a slow path anyway.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Steven Rostedt [Thu, 12 Feb 2009 19:16:46 +0000 (14:16 -0500)]
sched: do not account for NMIs
Impact: avoid corruption in system time accounting
Martin Schwidefsky told me that there was an issue with NMIs and
system accounting. The problem is that the accounting code is
not reentrant, and if an NMI goes off after an interrupt it can
corrupt the accounting.
For now, the best we can do is to treat NMIs like SMIs and they
are not accounted for.
This patch changes nmi_enter to not call __irq_enter and to do
the preempt-count and tracing calls directly.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Steven Rostedt [Thu, 12 Feb 2009 15:53:37 +0000 (10:53 -0500)]
preempt-count: force hardirq-count to max of 10
To add a bit in the preempt_count to be set when in NMI context, we
found that some archs did not have enough bits to spare. This is
due to the hardirq_count being a mask that can hold NR_IRQS.
Some archs allow for over 16000 IRQs, and that would require a mask
of 14 bits. The sofitrq mask is 8 bits and the preempt disable mask
is also 8 bits. The PREEMP_ACTIVE bit is bit 30, and bit 31 would
make the preempt_count (which is type int) a negative number.
A negative preempt_count is a sign of failure.
Add them up 14+8+8+1+1 you get 32 bits. No room for the NMI bit.
But the hardirq_count is to track the number of nested IRQs, not
the number of total IRQs. This originally took the paranoid approach
of setting the max nesting to NR_IRQS. But when we have archs with
over 1000 IRQs, it is not practical to think they will ever all
nest on a single CPU. Not to mention that this would most definitely
cause a stack overflow.
This patch sets a max of 10 bits to be used for IRQ nesting.
I did a 'git grep HARDIRQ' to examine all users of HARDIRQ_BITS and
HARDIRQ_MASK, and found that making it a max of 10 would not hurt
anyone. I did find that the m68k expected it to be 8 bits, so
I allow for the archs to set the number to be less than 10.
I removed the setting of HARDIRQ_BITS from the archs that set it
to more than 10. This includes ALPHA, ia64 and avr32.
This will always allow room for the NMI bit, and if we need to allow
for NMI nesting, we have 4 bits to play with.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Nick Piggin [Thu, 12 Feb 2009 03:34:23 +0000 (04:34 +0100)]
Fix page writeback thinko, causing Berkeley DB slowdown
A bug was introduced into write_cache_pages cyclic writeout by commit 31a12666d8f0c22235297e1c1575f82061480029 ("mm: write_cache_pages cyclic
fix"). The intention (and comments) is that we should cycle back and
look for more dirty pages at the beginning of the file if there is no
more work to be done.
But the !done condition was dropped from the test. This means that any
time the page writeout loop breaks (eg. due to nr_to_write == 0), we
will set index to 0, then goto again. This will set done_index to
index, then find done is set, so will proceed to the end of the
function. When updating mapping->writeback_index for cyclic writeout,
we now use done_index == 0, so we're always cycling back to 0.
This seemed to be causing random mmap writes (slapadd and iozone) to
start writing more pages from the LRU and writeout would slowdown, and
caused bugzilla entry
http://bugzilla.kernel.org/show_bug.cgi?id=12604
about Berkeley DB slowing down dramatically.
With this patch, iozone random write performance is increased nearly
5x on my system (iozone -B -r 4k -s 64k -s 512m -s 1200m on ext2).
Signed-off-by: Nick Piggin <npiggin@suse.de> Reported-and-tested-by: Jan Kara <jack@suse.cz> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: Kirill A. Shutemov <kirill@shutemov.name> Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
When the temperature exceeds 32767 milli-degrees the temperature overflows
to -32768 millidegrees. These are bothe well within the -55 - +125 degree
range for the sensor.
Fix overflow in left-shift of a u8.
Signed-off-by: Ian Dall <ian@beware.dropbear.id.au> Signed-off-by: Evgeniy Polyakov <zbr@ioremap.net> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This can also occur when an nbd device loses its nbd-client/server
connection. Although we clear the queue of any outstanding I/Os after the
client/server connection fails, any additional I/Os that get queued later
will hang.
This bug may also be the problem reported in this bug report:
http://bugzilla.kernel.org/show_bug.cgi?id=12277
Testing would need to be performed to determine if the two issues are the
same.
This problem was introduced by the new request handling thread code ("NBD:
allow nbd to be used locally", 3/2008), which entered into mainline around
2.6.25.
The fix, which is fairly simple, is to restore the check for lo->sock
being NULL in do_nbd_request. This causes I/O to an uninitialized nbd to
immediately fail with an I/O error, as it did prior to the introduction of
this bug.
Signed-off-by: Paul Clements <paul.clements@steeleye.com> Reported-by: Jon Nelson <jnelson-kernel-bugzilla@jamponi.net> Acked-by: Pavel Machek <pavel@ucw.cz> Cc: <stable@kernel.org> [2.6.26.x, 2.6.27.x, 2.6.28.x] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
mm: rearrange exit_mmap() to unlock before arch_exit_mmap
Christophe Saout reported [in precursor to:
http://marc.info/?l=linux-kernel&m=123209902707347&w=4]:
> Note that I also some a different issue with CONFIG_UNEVICTABLE_LRU.
> Seems like Xen tears down current->mm early on process termination, so
> that __get_user_pages in exit_mmap causes nasty messages when the
> process had any mlocked pages. (in fact, it somehow manages to get into
> the swapping code and produces a null pointer dereference trying to get
> a swap token)
Jeremy explained:
Yes. In the normal case under Xen, an in-use pagetable is "pinned",
meaning that it is RO to the kernel, and all updates must go via hypercall
(or writes are trapped and emulated, which is much the same thing). An
unpinned pagetable is not currently in use by any process, and can be
directly accessed as normal RW pages.
As an optimisation at process exit time, we unpin the pagetable as early
as possible (switching the process to init_mm), so that all the normal
pagetable teardown can happen with direct memory accesses.
This happens in exit_mmap() -> arch_exit_mmap(). The munlocking happens
a few lines below. The obvious thing to do would be to move
arch_exit_mmap() to below the munlock code, but I think we'd want to
call it even if mm->mmap is NULL, just to be on the safe side.
Thus, this patch:
exit_mmap() needs to unlock any locked vmas before calling arch_exit_mmap,
as the latter may switch the current mm to init_mm, which would cause the
former to fail.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com> Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: Christophe Saout <christophe@saout.de> Cc: Keir Fraser <keir.fraser@eu.citrix.com> Cc: Christophe Saout <christophe@saout.de> Cc: Alex Williamson <alex.williamson@hp.com> Cc: <stable@kernel.org> [2.6.28.x] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Jiri Slaby [Wed, 11 Feb 2009 21:04:40 +0000 (13:04 -0800)]
parport: parport_serial, don't bind netmos ibm 0299
Since netmos 9835 with subids 0x1014(IBM):0x0299 is now bound with
serial/8250_pci, because it has no parallel ports and subdevice id isn't
in the expected form, return -ENODEV from probe function.
Heiko Carstens [Wed, 11 Feb 2009 21:04:38 +0000 (13:04 -0800)]
syscall define: fix uml compile bug
With the new system call defines we get this on uml:
arch/um/sys-i386/built-in.o: In function `sys_call_table':
(.rodata+0x308): undefined reference to `sys_sigprocmask'
Reason for this is that uml passes the preprocessor option
-Dsigprocmask=kernel_sigprocmask to gcc when compiling the kernel.
This causes SYSCALL_DEFINE3(sigprocmask, ...) to be expanded to
SYSCALL_DEFINEx(3, kernel_sigprocmask, ...) and finally to a system
call named sys_kernel_sigprocmask. However sys_sigprocmask is missing
because of this.
To avoid macro expansion for the system call name just concatenate the
name at first define instead of carrying it through severel levels.
This was pointed out by Al Viro.
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Reviewed-by: WANG Cong <wangcong@zeuux.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Carsten Otte [Wed, 11 Feb 2009 21:04:37 +0000 (13:04 -0800)]
ext2/xip: refuse to change xip flag during remount with busy inodes
For a reason that I was unable to understand in three months of debugging,
mount ext2 -o remount stopped working properly when remounting from
regular operation to xip, or the other way around. According to a git
bisect search, the problem was introduced with the VM_MIXEDMAP/PTE_SPECIAL
rework in the vm:
In the failing scenario, the filesystem is mounted read only via root=
kernel parameter on s390x. During remount (in rc.sysinit), the inodes of
the bash binary and its libraries are busy and cannot be invalidated (the
bash which is running rc.sysinit resides on subject filesystem).
Afterwards, another bash process (running ifup-eth) recurses into a
subshell, runs dup_mm (via fork). Some of the mappings in this bash
process were created from inodes that could not be invalidated during
remount.
Both parent and child process crash some time later due to inconsistencies
in their address spaces. The issue seems to be timing sensitive, various
attempts to recreate it have failed.
This patch refuses to change the xip flag during remount in case some
inodes cannot be invalidated. This patch keeps users from running into
that issue.
[akpm@linux-foundation.org: cleanup] Signed-off-by: Carsten Otte <cotte@de.ibm.com> Cc: Nick Piggin <npiggin@suse.de> Cc: Jared Hulbert <jaredeh@gmail.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Li Zefan [Wed, 11 Feb 2009 21:04:36 +0000 (13:04 -0800)]
cgroups: fix lockdep subclasses overflow
I enabled all cgroup subsystems when compiling kernel, and then:
# mount -t cgroup -o net_cls xxx /mnt
# mkdir /mnt/0
This showed up immediately:
BUG: MAX_LOCKDEP_SUBCLASSES too low!
turning off the locking correctness validator.
It's caused by the cgroup hierarchy lock:
for (i = 0; i < CGROUP_SUBSYS_COUNT; i++) {
struct cgroup_subsys *ss = subsys[i];
if (ss->root == root)
mutex_lock_nested(&ss->hierarchy_mutex, i);
}
Now we have 9 cgroup subsystems, and the above 'i' for net_cls is 8, but
MAX_LOCKDEP_SUBCLASSES is 8.
This patch uses different lockdep keys for different subsystems.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Acked-by: Paul Menage <menage@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Randy Dunlap [Wed, 11 Feb 2009 21:04:33 +0000 (13:04 -0800)]
kernel-doc: fix syscall wrapper processing
Fix kernel-doc processing of SYSCALL wrappers.
The SYSCALL wrapper patches played havoc with kernel-doc for
syscalls. Syscalls that were scanned for DocBook processing
reported warnings like this one, for sys_tgkill:
Warning(kernel/signal.c:2285): No description found for parameter 'tgkill'
Warning(kernel/signal.c:2285): No description found for parameter 'pid_t'
Warning(kernel/signal.c:2285): No description found for parameter 'int'
because the macro parameters all "look like" function parameters,
although they are not:
/**
* sys_tgkill - send signal to one specific thread
* @tgid: the thread group ID of the thread
* @pid: the PID of the thread
* @sig: signal to be sent
*
* This syscall also checks the @tgid and returns -ESRCH even if the PID
* exists but it's not belonging to the target process anymore. This
* method solves the problem of threads exiting and PIDs getting reused.
*/
SYSCALL_DEFINE3(tgkill, pid_t, tgid, pid_t, pid, int, sig)
{
...
This patch special-cases the handling SYSCALL_DEFINE* function
prototypes by expanding them to
long sys_foobar(type1 arg1, type1 arg2, ...)
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Randy Dunlap [Wed, 11 Feb 2009 21:04:31 +0000 (13:04 -0800)]
kernel-doc: preferred ending marker and examples
Fix kernel-doc-nano-HOWTO.txt to use */ as the ending marker in kernel-doc
examples and state that */ is the preferred ending marker.
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com> Reported-by: Robert Love <robert.w.love@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
MinChan Kim [Wed, 11 Feb 2009 21:04:27 +0000 (13:04 -0800)]
mm: fix mlocked page counter mismatch
When I tested following program, I found that the mlocked counter
is strange. It cannot free some mlocked pages.
It is because try_to_unmap_file() doesn't check real
page mappings in vmas.
That is because the goal of an address_space for a file is to find all
processes into which the file's specific interval is mapped. It is
related to the file's interval, not to pages.
Even if the page isn't really mapped by the vma, it returns SWAP_MLOCK
since the vma has VM_LOCKED, then calls try_to_mlock_page. After this the
mlocked counter is increased again.
COWed anon page in a file-backed vma could be a such case. This patch
resolves it.
Since journal_start_commit() is now fixed to return 1 when we started a
transaction commit, there's some transaction waiting to be committed or
there's a transaction already committing, we don't need to call
ext3_force_commit() in ext3_sync_fs(). Furthermore ext3_force_commit()
can unnecessarily create sync transaction which is expensive so it's
worthwhile to remove it when we can.
Cc: Eric Sandeen <sandeen@redhat.com> Cc: <linux-ext4@vger.kernel.org> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Jan Kara [Wed, 11 Feb 2009 21:04:25 +0000 (13:04 -0800)]
jbd: fix return value of journal_start_commit()
journal_start_commit() returns 1 if either a transaction is committing or
the function has queued a transaction commit. But it returns 0 if we
raced with somebody queueing the transaction commit as well. This
resulted in ext3_sync_fs() not functioning correctly (description from
Arthur Jones): In the case of a data=ordered umount with pending long
symlinks which are delayed due to a long list of other I/O on the backing
block device, this causes the buffer associated with the long symlinks to
not be moved to the inode dirty list in the second phase of fsync_super.
Then, before they can be dirtied again, kjournald exits, seeing the UMOUNT
flag and the dirty pages are never written to the backing block device,
causing long symlink corruption and exposing new or previously freed block
data to userspace.
This can be reproduced with a script created by Eric Sandeen
<sandeen@redhat.com>:
#!/bin/bash
umount /mnt/test2
mount /dev/sdb4 /mnt/test2
rm -f /mnt/test2/*
dd if=/dev/zero of=/mnt/test2/bigfile bs=1M count=512
touch /mnt/test2/thisisveryveryveryveryveryveryveryveryveryveryveryveryveryveryveryverylongfilename
ln -s /mnt/test2/thisisveryveryveryveryveryveryveryveryveryveryveryveryveryveryveryverylongfilename
/mnt/test2/link
umount /mnt/test2
mount /dev/sdb4 /mnt/test2
ls /mnt/test2/
This patch fixes journal_start_commit() to always return 1 when there's
a transaction committing or queued for commit.
Cc: Eric Sandeen <sandeen@redhat.com> Cc: Mike Snitzer <snitzer@gmail.com> Cc: <linux-ext4@vger.kernel.org> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Sven Wegener [Wed, 11 Feb 2009 21:04:23 +0000 (13:04 -0800)]
mm: fix dirty_bytes/dirty_background_bytes sysctls on 64bit arches
We need to pass an unsigned long as the minimum, because it gets casted
to an unsigned long in the sysctl handler. If we pass an int, we'll
access four more bytes on 64bit arches, resulting in a random minimum
value.
[rientjes@google.com: fix type of `old_bytes'] Signed-off-by: Sven Wegener <sven.wegener@stealer.net> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: David Rientjes <rientjes@google.com> Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Andres Salomon [Wed, 11 Feb 2009 21:04:23 +0000 (13:04 -0800)]
gx1fb: properly alloc cmap and plug cmap leak
We weren't properly allocating the cmap for depths greater than 8bpp,
which caused pain for things like DirectFB. Also, we never freed the cmap
memory upon module unload..
Signed-off-by: Andres Salomon <dilinger@debian.org> Cc: Marco La Porta <marco-laporta@tiscali.it> Cc: Jordan Crouse <jordan@cosmicpenguin.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Andres Salomon [Wed, 11 Feb 2009 21:04:22 +0000 (13:04 -0800)]
gxfb: properly alloc cmap and plug cmap leak
We weren't properly allocating the cmap for depths greater than 8bpp,
which caused pain for things like DirectFB. Also, we never freed the cmap
memory upon module unload..
Signed-off-by: Andres Salomon <dilinger@debian.org> Cc: Marco La Porta <marco-laporta@tiscali.it> Cc: Jordan Crouse <jordan@cosmicpenguin.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Marco La Porta [Wed, 11 Feb 2009 21:04:20 +0000 (13:04 -0800)]
lxfb: properly alloc cmap in all cases and don't leak the memory
We weren't properly allocating the cmap for depths greater than 8bpp,
which caused pain for things like DirectFB. Also, we never freed the cmap
memory upon module unload..
[dilinger@debian.org: dropped unnecessary code and clean up patch]
[dilinger@debian.org: add error checking and handling] Signed-off-by: Andres Salomon <dilinger@debian.org> Cc: Jordan Crouse <jordan@cosmicpenguin.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Mel Gorman [Wed, 11 Feb 2009 16:34:16 +0000 (16:34 +0000)]
Do not account for hugetlbfs quota at mmap() time if mapping [SHM|MAP]_NORESERVE
Commit 5a6fe125950676015f5108fb71b2a67441755003 brought hugetlbfs more
in line with the core VM by obeying VM_NORESERVE and not reserving
hugepages for both shared and private mappings when [SHM|MAP]_NORESERVE
are specified. However, it is still taking filesystem quota
unconditionally.
At fault time, if there are no reserves and attempt is made to allocate
the page and account for filesystem quota. If either fail, the fault
fails. The impact is that quota is getting accounted for twice. This
patch partially reverts 5a6fe125950676015f5108fb71b2a67441755003. To
help prevent this mistake happening again, it improves the documentation
of hugetlb_reserve_pages()
Reported-by: Andy Whitcroft <apw@canonical.com> Signed-off-by: Mel Gorman <mel@csn.ul.ie> Acked-by: Andy Whitcroft <apw@canonical.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Linus Torvalds [Wed, 11 Feb 2009 16:24:32 +0000 (08:24 -0800)]
Merge branch 'timers-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'timers-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
timers: fix TIMER_ABSTIME for process wide cpu timers
timers: split process wide cpu clocks/timers, fix
x86: clean up hpet timer reinit
timers: split process wide cpu clocks/timers, remove spurious warning
timers: split process wide cpu clocks/timers
signal: re-add dead task accumulation stats.
x86: fix hpet timer reinit for x86_64
sched: fix nohz load balancer on cpu offline
Linus Torvalds [Wed, 11 Feb 2009 16:23:22 +0000 (08:23 -0800)]
Merge branch 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
ptrace, x86: fix the usage of ptrace_fork()
i8327: fix outb() parameter order
x86: fix math_emu register frame access
x86: math_emu info cleanup
x86: include correct %gs in a.out core dump
x86, vmi: put a missing paravirt_release_pmd in pgd_dtor
x86: find nr_irqs_gsi with mp_ioapic_routing
x86: add clflush before monitor for Intel 7400 series
x86: disable intel_iommu support by default
x86: don't apply __supported_pte_mask to non-present ptes
x86: fix grammar in user-visible BIOS warning
x86/Kconfig.cpu: make Kconfig help readable in the console
x86, 64-bit: print DMI info in the oops trace
Peter Zijlstra [Wed, 11 Feb 2009 10:30:27 +0000 (11:30 +0100)]
timers: fix TIMER_ABSTIME for process wide cpu timers
The POSIX timer interface allows for absolute time expiry values through the
TIMER_ABSTIME flag, therefore we have to synchronize the timer to the clock
every time we start it.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>
Peter Zijlstra [Tue, 10 Feb 2009 15:37:31 +0000 (16:37 +0100)]
timers: split process wide cpu clocks/timers, fix
To decrease the chance of a missed enable, always enable the timer when we
sample it, we'll always disable it when we find that there are no active timers
in the jiffy tick.
This fixes a flood of warnings reported by Mike Galbraith.
Reported-by: Mike Galbraith <efault@gmx.de> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>
Stefan Weinhuber [Wed, 11 Feb 2009 09:37:31 +0000 (10:37 +0100)]
[S390] dasd: fix race in dasd timer handling
In dasd_device_set_timer and dasd_block_set_timer we interpret the
return value of mod_timer in a wrong way. If the timer expires in
the small window between our check of timer_pending and the call to
mod_timer, then the timer will be set, mod_timer returns zero and
we will call add_timer for a timer that is already pending.
As del_timer and mod_timer do all the necessary checking themselves,
we can simplify our code and remove the race a the same time.
Signed-off-by: Stefan Weinhuber <wein@de.ibm.com> Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
The vdso_per_cpu_data entry in the lowcore structure uses __u32
instead of __u64. If the data page is above 4GB the pointer is
truncated and the kernel crashes.
Reported-by: Mijo Safradin <mijo@linux.vnet.ibm.com> Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Oleg Nesterov [Mon, 9 Feb 2009 01:02:33 +0000 (02:02 +0100)]
ptrace, x86: fix the usage of ptrace_fork()
I noticed by pure accident we have ptrace_fork() and friends. This was
added by "x86, bts: add fork and exit handling", commit bf53de907dfdaac178c92d774aae7370d7b97d20.
I can't test this, ds_request_bts() returns -EOPNOTSUPP, but I strongly
believe this needs the fix. I think something like this program
Hannes Eder [Tue, 10 Feb 2009 18:44:34 +0000 (19:44 +0100)]
tracing: fix sparse warnings: fix (un-)signedness
Fix these sparse warnings:
kernel/trace/ring_buffer.c:70:37: warning: incorrect type in argument 2 (different signedness)
kernel/trace/ring_buffer.c:84:39: warning: incorrect type in argument 2 (different signedness)
kernel/trace/ring_buffer.c:96:43: warning: incorrect type in argument 2 (different signedness)
kernel/trace/ring_buffer.c:2475:13: warning: incorrect type in argument 2 (different signedness)
kernel/trace/ring_buffer.c:2475:13: warning: incorrect type in argument 2 (different signedness)
kernel/trace/ring_buffer.c:2478:42: warning: incorrect type in argument 2 (different signedness)
kernel/trace/ring_buffer.c:2478:42: warning: incorrect type in argument 2 (different signedness)
kernel/trace/ring_buffer.c:2500:40: warning: incorrect type in argument 3 (different signedness)
kernel/trace/ring_buffer.c:2505:44: warning: incorrect type in argument 2 (different signedness)
kernel/trace/ring_buffer.c:2507:46: warning: incorrect type in argument 2 (different signedness)
kernel/trace/trace.c:2130:40: warning: incorrect type in argument 3 (different signedness)
kernel/trace/trace.c:2280:40: warning: incorrect type in argument 3 (different signedness)
Signed-off-by: Hannes Eder <hannes@hanneseder.net> Signed-off-by: Ingo Molnar <mingo@elte.hu>
Hannes Eder [Tue, 10 Feb 2009 18:44:12 +0000 (19:44 +0100)]
tracing: fix sparse warnings: make symbols static
Impact: make global variables and a global function static
The function '__trace_userstack' does not seem to have a caller, so it
is commented out.
Fix this sparse warnings:
kernel/trace/trace.c:82:5: warning: symbol 'tracing_disabled' was not declared. Should it be static?
kernel/trace/trace.c:600:10: warning: symbol 'trace_record_cmdline_disabled' was not declared. Should it be static?
kernel/trace/trace.c:957:6: warning: symbol '__trace_userstack' was not declared. Should it be static?
kernel/trace/trace.c:1694:5: warning: symbol 'tracing_release' was not declared. Should it be static?
Signed-off-by: Hannes Eder <hannes@hanneseder.net> Signed-off-by: Ingo Molnar <mingo@elte.hu>
Steven Rostedt [Tue, 10 Feb 2009 16:53:23 +0000 (11:53 -0500)]
tracing, x86: fix constraint for parent variable
The constraint used for retrieving and restoring the parent function
pointer is incorrect. The parent variable is a pointer, and the
address of the pointer is modified by the asm statement and not
the pointer itself. It is incorrect to pass it in as an output
constraint since the asm will never update the pointer.
Signed-off-by: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
powerpc/mm: Rework usage of _PAGE_COHERENT/NO_CACHE/GUARDED
broke setting of the _PAGE_COHERENT bit in the PPC HW PTE. Since we now
actually set _PAGE_COHERENT in the Linux PTE we shouldn't be clearing it
out before we propogate it to the PPC HW PTE.
Reported-by: Martyn Welch <martyn.welch@gefanuc.com> Signed-off-by: Kumar Gala <galak@kernel.crashing.org> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Linus Torvalds [Tue, 10 Feb 2009 23:54:50 +0000 (15:54 -0800)]
Merge master.kernel.org:/home/rmk/linux-2.6-arm
* master.kernel.org:/home/rmk/linux-2.6-arm:
[ARM] AACI: timeout will reach -1
[ARM] Storage class should be before const qualifier
[ARM] pxa: stop and disable IRQ for each DMA channels at startup
[ARM] pxa: make more SSCR0 bit definitions visible on multiple processors
[ARM] pxa: fix missing of __REG() definition for ac97 registers access
[ARM] pxa: fix NAND and MMC clock initialization for pxa3xx
Stefan Richter [Tue, 10 Feb 2009 22:27:32 +0000 (23:27 +0100)]
hugetlbfs: fix build failure with !CONFIG_HUGETLBFS
Fix regression due to 5a6fe125950676015f5108fb71b2a67441755003,
"Do not account for the address space used by hugetlbfs using VM_ACCOUNT"
which added an argument to the function hugetlb_file_setup() but not to
the macro hugetlb_file_setup().
Reported-by: Chris Clayton <chris2553@googlemail.com> Signed-off-by: Stefan Richter <stefanr@s5r6.in-berlin.de> Acked-by: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Linus Torvalds [Tue, 10 Feb 2009 19:55:12 +0000 (11:55 -0800)]
Merge branch 'merge' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc
* 'merge' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc:
powerpc: Add missing sparsemem.h include
powerpc/pci: mmap anonymous memory when legacy_mem doesn't exist
powerpc/cell: Add missing #include for oprofile
powerpc/ftrace: Fix math to calculate offset in TOC
powerpc: Don't emulate mr. instructions
powerpc/fsl-booke: Fix mapping functions to use phys_addr_t
arch/powerpc: Eliminate double sizeof
powerpc/cpm2: Fix set interrupt type
powerpc/83xx: Fix TSEC0 workability on MPC8313E-RDB boards
powerpc/83xx: Fix missing #{address,size}-cells in mpc8313erdb.dts
powerpc/83xx: Build breakage for CONFIG_PM but no CONFIG_SUSPEND
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6: (23 commits)
bridge: Fix LRO crash with tun
IPv6: fix to set device name when new IPv6 over IPv6 tunnel device is created.
gianfar: Fix boot hangs while bringing up gianfar ethernet
netfilter: xt_sctp: sctp chunk mapping doesn't work
netfilter: ctnetlink: fix echo if not subscribed to any multicast group
netfilter: ctnetlink: allow changing NAT sequence adjustment in creation
netfilter: nf_conntrack_ipv6: don't track ICMPv6 negotiation message
netfilter: fix tuple inversion for Node information request
netxen: fix msi-x interrupt handling
de2104x: force correct order when writing to rx ring
tun: Fix unicast filter overflow
drivers/isdn: introduce missing kfree
drivers/atm: introduce missing kfree
sunhme: Don't match PCI devices in SBUS probe.
9p: fix endian issues [attempt 3]
net_dma: call dmaengine_get only if NET_DMA enabled
3c509: Fix resume from hibernation for PnP mode.
sungem: Soft lockup in sungem on Netra AC200 when switching interface up
RxRPC: Fix a potential NULL dereference
r8169: Don't update statistics counters when interface is down
...
Mel Gorman [Tue, 10 Feb 2009 14:02:27 +0000 (14:02 +0000)]
Do not account for the address space used by hugetlbfs using VM_ACCOUNT
When overcommit is disabled, the core VM accounts for pages used by anonymous
shared, private mappings and special mappings. It keeps track of VMAs that
should be accounted for with VM_ACCOUNT and VMAs that never had a reserve
with VM_NORESERVE.
Overcommit for hugetlbfs is much riskier than overcommit for base pages
due to contiguity requirements. It avoids overcommiting on both shared and
private mappings using reservation counters that are checked and updated
during mmap(). This ensures (within limits) that hugepages exist in the
future when faults occurs or it is too easy to applications to be SIGKILLed.
As hugetlbfs makes its own reservations of a different unit to the base page
size, VM_ACCOUNT should never be set. Even if the units were correct, we would
double account for the usage in the core VM and hugetlbfs. VM_NORESERVE may
be set because an application can request no reserves be made for hugetlbfs
at the risk of getting killed later.
With commit fc8744adc870a8d4366908221508bb113d8b72ee, VM_NORESERVE and
VM_ACCOUNT are getting unconditionally set for hugetlbfs-backed mappings. This
breaks the accounting for both the core VM and hugetlbfs, can trigger an
OOM storm when hugepage pools are too small lockups and corrupted counters
otherwise are used. This patch brings hugetlbfs more in line with how the
core VM treats VM_NORESERVE but prevents VM_ACCOUNT being set.
Steven Rostedt [Tue, 10 Feb 2009 18:07:13 +0000 (13:07 -0500)]
tracing, x86: fix fixup section to return to original code
Impact: fix to prevent a kernel crash on fault
If for some reason the pointer to the parent function on the
stack takes a fault, the fix up code will not return back to
the original faulting code. This can lead to unpredictable
results and perhaps even a kernel panic.
A fault should not happen, but if it does, we should simply
disable the tracer, warn, and continue running the kernel.
It should not lead to a kernel crash.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Steven Rostedt [Tue, 10 Feb 2009 16:53:23 +0000 (11:53 -0500)]
tracing, x86: fix constraint for parent variable
The constraint used for retrieving and restoring the parent function
pointer is incorrect. The parent variable is a pointer, and the
address of the pointer is modified by the asm statement and not
the pointer itself. It is incorrect to pass it in as an output
constraint since the asm will never update the pointer.
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Lai Jiangshan [Mon, 9 Feb 2009 06:21:14 +0000 (14:21 +0800)]
ring_buffer: fix typing mistake
Impact: Fix bug
I found several very very curious line.
It's so curious that it may be brought by typing mistake.
When (cpu_buffer->reader_page == cpu_buffer->commit_page):
1) We haven't copied it for bpage is changed:
bpage = cpu_buffer->reader_page->page;
memcpy(bpage->data, cpu_buffer->reader_page->page->data + read ... )
2) We need update cpu_buffer->reader_page->read, but
"cpu_buffer->reader_page += read;" is not right.
[
This bug was a typo. The commit->reader_page is a page pointer
and not an index into the page. The line should have been
commit->reader_page->read += read. The other changes
by Lai are nice clean ups to the code. - SDR
]
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Michael Neuling [Sun, 8 Feb 2009 14:49:39 +0000 (14:49 +0000)]
powerpc: Add missing sparsemem.h include
arch/powerpc/platforms/pseries/hotplug-memory.c uses
remove_section_mapping() but doesn't include sparsemem.h which defines
it. This can cause compilation fails for some configs.
Signed-off-by: Michael Neuling <mikey@neuling.org> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
powerpc/pci: mmap anonymous memory when legacy_mem doesn't exist
The new legacy_mem file in sysfs is causing problems with X on machines
that don't support legacy memory access. The way I initially implemented
it, we would fail with -ENXIO when trying to mmap it, thus exposing to
X that we do support the API but there is no legacy memory.
Unfortunately, X poor error handling is causing it to fail to start when
it gets this error.
This implements a workaround hack that instead maps anonymous memory
instead (using shmem if VM_SHARED is set, just like /dev/zero does).
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Michael Neuling [Sun, 8 Feb 2009 13:04:14 +0000 (13:04 +0000)]
powerpc/cell: Add missing #include for oprofile
arch/powerpc/oprofile/cell/spu_profiler.c is missing a asm/time.h
include which is required for ppc_proc_freq. This can cause compile
failures for some config combinations.
Signed-off-by: Michael Neuling <mikey@neuling.org> Acked-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Steven Rostedt [Sat, 7 Feb 2009 20:22:40 +0000 (20:22 +0000)]
powerpc/ftrace: Fix math to calculate offset in TOC
Impact: fix dynamic ftrace with large modules in PPC64
The math to calculate the offset into the TOC that is taken from reading
the trampoline is incorrect. The bottom half of the offset is a signed
extended short. The current code was using an OR to create the offset
when it should have been using an addition.
Signed-off-by: Steven Rostedt <srostedt@redhat.com> Acked-by: Geoff Levand <geoffrey.levand@am.sony.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Kumar Gala [Tue, 10 Feb 2009 03:08:07 +0000 (21:08 -0600)]
powerpc/fsl-booke: Fix mapping functions to use phys_addr_t
Fixed v_mapped_by_tlbcam() and p_mapped_by_tlbcam() to use phys_addr_t
instead of unsigned long. In 36-bit physical mode we really need these
functions to deal with phys_addr_t when trying to match a physical
address or when returning one.
Signed-off-by: Kumar Gala <galak@kernel.crashing.org>
Hugh Dickins [Mon, 9 Feb 2009 19:20:50 +0000 (19:20 +0000)]
profiling: fix broken profiling regression
Impact: fix broken /proc/profile on UP machines
Commit c309b917cab55799ea489d7b5f1b77025d9f8462 "cpumask: convert
kernel/profile.c" broke profiling. prof_cpu_mask was previously
initialized to CPU_MASK_ALL, but left uninitialized in that commit.
We need to copy cpu_possible_mask (cpu_online_mask is not enough).
Tejun Heo [Mon, 9 Feb 2009 13:17:39 +0000 (22:17 +0900)]
x86: fix math_emu register frame access
do_device_not_available() is the handler for #NM and it declares that
it takes a unsigned long and calls math_emu(), which takes a long
argument and surprisingly expects the stack frame starting at the zero
argument would match struct math_emu_info, which isn't true regardless
of configuration in the current code.
This patch makes do_device_not_available() take struct pt_regs like
other exception handlers and initialize struct math_emu_info with
pointer to it and pass pointer to the math_emu_info to math_emulate()
like normal C functions do. This way, unless gcc makes a copy of
struct pt_regs in do_device_not_available(), the register frame is
correctly accessed regardless of kernel configuration or compiler
used.
This doesn't fix all math_emu problems but it at least gets it
somewhat working.
This crashed when an LRO packet generated by bnx2x reached a
tun device through the bridge. We're supposed to drop it at
the bridge. However, because the check was placed in br_forward
instead of __br_forward, it's only effective if we are sending
the packet through a single port.
This patch fixes it by moving the check into __br_forward.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
IPv6: fix to set device name when new IPv6 over IPv6 tunnel device is created.
When the user creates IPv6 over IPv6 tunnel, the device name created
by the kernel isn't set to t->parm.name, which is referred as the
result of ioctl().
Signed-off-by: Noriaki TAKAMIYA <takamiya@po.ntts.co.jp> Signed-off-by: David S. Miller <davem@davemloft.net>
Jarek Poplawski [Mon, 9 Feb 2009 22:59:30 +0000 (14:59 -0800)]
gianfar: Fix boot hangs while bringing up gianfar ethernet
Ira Snyder found that commit 8c7396aebb68994c0519e438eecdf4d5fa9c7844
"gianfar: Merge Tx and Rx interrupt for scheduling clean up ring" can
cause hangs. It's because there was removed clearing of interrupts in
gfar_schedule_cleanup() (which is called by an interrupt handler) in
case when netif scheduling has been disabled. This patch brings back
this action and a comment.
Reported-by: Ira Snyder <iws@ovro.caltech.edu> Reported-by: Peter Korsgaard <jacmet@sunsite.dk> Bisected-by: Ira Snyder <iws@ovro.caltech.edu> Tested-by: Peter Korsgaard <jacmet@sunsite.dk> Tested-by: Ira Snyder <iws@ovro.caltech.edu> Signed-off-by: Jarek Poplawski <jarkao2@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Qu Haoran [Mon, 9 Feb 2009 22:34:56 +0000 (14:34 -0800)]
netfilter: xt_sctp: sctp chunk mapping doesn't work
When user tries to map all chunks given in argument, kernel
works on a copy of the chunkmap, but at the end it doesn't
check the copy, but the orginal one.
Signed-off-by: Qu Haoran <haoran.qu@6wind.com> Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>
netfilter: ctnetlink: fix echo if not subscribed to any multicast group
This patch fixes echoing if the socket that has sent the request to
create/update/delete an entry is not subscribed to any multicast
group. With the current code, ctnetlink would not send the echo
message via unicast as nfnetlink_send() would be skip.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>