This patch moves the die notifier handling to common code. Previous
various architectures had exactly the same code for it. Note that the new
code is compiled unconditionally, this should be understood as an appel to
the other architecture maintainer to implement support for it aswell (aka
sprinkling a notify_die or two in the proper place)
arm had a notifiy_die that did something totally different, I renamed it to
arm_notify_die as part of the patch and made it static to the file it's
declared and used at. avr32 used to pass slightly less information through
this interface and I brought it into line with the other architectures.
[akpm@linux-foundation.org: build fix]
[akpm@linux-foundation.org: fix vmalloc_sync_all bustage]
[bryan.wu@analog.com: fix vmalloc_sync_all in nommu] Signed-off-by: Christoph Hellwig <hch@lst.de> Cc: <linux-arch@vger.kernel.org> Cc: Russell King <rmk@arm.linux.org.uk> Signed-off-by: Bryan Wu <bryan.wu@analog.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Randy Dunlap [Tue, 8 May 2007 07:26:59 +0000 (00:26 -0700)]
reiserfs: proc support requires PROC_FS
REISER_FS /proc option needs to depend on PROC_FS.
fs/reiserfs/procfs.c: In function 'show_super':
fs/reiserfs/procfs.c:134: error: 'reiserfs_proc_info_data_t' has no member named 'max_hash_collisions'
fs/reiserfs/procfs.c:134: error: 'reiserfs_proc_info_data_t' has no member named 'breads'
fs/reiserfs/procfs.c:135: error: 'reiserfs_proc_info_data_t' has no member named 'bread_miss'
fs/reiserfs/procfs.c:135: error: 'reiserfs_proc_info_data_t' has no member named 'search_by_key'
fs/reiserfs/procfs.c:136: error: 'reiserfs_proc_info_data_t' has no member named 'search_by_key_fs_changed'
fs/reiserfs/procfs.c:136: error: 'reiserfs_proc_info_data_t' has no member named 'search_by_key_restarted'
fs/reiserfs/procfs.c:137: error: 'reiserfs_proc_info_data_t' has no member named 'insert_item_restarted'
fs/reiserfs/procfs.c:137: error: 'reiserfs_proc_info_data_t' has no member named 'paste_into_item_restarted'
fs/reiserfs/procfs.c:138: error: 'reiserfs_proc_info_data_t' has no member named 'cut_from_item_restarted'
fs/reiserfs/procfs.c:139: error: 'reiserfs_proc_info_data_t' has no member named 'delete_solid_item_restarted'
fs/reiserfs/procfs.c:139: error: 'reiserfs_proc_info_data_t' has no member named 'delete_item_restarted'
fs/reiserfs/procfs.c:140: error: 'reiserfs_proc_info_data_t' has no member named 'leaked_oid'
fs/reiserfs/procfs.c:140: error: 'reiserfs_proc_info_data_t' has no member named 'leaves_removable'
fs/reiserfs/procfs.c: In function 'show_per_level':
fs/reiserfs/procfs.c:184: error: 'reiserfs_proc_info_data_t' has no member named 'balance_at'
fs/reiserfs/procfs.c:185: error: 'reiserfs_proc_info_data_t' has no member named 'sbk_read_at'
fs/reiserfs/procfs.c:186: error: 'reiserfs_proc_info_data_t' has no member named 'sbk_fs_changed'
fs/reiserfs/procfs.c:187: error: 'reiserfs_proc_info_data_t' has no member named 'sbk_restarted'
fs/reiserfs/procfs.c:188: error: 'reiserfs_proc_info_data_t' has no member named 'free_at'
fs/reiserfs/procfs.c:189: error: 'reiserfs_proc_info_data_t' has no member named 'items_at'
fs/reiserfs/procfs.c:190: error: 'reiserfs_proc_info_data_t' has no member named 'can_node_be_removed'
fs/reiserfs/procfs.c:191: error: 'reiserfs_proc_info_data_t' has no member named 'lnum'
fs/reiserfs/procfs.c:192: error: 'reiserfs_proc_info_data_t' has no member named 'rnum'
fs/reiserfs/procfs.c:193: error: 'reiserfs_proc_info_data_t' has no member named 'lbytes'
fs/reiserfs/procfs.c:194: error: 'reiserfs_proc_info_data_t' has no member named 'rbytes'
fs/reiserfs/procfs.c:195: error: 'reiserfs_proc_info_data_t' has no member named 'get_neighbors'
fs/reiserfs/procfs.c:196: error: 'reiserfs_proc_info_data_t' has no member named 'get_neighbors_restart'
fs/reiserfs/procfs.c:197: error: 'reiserfs_proc_info_data_t' has no member named 'need_l_neighbor'
fs/reiserfs/procfs.c:197: error: 'reiserfs_proc_info_data_t' has no member named 'need_r_neighbor'
fs/reiserfs/procfs.c: In function 'show_bitmap':
fs/reiserfs/procfs.c:224: error: 'reiserfs_proc_info_data_t' has no member named 'free_block'
fs/reiserfs/procfs.c:225: error: 'reiserfs_proc_info_data_t' has no member named 'scan_bitmap'
fs/reiserfs/procfs.c:226: error: 'reiserfs_proc_info_data_t' has no member named 'scan_bitmap'
fs/reiserfs/procfs.c:227: error: 'reiserfs_proc_info_data_t' has no member named 'scan_bitmap'
fs/reiserfs/procfs.c:228: error: 'reiserfs_proc_info_data_t' has no member named 'scan_bitmap'
fs/reiserfs/procfs.c:229: error: 'reiserfs_proc_info_data_t' has no member named 'scan_bitmap'
fs/reiserfs/procfs.c:230: error: 'reiserfs_proc_info_data_t' has no member named 'scan_bitmap'
fs/reiserfs/procfs.c:230: error: 'reiserfs_proc_info_data_t' has no member named 'scan_bitmap'
fs/reiserfs/procfs.c: In function 'show_journal':
fs/reiserfs/procfs.c:384: error: 'reiserfs_proc_info_data_t' has no member named 'journal'
fs/reiserfs/procfs.c:385: error: 'reiserfs_proc_info_data_t' has no member named 'journal'
fs/reiserfs/procfs.c:386: error: 'reiserfs_proc_info_data_t' has no member named 'journal'
fs/reiserfs/procfs.c:387: error: 'reiserfs_proc_info_data_t' has no member named 'journal'
fs/reiserfs/procfs.c:388: error: 'reiserfs_proc_info_data_t' has no member named 'journal'
fs/reiserfs/procfs.c:389: error: 'reiserfs_proc_info_data_t' has no member named 'journal'
fs/reiserfs/procfs.c:390: error: 'reiserfs_proc_info_data_t' has no member named 'journal'
fs/reiserfs/procfs.c:391: error: 'reiserfs_proc_info_data_t' has no member named 'journal'
fs/reiserfs/procfs.c:392: error: 'reiserfs_proc_info_data_t' has no member named 'journal'
fs/reiserfs/procfs.c:393: error: 'reiserfs_proc_info_data_t' has no member named 'journal'
fs/reiserfs/procfs.c:394: error: 'reiserfs_proc_info_data_t' has no member named 'journal'
fs/reiserfs/procfs.c:395: error: 'reiserfs_proc_info_data_t' has no member named 'journal'
fs/reiserfs/procfs.c:395: error: 'reiserfs_proc_info_data_t' has no member named 'journal'
fs/reiserfs/procfs.c:395: error: 'reiserfs_proc_info_data_t' has no member named 'journal'
fs/reiserfs/procfs.c: In function 'reiserfs_proc_info_init':
fs/reiserfs/procfs.c:504: warning: implicit declaration of function '__PINFO'
fs/reiserfs/procfs.c:504: error: request for member 'lock' in something not a structure or union
fs/reiserfs/procfs.c: In function 'reiserfs_proc_info_done':
fs/reiserfs/procfs.c:544: error: request for member 'lock' in something not a structure or union
fs/reiserfs/procfs.c:545: error: request for member 'exiting' in something not a structure or union
fs/reiserfs/procfs.c:546: error: request for member 'lock' in something not a structure or union
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
While researching the tty layer pid leaks I found a weird case in selinux when
we drop a controlling tty because of inadequate permissions we don't do the
normal hangup processing. Which is a problem if it happens the session leader
has exec'd something that can no longer access the tty.
We already have code in the kernel to handle this case in the form of the
TIOCNOTTY ioctl. So this patch factors out a helper function that is the
essence of that ioctl and calls it from the selinux code.
This removes the inconsistency in handling dropping of a controlling tty and
who knows it might even make some part of user space happy because it received
a SIGHUP it was expecting.
In addition since this removes the last user of proc_set_tty outside of
tty_io.c proc_set_tty is made static and removed from tty.h
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Acked-by: Alan Cox <alan@lxorguk.ukuu.org.uk> Cc: James Morris <jmorris@namei.org> Cc: Stephen Smalley <sds@tycho.nsa.gov> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
At some point I got confused and thought put_pid could not be called while a
spin lock was held. While it may be nice to avoid that to reduce lock hold
times put_pid can be safely called while we hold a spin lock.
This patch removes all of the complications from the code introduced by my
misunderstanding, making the code a little more readable.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Cc: Alan Cox <alan@lxorguk.ukuu.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
All of the users of proc_clear_tty are compiled into the kernel so exporting
this symbol appears gratuitous.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Cc: Alan Cox <alan@lxorguk.ukuu.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Gerd Hoffmann [Tue, 8 May 2007 07:26:49 +0000 (00:26 -0700)]
Fixes and cleanups for earlyprintk aka boot console
The console subsystem already has an idea of a boot console, using the
CON_BOOT flag. The implementation has some flaws though. The major
problem is that presence of a boot console makes register_console() ignore
any other console devices (unless explicitly specified on the kernel
command line).
This patch fixes the console selection code to *not* consider a boot
console a full-featured one, so the first non-boot console registering will
become the default console instead. This way the unregister call for the
boot console in the register_console() function actually triggers and the
handover from the boot console to the real console device works smoothly.
Added a printk for the handover, so you know which console device the
output goes to when the boot console stops printing messages.
The disable_early_printk() call is obsolete with that patch, explicitly
disabling the early console isn't needed any more as it works automagically
with that patch.
I've walked through the tree, dropped all disable_early_printk() instances
found below arch/ and tagged the consoles with CON_BOOT if needed. The
code is tested on x86, sh (thanks to Paul) and mips (thanks to Ralf).
Changes to last version: Rediffed against -rc3, adapted to mips cleanups by
Ralf, fixed "udbg-immortal" cmd line arg on powerpc.
Signed-off-by: Gerd Hoffmann <kraxel@exsuse.de> Acked-by: Paul Mundt <lethal@linux-sh.org> Acked-by: Ralf Baechle <ralf@linux-mips.org> Cc: Andi Kleen <ak@suse.de> Cc: Alan Cox <alan@lxorguk.ukuu.org.uk> Cc: Richard Henderson <rth@twiddle.net> Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru> Cc: Paul Mackerras <paulus@samba.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Jeremy Fitzhardinge <jeremy@goop.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Alexey Dobriyan [Tue, 8 May 2007 07:26:46 +0000 (00:26 -0700)]
/proc/*/oom_score oops re badness
Eternal quest to make
while true; do cat /proc/fs/xfs/stat >/dev/null 2>/dev/null; done
while true; do find /proc -type f 2>/dev/null | xargs cat >/dev/null 2>/dev/null; done
while true; do modprobe xfs; rmmod xfs; done
work reliably continues and now kernel oopses in the following way:
BUG: unable to handle ... at virtual address 6b6b6b6b
EIP is at badness
process: cat
proc_oom_score
proc_info_read
sys_fstat64
vfs_read
proc_info_read
sys_read
Failing code is prefetch hidden in list_for_each_entry() in badness().
badness() is reachable from two points. One is proc_oom_score, another
is out_of_memory() => select_bad_process() => badness().
Second path grabs tasklist_lock, while first doesn't.
Nick Piggin [Tue, 8 May 2007 07:26:43 +0000 (00:26 -0700)]
futex: restartable futex_wait
LTP test sigaction_16_24 fails, because it expects sem_wait to be restarted
if SA_RESTART is set. sem_wait is implemented with futex_wait, that
currently doesn't support being restarted. Ulrich confirms that the call
should be restartable.
Implement a restart_block method to handle the relative timeout, and allow
restarts.
Signed-off-by: Nick Piggin <npiggin@suse.de> Cc: Ulrich Drepper <drepper@redhat.com> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Roland McGrath <roland@redhat.com> Cc: Oleg Nesterov <oleg@tv-sign.ru> Acked-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Rusty Russell [Tue, 8 May 2007 07:26:42 +0000 (00:26 -0700)]
futex: get_futex_key, get_key_refs and drop_key_refs
lguest uses the convenient futex infrastructure for inter-domain I/O, so
expose get_futex_key, get_key_refs (renamed get_futex_key_refs) and
drop_key_refs (renamed drop_futex_key_refs). Also means we need to expose the
union that these use.
No code changes.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> Cc: Andi Kleen <ak@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Len Sorensen [Tue, 8 May 2007 07:26:33 +0000 (00:26 -0700)]
Subject: jsm driver fix for linuxpps support
The jsm driver doesn't currently use the uart_handle_*_change helper
functions, which are the obvious place for things like linuxpps to tie
into (which it now does of course), and as a result the jsm driver can
not be used with linuxpps and anything else that ties into the
serial_core helper functions. This patch adds calls to these helper
functions whenever the value they manage changes. That actual storage
of the state is not modified since the jsm driver caches the current
settings (The 8250 driver reads them everytime a user asks for the
state), and only updates them whenever they change.
Signed-off-by: Len Sorensen <lsorense@csclub.uwaterloo.ca> Cc: Scott H Kilau <Scott_Kilau@digi.com> Cc: Wendy Xiong <wendyx@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Len Sorensen [Tue, 8 May 2007 07:26:30 +0000 (00:26 -0700)]
Small fixes for jsm driver
The jsm driver fails when you try to use the TIOCSSERIAL ioctl. The reason
is that the driver never sets uart_port.uartclk, causing the data received
using TIOCGSERIAL to not match the internal state of the driver. This
patch fixes this problem by settings the uartclk to the value used by the
serial_core (16 times the baud base).
Signed-off-by: Len Sorensen <lsorense@csclub.uwaterloo.ca> Cc: Scott H Kilau <Scott_Kilau@digi.com> Cc: Wendy Xiong <wendyx@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Klaus Kudielka [Tue, 8 May 2007 07:26:25 +0000 (00:26 -0700)]
fix cyclades.h for x86_64 (and probably others)
At least on x86_64 the present cyclades.h is broken due to the wrong size
of uclong. This affects, of course, both the kernel and the user-level
utilities. The symptom is that cyzload refuses to load the firmware. I
also managed to freeze the machine when unloading the module.
The patch below fixes this in an architecture-independent way. I have
tested it with 2.6.19 and the driver works fine again with a Cyclades-Z on
an Athlon 64 X2.
[akpm@linux-foundation.org: fix warnings]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Kprobes doesn't scribble the kprobe.symbol_name field. Its only set by the
module when registering the probe. Modules that exercise good hygiene
using the "const" qualifier will see warnings...
warning: assignment discards qualifiers from pointer target type
Make struct kprobe.symbol_name const char *
Signed-off-by: Ananth N Mavinakayanahalli <ananth@in.ibm.com> Signed-off-by: Jim Keniston <jkenisto@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Alan Cox [Tue, 8 May 2007 07:26:21 +0000 (00:26 -0700)]
tty: i386/x86_64 arbitary speed support
Adds the needed TCGETS2/TCSETS2 ioctl calls, structures, defines and the like.
Tested against the test suite and passes. Other platforms should need
roughly the same change.
Signed-off-by: Alan Cox <alan@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Eric Dumazet [Tue, 8 May 2007 07:26:18 +0000 (00:26 -0700)]
VFS: delay the dentry name generation on sockets and pipes
1) Introduces a new method in 'struct dentry_operations'. This method
called d_dname() might be called from d_path() to build a pathname for
special filesystems. It is called without locks.
Future patches (if we succeed in having one common dentry for all
pipes/sockets) may need to change prototype of this method, but we now
use : char *d_dname(struct dentry *dentry, char *buffer, int buflen);
2) Adds a dynamic_dname() helper function that eases d_dname() implementations
3) Defines d_dname method for sockets : No more sprintf() at socket
creation. This is delayed up to the moment someone does an access to
/proc/pid/fd/...
4) Defines d_dname method for pipes : No more sprintf() at pipe
creation. This is delayed up to the moment someone does an access to
/proc/pid/fd/...
A benchmark consisting of 1.000.000 calls to pipe()/close()/close() gives a
*nice* speedup on my Pentium(M) 1.6 Ghz :
3.090 s instead of 3.450 s
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> Acked-by: Christoph Hellwig <hch@infradead.org> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Robert P. J. Day <rpjday@mindspring.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Robert P. J. Day <rpjday@mindspring.com> Cc: Markus Lidel <Markus.Lidel@shadowconnect.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Kees Cook [Tue, 8 May 2007 07:26:04 +0000 (00:26 -0700)]
proc: maps protection
The /proc/pid/ "maps", "smaps", and "numa_maps" files contain sensitive
information about the memory location and usage of processes. Issues:
- maps should not be world-readable, especially if programs expect any
kind of ASLR protection from local attackers.
- maps cannot just be 0400 because "-D_FORTIFY_SOURCE=2 -O2" makes glibc
check the maps when %n is in a *printf call, and a setuid(getuid())
process wouldn't be able to read its own maps file. (For reference
see http://lkml.org/lkml/2006/1/22/150)
- a system-wide toggle is needed to allow prior behavior in the case of
non-root applications that depend on access to the maps contents.
This change implements a check using "ptrace_may_attach" before allowing
access to read the maps contents. To control this protection, the new knob
/proc/sys/kernel/maps_protect has been added, with corresponding updates to
the procfs documentation.
[akpm@linux-foundation.org: build fixes]
[akpm@linux-foundation.org: New sysctl numbers are old hat] Signed-off-by: Kees Cook <kees@outflux.net> Cc: Arjan van de Ven <arjan@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Andrew Morton [Tue, 8 May 2007 07:26:02 +0000 (00:26 -0700)]
virtual_eisa_root_init() should be __init
WARNING: vmlinux - Section mismatch: reference to
.init.text:eisa_root_register from .text between 'virtual_eisa_root_init' (at
offset 0xc026b80f) and 'cpufreq_debug_disable_ratelimit'
Cc: Dave Jones <davej@codemonkey.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
It misspelled "MODVERSIONS" preprocessor variable with "CONFIG_MODVERSIONS".
Just kill it all.
Signed-off-by: Robert P. J. Day <rpjday@mindspring.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Miklos Szeredi [Tue, 8 May 2007 07:25:43 +0000 (00:25 -0700)]
add filesystem subtype support
There's a slight problem with filesystem type representation in fuse
based filesystems.
From the kernel's view, there are just two filesystem types: fuse and
fuseblk. From the user's view there are lots of different filesystem
types. The user is not even much concerned if the filesystem is fuse based
or not. So there's a conflict of interest in how this should be
represented in fstab, mtab and /proc/mounts.
The current scheme is to encode the real filesystem type in the mount
source. So an sshfs mount looks like this:
This url-ish syntax works OK for sshfs and similar filesystems. However
for block device based filesystems (ntfs-3g, zfs) it doesn't work, since
the kernel expects the mount source to be a real device name.
A possibly better scheme would be to encode the real type in the type
field as "type.subtype". So fuse mounts would look like this:
Davide Libenzi [Tue, 8 May 2007 07:25:41 +0000 (00:25 -0700)]
epoll: optimizations and cleanups
Epoll is doing multiple passes over the ready set at the moment, because of
the constraints over the f_op->poll() call. Looking at the code again, I
noticed that we already hold the epoll semaphore in read, and this
(together with other locking conditions that hold while doing an
epoll_wait()) can lead to a smarter way [1] to "ship" events to userspace
(in a single pass).
This is a stress application that can be used to test the new code. It
spwans multiple thread and call epoll_wait() and epoll_ctl() from many
threads. Stress tested on my dual Opteron 254 w/out any problems.
http://www.xmailserver.org/totalmess.c
This is not a benchmark, just something that tries to stress and exploit
possible problems with the new code.
Also, I made a stupid micro-benchmark:
http://www.xmailserver.org/epwbench.c
[1] Considering that epoll must be thread-safe, there are five ways we can
be hit during an epoll_wait() transfer loop (ep_send_events()):
1) The epoll fd going away and calling ep_free
This just can't happen, since we did an fget() in sys_epoll_wait
2) An epoll_ctl(EPOLL_CTL_DEL)
This can't happen because epoll_ctl() gets ep->sem in write, and
we're holding it in read during ep_send_events()
3) An fd stored inside the epoll fd going away
This can't happen because in eventpoll_release_file() we get
ep->sem in write, and we're holding it in read during
ep_send_events()
4) Another epoll_wait() happening on another thread
They both can be inside ep_send_events() at the same time, we get
(splice) the ready-list under the spinlock, so each one will get
its own ready list. Note that an fd cannot be at the same time
inside more than one ready list, because ep_poll_callback() will
not re-queue it if it sees it already linked:
if (ep_is_linked(&epi->rdllink))
goto is_linked;
Another case that can happen, is two concurrent epoll_wait(),
coming in with a userspace event buffer of size, say, ten.
Suppose there are 50 event ready in the list. The first
epoll_wait() will "steal" the whole list, while the second, seeing
no events, will go to sleep. But at the end of ep_send_events() in
the first epoll_wait(), we will re-inject surplus ready fds, and we
will trigger the proper wake_up to the second epoll_wait().
5) ep_poll_callback() hitting us asyncronously
This is the tricky part. As I said above, the ep_is_linked() test
done inside ep_poll_callback(), will guarantee us that until the
item will result linked to a list, ep_poll_callback() will not try
to re-queue it again (read, write data on any of its members). When
we do a list_del() in ep_send_events(), the item will still satisfy
the ep_is_linked() test (whatever data is written in prev/next,
it'll never be its own pointer), so ep_poll_callback() will still
leave us alone. It's only after the eventual smp_mb()+INIT_LIST_HEAD(&epi->rdllink)
that it'll become visible to ep_poll_callback(), but at the point
we're already past it.
Adrian Bunk [Tue, 8 May 2007 07:25:37 +0000 (00:25 -0700)]
the scheduled removal of OBSOLETE_OSS options
Signed-off-by: Adrian Bunk <bunk@stusta.de> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
David Brownell [Tue, 8 May 2007 07:25:29 +0000 (00:25 -0700)]
init dma masks in pnp_dev
PNP now initializes device dma masks, which prevents oopses when generic
dma calls are made using pnp device nodes.
This assumes PNP only uses ISA DMA, with 24 bit addresses; and that it's
safe to init those masks for all devices (rather than finding out which
devices have been assigned DMA channels, and handling only those).
Signed-off-by: David Brownell <dbrownell@users.sourceforge.net> Cc: Adam Belay <abelay@novell.com> Cc: Jaroslav Kysela <perex@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Merge sys_clone()/sys_unshare() nsproxy and namespace handling
sys_clone() and sys_unshare() both makes copies of nsproxy and its associated
namespaces. But they have different code paths.
This patch merges all the nsproxy and its associated namespace copy/clone
handling (as much as possible). Posted on container list earlier for
feedback.
- Create a new nsproxy and its associated namespaces and pass it back to
caller to attach it to right process.
- Changed all copy_*_ns() routines to return a new copy of namespace
instead of attaching it to task->nsproxy.
- Moved the CAP_SYS_ADMIN checks out of copy_*_ns() routines.
- Removed unnessary !ns checks from copy_*_ns() and added BUG_ON()
just incase.
- Get rid of all individual unshare_*_ns() routines and make use of
copy_*_ns() instead.
Nick Piggin [Tue, 8 May 2007 07:25:16 +0000 (00:25 -0700)]
exec: fix remove_arg_zero
Petr Tesarik discovered a problem in remove_arg_zero(). He writes:
When a script is loaded, load_script() replaces argv[0] with the
name of the interpreter and the filename passed to the exec syscall.
However, there is no guarantee that the length of the interpreter
name plus the length of the filename is greater than the length of
the original argv[0]. If the difference happens to cross a page boundary,
setup_arg_pages() will call put_dirty_page() [aka install_arg_page()]
with an address outside the VMA.
Therefore, remove_arg_zero() must free all pages which would be unused
after the argument is removed.
So, rewrite the remove_arg_zero function without gotos, with a few comments,
and with the commonly used explicit index/offset. This fixes the problem
and makes it easier to understand as well.
[a.p.zijlstra@chello.nl: add comment] Signed-off-by: Nick Piggin <npiggin@suse.de> Cc: Petr Tesarik <ptesarik@suse.cz> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Guy Streeter [Tue, 8 May 2007 07:25:12 +0000 (00:25 -0700)]
Cap shmmax at INT_MAX in compat shminfo
The value of shmmax may be larger than will fit in the struct used by
the 32bit compat version of sys_shmctl. This change mirrors what the
normal sys_shmctl does when called with the old IPC_INFO command.
Signed-off-by: Guy Streeter <streeter@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Prarit Bhargava [Tue, 8 May 2007 07:25:08 +0000 (00:25 -0700)]
Use stop_machine_run in the Intel RNG driver
Replace call_smp_function with stop_machine_run in the Intel RNG driver.
CPU A has done read_lock(&lock)
CPU B has done write_lock_irq(&lock) and is waiting for A to release the lock.
A third CPU calls call_smp_function and issues the IPI. CPU A takes CPU
C's IPI. CPU B is waiting with interrupts disabled and does not see the
IPI. CPU C is stuck waiting for CPU B to respond to the IPI.
Deadlock.
The solution is to use stop_machine_run instead of call_smp_function
(call_smp_function should not be called in situations where the CPUs may be
suspended).
[haruo.tomita@toshiba.co.jp: fix a typo in mod_init()]
[haruo.tomita@toshiba.co.jp: fix memory leak] Signed-off-by: Prarit Bhargava <prarit@redhat.com> Cc: Jan Beulich <jbeulich@novell.com> Cc: "Tomita, Haruo" <haruo.tomita@toshiba.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This could help to find buggy drivers where request_irq return value wasn't
checked. There's just no reason to ignore errors which can and do occur.
Anyone who got warning during compilation have to realise what it is't
realy safe code.
kconfig: centralize the selection of semaphore debugging in lib/Kconfig.debug
Remove the Kconfig selection of semaphore debugging from the ALPHA and FRV
Kconfig files, and centralize it in lib/Kconfig.debug.
There doesn't seem to be much point in letting individual architectures
independently define the same Kconfig option when it can just as easily be
put in a single Kconfig file and made dependent on a subset of
architectures. that way, at least the option shows up in the same relative
location in the menu each time.
Signed-off-by: Robert P. J. Day <rpjday@mindspring.com> Cc: David Howells <dhowells@redhat.com> Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru> Cc: Richard Henderson <rth@twiddle.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
reiserfs: correct misspelled "REISERFS_PROC_INFO" to "CONFIG_REISERFS_PROC_INFO"
Correct the misspelling of the preprocessor check of a Kconfig option to refer
to CONFIG_REISERFS_PROC_INFO and not just the incorrect REISERFS_PROC_INFO.
Signed-off-by: Robert P. J. Day <rpjday@mindspring.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Nicolas Boichat [Tue, 8 May 2007 07:24:52 +0000 (00:24 -0700)]
Apple SMC driver (hardware monitoring and control)
This driver provides support for the Apple System Management Controller, which
provides an accelerometer (Apple Sudden Motion Sensor), light sensors,
temperature sensors, keyboard backlight control and fan control. Only
Intel-based Apple's computers are supported (MacBook Pro, MacBook, MacMini).
[bunk@stusta.de: make drivers/hwmon/applesmc.c:backlight_work stati]
[khali@linux-fr.org: fix temperature attribute file names] Signed-off-by: Nicolas Boichat <nicolas@boichat.ch> Cc: Jean Delvare <khali@linux-fr.org> Cc: Dmitry Torokhov <dtor@mail.ru> Signed-off-by: Jean Delvare <khali@linux-fr.org> Signed-off-by: Adrian Bunk <bunk@stusta.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Michal Schmidt [Tue, 8 May 2007 07:24:49 +0000 (00:24 -0700)]
Fix compilation of drivers with -O0
It is sometimes useful to compile individual drivers with optimization
disabled for easier debugging. Currently drivers which use htonl() and
similar functions don't compile with -O0. This patch fixes it. It also
removes obsolete and misleading comments. This header is not for
userspace, so we don't have to care about strange programs these comments
mention.
(akpm: -O0 probably isn't a good idea, but this code looks pretty crufty and
unuseful)
Signed-off-by: Michal Schmidt <mschmidt@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Sam Ravnborg [Tue, 8 May 2007 07:24:42 +0000 (00:24 -0700)]
fix section mismatch warning in lib/swiotlb.c
kbuild spits outs following warning on a
defconfig x86_64 build:
WARNING: swiotlb.o - Section mismatch: reference to .init.text:swiotlb_init from __ksymtab between '__ksymtab_swiotlb_init' (at offset 0xa0) and '__ksymtab_swiotlb_free_coherent'
This warning happens because the function swiotlb_init is marked __init and
EXPORT_SYMBOL(). A 'git grep swiotlb_init' showed no users in drivers/ so
remove the EXPORT_SYMBOL.
Signed-off-by: Sam Ravnborg <sam@ravnborg.org> Cc: Andi Kleen <ak@suse.de> Cc: "Luck, Tony" <tony.luck@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
reiserfs: possible null pointer dereference during resize
sb_read may return NULL, let's explicitly check it. If so free new bitmap
blocks array, after this we may safely exit as it done above during bitmap
allocation.
Alan [Tue, 8 May 2007 07:24:21 +0000 (00:24 -0700)]
tty: Clarify documentation of ->write()
The tty driver write method is different to the usual fops device write
methods as the buffer is already in kernel space. Clarify the docs since
someone writing a driver made that mistake.
Signed-off-by: Alan Cox <alan@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Florin Malita [Tue, 8 May 2007 07:24:18 +0000 (00:24 -0700)]
devpts: add fsnotify create event
Currently, devpts doesn't generate an fsnotify event upon pts creation
because the regular vfs paths aren't involved. Deallocation, on the other
hand, correctly generates a nameremove event thanks to the d_delete()
invocation in devpts_pty_kill().
This patch adds the missing fsnotify_create() trigger in devpts_pty_new().
Signed-off-by: Florin Malita <fmalita@gmail.com> Acked-by: H. Peter Anvin <hpa@zytor.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Documentation: Ask driver writers to provide PM support
Add a paragraph in Documentation/SubmittingDrivers requesting that the
basic PM support be provided by new device drivers.
Add two new documents in Documentation/power/ giving general instructions
on debugging the suspend/resume functionality and testing the suspend and
resume support in device drivers.
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Cc: Pavel Machek <pavel@ucw.cz> Cc: David Brownell <david-b@pacbell.net> Cc: Nigel Cunningham <ncunningham@linuxmail.org> Cc: Alan Cox <alan@lxorguk.ukuu.org.uk> Cc: Greg KH <greg@kroah.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Trent Piepho [Tue, 8 May 2007 07:24:05 +0000 (00:24 -0700)]
Fix constant folding and poor optimization in byte swapping code
Constant folding does not work for the swabXX() byte swapping functions,
and the C versions optimize poorly.
Attempting to initialize a global variable to swab16(0x1234) or put
something like "case swab32(42):" in a switch statement will not compile.
It can work, swab.h just isn't doing it correctly. This patch fixes that.
Contrary to the comment in asm-i386/byteorder.h, gcc does not recognize the
"C" version of swab16 and turn it into efficient code. gcc can do this,
just not with the current code. The simple function:
u16 foo(u16 x) { return swab16(x); }
Would compile to:
movzwl %ax, %eax
movl %eax, %edx
shrl $8, %eax
sall $8, %edx
orl %eax, %edx
With this patch, it will compile to:
rolw $8, %ax
I also attempted to document the maze different macros/inline functions
that are used to create the final product.
Corey Minyard [Tue, 8 May 2007 07:23:58 +0000 (00:23 -0700)]
ipmi: add new IPMI nmi watchdog handling
Convert over to the new NMI handling for getting IPMI watchdog timeouts via an
NMI. This add config options to know if there is the ability to receive NMIs
and if it has an NMI post processing call. Then it modifies the IPMI watchdog
to take advantage of this so that it can know if an NMI comes in.
It also adds testing that the IPMI NMI watchdog works.
Corey Minyard [Tue, 8 May 2007 07:23:54 +0000 (00:23 -0700)]
ipmi: allow shared interrupts
The IPMI driver used enable_irq and disable_irq when it got into situations
where it couldn't allocate memory; it did this to avoid having the interrupt
just lock the machine when it couldn't get memory to perform the transaction
to disable the interrupt.
This patch modifies the driver to not use disable_irq and enable_irq. It
instead sends the messages to the BMC to perform this operation. It also
makes sure interrupts are cleanly disabled when the interface is shut down and
cleans up some shutdown things that are no longer necessary.
Corey Minyard [Tue, 8 May 2007 07:23:51 +0000 (00:23 -0700)]
ipmi: add powerpc openfirmware sensing
Add support for of_platform_driver to the ipmi_si module. When loading the
module, the driver will be registered to of_platform. The driver will be
probed for all devices with the type ipmi. It's supporting devices with
compatible settings ipmi-kcs, ipmi-smic and ipmi-bt. Only ipmi-kcs could be
tested.
Andrew Morton [Tue, 8 May 2007 07:23:49 +0000 (00:23 -0700)]
mm: shrink parent dentries when shrinking slab
Teach the dentry slab shrinker to aggressively shrink parent dentries when
shrinking the dentry cache.
This is done to attempt to improve the situation where the dentry slab cache
gets a lot of internal fragmentation due to pages containing directory
dentries. It is expected that this change will cause some of those dentries
to be reaped earlier, and with less scanning.
Miklos Szeredi [Tue, 8 May 2007 07:23:46 +0000 (00:23 -0700)]
fix quadratic behavior of shrink_dcache_parent()
The time shrink_dcache_parent() takes, grows quadratically with the depth
of the tree under 'parent'. This starts to get noticable at about 10,000.
These kinds of depths don't occur normally, and filesystems which invoke
shrink_dcache_parent() via d_invalidate() seem to have other depth
dependent timings, so it's not even easy to expose this problem.
However with FUSE it's easy to create a deep tree and d_invalidate()
will also get called. This can make a syscall hang for a very long
time.
This is the original discovery of the problem by Russ Cox:
The following patch fixes the quadratic behavior, by optionally allowing
prune_dcache() to prune ancestors of a dentry in one go, instead of doing
it one at a time.
Common code in dput() and prune_one_dentry() is extracted into a new helper
function d_kill().
shrink_dcache_parent() as well as shrink_dcache_sb() are converted to use
the ancestry-pruner option. Only for shrink_dcache_memory() is this
behavior not desirable, so it keeps using the old algorithm.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Maneesh Soni <maneesh@in.ibm.com> Acked-by: "Paul E. McKenney" <paulmck@us.ibm.com> Cc: Dipankar Sarma <dipankar@in.ibm.com> Cc: Neil Brown <neilb@suse.de> Cc: Trond Myklebust <trond.myklebust@fys.uio.no> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
William Cohen [Tue, 8 May 2007 07:23:41 +0000 (00:23 -0700)]
reduce size of task_struct on 64-bit machines
This past week I was playing around with that pahole tool
(http://oops.ghostprotocols.net:81/acme/dwarves/) and looking at the size
of various struct in the kernel. I was surprised by the size of the
task_struct on x86_64, approaching 4K. I looked through the fields in
task_struct and found that a number of them were declared as "unsigned
long" rather than "unsigned int" despite them appearing okay as 32-bit
sized fields. On x86_64 "unsigned long" ends up being 8 bytes in size and
forces 8 byte alignment. Is there a reason there a reason they are
"unsigned long"?
The patch below drops the size of the struct from 3808 bytes (60 64-byte
cachelines) to 3760 bytes (59 64-byte cachelines). A couple other fields
in the task struct take a signficant amount of space:
Cc: Andreas Dilger <adilger@dilger.ca> Cc: <linux-ext4@vger.kernel.org>
Andreas says:
This patch is now treating timestamps with the high bit set as negative
times (before Jan 1, 1970). This means we lose 1/2 of the possible range
of timestamps (lopping off 68 years before unix timestamp overflow -
now only 30 years away :-) to handle the extremely rare case of setting
timestamps into the distant past.
If we are only interested in fixing the underflow case, we could just
limit the values to 0 instead of storing negative values. At worst this
will skew the timestamp by a few hours for timezones in the far east
(files would still show Jan 1, 1970 in "ls -l" output).
That said, it seems 32-bit systems (mine at least) allow files to be set
into the past (01/01/1907 works fine) so it seems this patch is bringing
the x86_64 behaviour into sync with other kernels.
On the plus side, we have a patch that is ready to add nanosecond timestamps
to ext3 and as an added bonus adds 2 high bits to the on-disk timestamp so
this extends the maximum date to 2242.
Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Alexey Dobriyan [Tue, 8 May 2007 07:23:35 +0000 (00:23 -0700)]
Allow access to /proc/$PID/fd after setuid()
/proc/$PID/fd has r-x------ permissions, so if process does setuid(), it
will not be able to access /proc/*/fd/. This breaks fstatat() emulation
in glibc.
Jeff Dike [Tue, 8 May 2007 07:23:22 +0000 (00:23 -0700)]
uml: an idle system should have zero load average
The ever-vigilant users of linode.com noticed that an idle 2.6 UML has a
persistent load average of ~.4.
It turns out that because the UML timer handler processed softirqs before
actually delivering the tick, the tick was counted in the context of the idle
thread about half the time.
Signed-off-by: Jeff Dike <jdike@linux.intel.com> Cc: Paolo 'Blaisorblade' Giarrusso <blaisorblade@yahoo.it> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Jeff Dike [Tue, 8 May 2007 07:23:18 +0000 (00:23 -0700)]
uml: hostfs style fixes
hostfs needed some style goodness.
Signed-off-by: Jeff Dike <jdike@linux.intel.com> Cc: Paolo 'Blaisorblade' Giarrusso <blaisorblade@yahoo.it> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
uml: make hostfs_setattr() support operations on unlinked open files
This patch allows hostfs_setattr() to work on unlinked open files by calling
set_attr() (the userspace part) with the inode's fd.
Without this, applications that depend on doing attribute changes to unlinked
open files will fail.
It works by using the fd versions instead of the path ones (for example
fchmod() instead of chmod(), fchown() instead of chown()) when an fd is
available.
Signed-off-by: Alberto Bertogli <albertito@gmail.com> Signed-off-by: Jeff Dike <jdike@linux.intel.com> Cc: Paolo 'Blaisorblade' Giarrusso <blaisorblade@yahoo.it> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Yasunori Goto [Tue, 8 May 2007 07:23:10 +0000 (00:23 -0700)]
Add white list into modpost.c for memory hotplug code and ia64's machvec section
This patch is add white list into modpost.c for some functions and
ia64's section to fix section mismatchs.
sparse_index_alloc() and zone_wait_table_init() calls bootmem allocator
at boot time, and kmalloc/vmalloc at hotplug time. If config
memory hotplug is on, there are references of bootmem allocater(init text)
from them (normal text). This is cause of section mismatch.
Bootmem is called by many functions and it must be
used only at boot time. I think __init of them should keep for
section mismatch check. So, I would like to register sparse_index_alloc()
and zone_wait_table_init() into white list.
In addition, ia64's .machvec section is function table of some platform
dependent code. It is mixture of .init.text and normal text. These
reference of __init functions are valid too.
Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com> Cc: Sam Ravnborg <sam@ravnborg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
David Woodhouse [Tue, 8 May 2007 07:22:59 +0000 (00:22 -0700)]
Increase slab redzone to 64bits
There are two problems with the existing redzone implementation.
Firstly, it's causing misalignment of structures which contain a 64-bit
integer, such as netfilter's 'struct ipt_entry' -- causing netfilter
modules to fail to load because of the misalignment. (In particular, the
first check in
net/ipv4/netfilter/ip_tables.c::check_entry_size_and_hooks())
On ppc32 and sparc32, amongst others, __alignof__(uint64_t) == 8.
With slab debugging, we use 32-bit redzones. And allocated slab objects
aren't sufficiently aligned to hold a structure containing a uint64_t.
By _just_ setting ARCH_KMALLOC_MINALIGN to __alignof__(u64) we'd disable
redzone checks on those architectures. By using 64-bit redzones we avoid that
loss of debugging, and also fix the other problem while we're at it.
When investigating this, I noticed that on 64-bit platforms we're using a
32-bit value of RED_ACTIVE/RED_INACTIVE in the 64-bit memory location set
aside for the redzone. Which means that the four bytes immediately before
or after the allocated object at 0x00,0x00,0x00,0x00 for LE and BE
machines, respectively. Which is probably not the most useful choice of
poison value.
One way to fix both of those at once is just to switch to 64-bit
redzones in all cases.
Signed-off-by: David Woodhouse <dwmw2@infradead.org> Acked-by: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Christoph Lameter <clameter@engr.sgi.com> Acked-by: David S. Miller <davem@davemloft.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>