"It causes an oops when auto-detecting raid arrays, and it doesn't
seem easy to fix.
The array may not be 'open' when do_md_run is called, so
bdev->bd_disk might be NULL, so bd_set_size can oops.
This whole approach of opening an md device before it has been
assembled just seems to get more and more painful. I think I'm going
to have to come up with something clever to provide both backward
comparability with usage expectation, and sane integration into the
rest of the kernel."
Paul Walmsley [Wed, 9 May 2007 16:47:16 +0000 (10:47 -0600)]
Fix hang on IBM Token Ring PCMCIA card ejection
Ejecting a PCMCIA IBM Token Ring card that has not had its dev->open()
called will reliably trigger an uninitialized spinlock oops when
spinlock debugging is enabled. The system then hangs, occasionally
softlockup oopsing. Apparently ibmtr.c:tok_interrupt() doesn't expect
to be called before tok_open(), but tok_interrupt() gets called anyway
when the card is ejected. So, set an already-existing flag which
causes tok_interrupt() to bail out early upon card ejection. Tested by
inserting and removing the PCMCIA card several times.
Signed-off-by: Paul Walmsley <paul@booyaka.com> Signed-off-by: Jeff Garzik <jeff@garzik.org>
By default, the skge driver now enables wake on magic and wake on PHY.
This is a bad default (bug), wake on PHY means machine will never shutdown
if connected to a switch.
Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org>a Signed-off-by: Jeff Garzik <jeff@garzik.org>
* master.kernel.org:/pub/scm/linux/kernel/git/bart/ide-2.6:
ide: fix PIO setup on resume for ATAPI devices
ide: legacy PCI bus order probing fixes
ide: add ide_proc_register_port()
ide: add "initializing" argument to ide_register_hw()
ide: cable detection fixes (take 2)
ide: move IDE settings handling to ide-proc.c
ide: split off ioctl handling from IDE settings (v2)
ide: make /proc/ide/ optional
ide: add ide_tune_dma() helper
ide: rework the code for selecting the best DMA transfer mode (v3)
ide: fix UDMA/MWDMA/SWDMA masks (v3)
Linus Torvalds [Wed, 9 May 2007 22:29:58 +0000 (15:29 -0700)]
Merge git://git.linux-nfs.org/pub/linux/nfs-2.6
* git://git.linux-nfs.org/pub/linux/nfs-2.6:
NFS: Kill the obsolete NFS_PARANOIA
NFS: use __set_current_state()
sunrpc: fix crash in rpc_malloc()
NFS: Clean up NFSv4 XDR error message
NFS: NFS client underestimates how large an NFSv4 SETATTR reply can be
SUNRPC: Fix pointer arithmetic bug recently introduced in rpc_malloc/free
NFS: Remove redundant check in nfs_check_verifier()
NFS: Fix a jiffie wraparound issue
IDE PCI host drivers should register themselves with IDE core only when
IDE driver is built-in, otherwise (IDE driver is modular and thus IDE PCI
host drivers are also modular) the code has no effect and just complicates
the probing.
Fix it by adding new config option CONFIG_IDEPCI_PCIBUS (defined only when
needed and invisible to the user) and covering by #ifdef/#endif the code
in question. It turned out that "ide=reverse" was silently accepted but did
nothing in case when IDE driver was modular, this is fixed now.
* create_proc_ide_interfaces() tries to add /proc entries for every probed
and initialized IDE port, replace it by ide_proc_register_port() which does
it only for the given port (also rename destroy_proc_ide_interface() to
ide_proc_unregister_port() for consistency)
* convert {create,destroy}_proc_ide_interface[s]() users to use new functions
* pmac driver depended on proc_ide_create() to add /proc port entries, fix it
* au1xxx-ide, swarm and cs5520 drivers depended indirectly on ide-generic
driver (CONFIG_IDE_GENERIC=y) to add port /proc entries, fix them
* there is now no need to add /proc entries for IDE ports in proc_ide_create()
so don't do it
* proc_ide_create() needs now to be called before drivers are probed - fix it,
while at it make proc_ide_create() create /proc "ide" directory
ide: add "initializing" argument to ide_register_hw()
Add "initializing" argument to ide_register_hw() and use it instead of ide.c
wide variable of the same name. Update all users of ide_register_hw()
accordingly.
Tejun's recent eighty_ninty_three() fix has inspired me to do more thorough
review of the cable detection code...
* print user-friendly warning about limiting the maximum transfer speed
to UDMA33 (and the reason behind it) when 80-wire cable is not detected,
also while at it cleanup eighty_ninty_three() a bit
* use eighty_ninty_three() in ide_ata66_check(), this actually fixes 3 bugs:
- bit 14 (word 93 validity check) == 1 && bit 13 (80-wire cable test) == 1
were used as 80-wire cable present test for CONFIG_IDEDMA_IVB=n case
(please see FIXME comment in eighty_ninty_three() for more details)
- CONFIG_IDEDMA_IVB=y/n cases were interchanged
- check for SATA devices was missing
* remove private cable warnings from pdc_202xx{old,new} drivers now that core
code provides this functionality (plus, in pdc202xx_new case the test could
give false warnings for ATAPI devices because pdc202xx_new driver doesn't
even support ATAPI DMA)
ide: split off ioctl handling from IDE settings (v2)
* do write permission and min/max checks in ide_procset_t functions
* ide-disk.c: drive->id is always available so cleanup "multcount" setting
accordingly
* ide-disk.c: "address" setting was incorrectly defined as type TYPE_INTA,
fix it by using type TYPE_BYTE and updating ide_drive_t->adressing field,
the bug didn't trigger because this IDE setting uses custom ->set function
* ide.c: add set_ksettings() for handling HDIO_SET_KEEPSETTINGS ioctl
* ide.c: add set_unmaskirq() for handling HDIO_SET_UNMASKINTR ioctl
* handle ioctls directly in generic_ide_ioclt() and idedisk_ioctl()
instead of using IDE settings to deal with them
* remove no longer needed ide_find_setting_by_ioctl() and {read,write}_ioctl
fields from ide_settings_t, also remove now unused TYPE_INTA handling
v2:
* add missing EXPORT_SYMBOL_GPL(ide_setting_sem) needed now for ide-disk
ide: rework the code for selecting the best DMA transfer mode (v3)
Depends on the "ide: fix UDMA/MWDMA/SWDMA masks" patch.
* add ide_hwif_t.udma_filter hook for filtering UDMA mask
(use it in alim15x3, hpt366, siimage and serverworks drivers)
* add ide_max_dma_mode() for finding best DMA mode for the device
(loosely based on some older libata-core.c code)
* convert ide_dma_speed() users to use ide_max_dma_mode()
* make ide_rate_filter() take "ide_drive_t *drive" as an argument instead
of "u8 mode" and teach it to how to use UDMA mask to do filtering
* use ide_rate_filter() in hpt366 driver
* remove no longer needed ide_dma_speed() and *_ratemask()
* unexport eighty_ninty_three()
v2:
* rename ->filter_udma_mask to ->udma_filter
[ Suggested by Sergei Shtylyov <sshtylyov@ru.mvista.com>. ]
v3:
* updated for scc_pata driver (fixes XFER_UDMA_6 filtering for user-space
originated transfer mode change requests when 100MHz clock is used)
* use 0x00 instead of 0x80 to disable ->{ultra,mwdma,swdma}_mask
* add udma_mask field to ide_pci_device_t and use it to initialize
->ultra_mask in aec62xx, cmd64x, pdc202xx_{new,old} and piix drivers
* fix UDMA masks to match with chipset specific *_ratemask()
(alim15x3, hpt366, serverworks and siimage drivers need UDMA mask
filtering method - done in the next patch)
v2:
* piix: fix cable detection for 82801AA_1 and 82372FB_1
[ Noticed by Sergei Shtylyov <sshtylyov@ru.mvista.com>. ]
* cmd64x: use hwif->cds->udma_mask
[ Suggested by Sergei Shtylyov <sshtylyov@ru.mvista.com>. ]
* aec62xx: fix newly introduced bug - check DMA status not command register
[ Noticed by Sergei Shtylyov <sshtylyov@ru.mvista.com>. ]
v3:
* piix: use hwif->cds->udma_mask
[ Suggested by Sergei Shtylyov <sshtylyov@ru.mvista.com>. ]
Trond Myklebust [Wed, 9 May 2007 13:00:18 +0000 (09:00 -0400)]
NFS: Remove redundant check in nfs_check_verifier()
The check for nfs_attribute_timeout(dir) in nfs_check_verifier is
redundant: nfs_lookup_revalidate() will already call nfs_revalidate_inode()
on the parent dir when necessary.
The only case where this is not done is the case of a negative dentry. Fix
this case by moving up the revalidation code.
Trond Myklebust [Wed, 9 May 2007 13:00:17 +0000 (09:00 -0400)]
NFS: Fix a jiffie wraparound issue
dentry verifiers are always set to the parent directory's
cache_change_attribute. There is no reason to be testing for anything other
than equality when we're trying to find out if the dentry has been checked
since the last time the directory was modified.
* git://git.kernel.org/pub/scm/linux/kernel/git/aegl/linux-2.6:
[IA64] wire up pselect, ppoll
[IA64] Add TIF_RESTORE_SIGMASK
[IA64] unwind did not work for processes born with CLONE_STOPPED
[IA64] Optional method to purge the TLB on SN systems
[IA64] SPIN_LOCK_UNLOCKED macro cleanup in arch/ia64
[IA64-SN2][KJ] mmtimer.c-kzalloc
[IA64] fix stack alignment for ia32 signal handlers
[IA64] - Altix: hotplug after intr redirect can crash system
[IA64] save and restore cpus_allowed in cpu_idle_wait
[IA64] Removal of percpu TR cleanup in kexec code
[IA64] Fix some section mismatch errors
Linus Torvalds [Wed, 9 May 2007 20:10:11 +0000 (13:10 -0700)]
Merge git://git.infradead.org/mtd-2.6
* git://git.infradead.org/mtd-2.6: (21 commits)
[MTD] [CHIPS] Remove MTD_OBSOLETE_CHIPS (jedec, amd_flash, sharp)
[MTD] Delete allegedly obsolete "bank_size" field of mtd_info.
[MTD] Remove unnecessary user space check from mtd.h.
[MTD] [MAPS] Remove flash maps for no longer supported 405LP boards
[MTD] [MAPS] Fix missing printk() parameter in physmap_of.c MTD driver
[MTD] [NAND] platform NAND driver: add driver
[MTD] [NAND] platform NAND driver: update header
[JFFS2] Simplify and clean up jffs2_add_tn_to_tree() some more.
[JFFS2] Remove another bogus optimisation in jffs2_add_tn_to_tree()
[JFFS2] Remove broken insert_point optimisation in jffs2_add_tn_to_tree()
[JFFS2] Remember to calculate overlap on nodes which replace older nodes
[JFFS2] Don't advance c->wbuf_ofs to next eraseblock after wbuf flush
[MTD] [NAND] at91_nand.c: CMDLINE_PARTS support
[MTD] [NAND] Tidy up handling of page number in nand_block_bad()
[MTD] block2mtd_paramline[] mustn't be __initdata
[MTD] [NAND] Support multiple chips in CAFÉ driver
[MTD] [NAND] Rename cafe.c to cafe_nand.c and remove the multi-obj magic
[MTD] [NAND] Use rslib for CAFÉ ECC
[RSLIB] Support non-canonical GF representations
[JFFS2] Remove dead file histo_mips.h
...
It added a "mark_page_accessed(page)" at the final return path in
read_cache_page_async(). But in error cases, 'page' holds the error
code, and you can't mark it accessed.
[ and Glauber de Oliveira Costa points out that we can use a return
instead of adding more goto's ]
Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Linus Torvalds [Wed, 9 May 2007 19:56:01 +0000 (12:56 -0700)]
Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/paulus/powerpc
* 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/paulus/powerpc:
[POWERPC] Further fixes for the removal of 4level-fixup hack from ppc32
[POWERPC] EEH: log all PCI-X and PCI-E AER registers
[POWERPC] EEH: capture and log pci state on error
[POWERPC] EEH: Split up long error msg
[POWERPC] EEH: log error only after driver notification.
[POWERPC] fsl_soc: Make mac_addr const in fs_enet_of_init().
[POWERPC] Don't use SLAB/SLUB for PTE pages
[POWERPC] Spufs support for 64K LS mappings on 4K kernels
[POWERPC] Add ability to 4K kernel to hash in 64K pages
[POWERPC] Introduce address space "slices"
[POWERPC] Small fixes & cleanups in segment page size demotion
[POWERPC] iSeries: Make HVC_ISERIES the default
[POWERPC] iSeries: suppress build warning in lparmap.c
[POWERPC] Mark pages that don't exist as nosave
[POWERPC] swsusp: Introduce register_nosave_region_late
* git://git.kernel.org/pub/scm/linux/kernel/git/bunk/trivial: (25 commits)
sound: convert "sound" subdirectory to UTF-8
MAINTAINERS: Add cxacru website/mailing list
include files: convert "include" subdirectory to UTF-8
general: convert "kernel" subdirectory to UTF-8
documentation: convert the Documentation directory to UTF-8
Convert the toplevel files CREDITS and MAINTAINERS to UTF-8.
remove broken URLs from net drivers' output
Magic number prefix consistency change to Documentation/magic-number.txt
trivial: s/i_sem /i_mutex/
fix file specification in comments
drivers/base/platform.c: fix small typo in doc
misc doc and kconfig typos
Remove obsolete fat_cvf help text
Fix occurrences of "the the "
Fix minor typoes in kernel/module.c
Kconfig: Remove reference to external mqueue library
Kconfig: A couple of grammatical fixes in arch/i386/Kconfig
Correct comments in genrtc.c to refer to correct /proc file.
Fix more "deprecated" spellos.
Fix "deprecated" typoes.
...
H. Peter Anvin [Wed, 9 May 2007 07:02:11 +0000 (00:02 -0700)]
i386: msr.h: be paranoid about types and parentheses
When implementing things as macros, make sure we use typecasts and
parentheses where needed. The macros as defined were vulnerable to
surreptitious promotion causing problems.
Avoid macros where practical; e.g. wrmsr() can be an inline instead.
Signed-off-by: H. Peter Anvin <hpa@zytor.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
H. Peter Anvin [Wed, 9 May 2007 07:02:00 +0000 (00:02 -0700)]
i386: cpu/transmeta.c: fix definition of USER686
The definition of USER686 is supposed to be a mask of feature bits,
not an OR of feature numbers! It happened to work anyway on the only
processor affected, simply by pure coincidence. Fix.
Signed-off-by: H. Peter Anvin <hpa@zytor.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
NeilBrown [Wed, 9 May 2007 09:35:39 +0000 (02:35 -0700)]
md: improve partition detection in md array
md currently uses ->media_changed to make sure rescan_partitions
is call on md array after they are assembled.
However that doesn't happen until the array is opened, which is later
than some people would like.
So use blkdev_ioctl to do the rescan immediately that the
array has been assembled.
This means we can remove all the ->change infrastructure as it was only used
to trigger a partition rescan.
Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
NeilBrown [Wed, 9 May 2007 09:35:38 +0000 (02:35 -0700)]
md: allow reshape_position for md arrays to be set via sysfs
"reshape_position" records how much progress has been made on a "reshape"
(adding drives, changing layout or chunksize).
When it is set, the number of drives, layout and chunksize can have
two possible values, an old an a new.
So allow these different values to be visible, and allow both old and new to
be set: Set the old ones first, then the reshape_position, then the new
values.
Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
NeilBrown [Wed, 9 May 2007 09:35:37 +0000 (02:35 -0700)]
md: remove the slash from the name of a kmem_cache used by raid5
SLUB doesn't like slashes as it wants to use the cache name as the name of a
directory (or symlink) in sysfs.
Signed-off-by: Neil Brown <neilb@suse.de> Acked-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
NeilBrown [Wed, 9 May 2007 09:35:37 +0000 (02:35 -0700)]
md: stop using csum_partial for checksum calculation in md
If CONFIG_NET is not selected, csum_partial is not exported, so md.ko cannot
use it. We shouldn't really be using csum_partial anyway as it is an
internal-to-networking interface.
So replace it with C code to do the same thing. Speed is not crucial here, so
something simple and correct is best.
Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
NeilBrown [Wed, 9 May 2007 09:35:36 +0000 (02:35 -0700)]
md: move test for whether level supports bitmap to correct place
We need to check for internal-consistency of superblock in load_super.
validate_super is for inter-device consistency.
With the test in the wrong place, a badly created array will confuse md rather
an produce sensible errors.
Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Martin Peschke [Wed, 9 May 2007 09:35:35 +0000 (02:35 -0700)]
md: cleanup: use seq_release_private() where appropriate
We can save some lines of code by using seq_release_private().
Signed-off-by: Martin Peschke <mp3@de.ibm.com> Acked-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
drivers/md.c: Use ARRAY_SIZE macro when appropriate
Use ARRAY_SIZE macro already defined in kernel.h
Signed-off-by: Ahmed S. Darwish <darwish.07@gmail.com> Acked-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
David Rientjes [Wed, 9 May 2007 09:35:28 +0000 (02:35 -0700)]
sh: dma: use __maybe_unused
There is no such thing as labeling a variable as __attribute__((used)). Since
ts_shift is not referenced in inline assembly, we assume that we're simply
suppressing a warning here if the variable is declared but unreferenced.
Cc: Paul Mundt <lethal@linux-sh.org> Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
David Rientjes [Wed, 9 May 2007 09:35:27 +0000 (02:35 -0700)]
compiler: introduce __used and __maybe_unused
__used is defined to be __attribute__((unused)) for all pre-3.3 gcc
compilers to suppress warnings for unused functions because perhaps they
are referenced only in inline assembly. It is defined to be
__attribute__((used)) for gcc 3.3 and later so that the code is still
emitted for such functions.
__maybe_unused is defined to be __attribute__((unused)) for both function
and variable use if it could possibly be unreferenced due to the evaluation
of preprocessor macros. Function prototypes shall be marked with
__maybe_unused if the actual definition of the function is dependant on
preprocessor macros.
No update to compiler-intel.h is necessary because ICC supports both
__attribute__((used)) and __attribute__((unused)) as specified by the gcc
manual.
__attribute_used__ is deprecated and will be removed once all current
code is converted to using __used.
Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Adrian Bunk <bunk@stusta.de> Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Roman Zippel [Wed, 9 May 2007 09:35:17 +0000 (02:35 -0700)]
rename thread_info to stack
This finally renames the thread_info field in task structure to stack, so that
the assumptions about this field are gone and archs have more freedom about
placing the thread_info structure.
Nonbroken archs which have a proper thread pointer can do the access to both
current thread and task structure via a single pointer.
It'll allow for a few more cleanups of the fork code, from which e.g. ia64
could benefit.
Signed-off-by: Roman Zippel <zippel@linux-m68k.org>
[akpm@linux-foundation.org: build fix] Cc: Richard Henderson <rth@twiddle.net> Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru> Cc: Russell King <rmk@arm.linux.org.uk> Cc: Ian Molton <spyro@f2s.com> Cc: Haavard Skinnemoen <hskinnemoen@atmel.com> Cc: Mikael Starvik <starvik@axis.com> Cc: David Howells <dhowells@redhat.com> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Cc: "Luck, Tony" <tony.luck@intel.com> Cc: Hirokazu Takata <takata@linux-m32r.org> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Roman Zippel <zippel@linux-m68k.org> Cc: Greg Ungerer <gerg@uclinux.org> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Paul Mundt <lethal@linux-sh.org> Cc: Kazumoto Kojima <kkojima@rr.iij4u.or.jp> Cc: Richard Curnow <rc@rc0.org.uk> Cc: William Lee Irwin III <wli@holomorphy.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Jeff Dike <jdike@addtoit.com> Cc: Paolo 'Blaisorblade' Giarrusso <blaisorblade@yahoo.it> Cc: Miles Bader <uclinux-v850@lsi.nec.co.jp> Cc: Andi Kleen <ak@muc.de> Cc: Chris Zankel <chris@zankel.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Roman Zippel [Wed, 9 May 2007 09:35:15 +0000 (02:35 -0700)]
Allow arch to initialize arch field of the module structure
This will later allow an arch to add module specific information via linker
generated tables instead of poking directly in the module object structure.
Signed-off-by: Roman Zippel <zippel@linux-m68k.org> Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Thomas Gleixner [Wed, 9 May 2007 09:35:15 +0000 (02:35 -0700)]
clocksource: fix resume logic
We need to make sure that the clocksources are resumed, when timekeeping is
resumed. The current resume logic does not guarantee this.
Add a resume function pointer to the clocksource struct, so clocksource
drivers which need to reinitialize the clocksource can provide a resume
function.
Add a resume function, which calls the maybe available clocksource resume
functions and resets the watchdog function, so a stable TSC can be used
accross suspend/resume.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: john stultz <johnstul@us.ibm.com> Cc: Andi Kleen <ak@suse.de> Cc: Ingo Molnar <mingo@elte.hu> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently the slab allocators contain callbacks into the page allocator to
perform the draining of pagesets on remote nodes. This requires SLUB to have
a whole subsystem in order to be compatible with SLAB. Moving node draining
out of the slab allocators avoids a section of code in SLUB.
Move the node draining so that is is done when the vm statistics are updated.
At that point we are already touching all the cachelines with the pagesets of
a processor.
Add a expire counter there. If we have to update per zone or global vm
statistics then assume that the pageset will require subsequent draining.
The expire counter will be decremented on each vm stats update pass until it
reaches zero. Then we will drain one batch from the pageset. The draining
will cause vm counter updates which will then cause another expiration until
the pcp is empty. So we will drain a batch every 3 seconds.
Note that remote node draining is a somewhat esoteric feature that is required
on large NUMA systems because otherwise significant portions of system memory
can become trapped in pcp queues. The number of pcp is determined by the
number of processors and nodes in a system. A system with 4 processors and 2
nodes has 8 pcps which is okay. But a system with 1024 processors and 512
nodes has 512k pcps with a high potential for large amount of memory being
caught in them.
Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
vmstat is currently using the cache reaper to periodically bring the
statistics up to date. The cache reaper does only exists in SLUB as a way to
provide compatibility with SLAB. This patch removes the vmstat calls from the
slab allocators and provides its own handling.
The advantage is also that we can use a different frequency for the updates.
Refreshing vm stats is a pretty fast job so we can run this every second and
stagger this by only one tick. This will lead to some overlap in large
systems. F.e a system running at 250 HZ with 1024 processors will have 4 vm
updates occurring at once.
However, the vm stats update only accesses per node information. It is only
necessary to stagger the vm statistics updates per processor in each node. Vm
counter updates occurring on distant nodes will not cause cacheline
contention.
We could implement an alternate approach that runs the first processor on each
node at the second and then each of the other processor on a node on a
subsequent tick. That may be useful to keep a large amount of the second free
of timer activity. Maybe the timer folks will have some feedback on this one?
[jirislaby@gmail.com: add missing break] Cc: Arjan van de Ven <arjan@linux.intel.com> Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Jiri Slaby <jirislaby@gmail.com> Cc: Oleg Nesterov <oleg@tv-sign.ru> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
microcode: use suspend-related CPU hotplug notifications
Make the microcode driver use the suspend-related CPU hotplug notifications
to handle the CPU hotplug events occuring during system-wide suspend and
resume transitions. Remove the global variable suspend_cpu_hotplug
previously used for this purpose.
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Cc: Gautham R Shenoy <ego@in.ibm.com> Cc: Pavel Machek <pavel@ucw.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Since nonboot CPUs are now disabled after tasks and devices have been
frozen and the CPU hotplug infrastructure is used for this purpose, we need
special CPU hotplug notifications that will help the CPU-hotplug-aware
subsystems distinguish normal CPU hotplug events from CPU hotplug events
related to a system-wide suspend or resume operation in progress. This
patch introduces such notifications and causes them to be used during
suspend and resume transitions. It also changes all of the
CPU-hotplug-aware subsystems to take these notifications into consideration
(for now they are handled in the same way as the corresponding "normal"
ones).
[oleg@tv-sign.ru: cleanups] Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Cc: Gautham R Shenoy <ego@in.ibm.com> Cc: Pavel Machek <pavel@ucw.cz> Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Nate Diller [Wed, 9 May 2007 09:35:07 +0000 (02:35 -0700)]
fs: convert core functions to zero_user_page
It's very common for file systems to need to zero part or all of a page,
the simplist way is just to use kmap_atomic() and memset(). There's
actually a library function in include/linux/highmem.h that does exactly
that, but it's confusingly named memclear_highpage_flush(), which is
descriptive of *how* it does the work rather than what the *purpose* is.
So this patchset renames the function to zero_user_page(), and calls it
from the various places that currently open code it.
This first patch introduces the new function call, and converts all the
core kernel callsites, both the open-coded ones and the old
memclear_highpage_flush() ones. Following this patch is a series of
conversions for each file system individually, per AKPM, and finally a
patch deprecating the old call. The diffstat below shows the entire
patchset.
[akpm@linux-foundation.org: fix a few things] Signed-off-by: Nate Diller <nate.diller@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Eric Dumazet [Wed, 9 May 2007 09:35:04 +0000 (02:35 -0700)]
FUTEX: new PRIVATE futexes
Analysis of current linux futex code :
--------------------------------------
A central hash table futex_queues[] holds all contexts (futex_q) of waiting
threads.
Each futex_wait()/futex_wait() has to obtain a spinlock on a hash slot to
perform lookups or insert/deletion of a futex_q.
When a futex_wait() is done, calling thread has to :
1) - Obtain a read lock on mmap_sem to be able to validate the user pointer
(calling find_vma()). This validation tells us if the futex uses
an inode based store (mapped file), or mm based store (anonymous mem)
2) - compute a hash key
3) - Atomic increment of reference counter on an inode or a mm_struct
4) - lock part of futex_queues[] hash table
5) - perform the test on value of futex.
(rollback is value != expected_value, returns EWOULDBLOCK)
(various loops if test triggers mm faults)
6) queue the context into hash table, release the lock got in 4)
7) - release the read_lock on mmap_sem
<block>
8) Eventually unqueue the context (but rarely, as this part may be done
by the futex_wake())
Futexes were designed to improve scalability but current implementation has
various problems :
- Central hashtable :
This means scalability problems if many processes/threads want to use
futexes at the same time.
This means NUMA unbalance because this hashtable is located on one node.
- Using mmap_sem on every futex() syscall :
Even if mmap_sem is a rw_semaphore, up_read()/down_read() are doing atomic
ops on mmap_sem, dirtying cache line :
- lot of cache line ping pongs on SMP configurations.
mmap_sem is also extensively used by mm code (page faults, mmap()/munmap())
Highly threaded processes might suffer from mmap_sem contention.
mmap_sem is also used by oprofile code. Enabling oprofile hurts threaded
programs because of contention on the mmap_sem cache line.
- Using an atomic_inc()/atomic_dec() on inode ref counter or mm ref counter:
It's also a cache line ping pong on SMP. It also increases mmap_sem hold time
because of cache misses.
Most of these scalability problems come from the fact that futexes are in
one global namespace. As we use a central hash table, we must make sure
they are all using the same reference (given by the mm subsystem). We
chose to force all futexes be 'shared'. This has a cost.
But fact is POSIX defined PRIVATE and SHARED, allowing clear separation,
and optimal performance if carefuly implemented. Time has come for linux
to have better threading performance.
The goal is to permit new futex commands to avoid :
- Taking the mmap_sem semaphore, conflicting with other subsystems.
- Modifying a ref_count on mm or an inode, still conflicting with mm or fs.
This is possible because, for one process using PTHREAD_PROCESS_PRIVATE
futexes, we only need to distinguish futexes by their virtual address, no
matter the underlying mm storage is.
If glibc wants to exploit this new infrastructure, it should use new
_PRIVATE futex subcommands for PTHREAD_PROCESS_PRIVATE futexes. And be
prepared to fallback on old subcommands for old kernels. Using one global
variable with the FUTEX_PRIVATE_FLAG or 0 value should be OK.
PTHREAD_PROCESS_SHARED futexes should still use the old subcommands.
Compatibility with old applications is preserved, they still hit the
scalability problems, but new applications can fly :)
Note : the same SHARED futex (mapped on a file) can be used by old binaries
*and* new binaries, because both binaries will use the old subcommands.
Note : Vast majority of futexes should be using PROCESS_PRIVATE semantic,
as this is the default semantic. Almost all applications should benefit
of this changes (new kernel and updated libc)
Some bench results on a Pentium M 1.6 GHz (SMP kernel on a UP machine)
/* calling futex_wait(addr, value) with value != *addr */
433 cycles per futex(FUTEX_WAIT) call (mixing 2 futexes)
424 cycles per futex(FUTEX_WAIT) call (using one futex)
334 cycles per futex(FUTEX_WAIT_PRIVATE) call (mixing 2 futexes)
334 cycles per futex(FUTEX_WAIT_PRIVATE) call (using one futex)
For reference :
187 cycles per getppid() call
188 cycles per umask() call
181 cycles per ni_syscall() call
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Pierre Peiffer <pierre.peiffer@bull.net> Cc: "Ulrich Drepper" <drepper@gmail.com> Cc: "Nick Piggin" <nickpiggin@yahoo.com.au> Cc: "Ingo Molnar" <mingo@elte.hu> Cc: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pierre Peiffer [Wed, 9 May 2007 09:35:02 +0000 (02:35 -0700)]
futex_requeue_pi optimization
This patch provides the futex_requeue_pi functionality, which allows some
threads waiting on a normal futex to be requeued on the wait-queue of a
PI-futex.
This provides an optimization, already used for (normal) futexes, to be used
with the PI-futexes.
This optimization is currently used by the glibc in pthread_broadcast, when
using "normal" mutexes. With futex_requeue_pi, it can be used with
PRIO_INHERIT mutexes too.
Signed-off-by: Pierre Peiffer <pierre.peiffer@bull.net> Cc: Ingo Molnar <mingo@elte.hu> Cc: Ulrich Drepper <drepper@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pierre Peiffer [Wed, 9 May 2007 09:35:02 +0000 (02:35 -0700)]
Make futex_wait() use an hrtimer for timeout
This patch modifies futex_wait() to use an hrtimer + schedule() in place of
schedule_timeout().
schedule_timeout() is tick based, therefore the timeout granularity is the
tick (1 ms, 4 ms or 10 ms depending on HZ). By using a high resolution timer
for timeout wakeup, we can attain a much finer timeout granularity (in the
microsecond range). This parallels what is already done for futex_lock_pi().
The timeout passed to the syscall is no longer converted to jiffies and is
therefore passed to do_futex() and futex_wait() as an absolute ktime_t
therefore keeping nanosecond resolution.
Also this removes the need to pass the nanoseconds timeout part to
futex_lock_pi() in val2.
In futex_wait(), if there is no timeout then a regular schedule() is
performed. Otherwise, an hrtimer is fired before schedule() is called.
NeilBrown [Wed, 9 May 2007 09:34:57 +0000 (02:34 -0700)]
knfsd: avoid Oops if buggy userspace performs confusing filehandle->dentry mapping
When a lookup request arrives, nfsd uses information provided by userspace
(mountd) to find the right filesystem.
It then assumes that the same filehandle type as the incoming filehandle can
be used to create an outgoing filehandle.
However if mountd is buggy, or maybe just being creative, the filesystem may
not support that filesystem type, and the kernel could oops, particularly if
'ex_uuid' is NULL but a FSID_UUID* filehandle type is used.
So add some proper checking that the fsid version/type from the incoming
filehandle is actually supportable, and ignore that information if it isn't
supportable.
Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
NeilBrown [Wed, 9 May 2007 09:34:57 +0000 (02:34 -0700)]
knfsd: various nfsd xdr cleanups
1/ decode_sattr and decode_sattr3 never return NULL, so remove
several checks for that. ditto for xdr_decode_hyper.
2/ replace some open coded XDR_QUADLEN calls with calls to
XDR_QUADLEN
3/ in decode_writeargs, simply an 'if' to use a single
calculation.
.page_len is the length of that part of the packet that did
not fit in the first page (the head).
So the length of the data part is the remainder of the
head, plus page_len.
3/ other minor cleanups.
Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
kbuild directly interprets <modulename>-y as objects to build into a module,
no need to assign it to the old foo-objs variable.
Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
NeilBrown [Wed, 9 May 2007 09:34:55 +0000 (02:34 -0700)]
knfsd: simplify a 'while' condition in svcsock.c
This while loop has an overly complex condition, which performs a couple of
assignments. This hurts readability.
We don't really need a loop at all. We can just return -EAGAIN and (providing
we set SK_DATA), the function will be called again.
So discard the loop, make the complex conditional become a few clear function
calls, and hopefully improve readability.
Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Wei Yongjun [Wed, 9 May 2007 09:34:54 +0000 (02:34 -0700)]
knfsd: rpcgss: RPC_GSS_PROC_ DESTROY request will get a bad rpc
If I send a RPC_GSS_PROC_DESTROY message to NFSv4 server, it will reply with a
bad rpc reply which lacks an authentication verifier. Maybe this patch is
needed.
Frank Filz [Wed, 9 May 2007 09:34:53 +0000 (02:34 -0700)]
knfsd: fix resource leak resulting in module refcount leak for rpcsec_gss_krb5.ko
I have been investigating a module reference count leak on the server for
rpcsec_gss_krb5.ko. It turns out the problem is a reference count leak for
the security context in net/sunrpc/auth_gss/svcauth_gss.c.
The problem is that gss_write_init_verf() calls gss_svc_searchbyctx() which
does a rsc_lookup() but never releases the reference to the context. There is
another issue that rpc.svcgssd sets an "end of time" expiration for the
context
By adding a cache_put() call in gss_svc_searchbyctx(), and setting an
expiration timeout in the downcall, cache_clean() does clean up the context
and the module reference count now goes to zero after unmount.
I also verified that if the context expires and then the client makes a new
request, a new context is established.
Here is the patch to fix the kernel, I will start a separate thread to discuss
what expiration time should be set by rpc.svcgssd.
Acked-by: "J. Bruce Fields" <bfields@citi.umich.edu> Signed-off-by: Frank Filz <ffilzlnx@us.ibm.com> Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
NeilBrown [Wed, 9 May 2007 09:34:52 +0000 (02:34 -0700)]
knfsd: rpc: fix server-side wrapping of krb5i replies
It's not necessarily correct to assume that the xdr_buf used to hold the
server's reply must have page data whenever it has tail data.
And there's no need for us to deal with that case separately anyway.
Acked-by: "J. Bruce Fields" <bfields@citi.umich.edu> Signed-off-by: Neil Brown <neilb@suse.de> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
NeilBrown [Wed, 9 May 2007 09:34:52 +0000 (02:34 -0700)]
knfsd: avoid use of unitialised variables on error path when nfs exports
We need to zero various parts of 'exp' before any 'goto out', otherwise when
we go to free the contents... we die.
Signed-off-by: Neil Brown <neilb@suse.de> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Akinobu Mita [Wed, 9 May 2007 09:34:51 +0000 (02:34 -0700)]
sunrpc: fix error path in module_init
register_rpc_pipefs() needs to clean up rpc_inode_cache
by kmem_cache_destroy() on register_filesystem() failure.
init_sunrpc() needs to unregister rpc_pipe_fs by unregister_rpc_pipefs()
when rpc_init_mempool() returns error.
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Cc: Neil Brown <neilb@suse.de> Cc: Trond Myklebust <trond.myklebust@fys.uio.no> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Jeff Layton [Wed, 9 May 2007 09:34:50 +0000 (02:34 -0700)]
RPC: add wrapper for svc_reserve to account for checksum
When the kernel calls svc_reserve to downsize the expected size of an RPC
reply, it fails to account for the possibility of a checksum at the end of
the packet. If a client mounts a NFSv2/3 with sec=krb5i/p, and does I/O
then you'll generally see messages similar to this in the server's ring
buffer:
RPC request reserved 164 but used 208
While I was never able to verify it, I suspect that this problem is also
the root cause of some oopses I've seen under these conditions:
This is probably also a problem for other sec= types and for NFSv4. The
large reserved size for NFSv4 compound packets seems to generally paper
over the problem, however.
This patch adds a wrapper for svc_reserve that accounts for the possibility
of a checksum. It also fixes up the appropriate callers of svc_reserve to
call the wrapper. For now, it just uses a hardcoded value that I
determined via testing. That value may need to be revised upward as things
change, or we may want to eventually add a new auth_op that attempts to
calculate this somehow.
Unfortunately, there doesn't seem to be a good way to reliably determine
the expected checksum length prior to actually calculating it, particularly
with schemes like spkm3.
Signed-off-by: Jeff Layton <jlayton@redhat.com> Acked-by: Neil Brown <neilb@suse.de> Cc: Trond Myklebust <trond.myklebust@fys.uio.no> Acked-by: J. Bruce Fields <bfields@citi.umich.edu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Acked-by: Neil Brown <neilb@suse.de> Cc: Trond Myklebust <trond.myklebust@fys.uio.no> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
NeilBrown [Wed, 9 May 2007 09:34:48 +0000 (02:34 -0700)]
knfsd: rename sk_defer_lock to sk_lock
Now that sk_defer_lock protects two different things, make the name more
generic.
Also don't bother with disabling _bh as the lock is only ever taken from
process context.
Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Peter Staubach [Wed, 9 May 2007 09:34:48 +0000 (02:34 -0700)]
The NFSv2/NFSv3 server does not handle zero length WRITE requests correctly
The NFSv2 and NFSv3 servers do not handle WRITE requests for 0 bytes
correctly. The specifications indicate that the server should accept the
request, but it should mostly turn into a no-op. Currently, the server
will return an XDR decode error, which it should not.
Attached is a patch which addresses this issue. It also adds some boundary
checking to ensure that the request contains as much data as was requested
to be written. It also correctly handles an NFSv3 request which requests
to write more data than the server has stated that it is prepared to
handle. Previously, there was some support which looked like it should
work, but wasn't quite right.
Signed-off-by: Peter Staubach <staubach@redhat.com> Acked-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Adrian Bunk [Wed, 9 May 2007 09:34:46 +0000 (02:34 -0700)]
remove nfs4_acl_add_ace()
nfs4_acl_add_ace() can now be removed.
Signed-off-by: Adrian Bunk <bunk@stusta.de> Acked-by: Neil Brown <neilb@cse.unsw.edu.au> Acked-by: J. Bruce Fields <bfields@citi.umich.edu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Oleg Nesterov [Wed, 9 May 2007 09:34:46 +0000 (02:34 -0700)]
make cancel_rearming_delayed_work() reliable
Thanks to Jarek Poplawski for the ideas and for spotting the bug in the
initial draft patch.
cancel_rearming_delayed_work() currently has many limitations, because it
requires that dwork always re-arms itself via queue_delayed_work(). So it
hangs forever if dwork doesn't do this, or cancel_rearming_delayed_work/
cancel_delayed_work was already called. It uses flush_workqueue() in a
loop, so it can't be used if workqueue was freezed, and it is potentially
live- lockable on busy system if delay is small.
With this patch cancel_rearming_delayed_work() doesn't make any assumptions
about dwork, it can re-arm itself via queue_delayed_work(), or
queue_work(), or do nothing.
As a "side effect", cancel_work_sync() was changed to handle re-arming works
as well.
Disadvantages:
- this patch adds wmb() to insert_work().
- slowdowns the fast path (when del_timer() succeeds on entry) of
cancel_rearming_delayed_work(), because wait_on_work() is called
unconditionally. In that case, compared to the old version, we are
doing "unneeded" lock/unlock for each online CPU.
On the other hand, this means we don't need to use cancel_work_sync()
after cancel_rearming_delayed_work().
- complicates the code (.text grows by 130 bytes).
We are anyway kthread_stop()ping other per-cpu kernel threads after
move_task_off_dead_cpu(), so we can do it with the stop_machine_run thread
as well.
I just checked with Vatsa if there was any subtle reason why they
had put in the kthread_bind() in cpu.c. Vatsa cannot seem to recollect
any and I can't see any. So let us just remove the kthread_bind.
Signed-off-by: Gautham R Shenoy <ego@in.ibm.com> Cc: Oleg Nesterov <oleg@tv-sign.ru> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: "Rafael J. Wysocki" <rjw@sisk.pl> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Oleg Nesterov [Wed, 9 May 2007 09:34:37 +0000 (02:34 -0700)]
change kernel threads to ignore signals instead of blocking them
Currently kernel threads use sigprocmask(SIG_BLOCK) to protect against
signals. This doesn't prevent the signal delivery, this only blocks
signal_wake_up(). Every "killall -33 kthreadd" means a "struct siginfo"
leak.
Change kthreadd_setup() to set all handlers to SIG_IGN instead of blocking
them (make a new helper ignore_signals() for that). If the kernel thread
needs some signal, it should use allow_signal() anyway, and in that case it
should not use CLONE_SIGHAND.
Note that we can't change daemonize() (should die!) in the same way,
because it can be used along with CLONE_SIGHAND. This means that
allow_signal() still should unblock the signal to work correctly with
daemonize()ed threads.
However, disallow_signal() doesn't block the signal any longer but ignores
it.
NOTE: with or without this patch the kernel threads are not protected from
handle_stop_signal(), this seems harmless, but not good.
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Acked-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When a kernel thread calls daemonize, instead of reparenting the thread to
init reparent the thread to kthreadd next to the threads created by
kthread_create.
This is really just a stop gap until daemonize goes away, but it does
ensure no kernel threads are under init and they are all in one place that
is easy to find.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Cc: Oleg Nesterov <oleg@tv-sign.ru> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently there is a circular reference between work queue initialization
and kthread initialization. This prevents the kthread infrastructure from
initializing until after work queues have been initialized.
We want the properties of tasks created with kthread_create to be as close
as possible to the init_task and to not be contaminated by user processes.
The later we start our kthreadd that creates these tasks the harder it is
to avoid contamination from user processes and the more of a mess we have
to clean up because the defaults have changed on us.
So this patch modifies the kthread support to not use work queues but to
instead use a simple list of structures, and to have kthreadd start from
init_task immediately after our kernel thread that execs /sbin/init.
By being a true child of init_task we only have to change those process
settings that we want to have different from init_task, such as our process
name, the cpus that are allowed, blocking all signals and setting SIGCHLD
to SIG_IGN so that all of our children are reaped automatically.
By being a true child of init_task we also naturally get our ppid set to 0
and do not wind up as a child of PID == 1. Ensuring that tasks generated
by kthread_create will not slow down the functioning of the wait family of
functions.
[akpm@linux-foundation.org: use interruptible sleeps] Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Cc: Oleg Nesterov <oleg@tv-sign.ru> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Oleg Nesterov [Wed, 9 May 2007 09:34:23 +0000 (02:34 -0700)]
____call_usermodehelper: don't flush_signals()
____call_usermodehelper() has no reason for flush_signals(). It is a fresh
forked process which is going to exec a user-space application or exit on
failure.
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Oleg Nesterov [Wed, 9 May 2007 09:34:22 +0000 (02:34 -0700)]
unify flush_work/flush_work_keventd and rename it to cancel_work_sync
flush_work(wq, work) doesn't need the first parameter, we can use cwq->wq
(this was possible from the very beginnig, I missed this). So we can unify
flush_work_keventd and flush_work.
Also, rename flush_work() to cancel_work_sync() and fix all callers.
Perhaps this is not the best name, but "flush_work" is really bad.
(akpm: this is why the earlier patches bypassed maintainers)