Milan Broz [Fri, 19 Oct 2007 21:38:58 +0000 (22:38 +0100)]
dm crypt: add post processing queue
Add post-processing queue (per crypt device) for read operations.
Current implementation uses only one queue for all operations
and this can lead to starvation caused by many requests waiting
for memory allocation. But the needed memory-releasing operation
is queued after these requests (in the same queue).
Signed-off-by: Milan Broz <mbroz@redhat.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Jesper Juhl [Fri, 19 Oct 2007 21:38:54 +0000 (22:38 +0100)]
dm io:ctl remove vmalloc void cast
In drivers/md/dm-ioctl.c::copy_params() there's a call to vmalloc()
where we currently cast the return value, but that's pretty pointless
given that vmalloc() returns "void *".
Signed-off-by: Jesper Juhl <jesper.juhl@gmail.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Kcopyd uses a semaphore as mutex. Use the mutex API instead of the (binary)
semaphore,
Signed-off-by: Matthias Kaehlcke <matthias.kaehlcke@gmail.com> Cc: Neil Brown <neilb@suse.de> Signed-off-by: Alasdair G Kergon <agk@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Jun'ichi Nomura [Fri, 19 Oct 2007 21:38:43 +0000 (22:38 +0100)]
dm: fix thaw_bdev
This patch fixes a bd_mount_sem counter corruption bug in device-mapper.
thaw_bdev() should be called only when freeze_bdev() was called for the
device.
Otherwise, thaw_bdev() will up bd_mount_sem and corrupt the semaphore counter.
struct block_device with the corrupted semaphore may remain in slab cache
and be reused later.
Attached patch will fix it by calling unlock_fs() instead.
unlock_fs() will determine whether it should call thaw_bdev()
by checking the device is frozen or not.
Easy reproducer is:
#!/bin/sh
while [ 1 ]; do
dmsetup --notable create a
dmsetup --nolockfs suspend a
dmsetup remove a
done
It's not easy to see the effect of corrupted semaphore.
So I have tested with putting printk below in bdev_alloc_inode():
if (atomic_read(&ei->bdev.bd_mount_sem.count) != 1)
printk(KERN_DEBUG "Incorrect semaphore count = %d (%p)\n",
atomic_read(&ei->bdev.bd_mount_sem.count),
&ei->bdev);
Without the patch, I saw something like:
Incorrect semaphore count = 17 (f2ab91c0)
With the patch, the message didn't appear.
The bug was introduced in 2.6.16 with this bug fix:
Need to unfreeze and release bdev otherwise the bdev inode with
inconsistent state is reused later and cause problem.
and backported to 2.6.15.5.
It occurs only in free_dev(), which is called only when the dm device is
removed. The buggy code is executed only if md->suspended_bdev is
non-NULL and that can happen only when the device was suspended without
noflush.
Milan Broz [Fri, 19 Oct 2007 21:38:36 +0000 (22:38 +0100)]
dm io:ctl use constant struct size
Make size of dm_ioctl struct always 312 bytes on all supported
architectures.
This change retains compatibility with already-compiled code because
it uses an embedded offset to locate the payload that follows the
structure.
On 64-bit architectures there is no change at all; on 32-bit
we are increasing the size of dm-ioctl from 308 to 312 bytes.
Currently with 32-bit userspace / 64-bit kernel on x86_64
some ioctls (including rename, message) are incorrectly rejected
by the comparison against 'param + 1'. This breaks userspace
lvrename and multipath 'fail_if_no_path' changes, for example.
(BTW Device-mapper uses its own versioning and ignores the ioctl
size bits. Only the generic ioctl compat code on mixed arches
checks them, and that will continue to accept both sizes for now,
but we intend to list 308 as deprecated and eventually remove it.)
Signed-off-by: Milan Broz <mbroz@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com> Cc: Guido Guenther <agx@sigxcpu.org> Cc: Kevin Corry <kevcorry@us.ibm.com> Cc: stable@kernel.org
Bryn M. Reeves [Fri, 19 Oct 2007 21:29:32 +0000 (22:29 +0100)]
dm mpath: rdac fix init race
Re-order the initialisation of dm-rdac to avoid registering the hw
handler before the workqueue has been initialised. Closes a race
that would potentially give an oops.
Signed-off-by: Bryn M. Reeves <breeves@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
* ssh://master.kernel.org/pub/scm/linux/kernel/git/tglx/linux-2.6-x86: (33 commits)
x86: convert cpuinfo_x86 array to a per_cpu array
x86: introduce frame_pointer() and stack_pointer()
x86 & generic: change to __builtin_prefetch()
i386: do not BUG_ON() when MSR is unknown
x86: acpi use cpu_physical_id
x86: convert cpu_llc_id to be a per cpu variable
x86: convert cpu_to_apicid to be a per cpu variable
i386: introduce "used_vectors" bitmap which can be used to reserve vectors.
x86: use raw locks during oopses
x86: honor _PAGE_PSE bit on page walks
i386: do cpuid_device_create() in CPU_UP_PREPARE instead of CPU_ONLINE.
x86: implement missing x86_64 function smp_call_function_mask()
x86: use descriptor's functions instead of inline assembly
i386: consolidate show_regs and show_registers for i386
i386: make callgraph use dump_trace() on i386/x86_64
x86: enable iommu_merge by default
i386: i386 add AMD64 Barcelona PMU MSR definitions to msr.h
x86: Unify i386 and x86-64 early quirks
x86: enable HPET on ICH3 and ICH4
x86: force enable HPET on VT8235/8237 chipsets
...
Manually fix trivial conflict with task pid container helper changes in
arch/x86/kernel/process_32.c
Linus Torvalds [Fri, 19 Oct 2007 21:31:42 +0000 (14:31 -0700)]
Merge git://git.linux-nfs.org/pub/linux/nfs-2.6
* git://git.linux-nfs.org/pub/linux/nfs-2.6:
NFSv4: Fix an rpc_cred reference leakage in fs/nfs/delegation.c
NFSv4: Ensure that we wait for the CLOSE request to complete
NFS: Fix a race in sillyrename
NFS: Fix a writeback race...
Trond Myklebust [Mon, 15 Oct 2007 22:17:53 +0000 (18:17 -0400)]
NFS: Fix a race in sillyrename
lookup() and sillyrename() can race one another because the sillyrename()
completion cannot take the parent directory's inode->i_mutex since the
latter may be held by whoever is calling dput().
We therefore have little option but to add extra locking to ensure that
nfs_lookup() and nfs_atomic_open() do not race with the sillyrename
completion.
If somebody has looked up the sillyrenamed file in the meantime, we just
transfer the sillydelete information to the new dentry.
Please refer to the bug-report at
http://bugzilla.linux-nfs.org/show_bug.cgi?id=150
We cannot zero the user page in nfs_mark_uptodate() any more, since
a) We'd be modifying the page without holding the page lock
b) We can race with other updates of the page, most notably
because of the call to nfs_wb_page() in nfs_writepage_setup().
Instead, we do the zeroing in nfs_update_request() if we see that we're
creating a request that might potentially be marked as up to date.
Thanks to Olivier Paquet for reporting the bug and providing a test-case.
* git://git.kernel.org/pub/scm/linux/kernel/git/sam/kbuild:
kbuild: fix first module build
kconfig: update kconfig-language text
kbuild: introduce cc-cross-prefix
kbuild: disable depmod in cross-compile kernel build
kbuild: make deb-pkg - add 'Provides:' line
kconfig: comment typo in scripts/kconfig/Makefile.
kbuild: stop docproc segfaulting when SRCTREE isn't set.
kbuild: modpost problem when symbols move from one module to another
kbuild: cscope - filter out .tmp_* in find_sources
kbuild: mailing list has moved
kbuild: check asm symlink when building a kernel
Sam Ravnborg [Fri, 19 Oct 2007 20:20:02 +0000 (22:20 +0200)]
kbuild: fix first module build
When building a specific module before doing a total kernel
build it failed because $(MORVERDIR) were missing.
Creating the MODVERDIR explicit (independent of KBUILD_MODULES)
fixed this. As a side-effect the MODVERDIR will be created
also for a non-module build - but no harm done by that.
Linus Torvalds [Fri, 19 Oct 2007 20:12:46 +0000 (13:12 -0700)]
Merge branch 'release' of git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux-acpi-2.6
* 'release' of git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux-acpi-2.6: (41 commits)
ACPICA: hw: Don't carry spinlock over suspend
ACPICA: hw: remove use_lock flag from acpi_hw_register_{read, write}
ACPI: cpuidle: port idle timer suspend/resume workaround to cpuidle
ACPI: clean up acpi_enter_sleep_state_prep
Hibernation: Make sure that ACPI is enabled in acpi_hibernation_finish
ACPI: suppress uninitialized var warning
cpuidle: consolidate 2.6.22 cpuidle branch into one patch
ACPI: thinkpad-acpi: skip blanks before the data when parsing sysfs
ACPI: AC: Add sysfs interface
ACPI: SBS: Add sysfs alarm
ACPI: SBS: Add ACPI_PROCFS around procfs handling code.
ACPI: SBS: Add support for power_supply class (and sysfs)
ACPI: SBS: Make SBS reads table-driven.
ACPI: SBS: Simplify data structures in SBS
ACPI: SBS: Split host controller (ACPI0001) from SBS driver (ACPI0002)
ACPI: EC: Add new query handler to list head.
ACPI: Add acpi_bus_generate_event4() function
ACPI: Battery: add sysfs alarm
ACPI: Battery: Add sysfs support
ACPI: Battery: Misc clean-ups, no functional changes
...
Fix up conflicts in drivers/misc/thinkpad_acpi.[ch] manually
Randy Dunlap [Fri, 19 Oct 2007 17:53:48 +0000 (10:53 -0700)]
kconfig: update kconfig-language text
Add kconfig-language docs for mainmenu, def_bool, and def_tristate.
Remove "requires" as a synonym of "depends on" since it was removed
from the parser in commit 247537b9a270b52cc0375adcb0fb2605a160cba5.
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com> Signed-off-by: Sam Ravnborg <sam@ravnborg.org>
Linus Torvalds [Fri, 19 Oct 2007 19:01:22 +0000 (12:01 -0700)]
Merge branch 'upstream' of git://ftp.linux-mips.org/pub/scm/upstream-linus
* 'upstream' of git://ftp.linux-mips.org/pub/scm/upstream-linus:
[MIPS] Delete totally outdated Documentation/mips/time.README
[MIPS] Kill duplicated setup_irq() for cp0 timer
[MIPS] Sibyte: Finish conversion to modern time APIs.
[MIPS] time: Helpers to compute clocksource/event shift and mult values.
[MIPS] SMTC: Build fix.
[MIPS] time: Delete dead code.
[MIPS] MIPSsim: Strip defconfig file to the bones.
* git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6: (51 commits)
[CIFS] log better errors on failed mounts
[CIFS] Return better error when server requires signing but client forbids
[CIFS] fix typo
[CIFS] acl support part 4
[CIFS] Fix minor problems noticed by scan
[CIFS] fix bad handling of EAGAIN error on kernel_recvmsg in cifs_demultiplex_thread
[CIFS] build break
[CIFS] endian fixes
[CIFS] endian fixes in new acl code
[CIFS] Fix some endianness problems in new acl code
[CIFS] missing #endif from a previous patch
[CIFS] formatting fixes
[CIFS] Break up unicode_sessetup string functions
[CIFS] parse server_GUID in SPNEGO negProt response
[CIFS]
[CIFS] Fix endian conversion problem in posix mkdir
[CIFS] fix build break when lanman not enabled
[CIFS] remove two sparse warnings
[CIFS] remove compile warnings when debug disabled
[CIFS] CIFS ACL support part 3
...
Linus Torvalds [Fri, 19 Oct 2007 18:54:39 +0000 (11:54 -0700)]
Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6
* 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6:
[NET]: Fix possible dev_deactivate race condition
[INET]: Justification for local port range robustness.
[PACKET]: Kill unused pg_vec_endpage() function
[NET]: QoS/Sched as menuconfig
[NET]: Fix bug in sk_filter race cures.
[PATCH] mac80211: make ieee802_11_parse_elems return void
Begin infrastructure for kernel code samples in the samples/ directory.
Add its Kconfig and Kbuild files.
Source its Kconfig file in all arch/ Kconfigs.
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The marker activation functions sits in kernel/marker.c. A hash table is used
to keep track of the registered probes and armed markers, so the markers
within a newly loaded module that should be active can be activated at module
load time.
marker_query has been removed. marker_get_first, marker_get_next and
marker_release should be used as iterators on the markers.
[akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> Acked-by: "Frank Ch. Eigler" <fche@redhat.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Mike Mason <mmlnx@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Combine instrumentation menus in kernel/Kconfig.instrumentation
Quoting Randy:
"It seems sad that this patch sources Kconfig.marker, a 7-line file,
20-something times. Yes, you (we) don't want to put those 7 lines into
20-something different files, so sourcing is the right thing.
However, what you did for avr32 seems more on the right track to me: make
_one_ Instrumentation support menu that includes PROFILING, OPROFILE, KPROBES,
and MARKERS and then use (source) that in all of the arches."
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> Acked-by: Randy Dunlap <randy.dunlap@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Enable "cgroup" (formerly containers) based fair group scheduling. This
will let administrator create arbitrary groups of tasks (using "cgroup"
pseudo filesystem) and control their cpu bandwidth usage.
Bernhard Walle [Fri, 19 Oct 2007 06:41:02 +0000 (23:41 -0700)]
Use extended crashkernel command line on sh
This patch removes the crashkernel parsing from arch/sh/kernel/machine_kexec.c
and calls the generic function, introduced in the generic patch, in
setup_bootmem_allocator().
This is necessary because the amount of System RAM must be known in this
function now because of the new syntax.
Signed-off-by: Bernhard Walle <bwalle@suse.de> Acked-by: Paul Mundt <lethal@linux-sh.org> Cc: Vivek Goyal <vgoyal@in.ibm.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Bernhard Walle [Fri, 19 Oct 2007 06:41:00 +0000 (23:41 -0700)]
Use extended crashkernel command line on ia64
This patch adapts IA64 to use the generic parse_crashkernel() function instead
of its own parsing for the crashkernel command line.
Because the total amount of System RAM must be known when calling this
function, efi_memmap_init() is modified to return its accumulated total_memory
variable.
Also, the crashkernel handling is moved in an own function in
arch/ia64/kernel/setup.c to make the code more readable.
Bernhard Walle [Fri, 19 Oct 2007 06:40:59 +0000 (23:40 -0700)]
Use extended crashkernel command line on x86_64
This patch removes the crashkernel parsing from
arch/x86_64/kernel/machine_kexec.c and calls the generic function, introduced
in the last patch, in setup_bootmem_allocator().
This is necessary because the amount of System RAM must be known in this
function now because of the new syntax.
Signed-off-by: Bernhard Walle <bwalle@suse.de> Cc: Andi Kleen <ak@suse.de> Cc: Vivek Goyal <vgoyal@in.ibm.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Bernhard Walle [Fri, 19 Oct 2007 06:40:59 +0000 (23:40 -0700)]
Use extended crashkernel command line on i386
This patch removes the crashkernel parsing from
arch/i386/kernel/machine_kexec.c and calls the generic function, introduced in
the last patch, in setup_bootmem_allocator().
This is necessary because the amount of System RAM must be known in this
function now because of the new syntax.
Signed-off-by: Bernhard Walle <bwalle@suse.de> Cc: Andi Kleen <ak@suse.de> Cc: Vivek Goyal <vgoyal@in.ibm.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The motivation comes from distributors that configure their crashkernel
command line automatically with some configuration tool (YaST, you know ;)).
Of course that tool knows the value of System RAM, but if the user removes
RAM, then the system becomes unbootable or at least unusable and error
handling is very difficult.
This series implements this change for i386, x86_64, ia64, ppc64 and sh. That
should be all platforms that support kdump in current mainline. I tested all
platforms except sh due to the lack of a sh processor.
This patch:
This is the generic part of the patch. It adds a parse_crashkernel() function
in kernel/kexec.c that is called by the architecture specific code that
actually reserves the memory. That function takes the whole command line and
looks itself for "crashkernel=" in it.
If there are multiple occurrences, then the last one is taken. The advantage
is that if you have a bootloader like lilo or elilo which allows you to append
a command line parameter but not to remove one (like in GRUB), then you can
add another crashkernel value for testing at the boot command line and this
one overwrites the command line in the configuration then.
Signed-off-by: Bernhard Walle <bwalle@suse.de> Cc: Andi Kleen <ak@suse.de> Cc: "Luck, Tony" <tony.luck@intel.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mundt <lethal@linux-sh.org> Cc: Vivek Goyal <vgoyal@in.ibm.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pierre Peiffer [Fri, 19 Oct 2007 06:40:57 +0000 (23:40 -0700)]
IPC: fix error case when idr-cache is empty in ipcget()
With the use of idr to store the ipc, the case where the idr cache is
empty, when idr_get_new is called (this may happen even if we call
idr_pre_get() before), is not well handled: it lets
semget()/shmget()/msgget() return ENOSPC when this cache is empty, what 1.
does not reflect the facts and 2. does not conform to the man(s).
This patch fixes this by retrying the whole process of allocation in this case.
Signed-off-by: Pierre Peiffer <pierre.peiffer@bull.net> Cc: Nadia Derbey <Nadia.Derbey@bull.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pierre Peiffer [Fri, 19 Oct 2007 06:40:55 +0000 (23:40 -0700)]
IPC: cleanup some code and wrong comments about semundo list managment
Some comments about sem_undo_list seem wrong.
About the comment above unlock_semundo:
"... If task2 now exits before task1 releases the lock (by calling
unlock_semundo()), then task1 will never call spin_unlock(). ..."
This is just wrong, I see no reason for which task1 will not call
spin_unlock... The rest of this comment is also wrong... Unless I
miss something (of course).
Finally, (un)lock_semundo functions are useless, so remove them
for simplification. (this avoids an useless if statement)
Signed-off-by: Pierre Peiffer <pierre.peiffer@bull.net> Cc: Nadia Derbey <Nadia.Derbey@bull.net> Acked-by: Serge Hallyn <serue@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Nadia Derbey [Fri, 19 Oct 2007 06:40:54 +0000 (23:40 -0700)]
fix idr_find() locking
This is a patch that fixes the way idr_find() used to be called in ipc_lock():
in all the paths that don't imply an update of the ipcs idr, it was called
without the idr tree being locked.
The changes are:
. in ipc_ids, the mutex has been changed into a reader/writer semaphore.
. ipc_lock() now takes the mutex as a reader during the idr_find().
. a new routine ipc_lock_down() has been defined: it doesn't take the
mutex, assuming that it is being held by the caller. This is the routine
that is now called in all the update paths.
Nadia Derbey [Fri, 19 Oct 2007 06:40:51 +0000 (23:40 -0700)]
Storing ipcs into IDRs
This patch converts casts of struct kern_ipc_perm to
. struct msg_queue
. struct sem_array
. struct shmid_kernel
into the equivalent container_of() macro. It improves code maintenance
because the code need not change if kern_ipc_perm is no longer at the
beginning of the containing struct.
Nadia Derbey [Fri, 19 Oct 2007 06:40:51 +0000 (23:40 -0700)]
ipc: integrate ipc_checkid() into ipc_lock()
This patch introduces a new ipc_lock_check() routine interface:
. each time ipc_checkid() is called, this is done after calling ipc_lock().
ipc_checkid() is now called from inside ipc_lock_check().
Nadia Derbey [Fri, 19 Oct 2007 06:40:49 +0000 (23:40 -0700)]
ipc: unify the syscalls code
This patch introduces a change into the sys_msgget(), sys_semget() and
sys_shmget() routines: they now share a common code, which is better for
maintainability.
Nadia Derbey [Fri, 19 Oct 2007 06:40:48 +0000 (23:40 -0700)]
ipc: store ipcs into IDRs
This patch introduces ipcs storage into IDRs. The main changes are:
. This ipc_ids structure is changed: the entries array is changed into a
root idr structure.
. The grow_ary() routine is removed: it is not needed anymore when adding
an ipc structure, since we are now using the IDR facility.
. The ipc_rmid() routine interface is changed:
. there is no need for this routine to return the pointer passed in as
argument: it is now declared as a void
. since the id is now part of the kern_ipc_perm structure, no need to
have it as an argument to the routine
Gautham R Shenoy [Fri, 19 Oct 2007 06:40:47 +0000 (23:40 -0700)]
Add irq protection in the percpu-counters cpu-hotplug-callback path
Some of the per-cpu counters and thus their locks are accessed from IRQ
contexts. This can cause a deadlock if it interrupts a cpu-offline thread
which is transferring a dead-cpu's counts to the global counter.
Add appropriate IRQ protection in the cpu-hotplug callback path.
Signed-off-by: Gautham R Shenoy <ego@in.ibm.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Cliff Wickman [Fri, 19 Oct 2007 06:40:46 +0000 (23:40 -0700)]
hotplug cpu: migrate a task within its cpuset
When a cpu is disabled, move_task_off_dead_cpu() is called for tasks that have
been running on that cpu.
Currently, such a task is migrated:
1) to any cpu on the same node as the disabled cpu, which is both online
and among that task's cpus_allowed
2) to any cpu which is both online and among that task's cpus_allowed
It is typical of a multithreaded application running on a large NUMA system to
have its tasks confined to a cpuset so as to cluster them near the memory that
they share. Furthermore, it is typical to explicitly place such a task on a
specific cpu in that cpuset. And in that case the task's cpus_allowed
includes only a single cpu.
This patch would insert a preference to migrate such a task to some cpu within
its cpuset (and set its cpus_allowed to its entire cpuset).
With this patch, migrate the task to:
1) to any cpu on the same node as the disabled cpu, which is both online
and among that task's cpus_allowed
2) to any online cpu within the task's cpuset
3) to any cpu which is both online and among that task's cpus_allowed
In order to do this, move_task_off_dead_cpu() must make a call to
cpuset_cpus_allowed_locked(), a new subset of cpuset_cpus_allowed(), that will
not block. (name change - per Oleg's suggestion)
Calls are made to cpuset_lock() and cpuset_unlock() in migration_call() to set
the cpuset mutex during the whole migrate_live_tasks() and
migrate_dead_tasks() procedure.
[akpm@linux-foundation.org: build fix]
[pj@sgi.com: Fix indentation and spacing] Signed-off-by: Cliff Wickman <cpw@sgi.com> Cc: Oleg Nesterov <oleg@tv-sign.ru> Cc: Christoph Lameter <clameter@sgi.com> Cc: Paul Jackson <pj@sgi.com> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Paul Jackson <pj@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Paul Menage [Fri, 19 Oct 2007 06:40:44 +0000 (23:40 -0700)]
Control groups: Replace "cont" with "cgrp" and other misc renaming
Replace "cont" with "cgrp" and other misc renaming
This patch finishes some of the names that got missed in the great
"task containers" -> "control groups" rename. Primarily it renames
the local variable "cont" to "cgrp" in a number of places, and renames
the CONT_* enum members to CGRP_*.
This patch is not intended to have any effect on the generated code;
the output of "objdump -d kernel/cgroup.o" is unchanged.
Signed-off-by: Paul Menage <menage@google.com> Acked-by: Paul Jackson <pj@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pavel Emelyanov [Fri, 19 Oct 2007 06:40:44 +0000 (23:40 -0700)]
Use task_pid_nr() instead of pid_nr(task_pid())
There are two places that do so - the cgroups subsystem and the autofs
code.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Cc: Ian Kent <raven@themaw.net> Cc: Paul Menage <menage@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pavel Emelyanov [Fri, 19 Oct 2007 06:40:43 +0000 (23:40 -0700)]
Use task_pid_nr() in ip_vs_sync.c
The sync_master_pid and sync_backup_pid are set in set_sync_pid() and are
used later for set/not-set checks and in printk. So it is safe to use the
global pid value in this case.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Acked-by: Sukadev Bhattiprolu <sukadev@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pavel Emelyanov [Fri, 19 Oct 2007 06:40:43 +0000 (23:40 -0700)]
Remove unused variables from fs/proc/base.c
When removing the explicit task_struct->pid usage I found that
proc_readfd_common() and proc_pident_readdir() get this field, but do not
use it at all. So this cleanup is a cheap help with the task_struct->pid
isolation.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Alexey Dobriyan [Fri, 19 Oct 2007 06:40:41 +0000 (23:40 -0700)]
Use helpers to obtain task pid in printks (arch code)
One of the easiest things to isolate is the pid printed in kernel log.
There was a patch, that made this for arch-independent code, this one makes
so for arch/xxx files.
It took some time to cross-compile it, but hopefully these are all the
printks in arch code.
Signed-off-by: Alexey Dobriyan <adobriyan@openvz.org> Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Cc: <linux-arch@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pavel Emelyanov [Fri, 19 Oct 2007 06:40:40 +0000 (23:40 -0700)]
Use helpers to obtain task pid in printks
The task_struct->pid member is going to be deprecated, so start
using the helpers (task_pid_nr/task_pid_vnr/task_pid_nr_ns) in
the kernel.
The first thing to start with is the pid, printed to dmesg - in
this case we may safely use task_pid_nr(). Besides, printks produce
more (much more) than a half of all the explicit pid usage.
[akpm@linux-foundation.org: git-drm went and changed lots of stuff] Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Cc: Dave Airlie <airlied@linux.ie> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pavel Emelyanov [Fri, 19 Oct 2007 06:40:39 +0000 (23:40 -0700)]
Isolate the explicit usage of signal->pgrp
The pgrp field is not used widely around the kernel so it is now marked as
deprecated with appropriate comment.
The initialization of INIT_SIGNALS is trimmed because
a) they are set to 0 automatically;
b) gcc cannot properly initialize two anonymous (the second one
is the one with the session) unions. In this particular case
to make it compile we'd have to add some field initialized
right before the .pgrp.
We have to deprecate the pid, tgid, session and pgrp fields on struct
task_struct and struct signal_struct. The session and pgrp are already
deprecated. The tgid value is close to being such - the worst known usage
in in fs/locks.c and audit code. The pid field deprecation is mainly
blocked by numerous printk-s around the kernel that print the tsk->pid to
log.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Cc: Oleg Nesterov <oleg@tv-sign.ru> Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com> Cc: Cedric Le Goater <clg@fr.ibm.com> Cc: Serge Hallyn <serue@us.ibm.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Herbert Poetzl <herbert@13thfloor.at> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Eugene Teo [Fri, 19 Oct 2007 06:40:38 +0000 (23:40 -0700)]
Fix tsk->exit_state usage
tsk->exit_state can only be 0, EXIT_ZOMBIE, or EXIT_DEAD. A non-zero test
is the same as tsk->exit_state & (EXIT_ZOMBIE | EXIT_DEAD), so just testing
tsk->exit_state is sufficient.
Signed-off-by: Eugene Teo <eugeneteo@kernel.sg> Cc: Roland McGrath <roland@redhat.com> Cc: Oleg Nesterov <oleg@tv-sign.ru> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Neil Horman [Fri, 19 Oct 2007 06:40:37 +0000 (23:40 -0700)]
proc: export a processes resource limits via /proc/pid
Currently, there exists no method for a process to query the resource
limits of another process. They can be inferred via some mechanisms but
they cannot be explicitly determined. Given that this information can be
usefull to know during the debugging of an application, I've written this
patch which exports all of a processes limits via /proc/<pid>/limits.
Signed-off-by: Neil Horman <nhorman@tuxdriver.com> Cc: Oleg Nesterov <oleg@tv-sign.ru> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Jiri Slaby [Fri, 19 Oct 2007 06:40:32 +0000 (23:40 -0700)]
get rid of input BIT* duplicate defines
get rid of input BIT* duplicate defines
use newly global defined macros for input layer. Also remove includes of
input.h from non-input sources only for BIT macro definiton. Define the
macro temporarily in local manner, all those local definitons will be
removed further in this patchset (to not break bisecting).
BIT macro will be globally defined (1<<x)
Jiri Slaby [Fri, 19 Oct 2007 06:40:31 +0000 (23:40 -0700)]
define first set of BIT* macros
define first set of BIT* macros
- move BITOP_MASK and BITOP_WORD from asm-generic/bitops/atomic.h to
include/linux/bitops.h and rename it to BIT_MASK and BIT_WORD
- move BITS_TO_LONGS and BITS_PER_BYTE to bitops.h too and allow easily
define another BITS_TO_something (e.g. in event.c) by BITS_TO_TYPE macro
Remaining (and common) BIT macro will be defined after all occurences and
conflicts will be sorted out in the patches.
Jiri Slaby [Fri, 19 Oct 2007 06:40:26 +0000 (23:40 -0700)]
forbid asm/bitops.h direct inclusion
forbid asm/bitops.h direct inclusion
Because of compile errors that may occur after bit changes if asm/bitops.h is
included directly without e.g. linux/kernel.h which includes linux/bitops.h,
forbid direct inclusion of asm/bitops.h. Thanks to Adrian Bunk.
Signed-off-by: Jiri Slaby <jirislaby@gmail.com> Cc: Adrian Bunk <bunk@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Jiri Slaby [Fri, 19 Oct 2007 06:40:25 +0000 (23:40 -0700)]
remove asm/bitops.h includes
remove asm/bitops.h includes
including asm/bitops directly may cause compile errors. don't include it
and include linux/bitops instead. next patch will deny including asm header
directly.
Cc: Adrian Bunk <bunk@kernel.org> Signed-off-by: Jiri Slaby <jirislaby@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Jiri Slaby [Fri, 19 Oct 2007 06:40:24 +0000 (23:40 -0700)]
Misc: phantom, improved data passing
This new version guarantees amb_bit switch in small enough intervals, so that
the device won't stop working in the middle of a movement anymore. However it
preserves old (openhaptics) functionality.
Paul Menage [Fri, 19 Oct 2007 06:40:22 +0000 (23:40 -0700)]
Fix cpusets update_cpumask
Cause writes to cpuset "cpus" file to update cpus_allowed for member tasks:
- collect batches of tasks under tasklist_lock and then call
set_cpus_allowed() on them outside the lock (since this can sleep).
- add a simple generic priority heap type to allow efficient collection
of batches of tasks to be processed without duplicating or missing any
tasks in subsequent batches.
- make "cpus" file update a no-op if the mask hasn't changed
- fix race between update_cpumask() and sched_setaffinity() by making
sched_setaffinity() post-check that it's not running on any cpus outside
cpuset_cpus_allowed().
[akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Paul Menage <menage@google.com> Cc: Paul Jackson <pj@sgi.com> Cc: David Rientjes <rientjes@google.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Cedric Le Goater <clg@fr.ibm.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Serge Hallyn <serue@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Paul Jackson [Fri, 19 Oct 2007 06:40:21 +0000 (23:40 -0700)]
cpusets: decrustify cpuset mask update code
Decrustify the kernel/cpuset.c 'cpus' and 'mems' updating code.
Other than subtle improvements in the consistency of identifying
white space at the beginning and end of passed in masks, this
doesn't make any visible difference in behaviour. But it's
one or two hundred kernel text bytes smaller, and easier to
understand.
[akpm@linux-foundation.org: coding-style fix] Signed-off-by: Paul Jackson <pj@sgi.com> Reviewed-by: Paul Menage <menage@google.com> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Paul Jackson [Fri, 19 Oct 2007 06:40:20 +0000 (23:40 -0700)]
cpuset sched_load_balance flag
Add a new per-cpuset flag called 'sched_load_balance'.
When enabled in a cpuset (the default value) it tells the kernel scheduler
that the scheduler should provide the normal load balancing on the CPUs in
that cpuset, sometimes moving tasks from one CPU to a second CPU if the
second CPU is less loaded and if that task is allowed to run there.
When disabled (write "0" to the file) then it tells the kernel scheduler
that load balancing is not required for the CPUs in that cpuset.
Now even if this flag is disabled for some cpuset, the kernel may still
have to load balance some or all the CPUs in that cpuset, if some
overlapping cpuset has its sched_load_balance flag enabled.
If there are some CPUs that are not in any cpuset whose sched_load_balance
flag is enabled, the kernel scheduler will not load balance tasks to those
CPUs.
Moreover the kernel will partition the 'sched domains' (non-overlapping
sets of CPUs over which load balancing is attempted) into the finest
granularity partition that it can find, while still keeping any two CPUs
that are in the same shed_load_balance enabled cpuset in the same element
of the partition.
This serves two purposes:
1) It provides a mechanism for real time isolation of some CPUs, and
2) it can be used to improve performance on systems with many CPUs
by supporting configurations in which load balancing is not done
across all CPUs at once, but rather only done in several smaller
disjoint sets of CPUs.
This mechanism replaces the earlier overloading of the per-cpuset
flag 'cpu_exclusive', which overloading was removed in an earlier
patch: cpuset-remove-sched-domain-hooks-from-cpusets
See further the Documentation and comments in the code itself.
[akpm@linux-foundation.org: don't be weird] Signed-off-by: Paul Jackson <pj@sgi.com> Acked-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pavel Emelyanov [Fri, 19 Oct 2007 06:40:19 +0000 (23:40 -0700)]
Uninline the task_xid_nr_ns() calls
Since these are expanded into call to pid_nr_ns() anyway, it's OK to move
the whole routine out-of-line. This is a cheap way to save ~100 bytes from
vmlinux. Together with the previous two patches, it saves half-a-kilo from
the vmlinux.
Un-inline other (currently inlined) functions must be done with additional
performance testing.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com> Cc: Oleg Nesterov <oleg@tv-sign.ru> Cc: Paul Menage <menage@google.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pavel Emelyanov [Fri, 19 Oct 2007 06:40:19 +0000 (23:40 -0700)]
Uninline find_pid etc set of functions
The find_pid/_vpid/_pid_ns functions are used to find the struct pid by its
id, depending on whic id - global or virtual - is used.
The find_vpid() is a macro that pushes the current->nsproxy->pid_ns on the
stack to call another function - find_pid_ns(). It turned out, that this
dereference together with the push itself cause the kernel text size to
grow too much.
Move all these out-of-line. Together with the previous patch this saves a
bit less that 400 bytes from .text section.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com> Cc: Oleg Nesterov <oleg@tv-sign.ru> Cc: Paul Menage <menage@google.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pavel Emelyanov [Fri, 19 Oct 2007 06:40:18 +0000 (23:40 -0700)]
Isolate some explicit usage of task->tgid
With pid namespaces this field is now dangerous to use explicitly, so hide
it behind the helpers.
Also the pid and pgrp fields o task_struct and signal_struct are to be
deprecated. Unfortunately this patch cannot be sent right now as this
leads to tons of warnings, so start isolating them, and deprecate later.
Actually the p->tgid == pid has to be changed to has_group_leader_pid(),
but Oleg pointed out that in case of posix cpu timers this is the same, and
thread_group_leader() is more preferable.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Acked-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pavel Emelyanov [Fri, 19 Oct 2007 06:40:16 +0000 (23:40 -0700)]
Uninline find_task_by_xxx set of functions
The find_task_by_something is a set of macros are used to find task by pid
depending on what kind of pid is proposed - global or virtual one. All of
them are wrappers above the most generic one - find_task_by_pid_type_ns() -
and just substitute some args for it.
It turned out, that dereferencing the current->nsproxy->pid_ns construction
and pushing one more argument on the stack inline cause kernel text size to
grow.
This patch moves all this stuff out-of-line into kernel/pid.c. Together
with the next patch it saves a bit less than 400 bytes from the .text
section.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com> Cc: Oleg Nesterov <oleg@tv-sign.ru> Cc: Paul Menage <menage@google.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Acked-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>