Chuck Ebbert [Fri, 23 Jun 2006 09:04:31 +0000 (02:04 -0700)]
[PATCH] i386: extra checks in show_registers()
Sometimes thread_info and task_struct get out-of-sync with each other.
Printing task.thread_info in show_registers() can help spot this. And when
task_struct is corrupt then task.comm can contain garbage, so only print as
many characters as it can hold.
Chuck Ebbert [Fri, 23 Jun 2006 09:04:23 +0000 (02:04 -0700)]
[PATCH] i386: let usermode execute the "enter" instruction
The i386 page fault handler does not allow enough slack when checking for
userspace access below the current stack pointer. This prevents use of the
enter instruction by user code. Fix this by allowing enough slack for
"enter $65535,$31" to execute.
Problem reported by Tomasz Malesinski <tmal@mimuw.edu.pl>
Tested using this program, based on the original from Tomasz:
Zhang Yanmin [Fri, 23 Jun 2006 09:04:22 +0000 (02:04 -0700)]
[PATCH] x86: kernel irq balance doesn't work
On i386, kernel irq balance doesn't work.
1) In function do_irq_balance, after kernel finds the min_loaded cpu but
before calling set_pending_irq to really pin the selected_irq to the
target cpu, kernel does a cpus_and with irq_affinity[selected_irq].
Later on, when the irq is acked, kernel would calls
move_native_irq=>desc->handler->set_affinity to change the irq affinity.
However, every function pointed by
hw_interrupt_type->set_affinity(unsigned int irq, cpumask_t cpumask)
always changes irq_affinity[irq] to cpumask. Next time when recalling
do_irq_balance, it has to do cpu_ands again with
irq_affinity[selected_irq], but irq_affinity[selected_irq] already
becomes one cpu selected by the first irq balance.
2) Function balance_irq in file arch/i386/kernel/io_apic.c has the same
issue.
Due to "end_of_stack_stop_unwind_function" recursing back to itself in the
EBP stackframe-walker. So avoid this type of recursion when walking the
stack .
Jan Beulich [Fri, 23 Jun 2006 09:04:19 +0000 (02:04 -0700)]
[PATCH] fix x86 microcode driver handling of multiple matching revisions
When multiple updates matching a given CPU are found in the update file, the
action taken by the microcode update driver was inappropriate:
- when lower revision microcode was found before matching or higher revision
one, the driver would needlessly complain that it would not downgrade the
CPU
- when microcode matching the currently installed revision was found before
newer revision code, no update would actually take place
To change this behavior, the driver now concludes about possibly updates and
issues messages only when the entire input was parsed.
Additionally, this adds back (in different places, and conditionalized upon
a new module option) some messages removed by a previous patch.
Andreas Mohr [Fri, 23 Jun 2006 09:04:17 +0000 (02:04 -0700)]
[PATCH] i386 apm.c optimization
- avoid expensive modulo (integer division) which happened
since APM_MAX_EVENTS is 20 (non-power-of-2)
- kill compiler warnings by initializing two variables
- add __read_mostly to some important static variables that are read often
(by idle loop etc.)
- constify several structures
Signed-off-by: Andreas Mohr <andi@lisas.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
L3 cache miss reduction of __copy_from_user_ll
samples %
37408 65.1412 vmlinux __copy_from_user_ll
23 0.1103 vmlinux __copy_user_zeroing_intel_nocache
23/37408=0.061% (99.94% reduction)
Top 5 of 2.6.12.4.nt
Counted GLOBAL_POWER_EVENTS events (time during which processor is not stopped) with a unit mask of 0x01 (mandatory) count 100000
samples % app name symbol name
128392 8.0274 vmlinux __copy_user_zeroing_intel_nocache
64206 4.0143 vmlinux journal_add_journal_head
59746 3.7355 vmlinux do_get_write_access
47674 2.9807 vmlinux journal_put_journal_head
46021 2.8774 vmlinux journal_dirty_metadata
pattern9-0-cpu4-0-09011728/summary.out
Counted BSQ_CACHE_REFERENCE events (cache references seen by the bus unit) with a unit mask of 0x3f (multiple flags) count 3000
samples % app name symbol name
69755 4.2861 vmlinux __copy_user_zeroing_intel_nocache
55685 3.4215 vmlinux journal_add_journal_head
52371 3.2179 vmlinux __find_get_block
45504 2.7960 vmlinux journal_put_journal_head
36005 2.2123 vmlinux journal_stop
pattern9-0-cpu4-0-09011744/summary.out
Counted BSQ_CACHE_REFERENCE events (cache references seen by the bus unit) with a unit mask of 0x200 (read 3rd level cache miss) count 3000
samples % app name symbol name
1147 5.4994 vmlinux journal_add_journal_head
881 4.2240 vmlinux journal_dirty_data
872 4.1809 vmlinux blk_rq_map_sg
734 3.5192 vmlinux journal_commit_transaction
617 2.9582 vmlinux radix_tree_delete
pattern9-0-cpu4-0-09011731/summary.out
iozone results are
original 2.6.12.4 CPU time = 207.768 sec
cache aware CPU time = 184.783 sec
(three times run)
184.783/207.768=88.94% (11.06% reduction)
original:
pattern9-0-cpu4-0-08191720/iozone.out: CPU Utilization: Wall time 45.997 CPU time 64.527 CPU utilization 140.28 %
pattern9-0-cpu4-0-08191741/iozone.out: CPU Utilization: Wall time 46.878 CPU time 71.933 CPU utilization 153.45 %
pattern9-0-cpu4-0-08191743/iozone.out: CPU Utilization: Wall time 45.152 CPU time 71.308 CPU utilization 157.93 %
cache awre:
pattern9-0-cpu4-0-09011728/iozone.out: CPU Utilization: Wall time 44.842 CPU time 62.465 CPU utilization 139.30 %
pattern9-0-cpu4-0-09011731/iozone.out: CPU Utilization: Wall time 44.718 CPU time 59.273 CPU utilization 132.55 %
pattern9-0-cpu4-0-09011744/iozone.out: CPU Utilization: Wall time 44.367 CPU time 63.045 CPU utilization 142.10 %
Sergei Shtylyov [Fri, 23 Jun 2006 09:04:13 +0000 (02:04 -0700)]
[PATCH] Au1550/1200: add missing PSC #define's, make OSS driver use the proper ones
Add missing PSC #define's required for the drivers using PSC on DBAu1550
board (also fixing Au1550 PSC3 address) and all Au1200-based boards as
well. Make the OSS driver use the correct PSC definitions fo each board.
Signed-off-by: Sergei Shtylyov <sshtylyov@ru.mvista.com> Cc: Ralf Baechle <ralf@linux-mips.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
David Quigley [Fri, 23 Jun 2006 09:04:01 +0000 (02:04 -0700)]
[PATCH] SELinux: add task_movememory hook
This patch adds new security hook, task_movememory, to be called when memory
owened by a task is to be moved (e.g. when migrating pages to a this hook is
identical to the setscheduler implementation, but a separate hook introduced
to allow this check to be specialized in the future if necessary.
Since the last posting, the hook has been renamed following feedback from
Christoph Lameter.
Signed-off-by: David Quigley <dpquigl@tycho.nsa.gov> Acked-by: Stephen Smalley <sds@tycho.nsa.gov> Signed-off-by: James Morris <jmorris@namei.org> Cc: Christoph Lameter <clameter@sgi.com> Cc: Andi Kleen <ak@muc.de> Acked-by: Chris Wright <chrisw@sous-sol.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
David Quigley [Fri, 23 Jun 2006 09:04:00 +0000 (02:04 -0700)]
[PATCH] SELinux: add security hook call to mediate attach_task (kernel/cpuset.c)
Add a security hook call to enable security modules to control the ability
to attach a task to a cpuset. While limited control over this operation is
possible via permission checks on the pseudo fs interface, those checks are
not sufficient to control access to the target task, which is looked up in
this function. The existing task_setscheduler hook is re-used for this
operation since this falls under the same class of operations.
Signed-off-by: David Quigley <dpquigl@tycho.nsa.gov> Acked-by: Stephen Smalley <sds@tycho.nsa.gov> Signed-off-by: James Morris <jmorris@namei.org> Acked-by: Paul Jackson <pj@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
David Quigley [Fri, 23 Jun 2006 09:03:59 +0000 (02:03 -0700)]
[PATCH] SELinux: add security hooks to {get,set}affinity
This patch adds LSM hooks into the setaffinity and getaffinity functions to
enable security modules to control these operations between tasks with
task_setscheduler and task_getscheduler LSM hooks.
Signed-off-by: David Quigley <dpquigl@tycho.nsa.gov> Acked-by: Stephen Smalley <sds@tycho.nsa.gov> Signed-off-by: James Morris <jmorris@namei.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
James Morris [Fri, 23 Jun 2006 09:03:58 +0000 (02:03 -0700)]
[PATCH] lsm: add task_setioprio hook
Implement an LSM hook for setting a task's IO priority, similar to the hook
for setting a tasks's nice value.
A previous version of this LSM hook was included in an older version of
multiadm by Jan Engelhardt, although I don't recall it being submitted
upstream.
Also included is the corresponding SELinux hook, which re-uses the setsched
permission in the proccess class.
Signed-off-by: James Morris <jmorris@namei.org> Acked-by: Stephen Smalley <sds@tycho.nsa.gov> Cc: Jan Engelhardt <jengelh@linux01.gwdg.de> Cc: Chris Wright <chrisw@sous-sol.org> Cc: Jens Axboe <axboe@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
[PATCH] move_pages: fix 32 -> 64 bit compat function
The definition of the third parameter is a pointer to an array of virtual
addresses which give us some trouble. The existing code calculated the
wrong address in the array since I used void to avoid having to specify a
type.
I now use the correct type "compat_uptr_t __user *" in the definition of
the function in kernel/compat.c.
However, I used __u32 in syscalls.h. Would have to include compat.h there
in order to provide the same definition which would generate an ugly
include situation.
On both ia64 and x86_64 compat_uptr_t is u32. So this works although
parameter declarations differ.
Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
[PATCH] sys_move_pages: 32bit support (i386, x86_64)
sys_move_pages() support for 32bit (i386 plus x86_64 compat layer)
Add support for move_pages() on i386 and also add the compat functions
necessary to run 32 bit binaries on x86_64.
Add compat_sys_move_pages to the x86_64 32bit binary layer. Note that it is
not up to date so I added the missing pieces. Not sure if this is done the
right way.
[akpm@osdl.org: compile fix] Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: Andi Kleen <ak@muc.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
[PATCH] page migration: sys_move_pages(): support moving of individual pages
move_pages() is used to move individual pages of a process. The function can
be used to determine the location of pages and to move them onto the desired
node. move_pages() returns status information for each page.
long move_pages(pid, number_of_pages_to_move,
addresses_of_pages[],
nodes[] or NULL,
status[],
flags);
The addresses of pages is an array of void * pointing to the
pages to be moved.
The nodes array contains the node numbers that the pages should be moved
to. If a NULL is passed instead of an array then no pages are moved but
the status array is updated. The status request may be used to determine
the page state before issuing another move_pages() to move pages.
The status array will contain the state of all individual page migration
attempts when the function terminates. The status array is only valid if
move_pages() completed successfullly.
Possible page states in status[]:
0..MAX_NUMNODES The page is now on the indicated node.
-ENOENT Page is not present
-EACCES Page is mapped by multiple processes and can only
be moved if MPOL_MF_MOVE_ALL is specified.
-EPERM The page has been mlocked by a process/driver and
cannot be moved.
-EBUSY Page is busy and cannot be moved. Try again later.
-EFAULT Invalid address (no VMA or zero page).
-ENOMEM Unable to allocate memory on target node.
-EIO Unable to write back page. The page must be written
back in order to move it since the page is dirty and the
filesystem does not provide a migration function that
would allow the moving of dirty pages.
-EINVAL A dirty page cannot be moved. The filesystem does not provide
a migration function and has no ability to write back pages.
The flags parameter indicates what types of pages to move:
MPOL_MF_MOVE Move pages that are only mapped by the process.
MPOL_MF_MOVE_ALL Also move pages that are mapped by multiple processes.
Requires sufficient capabilities.
Possible return codes from move_pages()
-ENOENT No pages found that would require moving. All pages
are either already on the target node, not present, had an
invalid address or could not be moved because they were
mapped by multiple processes.
-EINVAL Flags other than MPOL_MF_MOVE(_ALL) specified or an attempt
to migrate pages in a kernel thread.
-EPERM MPOL_MF_MOVE_ALL specified without sufficient priviledges.
or an attempt to move a process belonging to another user.
-EACCES One of the target nodes is not allowed by the current cpuset.
-ENODEV One of the target nodes is not online.
-ESRCH Process does not exist.
-E2BIG Too many pages to move.
-ENOMEM Not enough memory to allocate control array.
-EFAULT Parameters could not be accessed.
A test program for move_pages() may be found with the patches
on ftp.kernel.org:/pub/linux/kernel/people/christoph/pmig/patches-2.6.17-rc4-mm3
From: Christoph Lameter <clameter@sgi.com>
Detailed results for sys_move_pages()
Pass a pointer to an integer to get_new_page() that may be used to
indicate where the completion status of a migration operation should be
placed. This allows sys_move_pags() to report back exactly what happened to
each page.
Wish there would be a better way to do this. Looks a bit hacky.
Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: Hugh Dickins <hugh@veritas.com> Cc: Jes Sorensen <jes@trained-monkey.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: Andi Kleen <ak@muc.de> Cc: Michael Kerrisk <mtk-manpages@gmx.net> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
[PATCH] page migration: use allocator function for migrate_pages()
Instead of passing a list of new pages, pass a function to allocate a new
page. This allows the correct placement of MPOL_INTERLEAVE pages during page
migration. It also further simplifies the callers of migrate pages.
migrate_pages() becomes similar to migrate_pages_to() so drop
migrate_pages_to(). The batching of new page allocations becomes unnecessary.
[PATCH] page migration: handle freeing of pages in migrate_pages()
Do not leave pages on the lists passed to migrate_pages(). Seems that we will
not need any postprocessing of pages. This will simplify the handling of
pages by the callers of migrate_pages().
Currently migrate_pages() is mess with lots of goto. Extract two functions
from migrate_pages() and get rid of the gotos.
Plus we can just unconditionally set the locked bit on the new page since we
are the only one holding a reference. Locking is to stop others from
accessing the page once we establish references to the new page.
Remove the list_del from move_to_lru in order to have finer control over list
processing.
Kirill Korotaev [Fri, 23 Jun 2006 09:03:50 +0000 (02:03 -0700)]
[PATCH] printk() should not be called under zone->lock
This patch fixes printk() under zone->lock in show_free_areas(). It can be
unsafe to call printk() under this lock, since caller can try to
allocate/free some memory and selfdeadlock on this lock. I found
allocations/freeing mem both in netconsole and serial console.
This issue was faced in reallity when meminfo was periodically printed for
debug purposes and netconsole was used.
Andrew Morton [Fri, 23 Jun 2006 09:03:47 +0000 (02:03 -0700)]
[PATCH] initialise total_memory() earlier
Initialise total_memory earlier in boot. Because if for some reason we run
page reclaim early in boot, we don't want total_memory to be zero when we use
it as a divisor.
And rename total_memory to vm_total_pages to avoid naming clashes with
architectures.
Cc: Yasunori Goto <y-goto@jp.fujitsu.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Martin Bligh <mbligh@google.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Ingo Molnar [Fri, 23 Jun 2006 09:03:46 +0000 (02:03 -0700)]
[PATCH] mm/slab.c: fix early init assumption
The SLAB bootstrap code assumes that the first two kmalloc caches created
(the INDEX_AC and INDEX_L3 kmalloc caches) wont be off-slab. But due to AC
and L3 structure size increase in lockdep, one of them ended up being
off-slab, and subsequently crashing with:
The fix is to introduce a bootstrap flag and to use it to prevent off-slab
caches being created so early during bootup.
(The calculation for off-slab caches is quite complex so i didnt want to
complicate things with introducing yet another INDEX_ calculation, the flag
approach is simpler and smaller.)
Signed-off-by: Ingo Molnar <mingo@elte.hu> Cc: Manfred Spraul <manfred@colorfullife.com> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Hugh Dickins [Fri, 23 Jun 2006 09:03:45 +0000 (02:03 -0700)]
[PATCH] fix update_mmu_cache in fremap.c
There are two calls to update_mmu_cache in fremap.c, both defective.
The one in install_page needs to be accompanied by lazy_mmu_prot_update
(some other cleanup time, move that into ia64 update_mmu_cache itself); and
the one in install_file_pte should be removed since the pte is not present.
Hugh Dickins [Fri, 23 Jun 2006 09:03:44 +0000 (02:03 -0700)]
[PATCH] swapoff: use atomic_inc_not_zero() on mm_users
Now that we have atomic_inc_not_zero, it's more elegant for try_to_unuse to
use that on mm_users: doesn't actually matter at present, but safer to be
sure that once mm_users has gone to 0, nothing raises it for an instant.
David Howells [Fri, 23 Jun 2006 09:03:43 +0000 (02:03 -0700)]
[PATCH] add page_mkwrite() vm_operations method
Add a new VMA operation to notify a filesystem or other driver about the
MMU generating a fault because userspace attempted to write to a page
mapped through a read-only PTE.
This facility permits the filesystem or driver to:
(*) Implement storage allocation/reservation on attempted write, and so to
deal with problems such as ENOSPC more gracefully (perhaps by generating
SIGBUS).
(*) Delay making the page writable until the contents have been written to a
backing cache. This is useful for NFS/AFS when using FS-Cache/CacheFS.
It permits the filesystem to have some guarantee about the state of the
cache.
(*) Account and limit number of dirty pages. This is one piece of the puzzle
needed to make shared writable mapping work safely in FUSE.
Needed by cachefs (Or is it cachefiles? Or fscache? <head spins>).
At least four other groups have stated an interest in it or a desire to use
the functionality it provides: FUSE, OCFS2, NTFS and JFFS2. Also, things like
EXT3 really ought to use it to deal with the case of shared-writable mmap
encountering ENOSPC before we permit the page to be dirtied.
From: Peter Zijlstra <a.p.zijlstra@chello.nl>
get_user_pages(.write=1, .force=1) can generate COW hits on read-only
shared mappings, this patch traps those as mkpage_write candidates and fails
to handle them the old way.
Signed-off-by: David Howells <dhowells@redhat.com> Cc: Miklos Szeredi <miklos@szeredi.hu> Cc: Joel Becker <Joel.Becker@oracle.com> Cc: Mark Fasheh <mark.fasheh@oracle.com> Cc: Anton Altaparmakov <aia21@cantab.net> Cc: David Woodhouse <dwmw2@infradead.org> Cc: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Pekka Enberg [Fri, 23 Jun 2006 09:03:40 +0000 (02:03 -0700)]
[PATCH] slab: verify pointers before free
Passing an invalid pointer to kfree() and kmem_cache_free() is likely to
cause bad memory corruption or even take down the whole system because the
bad pointer is likely reused immediately due to the per-CPU caches. Until
now, we don't do any verification for this if CONFIG_DEBUG_SLAB is
disabled.
As suggested by Linus, add PageSlab check to page_to_cache() and
page_to_slab() to verify pointers passed to kfree(). Also, move the
stronger check from cache_free_debugcheck() to kmem_cache_free() to ensure
the passed pointer actually belongs to the cache we're about to free the
object.
For page_to_cache() and page_to_slab(), the assertions should have
virtually no extra cost (two instructions, no data cache pressure) and for
kmem_cache_free() the overhead should be minimal.
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Manfred Spraul <manfred@colorfullife.com> Cc: Christoph Lameter <clameter@engr.sgi.com> Cc: Linus Torvalds <torvalds@osdl.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
[PATCH] More page migration: use migration entries for file pages
This implements the use of migration entries to preserve ptes of file backed
pages during migration. Processes can therefore be migrated back and forth
without loosing their connection to pagecache pages.
Note that we implement the migration entries only for linear mappings.
Nonlinear mappings still require the unmapping of the ptes for migration.
And another writepage() ugliness shows up. writepage() can drop the page
lock. Therefore we have to remove migration ptes before calling writepages()
in order to avoid having migration entries point to unlocked pages.
Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
[PATCH] More page migration: do not inc/dec rss counters
If we install a migration entry then the rss not really decreases since the
page is just moved somewhere else. We can save ourselves the work of
decrementing and later incrementing which will just eventually cause cacheline
bouncing.
Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This modifies the migration code to use the new migration entries. It now
becomes possible to migrate anonymous pages without having to add a swap
entry.
We add a couple of new functions to replace migration entries with the proper
ptes.
We cannot take the tree_lock for migrating anonymous pages anymore. However,
we know that we hold the only remaining reference to the page when the page
count reaches 1.
Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
We take the upper two swapfiles for the two types of migration ptes and define
a series of macros in swapops.h.
The VM is modified to handle the migration entries. migration entries can
only be encountered when the page they are pointing to is locked. This limits
the number of places one has to fix. We also check in copy_pte_range and in
mprotect_pte_range() for migration ptes.
We check for migration ptes in do_swap_cache and call a function that will
then wait on the page lock. This allows us to effectively stop all accesses
to apge.
Migration entries are created by try_to_unmap if called for migration and
removed by local functions in migrate.c
From: Hugh Dickins <hugh@veritas.com>
Several times while testing swapless page migration (I've no NUMA, just
hacking it up to migrate recklessly while running load), I've hit the
BUG_ON(!PageLocked(p)) in migration_entry_to_page.
This comes from an orphaned migration entry, unrelated to the current
correctly locked migration, but hit by remove_anon_migration_ptes as it
checks an address in each vma of the anon_vma list.
Such an orphan may be left behind if an earlier migration raced with fork:
copy_one_pte can duplicate a migration entry from parent to child, after
remove_anon_migration_ptes has checked the child vma, but before it has
removed it from the parent vma. (If the process were later to fault on this
orphaned entry, it would hit the same BUG from migration_entry_wait.)
This could be fixed by locking anon_vma in copy_one_pte, but we'd rather
not. There's no such problem with file pages, because vma_prio_tree_add
adds child vma after parent vma, and the page table locking at each end is
enough to serialize. Follow that example with anon_vma: add new vmas to the
tail instead of the head.
(There's no corresponding problem when inserting migration entries,
because a missed pte will leave the page count and mapcount high, which is
allowed for. And there's no corresponding problem when migrating via swap,
because a leftover swap entry will be correctly faulted. But the swapless
method has no refcounting of its entries.)
From: Ingo Molnar <mingo@elte.hu>
pte_unmap_unlock() takes the pte pointer as an argument.
From: Hugh Dickins <hugh@veritas.com>
Several times while testing swapless page migration, gcc has tried to exec
a pointer instead of a string: smells like COW mappings are not being
properly write-protected on fork.
The protection in copy_one_pte looks very convincing, until at last you
realize that the second arg to make_migration_entry is a boolean "write",
and SWP_MIGRATION_READ is 30.
Anyway, it's better done like in change_pte_range, using
is_write_migration_entry and make_migration_entry_read.
From: Hugh Dickins <hugh@veritas.com>
Remove unnecessary obfuscation from sys_swapon's range check on swap type,
which blew up causing memory corruption once swapless migration made
MAX_SWAPFILES no longer 2 ^ MAX_SWAPFILES_SHIFT.
[PATCH] page migration cleanup: move fallback handling into special function
Move the fallback code into a new fallback function and make the function
behave like any other migration function. This requires retaking the lock if
pageout() drops it.
Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
[PATCH] page migration cleanup: pass "mapping" to migration functions
Change handling of address spaces.
Pass a pointer to the address space in which the page is migrated to all
migration function. This avoids repeatedly having to retrieve the address
space pointer from the page and checking it for validity. The old page
mapping will change once migration has gone to a certain step, so it is less
confusing to have the pointer always available.
Move the setting of the mapping and index for the new page into
migrate_pages().
Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
[PATCH] page migration cleanup: extract try_to_unmap from migration functions
Extract try_to_unmap and rename remove_references -> move_mapping
try_to_unmap() may significantly change the page state by for example setting
the dirty bit. It is therefore best to unmap in migrate_pages() before
calling any migration functions.
migrate_page_remove_references() will then only move the new page in place of
the old page in the mapping. Rename the function to
migrate_page_move_mapping().
This allows us to get rid of the special unmapping for the fallback path.
Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
[PATCH] page migration cleanup: drop nr_refs in remove_references()
Drop nr_refs parameter from migrate_page_remove_references()
The nr_refs parameter is not really useful since the number of remaining
references is always
1 for anonymous pages without a mapping
2 for pages with a mapping
3 for pages with a mapping and PagePrivate set.
Remove the early check for the number of references since we are checking
page_mapcount() earlier. Ultimately only the refcount matters after the
tree_lock has been obtained.
Signed-off-by: Christoph Lameter <clameter@sgi.coim> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Remove the export for migrate_page_remove_references() and migrate_page_copy()
that are unlikely to be used directly by filesystems implementing migration.
The export was useful when buffer_migrate_page() lived in fs/buffer.c but it
has now been moved to migrate.c in the migration reorg.
Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
OGAWA Hirofumi [Fri, 23 Jun 2006 09:03:26 +0000 (02:03 -0700)]
[PATCH] writeback: fix range handling
When a writeback_control's `start' and `end' fields are used to
indicate a one-byte-range starting at file offset zero, the required
values of .start=0,.end=0 mean that the ->writepages() implementation
has no way of telling that it is being asked to perform a range
request. Because we're currently overloading (start == 0 && end == 0)
to mean "this is not a write-a-range request".
To make all this sane, the patch changes range of writeback_control.
So caller does: If it is calling ->writepages() to write pages, it
sets range (range_start/end or range_cyclic) always.
And if range_cyclic is true, ->writepages() thinks the range is
cyclic, otherwise it just uses range_start and range_end.
This patch does,
- Add LLONG_MAX, LLONG_MIN, ULLONG_MAX to include/linux/kernel.h
-1 is usually ok for range_end (type is long long). But, if someone did,
range_end += val; range_end is "val - 1"
u64val = range_end >> bits; u64val is "~(0ULL)"
or something, they are wrong. So, this adds LLONG_MAX to avoid nasty
things, and uses LLONG_MAX for range_end.
- All callers of ->writepages() sets range_start/end or range_cyclic.
- Fix updates of ->writeback_index. It seems already bit strange.
If it starts at 0 and ended by check of nr_to_write, this last
index may reduce chance to scan end of file. So, this updates
->writeback_index only if range_cyclic is true or whole-file is
scanned.
Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp> Cc: Nathan Scott <nathans@sgi.com> Cc: Anton Altaparmakov <aia21@cantab.net> Cc: Steven French <sfrench@us.ibm.com> Cc: "Vladimir V. Saveliev" <vs@namesys.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Peter Zijlstra [Fri, 23 Jun 2006 09:03:25 +0000 (02:03 -0700)]
[PATCH] buglet in radix_tree_tag_set
The comment states: 'Setting a tag on a not-present item is a BUG.' Hence
if 'index' is larger than the maxindex; the item _cannot_ be presen; it
should also be a BUG.
Also, this allows the following statement (assume a fresh tree):
radix_tree_tag_set(root, 16, 1);
to fail silently, but when preceded by:
radix_tree_insert(root, 32, item);
it would BUG, because the height has been extended by the insert.
In neither case was 16 present.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Acked-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Pekka Enberg [Fri, 23 Jun 2006 09:03:24 +0000 (02:03 -0700)]
[PATCH] slab: redzone double-free detection
At present our slab debugging tells us that it detected a double-free or
corruption - it does not distinguish between them. Sometimes it's useful
to be able to differentiate between these two types of information.
Add double-free detection to redzone verification when freeing an object.
As explained by Manfred, when we are freeing an object, both redzones
should be RED_ACTIVE. However, if both are RED_INACTIVE, we are trying to
free an object that was already free'd.
Signed-off-by: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Nick Piggin [Fri, 23 Jun 2006 09:03:22 +0000 (02:03 -0700)]
[PATCH] radix-tree: small
Reduce radix tree node memory usage by about a factor of 4 for small files
(< 64K). There are pointer traversal and memory usage costs for large
files with dense pagecache.
Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Nick Piggin [Fri, 23 Jun 2006 09:03:22 +0000 (02:03 -0700)]
[PATCH] radix-tree: direct data
The ability to have height 0 radix trees (a direct pointer to the data item
rather than going through a full node->slot) quietly disappeared with
old-2.6-bkcvs commit ffee171812d51652f9ba284302d9e5c5cc14bdfd. On 64-bit
machines this causes nearly 600 bytes to be used for every <= 4K file in
pagecache.
Re-introduce this feature, root tags stored in spare ->gfp_mask bits.
Simplify radix_tree_delete's complex tag clearing arrangement (which would
become even more complex) by just falling back to tag clearing functions
(the pagecache radix-tree never uses this path anyway, so the icache
savings will mean it's actually a speedup).
On my 4GB G5, this saves 8MB RAM per kernel kernel source+object tree in
pagecache.
Pagecache lookup, insertion, and removal speed for small files will also be
improved.
This makes RCU radix tree harder, but it's worth it.
Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Dean Nelson [Fri, 23 Jun 2006 09:03:21 +0000 (02:03 -0700)]
[PATCH] change gen_pool allocator to not touch managed memory
Modify the gen_pool allocator (lib/genalloc.c) to utilize a bitmap scheme
instead of the buddy scheme. The purpose of this change is to eliminate
the touching of the actual memory being allocated.
Since the change modifies the interface, a change to the uncached allocator
(arch/ia64/kernel/uncached.c) is also required.
Both Andrey Volkov and Jes Sorenson have expressed a desire that the
gen_pool allocator not write to the memory being managed. See the
following:
Rework the swsusp's memory shrinker in the following way:
- Simplify balance_pgdat() by removing all of the swsusp-related code
from it.
- Make shrink_all_memory() use shrink_slab() and a new function
shrink_all_zones() which calls shrink_active_list() and
shrink_inactive_list() directly for each zone in a way that's optimized
for suspend.
In shrink_all_memory() we try to free exactly as many pages as the caller
asks for, preferably in one shot, starting from easier targets. If slab
caches are huge, they are most likely to have enough pages to reclaim.
The inactive lists are next (the zones with more inactive pages go first)
etc.
Each time shrink_all_memory() attempts to shrink the active and inactive
lists for each zone in 5 passes. In the first pass, only the inactive
lists are taken into consideration. In the next two passes the active
lists are also shrunk, but mapped pages are not reclaimed. In the last
two passes the active and inactive lists are shrunk and mapped pages are
reclaimed as well. The aim of this is to alter the reclaim logic to choose
the best pages to keep on resume and improve the responsiveness of the
resumed system.
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Signed-off-by: Con Kolivas <kernel@kolivas.org> Signed-off-by: Adrian Bunk <bunk@stusta.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The last ifdef addition hit the ugliness treshold on this functions, so:
- rename the variable i to nr_pages so it's somewhat descriptive
- remove the addr variable and do the page_address call at the very end
- instead of ifdef'ing the whole alloc_pages_node call just make the
__GFP_COMP addition to flags conditional
- rewrite the __GFP_COMP comment to make sense
Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Chen, Kenneth W [Fri, 23 Jun 2006 09:03:15 +0000 (02:03 -0700)]
[PATCH] tightening hugetlb strict accounting
Current hugetlb strict accounting for shared mapping always assume mapping
starts at zero file offset and reserves pages between zero and size of the
file. This assumption often reserves (or lock down) a lot more pages then
necessary if application maps at none zero file offset. libhugetlbfs is
one example that requires proper reservation on shared mapping starts at
none zero offset.
This patch extends the reservation and hugetlb strict accounting to support
any arbitrary pair of (offset, len), resulting a much more robust and
accurate scheme. More importantly, it won't lock down any hugetlb pages
outside file mapping.
Signed-off-by: Ken Chen <kenneth.w.chen@intel.com> Acked-by: Adam Litke <agl@us.ibm.com> Cc: David Gibson <david@gibson.dropbear.id.au> Cc: William Lee Irwin III <wli@holomorphy.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Andreas Dilger [Fri, 23 Jun 2006 09:03:14 +0000 (02:03 -0700)]
[PATCH] reserve space for swap label
Reserve space in the swap disk header for a LABEL and UUID to be specified.
This has been possible with util-linux-2.12b (via e2fsprogs 1.36
libblkid), and is used by at least FC3 and later. The kernel doesn't
really care about this, but the space shouldn't accidentally be used by
something else either.
Also make the on-disk structures be fixed-size types, instead of "int",
though I don't know of any architecture in use where an "int" isn't the
same size as a "__u32" (all current kernel arches have it as "unsigned
int").
Signed-off-by: Andreas Dilger <adilger@shaw.ca> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
When sysctl vm.panic_on_oom = 1, the kernel panics intead of killing rogue
processes. And if vm.panic_on_oom is 0 the kernel will do oom_kill() in
the same way as it does today. Of course, the default value is 0 and only
root can modifies it.
In general, oom_killer works well and kill rogue processes. So the whole
system can survive. But there are environments where panic is preferable
rather than kill some processes.
Andy Whitcroft [Fri, 23 Jun 2006 09:03:12 +0000 (02:03 -0700)]
[PATCH] squash duplicate page_to_pfn and pfn_to_page
We have architectures where the size of page_to_pfn and pfn_to_page are
significant enough to overall image size that they wish to push them out of
line. However, in the process we have grown a second copy of the
implementation of each of these routines for each memory model. Share the
implmentation exposing it either inline or out-of-line as required.
Signed-off-by: Andy Whitcroft <apw@shadowen.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Yasunori Goto [Fri, 23 Jun 2006 09:03:11 +0000 (02:03 -0700)]
[PATCH] wait_table and zonelist initializing for memory hotadd: update zonelists
In current code, zonelist is considered to be build once, no modification.
But MemoryHotplug can add new zone/pgdat. It must be updated.
This patch modifies build_all_zonelists(). By this, build_all_zonelist() can
reconfig pgdat's zonelists.
To update them safety, this patch use stop_machine_run(). Other cpus don't
touch among updating them by using it.
In old version (V2 of node hotadd), kernel updated them after zone
initialization. But present_page of its new zone is still 0, because
online_page() is not called yet at this time. Build_zonelists() checks
present_pages to find present zone. It was too early. So, I changed it after
online_pages().
Yasunori Goto [Fri, 23 Jun 2006 09:03:10 +0000 (02:03 -0700)]
[PATCH] wait_table and zonelist initializing for memory hotadd: wait_table initialization
Wait_table is initialized according to zone size at boot time. But, we cannot
know the maixmum zone size when memory hotplug is enabled. It can be
changed.... And resizing of wait_table is hard.
So kernel allocate and initialzie wait_table as its maximum size.
Yasunori Goto [Fri, 23 Jun 2006 09:03:10 +0000 (02:03 -0700)]
[PATCH] wait_table and zonelist initializing for memory hotadd: add return code for init_current_empty_zone
When add_zone() is called against empty zone (not populated zone), we have to
initialize the zone which didn't initialize at boot time. But,
init_currently_empty_zone() may fail due to allocation of wait table. So,
this patch is to catch its error code.
Pekka Enberg [Fri, 23 Jun 2006 09:03:07 +0000 (02:03 -0700)]
[PATCH] slab: page mapping cleanup
Clean up slab allocator page mapping a bit. The memory allocated for a
slab is physically contiguous so it is okay to assume struct pages are too
so kill the long-standing comment. Furthermore, rename set_slab_attr to
slab_map_pages and add a comment explaining why its needed.
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>