git.karo-electronics.de Git - mv-sheeva.git/log

hugetlb: update hugetlb documentation for NUMA controls

Update the kernel huge tlb documentation to describe the numa memory
policy based huge page management. Additionaly, the patch includes a fair
amount of rework to improve consistency, eliminate duplication and set the
context for documenting the memory policy interaction.

Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Acked-by: David Rientjes <rientjes@google.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Andi Kleen <andi@firstfloor.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: Randy Dunlap <randy.dunlap@oracle.com>
Cc: Nishanth Aravamudan <nacc@us.ibm.com>
Cc: Adam Litke <agl@us.ibm.com>
Cc: Andy Whitcroft <apw@canonical.com>
Cc: Eric Whitney <eric.whitney@hp.com>
Cc: Christoph Lameter <cl@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

hugetlb: add per node hstate attributes

Add the per huge page size control/query attributes to the per node
sysdevs:

/sys/devices/system/node/node<ID>/hugepages/hugepages-<size>/
nr_hugepages       - r/w
free_huge_pages    - r/o
surplus_huge_pages - r/o

The patch attempts to re-use/share as much of the existing global hstate
attribute initialization and handling, and the "nodes_allowed" constraint
processing as possible.

Calling set_max_huge_pages() with no node indicates a change to global
hstate parameters.  In this case, any non-default task mempolicy will be
used to generate the nodes_allowed mask.  A valid node id indicates an
update to that node's hstate parameters, and the count argument specifies
the target count for the specified node.  From this info, we compute the
target global count for the hstate and construct a nodes_allowed node mask
contain only the specified node.

Setting the node specific nr_hugepages via the per node attribute
effectively ignores any task mempolicy or cpuset constraints.

With this patch:

(me):ls /sys/devices/system/node/node0/hugepages/hugepages-2048kB
./  ../  free_hugepages  nr_hugepages  surplus_hugepages

Starting from:
Node 0 HugePages_Total:     0
Node 0 HugePages_Free:      0
Node 0 HugePages_Surp:      0
Node 1 HugePages_Total:     0
Node 1 HugePages_Free:      0
Node 1 HugePages_Surp:      0
Node 2 HugePages_Total:     0
Node 2 HugePages_Free:      0
Node 2 HugePages_Surp:      0
Node 3 HugePages_Total:     0
Node 3 HugePages_Free:      0
Node 3 HugePages_Surp:      0
vm.nr_hugepages = 0

Allocate 16 persistent huge pages on node 2:
(me):echo 16 >/sys/devices/system/node/node2/hugepages/hugepages-2048kB/nr_hugepages

[Note that this is equivalent to:
numactl -m 2 hugeadmin --pool-pages-min 2M:+16
]

Yields:
Node 0 HugePages_Total:     0
Node 0 HugePages_Free:      0
Node 0 HugePages_Surp:      0
Node 1 HugePages_Total:     0
Node 1 HugePages_Free:      0
Node 1 HugePages_Surp:      0
Node 2 HugePages_Total:    16
Node 2 HugePages_Free:     16
Node 2 HugePages_Surp:      0
Node 3 HugePages_Total:     0
Node 3 HugePages_Free:      0
Node 3 HugePages_Surp:      0
vm.nr_hugepages = 16

Global controls work as expected--reduce pool to 8 persistent huge pages:
(me):echo 8 >/sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages

Node 0 HugePages_Total:     0
Node 0 HugePages_Free:      0
Node 0 HugePages_Surp:      0
Node 1 HugePages_Total:     0
Node 1 HugePages_Free:      0
Node 1 HugePages_Surp:      0
Node 2 HugePages_Total:     8
Node 2 HugePages_Free:      8
Node 2 HugePages_Surp:      0
Node 3 HugePages_Total:     0
Node 3 HugePages_Free:      0
Node 3 HugePages_Surp:      0

Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Andi Kleen <andi@firstfloor.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Randy Dunlap <randy.dunlap@oracle.com>
Cc: Nishanth Aravamudan <nacc@us.ibm.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Adam Litke <agl@us.ibm.com>
Cc: Andy Whitcroft <apw@canonical.com>
Cc: Eric Whitney <eric.whitney@hp.com>
Cc: Christoph Lameter <cl@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

hugetlb: add generic definition of NUMA_NO_NODE

Move definition of NUMA_NO_NODE from ia64 and x86_64 arch specific headers
to generic header 'linux/numa.h' for use in generic code. NUMA_NO_NODE
replaces bare '-1' where it's used in this series to indicate "no node id
specified". Ultimately, it can be used to replace the -1 elsewhere where
it is used similarly.

Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Acked-by: David Rientjes <rientjes@google.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Andi Kleen <andi@firstfloor.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Randy Dunlap <randy.dunlap@oracle.com>
Cc: Nishanth Aravamudan <nacc@us.ibm.com>
Cc: Adam Litke <agl@us.ibm.com>
Cc: Andy Whitcroft <apw@canonical.com>
Cc: Eric Whitney <eric.whitney@hp.com>
Cc: Christoph Lameter <cl@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

hugetlb: derive huge pages nodes allowed from task mempolicy

This patch derives a "nodes_allowed" node mask from the numa mempolicy of
the task modifying the number of persistent huge pages to control the
allocation, freeing and adjusting of surplus huge pages when the pool page
count is modified via the new sysctl or sysfs attribute
"nr_hugepages_mempolicy".  The nodes_allowed mask is derived as follows:

* For "default" [NULL] task mempolicy, a NULL nodemask_t pointer
  is produced.  This will cause the hugetlb subsystem to use
  node_online_map as the "nodes_allowed".  This preserves the
  behavior before this patch.
* For "preferred" mempolicy, including explicit local allocation,
  a nodemask with the single preferred node will be produced.
  "local" policy will NOT track any internode migrations of the
  task adjusting nr_hugepages.
* For "bind" and "interleave" policy, the mempolicy's nodemask
  will be used.
* Other than to inform the construction of the nodes_allowed node
  mask, the actual mempolicy mode is ignored.  That is, all modes
  behave like interleave over the resulting nodes_allowed mask
  with no "fallback".

See the updated documentation [next patch] for more information
about the implications of this patch.

Examples:

Starting with:

Node 0 HugePages_Total:     0
Node 1 HugePages_Total:     0
Node 2 HugePages_Total:     0
Node 3 HugePages_Total:     0

Default behavior [with or without this patch] balances persistent
hugepage allocation across nodes [with sufficient contiguous memory]:

sysctl vm.nr_hugepages[_mempolicy]=32

yields:

Node 0 HugePages_Total:     8
Node 1 HugePages_Total:     8
Node 2 HugePages_Total:     8
Node 3 HugePages_Total:     8

Of course, we only have nr_hugepages_mempolicy with the patch,
but with default mempolicy, nr_hugepages_mempolicy behaves the
same as nr_hugepages.

Applying mempolicy--e.g., with numactl [using '-m' a.k.a.
'--membind' because it allows multiple nodes to be specified
and it's easy to type]--we can allocate huge pages on
individual nodes or sets of nodes.  So, starting from the
condition above, with 8 huge pages per node, add 8 more to
node 2 using:

numactl -m 2 sysctl vm.nr_hugepages_mempolicy=40

This yields:

Node 0 HugePages_Total:     8
Node 1 HugePages_Total:     8
Node 2 HugePages_Total:    16
Node 3 HugePages_Total:     8

The incremental 8 huge pages were restricted to node 2 by the
specified mempolicy.

Similarly, we can use mempolicy to free persistent huge pages
from specified nodes:

numactl -m 0,1 sysctl vm.nr_hugepages_mempolicy=32

yields:

Node 0 HugePages_Total:     4
Node 1 HugePages_Total:     4
Node 2 HugePages_Total:    16
Node 3 HugePages_Total:     8

The 8 huge pages freed were balanced over nodes 0 and 1.

[rientjes@google.com: accomodate reworked NODEMASK_ALLOC]
Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Andi Kleen <andi@firstfloor.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Randy Dunlap <randy.dunlap@oracle.com>
Cc: Nishanth Aravamudan <nacc@us.ibm.com>
Cc: Adam Litke <agl@us.ibm.com>
Cc: Andy Whitcroft <apw@canonical.com>
Cc: Eric Whitney <eric.whitney@hp.com>
Cc: Christoph Lameter <cl@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

hugetlb: factor init_nodemask_of_node()

Factor init_nodemask_of_node() out of the nodemask_of_node() macro.

This will be used to populate the huge pages "nodes_allowed" nodemask for
a single node when basing nodes_allowed on a preferred/local mempolicy or
when a persistent huge page pool page count is modified via a per node
sysfs attribute.

Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Andi Kleen <andi@firstfloor.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Randy Dunlap <randy.dunlap@oracle.com>
Cc: Nishanth Aravamudan <nacc@us.ibm.com>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Adam Litke <agl@us.ibm.com>
Cc: Andy Whitcroft <apw@canonical.com>
Cc: Eric Whitney <eric.whitney@hp.com>
Cc: Christoph Lameter <cl@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

hugetlb: add nodemask arg to huge page alloc, free and surplus adjust functions

In preparation for constraining huge page allocation and freeing by the
controlling task's numa mempolicy, add a "nodes_allowed" nodemask pointer
to the allocate, free and surplus adjustment functions.  For now, pass
NULL to indicate default behavior--i.e., use node_online_map.  A
subsqeuent patch will derive a non-default mask from the controlling
task's numa mempolicy.

Note that this method of updating the global hstate nr_hugepages under the
constraint of a nodemask simplifies keeping the global state
consistent--especially the number of persistent and surplus pages relative
to reservations and overcommit limits.  There are undoubtedly other ways
to do this, but this works for both interfaces: mempolicy and per node
attributes.

[rientjes@google.com: fix HIGHMEM compile error]
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Reviewed-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: David Rientjes <rientjes@google.com>
Reviewed-by: Andi Kleen <andi@firstfloor.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Randy Dunlap <randy.dunlap@oracle.com>
Cc: Nishanth Aravamudan <nacc@us.ibm.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Adam Litke <agl@us.ibm.com>
Cc: Andy Whitcroft <apw@canonical.com>
Cc: Eric Whitney <eric.whitney@hp.com>
Cc: Christoph Lameter <cl@linux-foundation.org>
Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

hugetlb: rework hstate_next_node_* functions

Modify the hstate_next_node* functions to allow them to be called to
obtain the "start_nid". Then, whereas prior to this patch we
unconditionally called hstate_next_node_to_{alloc|free}(), whether or not
we successfully allocated/freed a huge page on the node, now we only call
these functions on failure to alloc/free to advance to next allowed node.

Factor out the next_node_allowed() function to handle wrap at end of
node_online_map. In this version, the allowed nodes include all of the
online nodes.

Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Reviewed-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: David Rientjes <rientjes@google.com>
Reviewed-by: Andi Kleen <andi@firstfloor.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Randy Dunlap <randy.dunlap@oracle.com>
Cc: Nishanth Aravamudan <nacc@us.ibm.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Adam Litke <agl@us.ibm.com>
Cc: Andy Whitcroft <apw@canonical.com>
Cc: Eric Whitney <eric.whitney@hp.com>
Cc: Christoph Lameter <cl@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

nodemask: make NODEMASK_ALLOC more general

This is a series of patches to provide control over the location of the
allocation and freeing of persistent huge pages on a NUMA platform.
Please consider for merging into mmotm.

This series uses two mechanisms to constrain the nodes from which
persistent huge pages are allocated: 1) the task NUMA mempolicy of the
task modifying a new sysctl "nr_hugepages_mempolicy", based on a
suggestion by Mel Gorman; and 2) a subset of the hugepages hstate sysfs
attributes have been added [in V4] to each node system device under:

/sys/devices/node/node[0-9]*/hugepages

The per node attibutes allow direct assignment of a huge page count on a
specific node, regardless of the task's mempolicy or cpuset constraints.

This patch:

NODEMASK_ALLOC(x, m) assumes x is a type of struct, which is unnecessary.
It's perfectly reasonable to use this macro to allocate a nodemask_t,
which is anonymous, either dynamically or on the stack depending on
NODES_SHIFT.

Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Randy Dunlap <randy.dunlap@oracle.com>
Cc: Nishanth Aravamudan <nacc@us.ibm.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Adam Litke <agl@us.ibm.com>
Cc: Andy Whitcroft <apw@canonical.com>
Cc: Eric Whitney <eric.whitney@hp.com>
Cc: Christoph Lameter <cl@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

mm: move inc_zone_page_state(NR_ISOLATED) to just isolated place

Christoph pointed out inc_zone_page_state(NR_ISOLATED) should be placed
in right after isolate_page().

This patch does it.

Reviewed-by: Christoph Lameter <cl@linux-foundation.org>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

/dev/mem: remove redundant parameter from do_write_kmem()

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Avi Kivity <avi@qumranet.com>
Cc: Greg Kroah-Hartman <gregkh@suse.de>
Cc: Johannes Berg <johannes@sipsolutions.net>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Mark Brown <broonie@opensource.wolfsonmicro.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

/dev/mem: remove the "written" variable in write_kmem()

Also rename "len" to "sz". No behavior change.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Avi Kivity <avi@qumranet.com>
Cc: Greg Kroah-Hartman <gregkh@suse.de>
Cc: Johannes Berg <johannes@sipsolutions.net>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Mark Brown <broonie@opensource.wolfsonmicro.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

/dev/mem: make size_inside_page() logic straight

Also convert more size_inside_page() users.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Avi Kivity <avi@qumranet.com>
Cc: Greg Kroah-Hartman <gregkh@suse.de>
Cc: Johannes Berg <johannes@sipsolutions.net>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Mark Brown <broonie@opensource.wolfsonmicro.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

/dev/mem: cleanup unxlate_dev_mem_ptr() calls

No behaviour change.

[akpm@linux-foundation.org: cleanuplets]
[akpm@linux-foundation.org: remove unused `ret']
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Acked-by: Andi Kleen <ak@linux.intel.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Greg Kroah-Hartman <gregkh@suse.de>
Cc: Mark Brown <broonie@opensource.wolfsonmicro.com>
Cc: Johannes Berg <johannes@sipsolutions.net>
Cc: Avi Kivity <avi@qumranet.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

/dev/mem: introduce size_inside_page()

Introduce size_inside_page() to replace duplicate /dev/mem code.

Also apply it to /dev/kmem, whose alignment logic was buggy.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Acked-by: Andi Kleen <ak@linux.intel.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Greg Kroah-Hartman <gregkh@suse.de>
Cc: Mark Brown <broonie@opensource.wolfsonmicro.com>
Cc: Johannes Berg <johannes@sipsolutions.net>
Cc: Avi Kivity <avi@qumranet.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

/dev/mem: remove redundant test on len

The len test in write_kmem() is always true, so can be reduced.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Acked-by: Andi Kleen <ak@linux.intel.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Greg Kroah-Hartman <gregkh@suse.de>
Cc: Mark Brown <broonie@opensource.wolfsonmicro.com>
Cc: Johannes Berg <johannes@sipsolutions.net>
Cc: Avi Kivity <avi@qumranet.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

mmap: don't return ENOMEM when mapcount is temporarily exceeded in munmap()

On ia64, the following test program exit abnormally, because glibc thread
library called abort().

========================================================
(gdb) bt
#0  0xa000000000010620 in __kernel_syscall_via_break ()
#1  0x20000000003208e0 in raise () from /lib/libc.so.6.1
#2  0x2000000000324090 in abort () from /lib/libc.so.6.1
#3  0x200000000027c3e0 in __deallocate_stack () from /lib/libpthread.so.0
#4  0x200000000027f7c0 in start_thread () from /lib/libpthread.so.0
#5  0x200000000047ef60 in __clone2 () from /lib/libc.so.6.1
========================================================

The fact is, glibc call munmap() when thread exitng time for freeing
stack, and it assume munlock() never fail.  However, munmap() often make
vma splitting and it with many mapcount make -ENOMEM.

Oh well, that's crazy, because stack unmapping never increase mapcount.
The maxcount exceeding is only temporary.  internal temporary exceeding
shouldn't make ENOMEM.

This patch does it.

test_max_mapcount.c
==================================================================
  #include<stdio.h>
  #include<stdlib.h>
  #include<string.h>
  #include<pthread.h>
  #include<errno.h>
  #include<unistd.h>

  #define THREAD_NUM 30000
  #define MAL_SIZE (8*1024*1024)

void *wait_thread(void *args)
{
void *addr;

addr = malloc(MAL_SIZE);
sleep(10);

return NULL;
}

void *wait_thread2(void *args)
{
sleep(60);

return NULL;
}

int main(int argc, char *argv[])
{
int i;
pthread_t thread[THREAD_NUM], th;
int ret, count = 0;
pthread_attr_t attr;

ret = pthread_attr_init(&attr);
if(ret) {
perror("pthread_attr_init");
}

ret = pthread_attr_setdetachstate(&attr, PTHREAD_CREATE_DETACHED);
if(ret) {
perror("pthread_attr_setdetachstate");
}

for (i = 0; i < THREAD_NUM; i++) {
ret = pthread_create(&th, &attr, wait_thread, NULL);
if(ret) {
fprintf(stderr, "[%d] ", count);
perror("pthread_create");
} else {
printf("[%d] create OK.\n", count);
}
count++;

ret = pthread_create(&thread[i], &attr, wait_thread2, NULL);
if(ret) {
fprintf(stderr, "[%d] ", count);
perror("pthread_create");
} else {
printf("[%d] create OK.\n", count);
}
count++;
}

sleep(3600);
return 0;
}
==================================================================

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

page-types: exit early when invoked with -d|--describe

On a system with large amount of memory (256GB), invoking page-types can
take quite a long time, which is unreasonable considering the user only
wants a description of the flags:

# time ./page-types -d 0x10
0x0000000000000010 ____D_____________________________ dirty

real 0m34.285s
user 0m1.966s
sys 0m32.313s

This is because we still walk the entire address range.

Exiting early seems like a reasonble solution:

# time ./page-types -d 0x10
0x0000000000000010 ____D_____________________________ dirty

real 0m0.007s
user 0m0.001s
sys 0m0.005s

Signed-off-by: Alex Chiang <achiang@hp.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Haicheng Li <haicheng.li@intel.com>
Acked-by: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

page-types: whitespace alignment

Align the output when page-type -h is invoked.

Signed-off-by: Alex Chiang <achiang@hp.com>
Acked-by: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

page-types: learn to describe flags directly from command line

Teach page-types to describe page flags directly from the command line.

Why is this useful? For instance, if you're using memory hotplug and see
this in /var/log/messages:

kernel: removing from LRU failed 3836dd0/1/1e00000000000010

It would be nice to decode those page flags without staring at the source.

Example usage and output:

# Documentation/vm/page-types -d 0x10
0x0000000000000010 ____D_____________________________ dirty

# Documentation/vm/page-types -d anon
0x0000000000001000 ____________a_____________________ anonymous

# Documentation/vm/page-types -d anon,0x10
0x0000000000001010 ____D_______a_____________________ dirty,anonymous

[achiang@hp.com: documentation]
Signed-off-by: Alex Chiang <achiang@hp.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Haicheng Li <haicheng.li@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

page-types: unsigned cannot be less than 0 in add_page()

If not signed, testing of the read() return value in this function
will not work.

Signed-off-by: Roel Kluin <roel.kluin@gmail.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

page-types: constify read only arrays

Signed-off-by: Tommi Rantala <tt.rantala@gmail.com>
Cc: Randy Dunlap <rdunlap@xenotime.net>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

oom: dump stack and VM state when oom killer panics

The oom killer header, including information such as the allocation order
and gfp mask, current's cpuset and memory controller, call trace, and VM
state information is currently only shown when the oom killer has selected
a task to kill.

This information is omitted, however, when the oom killer panics either
because of panic_on_oom sysctl settings or when no killable task was
found. It is still relevant to know crucial pieces of information such as
the allocation order and VM state when diagnosing such issues, especially
at boot.

This patch displays the oom killer header whenever it panics so that bug
reports can include pertinent information to debug the issue, if possible.

Signed-off-by: David Rientjes <rientjes@google.com>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

MAINTAINERS: new kbuild maintainer

Sam was fine with handing over kbuild maintainership to me. The git
trees are already in linux-next, a merge request will follow shortly.

Acked-by: Sam Ravnborg <sam@ravnborg.org>
Signed-off-by: Michal Marek <mmarek@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

hfs: fix a potential buffer overflow

A specially-crafted Hierarchical File System (HFS) filesystem could cause
a buffer overflow to occur in a process's kernel stack during a memcpy()
call within the hfs_bnode_read() function (at fs/hfs/bnode.c:24). The
attacker can provide the source buffer and length, and the destination
buffer is a local variable of a fixed length. This local variable (passed
as "&entry" from fs/hfs/dir.c:112 and allocated on line 60) is stored in
the stack frame of hfs_bnode_read()'s caller, which is hfs_readdir().
Because the hfs_readdir() function executes upon any attempt to read a
directory on the filesystem, it gets called whenever a user attempts to
inspect any filesystem contents.

[amwang@redhat.com: modify this patch and fix coding style problems]
Signed-off-by: WANG Cong <amwang@redhat.com>
Cc: Eugene Teo <eteo@redhat.com>
Cc: Roman Zippel <zippel@linux-m68k.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Dave Anderson <anderson@redhat.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

bsdacct: fix uid/gid misreporting

commit d8e180dcd5bbbab9cd3ff2e779efcf70692ef541 "bsdacct: switch
credentials for writing to the accounting file" introduced credential
switching during final acct data collecting. However, uid/gid pair
continued to be collected from current which became credentials of who
created acct file, not who exits.

Addresses http://bugzilla.kernel.org/show_bug.cgi?id=14676

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Reported-by: Juho K. Juopperi <jkj@kapsi.fi>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Acked-by: David Howells <dhowells@redhat.com>
Reviewed-by: Michal Schmidt <mschmidt@redhat.com>
Cc: James Morris <jmorris@namei.org>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Merge branch 'i2c-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jdelvare/staging

* 'i2c-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jdelvare/staging:
  i2c-core: i2c bus should support PM entries in struct dev_pm_ops
  i2c: Get rid of I2C_CLIENT_MODULE_PARM
  i2c: Drop I2C_CLIENT_INSMOD_2 to 8
  i2c: Drop I2C_CLIENT_INSMOD_1
  i2c: Get rid of struct i2c_client_address_data
  i2c: Drop the kind parameter from detect callbacks

Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-udf-2.6

* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-udf-2.6:
  udf: Avoid IO in udf_clear_inode
  udf: Try harder when looking for VAT inode
  udf: Fix compilation with UDFFS_DEBUG enabled

udf: Avoid IO in udf_clear_inode

It is not very good to do IO in udf_clear_inode. First, VFS does not really
expect inode to become dirty there and thus we have to write it ourselves,
second, memory reclaim gets blocked waiting for IO when it does not really
expect it, third, the IO pattern (e.g. on umount) resulting from writes in
udf_clear_inode is bad and it slows down writing a lot.

The reason why UDF needed to do IO in udf_clear_inode is that UDF standard
mandates extent length to exactly match inode size. But when we allocate
extents to a file or directory, we don't really know what exactly the final
file size will be and thus temporarily set it to block boundary and later
truncate it to exact length in udf_clear_inode. Now, this is changed to
truncate to final file size in udf_release_file for regular files. For
directories and symlinks, we do the truncation at the moment when learn
what the final file size will be.

Signed-off-by: Jan Kara <jack@suse.cz>

udf: Try harder when looking for VAT inode

Some disks do not contain VAT inode in the last recorded block as required
by the standard but a few blocks earlier (or the number of recorded blocks
is wrong). So look for the VAT inode a bit before the end of the media.

Signed-off-by: Jan Kara <jack@suse.cz>

udf: Fix compilation with UDFFS_DEBUG enabled

Signed-off-by: Jan Kara <jack@suse.cz>

Merge branch 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip

* 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  x86, mce: Clean up thermal init by introducing intel_thermal_supported()
  x86, mce: Thermal monitoring depends on APIC being enabled
  x86: Gart: fix breakage due to IOMMU initialization cleanup
  x86: Move swiotlb initialization before dma32_free_bootmem
  x86: Fix build warning in arch/x86/mm/mmio-mod.c
  x86: Remove usedac in feature-removal-schedule.txt
  x86: Fix duplicated UV BAU interrupt vector
  nvram: Fix write beyond end condition; prove to gcc copy is safe
  mm: Adjust do_pages_stat() so gcc can see copy_from_user() is safe
  x86: Limit the number of processor bootup messages
  x86: Remove enabling x2apic message for every CPU
  doc: Add documentation for bootloader_{type,version}
  x86, msr: Add support for non-contiguous cpumasks
  x86: Use find_e820() instead of hard coded trampoline address
  x86, AMD: Fix stale cpuid4_info shared_map data in shared_cpu_map cpumasks

Trivial percpu-naming-introduced conflicts in arch/x86/kernel/cpu/intel_cacheinfo.c

Merge git://git.kernel.org/pub/scm/linux/kernel/git/brodo/pcmcia-2.6

* git://git.kernel.org/pub/scm/linux/kernel/git/brodo/pcmcia-2.6:
pcmcia: CodingStyle fixes
pcmcia: remove unused IRQ_FIRST_SHARED

i2c-core: i2c bus should support PM entries in struct dev_pm_ops

Struct dev_pm_ops is not configured in current i2c bus type. i2c drivers
only depends on suspend/resume entries in struct dev_pm_ops are not
informed of PM suspend and resume events by i2c framework.

Signed-off-by: Sonic Zhang <sonic.zhang@analog.com>
Signed-off-by: Jean Delvare <khali@linux-fr.org>

i2c: Get rid of I2C_CLIENT_MODULE_PARM

There is no user left of I2C_CLIENT_MODULE_PARM, so we can finally
get rid of this ugly macro.

Signed-off-by: Jean Delvare <khali@linux-fr.org>
Tested-by: Wolfram Sang <w.sang@pengutronix.de>

i2c: Drop I2C_CLIENT_INSMOD_2 to 8

These macros simply declare an enum, so drivers might as well declare
it themselves. This puts an end to the arbitrary limit of 8 chip types
per i2c driver.

Signed-off-by: Jean Delvare <khali@linux-fr.org>
Tested-by: Wolfram Sang <w.sang@pengutronix.de>

i2c: Drop I2C_CLIENT_INSMOD_1

This macro simply declares an enum, so drivers might as well declare
it themselves.

Signed-off-by: Jean Delvare <khali@linux-fr.org>
Tested-by: Wolfram Sang <w.sang@pengutronix.de>

i2c: Get rid of struct i2c_client_address_data

Struct i2c_client_address_data only contains one field at this point,
which makes its usefulness questionable. Get rid of it and pass simple
address lists around instead.

Signed-off-by: Jean Delvare <khali@linux-fr.org>
Tested-by: Wolfram Sang <w.sang@pengutronix.de>

i2c: Drop the kind parameter from detect callbacks

The "kind" parameter always has value -1, and nobody is using it any
longer, so we can remove it.

Signed-off-by: Jean Delvare <khali@linux-fr.org>
Tested-by: Wolfram Sang <w.sang@pengutronix.de>

Merge branch 'next-spi' of git://git.secretlab.ca/git/linux-2.6

* 'next-spi' of git://git.secretlab.ca/git/linux-2.6: (23 commits)
  spi: fix probe/remove section markings
  Add OMAP spi100k driver
  spi-imx: don't access struct device directly but use dev_get_platdata
  spi-imx: Add mx25 support
  spi-imx: use positive logic to distinguish cpu variants
  spi-imx: correct check for platform_get_irq failing
  ARM: NUC900: Add spi driver support for nuc900
  spi: SuperH MSIOF SPI Master driver V2
  spi: fix spidev compilation failure when VERBOSE is defined
  spi/au1550_spi: fix setupxfer not to override cfg with zeros
  spi/mpc8xxx: don't use __exit_p to wrap plat_mpc8xxx_spi_remove
  spi/i.MX: fix broken error handling for gpio_request
  spi/i.mx: drain MXC SPI transfer buffer when probing device
  MAINTAINERS: add SPI co-maintainer.
  spi/xilinx_spi: fix incorrect casting
  spi/mpc52xx-spi: minor cleanups
  xilinx_spi: add a platform driver using the xilinx_spi common module.
  xilinx_spi: add support for the DS570 IP.
  xilinx_spi: Switch to iomem functions and support little endian.
  xilinx_spi: Split into of driver and generic part.
  ...

Merge branch 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip

* 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  perf sched: Fix build failure on sparc
  perf bench: Add "all" pseudo subsystem and "all" pseudo suite
  perf tools: Introduce perf_session class
  perf symbols: Ditch dso->find_symbol
  perf symbols: Allow lookups by symbol name too
  perf symbols: Add missing "Variables" entry to map_type__name
  perf symbols: Add support for 'variable' symtabs
  perf symbols: Introduce ELF counterparts to symbol_type__is_a
  perf symbols: Introduce symbol_type__is_a
  perf symbols: Rename kthreads to kmaps, using another abstraction for it
  perf tools: Allow building for ARM
  hw-breakpoints: Handle bad modify_user_hw_breakpoint off-case return value
  perf tools: Allow cross compiling
  tracing, slab: Fix no callsite ifndef CONFIG_KMEMTRACE
  tracing, slab: Define kmem_cache_alloc_notrace ifdef CONFIG_TRACING

Trivial conflict due to different fixes to modify_user_hw_breakpoint()
in include/linux/hw_breakpoint.h

PCI: Global variable decls must match the defs in section attributes

Global variable declarations must match the definitions in section attributes
as the compiler is at liberty to vary the method it uses to access a variable,
depending on the section it is in.

When building the FRV arch, I now see:

  drivers/built-in.o: In function `pci_apply_final_quirks':
  drivers/pci/quirks.c:2606: relocation truncated to fit: R_FRV_GPREL12 against symbol `pci_dfl_cache_line_size' defined in .devinit.data section in drivers/built-in.o
  drivers/pci/quirks.c:2623: relocation truncated to fit: R_FRV_GPREL12 against symbol `pci_dfl_cache_line_size' defined in .devinit.data section in drivers/built-in.o
  drivers/pci/quirks.c:2630: relocation truncated to fit: R_FRV_GPREL12 against symbol `pci_dfl_cache_line_size' defined in .devinit.data section in drivers/built-in.o

because the declaration of pci_dfl_cache_line_size in linux/pci.h does not
match the definition in drivers/pci/pci.c.

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

FRV: Fix no-hardware-breakpoint case

If there is no hardware breakpoint support, modify_user_hw_breakpoint()
tries to return a NULL pointer through as an 'int' return value:

  In file included from kernel/exit.c:53:
  include/linux/hw_breakpoint.h: In function 'modify_user_hw_breakpoint':
  include/linux/hw_breakpoint.h:96: warning: return makes integer from pointer without a cast

Return 0 instead.

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Merge branch 'for-linus' of git://git.monstr.eu/linux-2.6-microblaze

* 'for-linus' of git://git.monstr.eu/linux-2.6-microblaze: (46 commits)
  microblaze: Remove rt_sigsuspend wrapper
  microblaze: nommu: Don't clobber R11 on syscalls
  microblaze: Remove show_tmem function
  microblaze: Support for WB cache
  microblaze: Add PVR for Microblaze v7.30.a
  microblaze: Remove ancient and fake microblaze version from cpu_ver table
  microblaze: Remove panic_timeout init value
  microblaze: Do not count system calls in default
  microblaze: Enable DTC compilation
  microblaze: Core oprofile configs and hooks
  microblaze: Fix level interrupt ACKing
  microblaze: Enable futimesat syscall
  microblaze: Checking DTS against PVR for write-back cache
  microblaze: Remove duplicity from pgalloc.h
  microblaze: Futex support
  microblaze: Adding dev_arch_data functions
  microblaze: Fix the heartbeat gpio to be more robust
  microblaze: Simple __copy_tofrom_user for noMMU
  microblaze: Export memory_start for modules
  microblaze: Use lowest-common-denominator default CPU settings
  ...

Merge branch 'for-linus' of git://neil.brown.name/md

* 'for-linus' of git://neil.brown.name/md: (27 commits)
  md: add 'recovery_start' per-device sysfs attribute
  md: rcu_read_lock() walk of mddev->disks in md_do_sync()
  md: integrate spares into array at earliest opportunity.
  md: move compat_ioctl handling into md.c
  md: revise Kconfig help for MD_MULTIPATH
  md: add MODULE_DESCRIPTION for all md related modules.
  raid: improve MD/raid10 handling of correctable read errors.
  md/raid10: print more useful messages on device failure.
  md/bitmap: update dirty flag when bitmap bits are explicitly set.
  md: Support write-intent bitmaps with externally managed metadata.
  md/bitmap: move setting of daemon_lastrun out of bitmap_read_sb
  md: support updating bitmap parameters via sysfs.
  md: factor out parsing of fixed-point numbers
  md: support bitmap offset appropriate for external-metadata arrays.
  md: remove needless setting of thread->timeout in raid10_quiesce
  md: change daemon_sleep to be in 'jiffies' rather than 'seconds'.
  md: move offset, daemon_sleep and chunksize out of bitmap structure
  md: collect bitmap-specific fields into one structure.
  md/raid1: add takeover support for raid5->raid1
  md: add honouring of suspend_{lo,hi} to raid1.
  ...

Merge branch 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/sameo/mfd-2.6

* 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/sameo/mfd-2.6: (58 commits)
  mfd: Add twl6030 regulator subdevices
  regulator: Add support for twl6030 regulators
  rtc: Add twl6030 RTC support
  mfd: Add support for twl6030 irq framework
  mfd: Rename twl4030_ routines in twl-regulator.c
  mfd: Rename twl4030_ routines in rtc-twl.c
  mfd: Rename all twl4030_i2c*
  mfd: Rename twl4030* driver files to enable re-use
  mfd: Clarify twl4030 return value for read and write
  mfd: Add all twl4030 regulators to the twl4030 mfd driver
  mfd: Don't set mc13783 ADREFMODE for touch conversions
  mfd: Remove ezx-pcap defines for custom led gpio encoding
  mfd: Near complete mc13783 rewrite
  mfd: Remove build time warning for WM835x register default tables
  mfd: Force I2C to be built in when building WM831x
  mfd: Don't allow wm831x to be built as a module
  mfd: Fix incorrect error check for wm8350-core
  mfd: Fix twl4030 warning
  gpiolib: Implement gpio_to_irq() for wm831x
  mfd: Remove default selection of AB4500
  ...

Merge branch 'devel' of master.kernel.org:/home/rmk/linux-2.6-arm

* 'devel' of master.kernel.org:/home/rmk/linux-2.6-arm:
  ARM: fix lh7a40x build
  ARM: fix sa1100 build
  ARM: fix clps711x, footbridge, integrator, ixp2000, ixp2300 and s3c build bug
  ARM: VFP: fix vfp thread init bug and document vfp notifier entry conditions
  ARM: pxa: fix now incorrect reference of skt->irq by using skt->socket.pci_irq
  [ARM] pxa/zeus: default configuration for Arcom Zeus SBC.
  [ARM] pxa/zeus: make Viper pcmcia support more generic to support Zeus
  [ARM] pxa/zeus: basic support for Arcom Zeus SBC
  [ARM] pxa/em-x270: fix usb hub power up/reset sequence
  PCMCIA: fix pxa2xx_lubbock modular build error
  ARM: RealView: Fix typo in the RealView/PBX Kconfig entry
  ARM: Do not allow the probing of the local timer
  ARM: Add an earlyprintk debug console

Merge git://git.linux-nfs.org/projects/trondmy/nfs-2.6

* git://git.linux-nfs.org/projects/trondmy/nfs-2.6: (75 commits)
  NFS: Fix nfs_migrate_page()
  rpc: remove unneeded function parameter in gss_add_msg()
  nfs41: Invoke RECLAIM_COMPLETE on all new client ids
  SUNRPC: IS_ERR/PTR_ERR confusion
  NFSv41: Fix a potential state leakage when restarting nfs4_close_prepare
  nfs41: Handle NFSv4.1 session errors in the delegation recall code
  nfs41: Retry delegation return if it failed with session error
  nfs41: Handle session errors during delegation return
  nfs41: Mark stateids in need of reclaim if state manager gets stale clientid
  NFS: Fix up the declaration of nfs4_restart_rpc when NFSv4 not configured
  nfs41: Don't clear DRAINING flag on NFS4ERR_STALE_CLIENTID
  nfs41: nfs41_setup_state_renewal
  NFSv41: More cleanups
  NFSv41: Fix up some bugs in the NFS4CLNT_SESSION_DRAINING code
  NFSv41: Clean up slot table management
  NFSv41: Fix nfs4_proc_create_session
  nfs41: Invoke RECLAIM_COMPLETE
  nfs41: RECLAIM_COMPLETE functionality
  nfs41: RECLAIM_COMPLETE XDR functionality
  Cleanup some NFSv4 XDR decode comments
  ...

Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu: (34 commits)
  m68k: rename global variable vmalloc_end to m68k_vmalloc_end
  percpu: add missing per_cpu_ptr_to_phys() definition for UP
  percpu: Fix kdump failure if booted with percpu_alloc=page
  percpu: make misc percpu symbols unique
  percpu: make percpu symbols in ia64 unique
  percpu: make percpu symbols in powerpc unique
  percpu: make percpu symbols in x86 unique
  percpu: make percpu symbols in xen unique
  percpu: make percpu symbols in cpufreq unique
  percpu: make percpu symbols in oprofile unique
  percpu: make percpu symbols in tracer unique
  percpu: make percpu symbols under kernel/ and mm/ unique
  percpu: remove some sparse warnings
  percpu: make alloc_percpu() handle array types
  vmalloc: fix use of non-existent percpu variable in put_cpu_var()
  this_cpu: Use this_cpu_xx in trace_functions_graph.c
  this_cpu: Use this_cpu_xx for ftrace
  this_cpu: Use this_cpu_xx in nmi handling
  this_cpu: Use this_cpu operations in RCU
  this_cpu: Use this_cpu ops for VM statistics
  ...

Fix up trivial (famous last words) global per-cpu naming conflicts in
arch/x86/kvm/svm.c
mm/slab.c

Documentation: rw_lock lessons learned

In recent months, two different network projects erroneously
strayed down the rw_lock path.  Update the Documentation
based upon comments by Eric Dumazet and Paul E. McKenney in
those threads.

Further updates await somebody else with more expertise.

Changes:
  - Merged with extensive content by Stephen Hemminger.
  - Fix one of the comments by Linus Torvalds.

Signed-off-by: William.Allen.Simpson@gmail.com
Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

x86, mce: Clean up thermal init by introducing intel_thermal_supported()

It looks better to have a common function. No change in functionality.

Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
LKML-Reference: <4B25FDDC.407@jp.fujitsu.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>

x86, mce: Thermal monitoring depends on APIC being enabled

Add check if APIC is not disabled since thermal
monitoring depends on it. As only apic gets disabled
we should not try to install "thermal monitor" vector,
print out that thermal monitoring is enabled and etc...

Note that "Intel Correct Machine Check Interrupts" already
has such a check.

Also I decided to not add cpu_has_apic check into
mcheck_intel_therm_init since even if it'll call apic_read on
disabled apic -- it's safe here and allow us to save a few code
bytes.

Reported-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
LKML-Reference: <4B25FDC2.3020401@jp.fujitsu.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>

perf sched: Fix build failure on sparc

Here, tvec->tv_usec is "unsigned int" not "unsigned long".

Since the type is different on every platform, it's probably
best to just use long printf formats and cast.

Signed-off-by: David S. Miller <davem@davemloft.net>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
LKML-Reference: <20091213.235622.53363059.davem@davemloft.net>
Signed-off-by: Ingo Molnar <mingo@elte.hu>

x86: Gart: fix breakage due to IOMMU initialization cleanup

This fixes the following breakage of the commit
75f1cdf1dda92cae037ec848ae63690d91913eac:

- GART systems that don't AGP with broken BIOS and more than 4GB
  memory are forced to use swiotlb. They can allocate aperture by
  hand and use GART.

- GART systems without GAP must disable GART on shutdown.

- swiotlb usage is forced by the boot option,
  gart_iommu_hole_init() is not called, so we disable GART
  early_gart_iommu_check().

Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
LKML-Reference: <1260759135-6450-3-git-send-email-fujita.tomonori@lab.ntt.co.jp>
Signed-off-by: Ingo Molnar <mingo@elte.hu>

x86: Move swiotlb initialization before dma32_free_bootmem

The commit 75f1cdf1dda92cae037ec848ae63690d91913eac introduced a
bug that we initialize SWIOTLB right after dma32_free_bootmem so
we wrongly steal memory area allocated for GART with broken BIOS
earlier.

This moves swiotlb initialization before dma32_free_bootmem().

Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Cc: yinghai@kernel.org
LKML-Reference: <1260759135-6450-2-git-send-email-fujita.tomonori@lab.ntt.co.jp>
Signed-off-by: Ingo Molnar <mingo@elte.hu>

x86: Fix build warning in arch/x86/mm/mmio-mod.c

Stephen Rothwell reported these warnings:

arch/x86/mm/mmio-mod.c: In function 'print_pte':
arch/x86/mm/mmio-mod.c:100: warning: too many arguments for format
arch/x86/mm/mmio-mod.c:106: warning: too many arguments for format

The 'fmt' was left out accidentally.

Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Joe Perches <joe@perches.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Linus <torvalds@linux-foundation.org>
LKML-Reference: <1260775443.18538.16.camel@Joe-Laptop.home>
Signed-off-by: Ingo Molnar <mingo@elte.hu>

x86: Remove usedac in feature-removal-schedule.txt

The reason of removal, "replaced by allowdac and no dac
combination" is incorrect. There is no way to do the same thing
with "allowdac" and "nodac" combination.

The usedac option enables us to stop via_no_dac() setting
forbid_dac to 1. That is, someone who uses VIA bridges can use
DAC with this option even if some of VIA bridges seem to be
broken about DAC.

Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Acked-by: WANG Cong <amwang@redhat.com>
Cc: gcosta@redhat.com
LKML-Reference: <20091214104423X.fujita.tomonori@lab.ntt.co.jp>
Signed-off-by: Ingo Molnar <mingo@elte.hu>

perf bench: Add "all" pseudo subsystem and "all" pseudo suite

This patch adds a new "all" pseudo subsystem and an "all" pseudo
suite. These are for testing all subsystem and its all suite, or
all suite of one subsystem.

(This patch also contains a few trivial comment fixes for
bench/* and output style fixes. I judged that there are no
necessity to make them into individual patch.)

Example of use:

| % ./perf bench sched all                      # Test all suites of sched subsystem
| # Running sched/messaging benchmark...
| # 20 sender and receiver processes per group
| # 10 groups == 400 processes run
|
|      Total time: 0.414 [sec]
|
| # Running sched/pipe benchmark...
| # Extecuted 1000000 pipe operations between two tasks
|
|      Total time: 10.999 [sec]
|
|       10.999317 usecs/op
|           90914 ops/sec
|
| % ./perf bench all                            # Test all suites of all subsystems
| # Running sched/messaging benchmark...
| # 20 sender and receiver processes per group
| # 10 groups == 400 processes run
|
|      Total time: 0.420 [sec]
|
| # Running sched/pipe benchmark...
| # Extecuted 1000000 pipe operations between two tasks
|
|      Total time: 11.741 [sec]
|
|       11.741346 usecs/op
|           85169 ops/sec
|
| # Running mem/memcpy benchmark...
| # Copying 1MB Bytes from 0x7ff33e920010 to 0x7ff3401ae010 ...
|
|      808.407437 MB/Sec

Signed-off-by: Hitoshi Mitake <mitake@dcl.info.waseda.ac.jp>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
LKML-Reference: <1260691319-4683-1-git-send-email-mitake@dcl.info.waseda.ac.jp>
Signed-off-by: Ingo Molnar <mingo@elte.hu>

microblaze: Remove rt_sigsuspend wrapper

Generic rt_sigsuspend syscalls doesn't need any asm wrapper.

Signed-off-by: Michal Simek <monstr@monstr.eu>

microblaze: nommu: Don't clobber R11 on syscalls

The noMMU syscall trap has a bug that causes R11 to be zero on return to
userland. Remove the extra "save" of R11 responsible for the bug.

Remove reloading of mode indicator because r11 already contains it.

Signed-off-by: Steven J. Magnani <steve@digidescorp.com>
Signed-off-by: Michal Simek <monstr@monstr.eu>

microblaze: Remove show_tmem function

show_tmem function do nothing that's why I removed it.
There is also cleaning of commented ancient code.

Signed-off-by: Michal Simek <monstr@monstr.eu>

microblaze: Support for WB cache

Microblaze version 7.20.d is the first MB version which can be run
on MMU linux. Please do not used previous version because they contain
HW bug.
Based on WB support was necessary to redesign whole cache design.
Microblaze versions from 7.20.a don't need to disable IRQ and cache
before working with them that's why there are special structures for it.

Signed-off-by: Michal Simek <monstr@monstr.eu>

microblaze: Add PVR for Microblaze v7.30.a

Microblaze v7.30.a will have 0x10 version string.

Signed-off-by: Michal Simek <monstr@monstr.eu>

microblaze: Remove ancient and fake microblaze version from cpu_ver table

We need to continue with next microblaze PVR version that's why
I have to remove that ancient version. These version strings not match
any versions. From Microblaze v5.00.a is possible to use this style.
I believe that none use ancients versions. If yes they will be just
labeled as unknown version.

Signed-off-by: Michal Simek <monstr@monstr.eu>

microblaze: Remove panic_timeout init value

panic_timeout is in BSS section and it is cleared with BSS section.
This means that value is setup to 0.

Signed-off-by: Michal Simek <monstr@monstr.eu>

microblaze: Do not count system calls in default

There is not necessary to count system calls that's why
I added DEBUG macro

Signed-off-by: Michal Simek <monstr@monstr.eu>

microblaze: Enable DTC compilation

For simpleImage format we need to compile DTC. There is still possibility
to compile only Linux kernel without DTB compiled-in.

Signed-off-by: Michal Simek <monstr@monstr.eu>

microblaze: Core oprofile configs and hooks

Microblaze uses timer interrupt mode. Microblaze don't have
any performance counter that's why we use just simple implementation.

Signed-off-by: John Williams <john.williams@petalogix.com>
Signed-off-by: Michal Simek <monstr@monstr.eu>

microblaze: Fix level interrupt ACKing

Level interrupts need to be ack'd in the unmask handler, as in powerpc.
Among other issues, this bug causes the system clock to appear to run at
double-speed.

Signed-off-by: Steven J. Magnani <steve@digidescorp.com>
Signed-off-by: Michal Simek <monstr@monstr.eu>

microblaze: Enable futimesat syscall

Futimesat was disabled. LTP testing shows that MB has no
problem with this syscall.

Signed-off-by: Michal Simek <monstr@monstr.eu>

microblaze: Checking DTS against PVR for write-back cache

WB cache has special flag in PVR. There is added checking mechanism
for PVR and DTS.

Signed-off-by: Michal Simek <monstr@monstr.eu>

microblaze: Remove duplicity from pgalloc.h

just file cleanup

Signed-off-by: Michal Simek <monstr@monstr.eu>

microblaze: Futex support

Microblaze v7.20 provides new lwx, swx instructions which bring
possibility to implement lock rutines.

There are some tests in open posix thread LTP part but current
toolchain not support it.

Signed-off-by: Michal Simek <monstr@monstr.eu>

microblaze: Adding dev_arch_data functions

The functions, dev_arch_data_set_node and get_node are missing
and are needed by some device drivers such as I2C.

Signed-off-by: John Linn <john.linn@xilinx.com>
Signed-off-by: Michal Simek <monstr@monstr.eu>

microblaze: Fix the heartbeat gpio to be more robust

The device tree handling for the gpio in the heart beat was not handling
the system when there was no gpio and it wasn't working with a newer version
of the gpio core which does not have the is-bidir property.

Signed-off-by: John Linn <john.linn@xilinx.com>
Signed-off-by: Michal Simek <monstr@monstr.eu>

microblaze: Simple __copy_tofrom_user for noMMU

This is first patch which clear part of uaccess.h.
uaccess.h will be clear later.

Signed-off-by: John Williams <john.williams@petalogix.com>
Signed-off-by: Michal Simek <monstr@monstr.eu>

microblaze: Export memory_start for modules

memory_start symbol is needed by kernel modules.

Signed-off-by: Michal Simek <monstr@monstr.eu>

microblaze: Use lowest-common-denominator default CPU settings

This will ensure that kernels built with no custom CPU settings will still boot
OK on hardware that has additional CPU hardware instructions etc.

Signed-off-by: John Williams <john.williams@petalogix.com>
Signed-off-by: Michal Simek <monstr@monstr.eu>

microblaze: Update default generic DTS

It is generated with longer compatible list

Signed-off-by: Michal Simek <monstr@monstr.eu>

microblaze: Enable asm optimization only for HW with barrel-shifter

Asm code uses barrel-shifter instruction that's why we have
to protect cases when HW don't have it.

Reported-by: John Linn <john.linn@xilinx.com>
Signed-off-by: Michal Simek <monstr@monstr.eu>

microblaze: Remove the buggy ALLOW_EDIT_AUTO config option

This was intended to allow manual override of CPU settings copied automatically
to Kconfig.auto, however it's problematic for several reasons, but mostly:

  * If the defconfig doesn't have ALLOW_EDIT_AUTO=y, then it's impossible for
    that defconfig to iverride the values in the kernel source tree.  This leads
    to very strange errors where the kernel is compiled with the wrong CPUFLAGS.

Next patch in the series will back out the default in Kconfig.auto to baseline
settings, so a kernel built with no default values will at least boot on any
hardware, just not make use of additional CPU features.

Signed-off-by: John Williams <john.williams@petalogix.com>
Signed-off-by: Michal Simek <monstr@monstr.eu>

microblaze: Move cache macro from cache.h to cacheflush.h

Signed-off-by: Michal Simek <monstr@monstr.eu>

microblaze: support U-BOOT image format

Two version are generated.
linux.bin.ub which is created from linux.bin file
and
simpleImage.<dts>.ub which is created from stripped simpleImage.<dts> file

Load address and entry point is for microblaze first instruction
which is CONFIG_KERNEL_BASE_ADDR variable.

There is possible for simpleImage format parse _start symbol too.

simpleImage.<dts> is still stripped elf file

I cleared simpleImage.<dts>.unstrip file because there are so big.

Signed-off-by: Michal Simek <monstr@monstr.eu>

microblaze: Ptrace notifying from signal code

After the signal frame is set up on the userspace stack, ptrace() should
be given an opportunity to single-step into the signal handler

FRV, Blackfin, mn10300 and UM. Worth to look at that patches.

Signed-off-by: Michal Simek <monstr@monstr.eu>

microblaze: Extend cpuinfo for support write-back caches

There is missing checking agains PVR but this is not important
for now. There are some missing checking too.

Signed-off-by: Michal Simek <monstr@monstr.eu>

microblaze: Fix cache_line_lenght

We used cache_line as cache_line_lenght. For this reason
we did cache flushing 4 times longer than was necessary.

Signed-off-by: Michal Simek <monstr@monstr.eu>

microblaze: Detect new 7.20.d version

Signed-off-by: Michal Simek <monstr@monstr.eu>

microblaze: Support both levels for reset

Till this patch reset always perform writen to 1.
Now we can use negative logic and perform reset write to 0.

It is opposite level than is currently read from that pin

Signed-off-by: Michal Simek <monstr@monstr.eu>

microblaze: Fix announce message for reset gpio

I had to change message for gpio-reset because I always
not to see it. Prefix RESET is big and visible.

Signed-off-by: Michal Simek <monstr@monstr.eu>

microblaze: Remove saving and restoring before calling signal code

Saving is done in SAVE_STATE macros that's why another save discard
previous saved value.

This change has no effect to normal programs because they ends in any exception
and they are killed. On the other side has effect on debugging.

Signed-off-by: Michal Simek <monstr@monstr.eu>

microblaze: Fix pfn_valid() for noMMU

Configuring DEBUG_SLAB causes a noMMU kernel to die during initialization
with an invalid virtual address panic in kfree_debugcheck().
The panic is due to an improper definition of pfn_valid().

Signed-off-by: Steven J. Magnani <steve@digidescorp.com>
Signed-off-by: Michal Simek <monstr@monstr.eu>

microblaze: ftrace: Add dynamic function graph tracer

This patch add support for dynamic function graph tracer.

There is one my expactation that I can do flush_icache after
all code modification. On microblaze is this safer than do
flush for every entry. For icache is used name flush but
correct should be invalidation - this will be fix in upcomming
new cache implementaion and WB support.

Signed-off-by: Michal Simek <monstr@monstr.eu>

microblaze: ftrace: add function graph support

For more information look at Documentation/trace folder.

Signed-off-by: Michal Simek <monstr@monstr.eu>

microblaze: ftrace: Add dynamic trace support

With dynamic function tracer, by default, _mcount is defined as an
"empty" function, it returns directly without any more action. When
enabling it in user-space, it will jump to a real tracing
function(ftrace_caller), and do the real job for us.

Differ from the static function tracer, dynamic function tracer provides
two functions ftrace_make_call()/ftrace_make_nop() to enable/disable the
tracing of some indicated kernel functions(set_ftrace_filter).

In the kernel version, there is only one "_mcount" string for every
kernel function, so, we just need to match this one in mcount_regex of
scripts/recordmcount.pl.

For more information please look at code and Documentation/trace folder.

Steven ACK that scripts/recordmcount.pl part.

Acked-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Michal Simek <monstr@monstr.eu>

microblaze: ftrace: enable HAVE_FUNCTION_TRACE_MCOUNT_TEST

Implement MCOUNT_TEST in asm code - it is faster than use
generic code

Signed-off-by: Michal Simek <monstr@monstr.eu>

microblaze: ftrace: add static function tracer

If -pg of gcc is enabled with CONFIG_FUNCTION_TRACER=y. a calling to
_mcount will be inserted into each kernel function. so, there is a
possibility to trace the kernel functions in _mcount.

This patch add the specific _mcount support for static function
tracing. by default, ftrace_trace_function is initialized as
ftrace_stub(an empty function), so, the default _mcount will introduce
very little overhead. after enabling ftrace in user-space, it will jump
to a real tracing function and do static function tracing for us.

Commit message from Wu Zhangjin <wuzhangjin@gmail.com>

Signed-off-by: Michal Simek <monstr@monstr.eu>

microblaze: Add TRACE_IRQFLAGS_SUPPORT

There are just two major changes
Renamed local_irq functions to raw_local_irq in irq.c.
Added TRACE_IRQFLAGS_SUPPORT to Kconfig.debug.

Look at Documentation/irqflags-tracing.txt

Signed-off-by: Michal Simek <monstr@monstr.eu>

microblaze: preliminary enabling for LATENCYTOP support in Kconfig

Signed-off-by: Michal Simek <monstr@monstr.eu>

microblaze: Lockdep support

Microblaze needs to do lock_init very soon because MMU init calls lock functions.

Here is the explanation from Peter Zijlstra why we have to enable
__ARCH_WANTS_INTERRUPTS_ON_CTSW.

"So we schedule while holding rq->lock (for obvious reasons), but since
lockdep tracks held locks per tasks, we need to transfer the held state
from the prev to the next task. We do this by explicity calling
spin_release(&rq->lock) in context_switch() right before switch_to(),
and calling spin_acquire(&rq->lock) in
finish_task_switch()->finish_lock_switch().

Now, for some reason lockdep thinks that interrupts got enabled over the
context switch (git grep __ARCH_WANTS_INTERRUPTS_ON_CTSW arch/microblaze
doesn't seem to turn up anything).

Clearly trying to acquire the rq->lock with interrupts enabled is a bad
idea and lockdep warns you about this."

Signed-off-by: Michal Simek <monstr@monstr.eu>

microblaze: Register timecounter/cyclecounter

It is the same counter as we use as free running one.
I would like to use it for ftrace.

Signed-off-by: Michal Simek <monstr@monstr.eu>

microblaze: Stack trace support

This is working implemetation but the problem is that
Microblaze misses frame pointer that's why is there
big loop which trace and show all addresses which are in text.
It shows addresses which are in registers, etc.

This is problem and this is the reason why all Microblaze
traces are wrong. There is an option to do hacks and trace
the kernel code but this is too complicated.

Signed-off-by: Michal Simek <monstr@monstr.eu>