]> git.karo-electronics.de Git - karo-tx-linux.git/log
karo-tx-linux.git
11 years agomm: rename page struct field helpers
Mel Gorman [Thu, 7 Feb 2013 01:26:59 +0000 (12:26 +1100)]
mm: rename page struct field helpers

The function names page_xchg_last_nid(), page_last_nid() and
reset_page_last_nid() were judged to be inconsistent so rename them to a
struct_field_op style pattern.  As it looked jarring to have
reset_page_mapcount() and page_nid_reset_last() beside each other in
memmap_init_zone(), this patch also renames reset_page_mapcount() to
page_mapcount_reset().  There are others like init_page_count() but as it
is used throughout the arch code a rename would likely cause more
conflicts than it is worth.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Suggested-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agomm: memmap_init_zone() performance improvement
Mike Yoknis [Thu, 7 Feb 2013 01:26:59 +0000 (12:26 +1100)]
mm: memmap_init_zone() performance improvement

We have what we call an "architectural simulator".  It is a computer
program that pretends that it is a computer system.  We use it to test the
firmware before real hardware is available.  We have booted Linux on our
simulator.  As you would expect it takes longer to boot on the simulator
than it does on real hardware.

With my patch - boot time 41 minutes
Without patch - boot time 94 minutes

These numbers do not scale linearly to real hardware.  But indicate to me
a place where Linux can be improved.

memmap_init_zone() loops through every Page Frame Number (pfn), including
pfn values that are within the gaps between existing memory sections.  The
unneeded looping will become a boot performance issue when machines
configure larger memory ranges that will contain larger and more numerous
gaps.

The code will skip across invalid pfn values to reduce the number of loops
executed.

Signed-off-by: Mike Yoknis <mike.yoknis@hp.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agomemcg: avoid dangling reference count in creation failure.
Glauber Costa [Thu, 7 Feb 2013 01:26:58 +0000 (12:26 +1100)]
memcg: avoid dangling reference count in creation failure.

When use_hierarchy is enabled, we acquire an extra reference count in our
parent during cgroup creation.  We don't release it, though, if any
failure exist in the creation process.

Signed-off-by: Glauber Costa <glommer@parallels.com>
Reported-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: Tejun Heo <tj@kernel.org>
Cc: Hiroyuki Kamezawa <kamezawa.hiroyuki@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agomemcg: increment static branch right after limit set
Glauber Costa [Thu, 7 Feb 2013 01:26:58 +0000 (12:26 +1100)]
memcg: increment static branch right after limit set

We were deferring the kmemcg static branch increment to a later time, due
to a nasty dependency between the cpu_hotplug lock, taken by the jump
label update, and the cgroup_lock.

Now we no longer take the cgroup lock, and we can save ourselves the
trouble.

Signed-off-by: Glauber Costa <glommer@parallels.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: Tejun Heo <tj@kernel.org>
Cc: Hiroyuki Kamezawa <kamezawa.hiroyuki@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agomemcg: replace cgroup_lock with memcg specific memcg_lock
Glauber Costa [Thu, 7 Feb 2013 01:26:58 +0000 (12:26 +1100)]
memcg: replace cgroup_lock with memcg specific memcg_lock

After the preparation work done in earlier patches, the cgroup_lock can be
trivially replaced with a memcg-specific lock.  This is an automatic
translation at every site where the values involved were queried.

The sites where values are written, however, used to be naturally called
under cgroup_lock.  This is the case for instance in the css_online
callback.  For those, we now need to explicitly add the memcg lock.

With this, all the calls to cgroup_lock outside cgroup core are gone.

Signed-off-by: Glauber Costa <glommer@parallels.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: Tejun Heo <tj@kernel.org>
Cc: Hiroyuki Kamezawa <kamezawa.hiroyuki@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agomemcg-fast-hierarchy-aware-child-test-fix
Andrew Morton [Thu, 7 Feb 2013 01:26:57 +0000 (12:26 +1100)]
memcg-fast-hierarchy-aware-child-test-fix

tweak comments

Cc: Glauber Costa <glommer@parallels.com>
Cc: Hiroyuki Kamezawa <kamezawa.hiroyuki@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agomemcg: fast hierarchy-aware child test
Glauber Costa [Thu, 7 Feb 2013 01:26:57 +0000 (12:26 +1100)]
memcg: fast hierarchy-aware child test

Currently, we use cgroups' provided list of children to verify if it is
safe to proceed with any value change that is dependent on the cgroup
being empty.

This is less than ideal, because it enforces a dependency over cgroup core
that we would be better off without.  The solution proposed here is to
iterate over the child cgroups and if any is found that is already online,
we bounce and return: we don't really care how many children we have, only
if we have any.

This is also made to be hierarchy aware.  IOW, cgroups with hierarchy
disabled, while they still exist, will be considered for the purpose of
this interface as having no children.

Signed-off-by: Glauber Costa <glommer@parallels.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: Tejun Heo <tj@kernel.org>
Cc: Hiroyuki Kamezawa <kamezawa.hiroyuki@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agomemcg: split part of memcg creation to css_online
Glauber Costa [Thu, 7 Feb 2013 01:26:57 +0000 (12:26 +1100)]
memcg: split part of memcg creation to css_online

This patch is a preparatory work for later locking rework to get rid of
big cgroup lock from memory controller code.

The memory controller uses some tunables to adjust its operation.  Those
tunables are inherited from parent to children upon children
intialization.  For most of them, the value cannot be changed after the
parent has a new children.

cgroup core splits initialization in two phases: css_alloc and css_online.
 After css_alloc, the memory allocation and basic initialization are done.
 But the new group is not yet visible anywhere, not even for cgroup core
code.  It is only somewhere between css_alloc and css_online that it is
inserted into the internal children lists.  Copying tunable values in
css_alloc will lead to inconsistent values: the children will copy the old
parent values, that can change between the copy and the moment in which
the groups is linked to any data structure that can indicate the presence
of children.

Signed-off-by: Glauber Costa <glommer@parallels.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: Tejun Heo <tj@kernel.org>
Cc: Hiroyuki Kamezawa <kamezawa.hiroyuki@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agomemcg: prevent changes to move_charge_at_immigrate during task attach
Glauber Costa [Thu, 7 Feb 2013 01:26:56 +0000 (12:26 +1100)]
memcg: prevent changes to move_charge_at_immigrate during task attach

In memcg, we use the cgroup_lock basically to synchronize against
attaching new children to a cgroup.  We do this because we rely on cgroup
core to provide us with this information.

We need to guarantee that upon child creation, our tunables are
consistent.  For those, the calls to cgroup_lock() all live in handlers
like mem_cgroup_hierarchy_write(), where we change a tunable in the group
that is hierarchy-related.  For instance, the use_hierarchy flag cannot be
changed if the cgroup already have children.

Furthermore, those values are propagated from the parent to the child when
a new child is created.  So if we don't lock like this, we can end up with
the following situation:

A                                   B
 memcg_css_alloc()                       mem_cgroup_hierarchy_write()
 copy use hierarchy from parent          change use hierarchy in parent
 finish creation.

This is mainly because during create, we are still not fully connected to
the css tree.  So all iterators and the such that we could use, will fail
to show that the group has children.

My observation is that all of creation can proceed in parallel with those
tasks, except value assignment.  So what this patch series does is to first
move all value assignment that is dependent on parent values from
css_alloc to css_online, where the iterators all work, and then we lock
only the value assignment.  This will guarantee that parent and children
always have consistent values.  Together with an online test, that can be
derived from the observation that the refcount of an online memcg can be
made to be always positive, we should be able to synchronize our side
without the cgroup lock.

This patch:

Currently, we rely on the cgroup_lock() to prevent changes to
move_charge_at_immigrate during task migration.  However, this is only
needed because the current strategy keeps checking this value throughout
the whole process.  Since all we need is serialization, one needs only to
guarantee that whatever decision we made in the beginning of a specific
migration is respected throughout the process.

We can achieve this by just saving it in mc. By doing this, no kind of
locking is needed.

Signed-off-by: Glauber Costa <glommer@parallels.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: Tejun Heo <tj@kernel.org>
Cc: Hiroyuki Kamezawa <kamezawa.hiroyuki@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agomemcg-reduce-the-size-of-struct-memcg-244-fold-fix-fix
Andrew Morton [Thu, 7 Feb 2013 01:26:56 +0000 (12:26 +1100)]
memcg-reduce-the-size-of-struct-memcg-244-fold-fix-fix

oops

Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Glauber Costa <glommer@parallels.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agomemcg-reduce-the-size-of-struct-memcg-244-fold-fix
Andrew Morton [Thu, 7 Feb 2013 01:26:56 +0000 (12:26 +1100)]
memcg-reduce-the-size-of-struct-memcg-244-fold-fix

add check for invalid nid, remove inline

Cc: Glauber Costa <glommer@parallels.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Rik van Riel <riel@redhat.com>
Cc: Ying Han <yinghan@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agomemcg: reduce the size of struct memcg 244-fold.
Glauber Costa [Thu, 7 Feb 2013 01:26:55 +0000 (12:26 +1100)]
memcg: reduce the size of struct memcg 244-fold.

In order to maintain all the memcg bookkeeping, we need per-node
descriptors, which will in turn contain a per-zone descriptor.

Because we want to statically allocate those, this array ends up being
very big.  Part of the reason is that we allocate something large enough
to hold MAX_NUMNODES, the compile time constant that holds the maximum
number of nodes we would ever consider.

However, we can do better in some cases if the firmware help us.  This is
true for modern x86 machines; coincidentally one of the architectures in
which MAX_NUMNODES tends to be very big.

By using the firmware-provided maximum number of nodes instead of
MAX_NUMNODES, we can reduce the memory footprint of struct memcg
considerably.  In the extreme case in which we have only one node, this
reduces the size of the structure from ~ 64k to ~2k.  This is particularly
important because it means that we will no longer resort to the vmalloc
area for the struct memcg on defconfigs.  We also have enough room for an
extra node and still be outside vmalloc.

One also has to keep in mind that with the industry's ability to fit more
processors in a die as fast as the FED prints money, a nodes = 2
configuration is already respectably big.

Signed-off-by: Glauber Costa <glommer@parallels.com>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Greg Thelen <gthelen@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Ying Han <yinghan@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agomm: init: report on last-nid information stored in page->flags
Mel Gorman [Thu, 7 Feb 2013 01:26:55 +0000 (12:26 +1100)]
mm: init: report on last-nid information stored in page->flags

Answering the question "how much space remains in the page->flags" is
time-consuming.  mminit_loglevel can help answer the question but it does
not take last_nid information into account.  This patch corrects it and
while there it corrects the messages related to page flag usage, pgshifts
and node/zone id.  When applied the relevant output looks something like
this but will depend on the kernel configuration.

[    0.000000] mminit::pageflags_layout_widths Section 0 Node 9 Zone 2 Lastnid 9 Flags 25
[    0.000000] mminit::pageflags_layout_shifts Section 19 Node 9 Zone 2 Lastnid 9
[    0.000000] mminit::pageflags_layout_pgshifts Section 0 Node 55 Zone 53 Lastnid 44
[    0.000000] mminit::pageflags_layout_nodezoneid Node/Zone ID: 64 -> 53
[    0.000000] mminit::pageflags_layout_usage location: 64 -> 44 layout 44 -> 25 unused 25 -> 0 page-flags

Signed-off-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agomm: uninline page_xchg_last_nid()
Mel Gorman [Thu, 7 Feb 2013 01:26:55 +0000 (12:26 +1100)]
mm: uninline page_xchg_last_nid()

Andrew Morton pointed out that page_xchg_last_nid() and
reset_page_last_nid() were "getting nuttily large" and asked that it be
investigated.

reset_page_last_nid() is on the page free path and it would be unfortunate
to make that path more expensive than it needs to be.  Due to the internal
use of page_xchg_last_nid() it is already too expensive but fortunately,
it should also be impossible for the page->flags to be updated in parallel
when we call reset_page_last_nid().  Instead of unlining the function, it
uses a simplier implementation that assumes no parallel updates and should
now be sufficiently short for inlining.

page_xchg_last_nid() is called in paths that are already quite expensive
(splitting huge page, fault handling, migration) and it is reasonable to
uninline.  There was not really a good place to place the function but
mm/mmzone.c was the closest fit IMO.

This patch saved 128 bytes of text in the vmlinux file for the kernel
configuration I used for testing automatic NUMA balancing.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agomemcg: clean up swap accounting initialization code
Michal Hocko [Thu, 7 Feb 2013 01:26:54 +0000 (12:26 +1100)]
memcg: clean up swap accounting initialization code

Memcg swap accounting is currently enabled by enable_swap_cgroup when the
root cgroup is created.  mem_cgroup_init acts as a memcg subsystem
initializer which sounds like a much better place for enable_swap_cgroup
as well.  We already register memsw files from there so it makes a lot of
sense to merge those two into a single enable_swap_cgroup function.

This patch doesn't introduce any semantic changes.

Signed-off-by: Michal Hocko <mhocko@suse.cz>
Cc: Zhouping Liu <zliu@redhat.com>
Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Li Zefan <lizefan@huawei.com>
Cc: CAI Qian <caiqian@redhat.com>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agomemcg: do not create memsw files if swap accounting is disabled
Michal Hocko [Thu, 7 Feb 2013 01:26:54 +0000 (12:26 +1100)]
memcg: do not create memsw files if swap accounting is disabled

Zhouping Liu has reported that memsw files are exported even though swap
accounting is runtime disabled if CONFIG_MEMCG_SWAP is enabled.  This
behavior has been introduced by af36f906 (memcg: always create memsw files
if CONFIG_CGROUP_MEM_RES_CTLR_SWAP) and it causes any attempt to open the
file to return EOPNOTSUPP.  Although EOPNOTSUPP should say be clear that
memsw operations are not supported in the given configuration it is fair
to say that this behavior could be quite confusing.

Let's tear memsw files out of default cgroup files and add them only if
the swap accounting is really enabled (either by CONFIG_MEMCG_SWAP_ENABLED
or swapaccount=1 boot parameter).  We can hook into mem_cgroup_init which
is called when the memcg subsystem is initialized and which happens after
boot command line is processed.

Signed-off-by: Michal Hocko <mhocko@suse.cz>
Reported-by: Zhouping Liu <zliu@redhat.com>
Tested-by: Zhouping Liu <zliu@redhat.com>
Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Li Zefan <lizefan@huawei.com>
Cc: CAI Qian <caiqian@redhat.com>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agopage-writebackc-subtract-min_free_kbytes-from-dirtyable-memory-fix-fix
Andrew Morton [Thu, 7 Feb 2013 01:26:54 +0000 (12:26 +1100)]
page-writebackc-subtract-min_free_kbytes-from-dirtyable-memory-fix-fix

fix min() warning

Cc: Paul Szabo <psz@maths.usyd.edu.au>
Cc: Rik van Riel <riel@redhat.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agopage-writebackc-subtract-min_free_kbytes-from-dirtyable-memory-fix
Andrew Morton [Thu, 7 Feb 2013 01:26:53 +0000 (12:26 +1100)]
page-writebackc-subtract-min_free_kbytes-from-dirtyable-memory-fix

fix up min_free_kbytes extern declarations

Cc: Paul Szabo <psz@maths.usyd.edu.au>
Cc: Rik van Riel <riel@redhat.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agopage-writeback.c: subtract min_free_kbytes from dirtyable memory
Paul Szabo [Thu, 7 Feb 2013 01:26:53 +0000 (12:26 +1100)]
page-writeback.c: subtract min_free_kbytes from dirtyable memory

When calculating amount of dirtyable memory, min_free_kbytes should be
subtracted because it is not intended for dirty pages.

Addresses http://bugs.debian.org/695182

Signed-off-by: Paul Szabo <psz@maths.usyd.edu.au>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agomm/rmap: rename anon_vma_unlock() => anon_vma_unlock_write()
Konstantin Khlebnikov [Thu, 7 Feb 2013 01:26:53 +0000 (12:26 +1100)]
mm/rmap: rename anon_vma_unlock() => anon_vma_unlock_write()

comment in 4fc3f1d66b1ef0d ("mm/rmap, migration: Make rmap_walk_anon() and
try_to_unmap_anon() more scalable") says:

| Rename anon_vma_[un]lock() => anon_vma_[un]lock_write(),
| to make it clearer that it's an exclusive write-lock in
| that case - suggested by Rik van Riel.

But that commit renames only anon_vma_lock()

Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org>
Cc: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agomm: Get rid of lockdep whinge on sys_swapon
Minchan Kim [Thu, 7 Feb 2013 01:26:52 +0000 (12:26 +1100)]
mm: Get rid of lockdep whinge on sys_swapon

[1] forgot to initialize spin_lock so lockdep is whingeing
about it. This patch fixes it.

[1] 0f181e0e4, swap: add per-partition lock for swapfile

Signed-off-by: Minchan Kim <minchan@kernel.org>
Reported-by: Valdis Kletnieks <Valdis.Kletnieks@vt.edu>
Tested-by: Valdis Kletnieks <Valdis.Kletnieks@vt.edu>
Cc: Shaohua Li <shli@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agoswap-add-per-partition-lock-for-swapfile-fix-fix-fix-fix
Shaohua Li [Thu, 7 Feb 2013 01:26:52 +0000 (12:26 +1100)]
swap-add-per-partition-lock-for-swapfile-fix-fix-fix-fix

Fix building errors like:
> arch/sparc/mm/init_32.c: In function 'show_mem':
> arch/sparc/mm/init_32.c:60:23: error: invalid operands to binary << (have 'atomic_long_t' and 'int')

Signed-off-by: Shaohua Li <shli@fusionio.com>
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Reported-by: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agoswap: add per-partition lock for swapfile fix
Hugh Dickins [Thu, 7 Feb 2013 01:26:52 +0000 (12:26 +1100)]
swap: add per-partition lock for swapfile fix

I had all cpus spinning in swap_info_get(), for the lock on an area
being swapped off: probably because get_swap_page() forgot to unlock.

Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: Shaohua Li <shli@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agoswap: fix "add per-partition lock for swapfile" for nommu
Arnd Bergmann [Thu, 7 Feb 2013 01:26:52 +0000 (12:26 +1100)]
swap: fix "add per-partition lock for swapfile" for nommu

The patch "swap: add per-partition lock for swapfile" made the
nr_swap_pages variable unaccessible but forgot to change the
mm/nommu.c file that uses it. This does the trivial conversion
to let us build nommu kernels again

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Cc: Shaohua Li <shli@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agoswap-add-per-partition-lock-for-swapfile-fix-fix
Shaohua Li [Thu, 7 Feb 2013 01:26:51 +0000 (12:26 +1100)]
swap-add-per-partition-lock-for-swapfile-fix-fix

> arch/sparc/mm/init_32.c: In function 'show_mem':
> arch/sparc/mm/init_32.c:60:23: error: invalid operands to binary << (have 'atomic_long_t' and 'int')
>

Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agoswap: add per-partition lock for swapfile
Shaohua Li [Thu, 7 Feb 2013 01:26:51 +0000 (12:26 +1100)]
swap: add per-partition lock for swapfile

swap_lock is heavily contended when I test swap to 3 fast SSD (even
slightly slower than swap to 2 such SSD).  The main contention comes from
swap_info_get().  This patch tries to fix the gap with adding a new
per-partition lock.

Global data like nr_swapfiles, total_swap_pages, least_priority and
swap_list are still protected by swap_lock.

nr_swap_pages is an atomic now, it can be changed without swap_lock.  In
theory, it's possible get_swap_page() finds no swap pages but actually
there are free swap pages.  But sounds not a big problem.

Accessing partition specific data (like scan_swap_map and so on) is only
protected by swap_info_struct.lock.

Changing swap_info_struct.flags need hold swap_lock and
swap_info_struct.lock, because scan_scan_map() will check it.  read the
flags is ok with either the locks hold.

If both swap_lock and swap_info_struct.lock must be hold, we always hold
the former first to avoid deadlock.

swap_entry_free() can change swap_list.  To delete that code, we add a new
highest_priority_index.  Whenever get_swap_page() is called, we check it.
If it's valid, we use it.

It's a pity get_swap_page() still holds swap_lock().  But in practice,
swap_lock() isn't heavily contended in my test with this patch (or I can
say there are other much more heavier bottlenecks like TLB flush).  And
BTW, looks get_swap_page() doesn't really need the lock.  We never free
swap_info[] and we check SWAP_WRITEOK flag.  The only risk without the
lock is we could swapout to some low priority swap, but we can quickly
recover after several rounds of swap, so sounds not a big deal to me.  But
I'd prefer to fix this if it's a real problem.

"swap: make each swap partition have one address_space" improved the
swapout speed from 1.7G/s to 2G/s.  This patch further improves the speed
to 2.3G/s, so around 15% improvement.  It's a multi-process test, so TLB
flush isn't the biggest bottleneck before the patches.

Signed-off-by: Shaohua Li <shli@fusionio.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Seth Jennings <sjenning@linux.vnet.ibm.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Cc: Dan Magenheimer <dan.magenheimer@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agoswap-make-each-swap-partition-have-one-address_space-fix-fix
Shaohua Li [Thu, 7 Feb 2013 01:26:51 +0000 (12:26 +1100)]
swap-make-each-swap-partition-have-one-address_space-fix-fix

Sasha reported:
Commit "swap: make each swap partition have one address_space" is triggering
a series of warnings on boot:

[    3.446071] ------------[ cut here ]------------
[    3.446664] WARNING: at lib/debugobjects.c:261 debug_print_object+0x8e/0xb0()
[    3.447715] ODEBUG: init active (active state 0) object type: percpu_counter hint:           (null)
[    3.450360] Modules linked in:
[    3.451593] Pid: 1, comm: swapper/0 Tainted: G        W    3.8.0-rc4-next-20130124-sasha-00004-g838a1b4 #266
[    3.454508] Call Trace:
[    3.455248]  [<ffffffff8110d1bc>] warn_slowpath_common+0x8c/0xc0
[    3.455248]  [<ffffffff8110d291>] warn_slowpath_fmt+0x41/0x50
[    3.455248]  [<ffffffff81a2bb5e>] debug_print_object+0x8e/0xb0
[    3.455248]  [<ffffffff81a2c26b>] __debug_object_init+0x20b/0x290
[    3.455248]  [<ffffffff81a2c305>] debug_object_init+0x15/0x20
[    3.455248]  [<ffffffff81a3fbed>] __percpu_counter_init+0x6d/0xe0
[    3.455248]  [<ffffffff81231bdc>] bdi_init+0x1ac/0x270
[    3.455248]  [<ffffffff8618f20b>] swap_setup+0x3b/0x87
[    3.455248]  [<ffffffff8618f257>] ? swap_setup+0x87/0x87
[    3.455248]  [<ffffffff8618f268>] kswapd_init+0x11/0x7c
[    3.455248]  [<ffffffff810020ca>] do_one_initcall+0x8a/0x180
[    3.455248]  [<ffffffff86168cfd>] do_basic_setup+0x96/0xb4
[    3.455248]  [<ffffffff861685ae>] ? loglevel+0x31/0x31
[    3.455248]  [<ffffffff861885cd>] ? sched_init_smp+0x150/0x157
[    3.455248]  [<ffffffff86168ded>] kernel_init_freeable+0xd2/0x14c
[    3.455248]  [<ffffffff83cade10>] ? rest_init+0x140/0x140
[    3.455248]  [<ffffffff83cade19>] kernel_init+0x9/0xf0
[    3.455248]  [<ffffffff83d5727c>] ret_from_fork+0x7c/0xb0
[    3.455248]  [<ffffffff83cade10>] ? rest_init+0x140/0x140
[    3.455248] ---[ end trace 0b176d5c0f21bffb ]---

Initialize swap space backing_dev_info once to avoid the warning.

Signed-off-by: Shaohua Li <shli@fusionio.com>
Reported-by: Sasha Levin <sasha.levin@oracle.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agoswap-make-each-swap-partition-have-one-address_space-fix
Andrew Morton [Thu, 7 Feb 2013 01:26:50 +0000 (12:26 +1100)]
swap-make-each-swap-partition-have-one-address_space-fix

revert unneeded change to  __add_to_swap_cache

Cc: Hugh Dickins <hughd@google.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Shaohua Li <shli@fusionio.com>
Cc: Shaohua Li <shli@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agoswap: make each swap partition have one address_space
Shaohua Li [Thu, 7 Feb 2013 01:26:50 +0000 (12:26 +1100)]
swap: make each swap partition have one address_space

When I use several fast SSD to do swap, swapper_space.tree_lock is heavily
contended.  This makes each swap partition have one address_space to
reduce the lock contention.  There is an array of address_space for swap.
The swap entry type is the index to the array.

In my test with 3 SSD, this increases the swapout throughput 20%.

Signed-off-by: Shaohua Li <shli@fusionio.com>
Cc: Hugh Dickins <hughd@google.com>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agomm: don't inline page_mapping()
Shaohua Li [Thu, 7 Feb 2013 01:26:50 +0000 (12:26 +1100)]
mm: don't inline page_mapping()

According to akpm, this saves 1/2k text and makes things simple for the
next patch.

Numbers from Minchan:

add/remove: 1/0 grow/shrink: 6/22 up/down: 92/-516 (-424)
function                                     old     new   delta
page_mapping                                   -      48     +48
do_task_stat                                2292    2308     +16
page_remove_rmap                             240     248      +8
load_elf_binary                             4500    4508      +8
update_queue                                 532     536      +4
scsi_probe_and_add_lun                      2892    2896      +4
lookup_fast                                  644     648      +4
vcs_read                                    1040    1036      -4
__ip_route_output_key                       1904    1900      -4
ip_route_input_noref                        2508    2500      -8
shmem_file_aio_read                          784     772     -12
__isolate_lru_page                           272     256     -16
shmem_replace_page                           708     688     -20
mark_buffer_dirty                            228     208     -20
__set_page_dirty_buffers                     240     220     -20
__remove_mapping                             276     256     -20
update_mmu_cache                             500     476     -24
set_page_dirty_balance                        92      68     -24
set_page_dirty                               172     148     -24
page_evictable                                88      64     -24
page_cache_pipe_buf_steal                    248     224     -24
clear_page_dirty_for_io                      340     316     -24
test_set_page_writeback                      400     372     -28
test_clear_page_writeback                    516     488     -28
invalidate_inode_page                        156     128     -28
page_mkclean                                 432     400     -32
flush_dcache_page                            360     328     -32
__set_page_dirty_nobuffers                   324     280     -44
shrink_page_list                            2412    2356     -56

Signed-off-by: Shaohua Li <shli@fusionio.com>
Suggested-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Hugh Dickins <hughd@google.com>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agomm: numa: Cleanup flow of transhuge page migration
Hugh Dickins [Thu, 7 Feb 2013 01:26:49 +0000 (12:26 +1100)]
mm: numa: Cleanup flow of transhuge page migration

When correcting commit 04fa5d6a ("mm: migrate: check page_count of THP
before migrating") Hugh Dickins noted that the control flow for transhuge
migration was difficult to follow.  Unconditionally calling put_page() in
numamigrate_isolate_page() made the failure paths of both
migrate_misplaced_transhuge_page() and migrate_misplaced_page() more
complex that they should be.  Further, he was extremely wary that an
unlock_page() should ever happen after a put_page() even if the put_page()
should never be the final put_page.

Hugh implemented the following cleanup to simplify the path by calling
putback_lru_page() inside numamigrate_isolate_page() if it failed to
isolate and always calling unlock_page() within
migrate_misplaced_transhuge_page().  There is no functional change after
this patch is applied but the code is easier to follow and unlock_page()
always happens before put_page().

[mgorman@suse.de: changelog only]
Signed-off-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Simon Jeons <simon.jeons@gmail.com>
Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agomm: Fold page->_last_nid into page->flags where possible
Peter Zijlstra [Thu, 7 Feb 2013 01:26:49 +0000 (12:26 +1100)]
mm: Fold page->_last_nid into page->flags where possible

page->_last_nid fits into page->flags on 64-bit.  The unlikely 32-bit NUMA
configuration with NUMA Balancing will still need an extra page field.  As
Peter notes "Completely dropping 32bit support for CONFIG_NUMA_BALANCING
would simplify things, but it would also remove the warning if we grow
enough 64bit only page-flags to push the last-cpu out."

[mgorman@suse.de: minor modifications]
Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Simon Jeons <simon.jeons@gmail.com>
Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agomm: move page flags layout to separate header
Peter Zijlstra [Thu, 7 Feb 2013 01:26:49 +0000 (12:26 +1100)]
mm: move page flags layout to separate header

This is a preparation patch for moving page->_last_nid into page->flags
that moves page flag layout information to a separate header.  This patch
is necessary because otherwise there would be a circular dependency
between mm_types.h and mm.h.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Simon Jeons <simon.jeons@gmail.com>
Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agomm: numa: handle side-effects in count_vm_numa_events() for !CONFIG_NUMA_BALANCING
Mel Gorman [Thu, 7 Feb 2013 01:26:48 +0000 (12:26 +1100)]
mm: numa: handle side-effects in count_vm_numa_events() for !CONFIG_NUMA_BALANCING

The current definitions for count_vm_numa_events() is wrong for
!CONFIG_NUMA_BALANCING as the following would miss the side-effect.

count_vm_numa_events(NUMA_FOO, bar++);

There are no such users of count_vm_numa_events() but this patch fixes it
as it is a potential pitfall.  Ideally both would be converted to static
inline but NUMA_PTE_UPDATES is not defined if !CONFIG_NUMA_BALANCING and
creating dummy constants just to have a static inline would be similarly
clumsy.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Simon Jeons <simon.jeons@gmail.com>
Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agomm: numa: take THP into account when migrating pages for NUMA balancing
Mel Gorman [Thu, 7 Feb 2013 01:26:48 +0000 (12:26 +1100)]
mm: numa: take THP into account when migrating pages for NUMA balancing

Wanpeng Li pointed out that numamigrate_isolate_page() assumes that only
one base page is being migrated when in fact it can also be checking THP.
The consequences are that a migration will be attempted when a target node
is nearly full and fail later.  It's unlikely to be user-visible but it
should be fixed.  While we are there, migrate_balanced_pgdat() should
treat nr_migrate_pages as an unsigned long as it is treated as a
watermark.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Suggested-by: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Simon Jeons <simon.jeons@gmail.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agomm: numa: fix minor typo in numa_next_scan
Mel Gorman [Thu, 7 Feb 2013 01:26:48 +0000 (12:26 +1100)]
mm: numa: fix minor typo in numa_next_scan

s/me/be/ and clarify the comment a bit when we're changing it anyway.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Suggested-by: Simon Jeons <simon.jeons@gmail.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agomm: remove unused memclear_highpage_flush()
Kirill A. Shutemov [Thu, 7 Feb 2013 01:26:47 +0000 (12:26 +1100)]
mm: remove unused memclear_highpage_flush()

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agousb: forbid memory allocation with I/O during bus reset
Ming Lei [Thu, 7 Feb 2013 01:26:47 +0000 (12:26 +1100)]
usb: forbid memory allocation with I/O during bus reset

If one storage interface or usb network interface(iSCSI case) exists in
current configuration, memory allocation with GFP_KERNEL during
usb_device_reset() might trigger I/O transfer on the storage interface
itself and cause deadlock because the 'us->dev_mutex' is held in
.pre_reset() and the storage interface can't do I/O transfer when the
reset is triggered by other interface, or the error handling can't be
completed if the reset is triggered by the storage itself (error handling
path).

Signed-off-by: Ming Lei <ming.lei@canonical.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: David Decotigny <david.decotigny@google.com>
Cc: Tom Herbert <therbert@google.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Alan Stern <stern@rowland.harvard.edu>
Cc: Oliver Neukum <oneukum@suse.de>
Reviewed-by: Jiri Kosina <jkosina@suse.cz>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Cc: Greg KH <greg@kroah.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agopm / runtime: force memory allocation with no I/O during Runtime PM callbcack
Ming Lei [Thu, 7 Feb 2013 01:26:47 +0000 (12:26 +1100)]
pm / runtime: force memory allocation with no I/O during Runtime PM callbcack

Apply the introduced memalloc_noio_save() and memalloc_noio_restore() to
force memory allocation with no I/O during runtime_resume/runtime_suspend
callback on device with the flag of 'memalloc_noio' set.

Signed-off-by: Ming Lei <ming.lei@canonical.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: David Decotigny <david.decotigny@google.com>
Cc: Tom Herbert <therbert@google.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Alan Stern <stern@rowland.harvard.edu>
Cc: Oliver Neukum <oneukum@suse.de>
Cc: Jiri Kosina <jiri.kosina@suse.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Cc: Greg KH <greg@kroah.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agonet/core: apply pm_runtime_set_memalloc_noio on network devices
Ming Lei [Thu, 7 Feb 2013 01:26:46 +0000 (12:26 +1100)]
net/core: apply pm_runtime_set_memalloc_noio on network devices

Deadlock might be caused by allocating memory with GFP_KERNEL in
runtime_resume and runtime_suspend callback of network devices in iSCSI
situation, so mark network devices and its ancestor as 'memalloc_noio'
with the introduced pm_runtime_set_memalloc_noio().

Signed-off-by: Ming Lei <ming.lei@canonical.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: David Decotigny <david.decotigny@google.com>
Cc: Tom Herbert <therbert@google.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Alan Stern <stern@rowland.harvard.edu>
Cc: Oliver Neukum <oneukum@suse.de>
Cc: Jiri Kosina <jiri.kosina@suse.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Cc: Greg KH <greg@kroah.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agoblock/genhd.c: apply pm_runtime_set_memalloc_noio on block devices
Ming Lei [Thu, 7 Feb 2013 01:26:46 +0000 (12:26 +1100)]
block/genhd.c: apply pm_runtime_set_memalloc_noio on block devices

Apply the introduced pm_runtime_set_memalloc_noio on block device so that
PM core will teach mm to not allocate memory with GFP_IOFS when calling
the runtime_resume and runtime_suspend callback for block devices and its
ancestors.

Signed-off-by: Ming Lei <ming.lei@canonical.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Alan Stern <stern@rowland.harvard.edu>
Cc: Oliver Neukum <oneukum@suse.de>
Cc: Jiri Kosina <jiri.kosina@suse.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Cc: Greg KH <greg@kroah.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: David Decotigny <david.decotigny@google.com>
Cc: Tom Herbert <therbert@google.com>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agopm / runtime: introduce pm_runtime_set_memalloc_noio()
Ming Lei [Thu, 7 Feb 2013 01:26:46 +0000 (12:26 +1100)]
pm / runtime: introduce pm_runtime_set_memalloc_noio()

Introduce the flag memalloc_noio in 'struct dev_pm_info' to help PM core
to teach mm not allocating memory with GFP_KERNEL flag for avoiding
probable deadlock.

As explained in the comment, any GFP_KERNEL allocation inside
runtime_resume() or runtime_suspend() on any one of device in the path
from one block or network device to the root device in the device tree may
cause deadlock, the introduced pm_runtime_set_memalloc_noio() sets or
clears the flag on device in the path recursively.

Signed-off-by: Ming Lei <ming.lei@canonical.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Alan Stern <stern@rowland.harvard.edu>
Cc: Oliver Neukum <oneukum@suse.de>
Cc: Jiri Kosina <jiri.kosina@suse.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Cc: Greg KH <greg@kroah.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: David Decotigny <david.decotigny@google.com>
Cc: Tom Herbert <therbert@google.com>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agomm: teach mm by current context info to not do I/O during memory allocation
Ming Lei [Thu, 7 Feb 2013 01:26:45 +0000 (12:26 +1100)]
mm: teach mm by current context info to not do I/O during memory allocation

This patch introduces PF_MEMALLOC_NOIO on process flag('flags' field of
'struct task_struct'), so that the flag can be set by one task to avoid
doing I/O inside memory allocation in the task's context.

The patch trys to solve one deadlock problem caused by block device, and
the problem may happen at least in the below situations:

- during block device runtime resume, if memory allocation with
  GFP_KERNEL is called inside runtime resume callback of any one of its
  ancestors(or the block device itself), the deadlock may be triggered
  inside the memory allocation since it might not complete until the block
  device becomes active and the involed page I/O finishes.  The situation
  is pointed out first by Alan Stern.  It is not a good approach to
  convert all GFP_KERNEL[1] in the path into GFP_NOIO because several
  subsystems may be involved(for example, PCI, USB and SCSI may be
  involved for usb mass stoarage device, network devices involved too in
  the iSCSI case)

- during block device runtime suspend, because runtime resume need to
  wait for completion of concurrent runtime suspend.

- during error handling of usb mass storage deivce, USB bus reset will
  be put on the device, so there shouldn't have any memory allocation with
  GFP_KERNEL during USB bus reset, otherwise the deadlock similar with
  above may be triggered.  Unfortunately, any usb device may include one
  mass storage interface in theory, so it requires all usb interface
  drivers to handle the situation.  In fact, most usb drivers don't know
  how to handle bus reset on the device and don't provide .pre_set() and
  .post_reset() callback at all, so USB core has to unbind and bind driver
  for these devices.  So it is still not practical to resort to GFP_NOIO
  for solving the problem.

Also the introduced solution can be used by block subsystem or block
drivers too, for example, set the PF_MEMALLOC_NOIO flag before doing
actual I/O transfer.

It is not a good idea to convert all these GFP_KERNEL in the affected path
into GFP_NOIO because these functions doing that may be implemented as
library and will be called in many other contexts.

In fact, memalloc_noio_flags() can convert some of current static GFP_NOIO
allocation into GFP_KERNEL back in other non-affected contexts, at least
almost all GFP_NOIO in USB subsystem can be converted into GFP_KERNEL
after applying the approach and make allocation with GFP_NOIO only happen
in runtime resume/bus reset/block I/O transfer contexts generally.

[1], several GFP_KERNEL allocation examples in runtime resume path

- pci subsystem
acpi_os_allocate
<-acpi_ut_allocate
<-ACPI_ALLOCATE_ZEROED
<-acpi_evaluate_object
<-__acpi_bus_set_power
<-acpi_bus_set_power
<-acpi_pci_set_power_state
<-platform_pci_set_power_state
<-pci_platform_power_transition
<-__pci_complete_power_transition
<-pci_set_power_state
<-pci_restore_standard_config
<-pci_pm_runtime_resume
- usb subsystem
usb_get_status
<-finish_port_resume
<-usb_port_resume
<-generic_resume
<-usb_resume_device
<-usb_resume_both
<-usb_runtime_resume

- some individual usb drivers
usblp, uvc, gspca, most of dvb-usb-v2 media drivers, cpia2, az6007, ....

That is just what I have found.  Unfortunately, this allocation can only
be found by human being now, and there should be many not found since any
function in the resume path(call tree) may allocate memory with
GFP_KERNEL.

Signed-off-by: Ming Lei <ming.lei@canonical.com>
Signed-off-by: Minchan Kim <minchan@kernel.org>
Cc: Alan Stern <stern@rowland.harvard.edu>
Cc: Oliver Neukum <oneukum@suse.de>
Cc: Jiri Kosina <jiri.kosina@suse.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Cc: Greg KH <greg@kroah.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: David Decotigny <david.decotigny@google.com>
Cc: Tom Herbert <therbert@google.com>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agomm: don't wait on congested zones in balance_pgdat()
Zlatko Calusic [Thu, 7 Feb 2013 01:26:45 +0000 (12:26 +1100)]
mm: don't wait on congested zones in balance_pgdat()

Commit 92df3a72 (mm: vmscan: throttle reclaim if encountering too many
dirty pages under writeback) introduced waiting on congested zones
based on a sane algorithm in shrink_inactive_list(). What this means
is that there's no more need for throttling and additional heuristics
in balance_pgdat(). So, let's remove it and tidy up the code.

Signed-off-by: Zlatko Calusic <zlatko.calusic@iskon.hr>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Johannes Weiner <jweiner@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agomm-memory-failurec-fix-wrong-num_poisoned_pages-in-handling-memory-error-on-thp-fix
Andrew Morton [Thu, 7 Feb 2013 01:26:45 +0000 (12:26 +1100)]
mm-memory-failurec-fix-wrong-num_poisoned_pages-in-handling-memory-error-on-thp-fix

tweak comment

Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agomm/memory-failure.c: fix wrong num_poisoned_pages in handling memory error on thp
Naoya Horiguchi [Thu, 7 Feb 2013 01:26:44 +0000 (12:26 +1100)]
mm/memory-failure.c: fix wrong num_poisoned_pages in handling memory error on thp

num_poisoned_pages counts up the number of pages isolated by memory
errors.  But for thp, only one subpage is isolated because memory error
handler splits it, so it's wrong to add (1 << compound_trans_order).

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agomm/memory-failure.c: clean up soft_offline_page()
Naoya Horiguchi [Thu, 7 Feb 2013 01:26:44 +0000 (12:26 +1100)]
mm/memory-failure.c: clean up soft_offline_page()

Currently soft_offline_page() is hard to maintain because it has many
return points and goto statements.  All of this mess come from
get_any_page().  This function should only get page refcount as the name
implies, but it does some page isolating actions like SetPageHWPoison()
and dequeuing hugepage.  This patch corrects it and introduces some
internal subroutines to make soft offlining code more readable and
maintainable.

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Reviewed-by: Andi Kleen <andi@firstfloor.org>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Xishi Qiu <qiuxishi@huawei.com>
Cc: Jiang Liu <jiang.liu@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agomemory-failure-use-num_poisoned_pages-instead-of-mce_bad_pages-fix
Andrew Morton [Thu, 7 Feb 2013 01:26:44 +0000 (12:26 +1100)]
memory-failure-use-num_poisoned_pages-instead-of-mce_bad_pages-fix

fix mm/sparse.c

Cc: Andi Kleen <andi@firstfloor.org>
Cc: Jiang Liu <jiang.liu@huawei.com>
Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Cc: Xishi Qiu <qiuxishi@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agomemory-failure: use num_poisoned_pages instead of mce_bad_pages
Xishi Qiu [Thu, 7 Feb 2013 01:26:43 +0000 (12:26 +1100)]
memory-failure: use num_poisoned_pages instead of mce_bad_pages

Since MCE is an x86 concept, and this code is in mm/, it would be better
to use the name num_poisoned_pages instead of mce_bad_pages.

Signed-off-by: Xishi Qiu <qiuxishi@huawei.com>
Signed-off-by: Jiang Liu <jiang.liu@huawei.com>
Suggested-by: Borislav Petkov <bp@alien8.de>
Reviewed-by: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Cc: Andi Kleen <andi@firstfloor.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agomemory-failure: do code refactor of soft_offline_page()
Xishi Qiu [Thu, 7 Feb 2013 01:26:43 +0000 (12:26 +1100)]
memory-failure: do code refactor of soft_offline_page()

There are too many return points randomly intermingled with some "goto
done" return points.  So adjust the function structure, one for the
success path, the other for the failure path.  Use atomic_long_inc instead
of atomic_long_add.

Signed-off-by: Xishi Qiu <qiuxishi@huawei.com>
Signed-off-by: Jiang Liu <jiang.liu@huawei.com>
Suggested-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Cc: Andi Kleen <andi@firstfloor.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agomemory-failure: fix an error of mce_bad_pages statistics
Xishi Qiu [Thu, 7 Feb 2013 01:26:43 +0000 (12:26 +1100)]
memory-failure: fix an error of mce_bad_pages statistics

$ echo paddr > /sys/devices/system/memory/soft_offline_page to offline a
*free* page, the value of mce_bad_pages will be added, and the page is set
HWPoison flag, but it is still managed by page buddy alocator.

$ cat /proc/meminfo | grep HardwareCorrupted shows the value.

If we offline the same page, the value of mce_bad_pages will be added
*again*, this means the value is incorrect now.  Assume the page is still
free during this short time.

soft_offline_page()
get_any_page()
"else if (is_free_buddy_page(p))" branch return 0
"goto done";
                                           "atomic_long_add(1, &mce_bad_pages);"

This patch:

Move poisoned page check at the beginning of the function in order to
fix the error.

Signed-off-by: Xishi Qiu <qiuxishi@huawei.com>
Signed-off-by: Jiang Liu <jiang.liu@huawei.com>
Tested-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Cc: Andi Kleen <andi@firstfloor.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agomm: remove MIGRATE_ISOLATE check in hotpath
Minchan Kim [Thu, 7 Feb 2013 01:26:42 +0000 (12:26 +1100)]
mm: remove MIGRATE_ISOLATE check in hotpath

Several functions test MIGRATE_ISOLATE and some of those are hotpath but
MIGRATE_ISOLATE is used only if we enable CONFIG_MEMORY_ISOLATION(ie, CMA,
memory-hotplug and memory-failure) which are not common config option.  So
let's not add unnecessary overhead and code when we don't enable
CONFIG_MEMORY_ISOLATION.

Signed-off-by: Minchan Kim <minchan@kernel.org>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Acked-by: Michal Nazarewicz <mina86@mina86.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agomm: increase totalram_pages when free pages allocated by bootmem allocator
Jiang Liu [Thu, 7 Feb 2013 01:26:42 +0000 (12:26 +1100)]
mm: increase totalram_pages when free pages allocated by bootmem allocator

Function put_page_bootmem() is used to free pages allocated by bootmem
allocator, so it should increase totalram_pages when freeing pages into
the buddy system.

Signed-off-by: Jiang Liu <jiang.liu@huawei.com>
Cc: Wen Congyang <wency@cn.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Jiang Liu <jiang.liu@huawei.com>
Cc: Maciej Rutecki <maciej.rutecki@gmail.com>
Cc: Chris Clayton <chris2553@googlemail.com>
Cc: "Rafael J . Wysocki" <rjw@sisk.pl>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Minchan Kim <minchan@kernel.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Jianguo Wu <wujianguo@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agomm: set zone->present_pages to number of existing pages in the zone
Jiang Liu [Thu, 7 Feb 2013 01:26:42 +0000 (12:26 +1100)]
mm: set zone->present_pages to number of existing pages in the zone

Now all users of "number of pages managed by the buddy system" have been
converted to use zone->managed_pages, so set zone->present_pages to what
it should be:

present_pages = spanned_pages - absent_pages;

Signed-off-by: Jiang Liu <jiang.liu@huawei.com>
Cc: Wen Congyang <wency@cn.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Jiang Liu <jiang.liu@huawei.com>
Cc: Maciej Rutecki <maciej.rutecki@gmail.com>
Cc: Chris Clayton <chris2553@googlemail.com>
Cc: "Rafael J . Wysocki" <rjw@sisk.pl>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Minchan Kim <minchan@kernel.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Jianguo Wu <wujianguo@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agomm: use zone->present_pages instead of zone->managed_pages where appropriate
Jiang Liu [Thu, 7 Feb 2013 01:26:41 +0000 (12:26 +1100)]
mm: use zone->present_pages instead of zone->managed_pages where appropriate

Now we have zone->managed_pages for "pages managed by the buddy system in
the zone", so replace zone->present_pages with zone->managed_pages if what
the user really wants is number of allocatable pages.

Signed-off-by: Jiang Liu <jiang.liu@huawei.com>
Cc: Wen Congyang <wency@cn.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Jiang Liu <jiang.liu@huawei.com>
Cc: Maciej Rutecki <maciej.rutecki@gmail.com>
Cc: Chris Clayton <chris2553@googlemail.com>
Cc: "Rafael J . Wysocki" <rjw@sisk.pl>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Minchan Kim <minchan@kernel.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Jianguo Wu <wujianguo@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agomm/memblock.c: use CONFIG_HAVE_MEMBLOCK_NODE_MAP to protect movablecore_map in memblo...
Tang Chen [Thu, 7 Feb 2013 01:26:41 +0000 (12:26 +1100)]
mm/memblock.c: use CONFIG_HAVE_MEMBLOCK_NODE_MAP to protect movablecore_map in memblock_overlaps_region().

The definition of struct movablecore_map is protected by
CONFIG_HAVE_MEMBLOCK_NODE_MAP but its use in memblock_overlaps_region() is
not.  So add CONFIG_HAVE_MEMBLOCK_NODE_MAP to protect the use of
movablecore_map in memblock_overlaps_region().

Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agoacpi-memory-hotplug-support-getting-hotplug-info-from-srat-fix-fix
Andrew Morton [Thu, 7 Feb 2013 01:26:41 +0000 (12:26 +1100)]
acpi-memory-hotplug-support-getting-hotplug-info-from-srat-fix-fix

use strcmp()

Cc: Tang Chen <tangchen@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agoacpi-memory-hotplug-support-getting-hotplug-info-from-srat-fix
Andrew Morton [Thu, 7 Feb 2013 01:26:40 +0000 (12:26 +1100)]
acpi-memory-hotplug-support-getting-hotplug-info-from-srat-fix

mm/page_alloc.c: In function 'cmdline_parse_movablemem_map':
mm/page_alloc.c:5372: warning: comparison of distinct pointer types lacks a cast

not the right fix, but I'm tired of the warning

Cc: Tang Chen <tangchen@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agoacpi, memory-hotplug: support getting hotplug info from SRAT
Tang Chen [Thu, 7 Feb 2013 01:26:40 +0000 (12:26 +1100)]
acpi, memory-hotplug: support getting hotplug info from SRAT

We now provide an option for users who don't want to specify physical
memory address in kernel commandline.

        /*
         * For movablemem_map=acpi:
         *
         * SRAT:                |_____| |_____| |_________| |_________| ......
         * node id:                0       1         1           2
         * hotpluggable:           n       y         y           n
         * movablemem_map:              |_____| |_________|
         *
         * Using movablemem_map, we can prevent memblock from allocating memory
         * on ZONE_MOVABLE at boot time.
         */

So user just specify movablemem_map=acpi, and the kernel will use
hotpluggable info in SRAT to determine which memory ranges should be set
as ZONE_MOVABLE.

NOTE: Using this way will cause NUMA performance down because the whole node
      will be set as ZONE_MOVABLE, and kernel cannot use memory on it.
      If users don't want to lose NUMA performance, just don't use it.

Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Jiang Liu <jiang.liu@huawei.com>
Cc: Jianguo Wu <wujianguo@huawei.com>
Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Wu Jianguo <wujianguo@huawei.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Len Brown <lenb@kernel.org>
Cc: "Brown, Len" <len.brown@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agoacpi-memory-hotplug-extend-movablemem_map-ranges-to-the-end-of-node-fix
Andrew Morton [Thu, 7 Feb 2013 01:26:40 +0000 (12:26 +1100)]
acpi-memory-hotplug-extend-movablemem_map-ranges-to-the-end-of-node-fix

clean up code, fix build warning

Cc: "Brown, Len" <len.brown@intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Jiang Liu <jiang.liu@huawei.com>
Cc: Jianguo Wu <wujianguo@huawei.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Len Brown <lenb@kernel.org>
Cc: Tang Chen <tangchen@cn.fujitsu.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Wu Jianguo <wujianguo@huawei.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agoacpi, memory-hotplug: extend movablemem_map ranges to the end of node
Tang Chen [Thu, 7 Feb 2013 01:26:39 +0000 (12:26 +1100)]
acpi, memory-hotplug: extend movablemem_map ranges to the end of node

When implementing movablemem_map boot option, we introduced an array
movablemem_map.map[] to store the memory ranges to be set as ZONE_MOVABLE.

Since ZONE_MOVABLE is the latst zone of a node, if user didn't specify the
whole node memory range, we need to extend it to the node end so that we
can use it to prevent memblock from allocating memory in the ranges user
didn't specify.

We now implement movablemem_map boot option like this:
        /*
         * For movablemem_map=nn[KMG]@ss[KMG]:
         *
         * SRAT:                |_____| |_____| |_________| |_________| ......
         * node id:                0       1         1           2
         * user specified:                |__|                 |___|
         * movablemem_map:                |___| |_________|    |______| ......
         *
         * Using movablemem_map, we can prevent memblock from allocating memory
         * on ZONE_MOVABLE at boot time.
         *
         * NOTE: In this case, SRAT info will be ingored.
         */

Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Jiang Liu <jiang.liu@huawei.com>
Cc: Jianguo Wu <wujianguo@huawei.com>
Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Wu Jianguo <wujianguo@huawei.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Len Brown <lenb@kernel.org>
Cc: "Brown, Len" <len.brown@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agoacpi, memory-hotplug: parse SRAT before memblock is ready fix
Michal Hocko [Thu, 7 Feb 2013 01:26:39 +0000 (12:26 +1100)]
acpi, memory-hotplug: parse SRAT before memblock is ready fix

alnoconfig complains:
arch/x86/kernel/setup.c: In function `setup_arch':
arch/x86/kernel/setup.c:917: error: implicit declaration of function `early_parse_srat'

because early_parse_srat is not declared for !CONFIG_ACPI. Moreover it
is defined only for CONFIG_ACPI_NUMA.

I am not sure what is the correct way to fix this but I guess that
providing an empty definition for !CONFIG_ACPI_NUMA is OK.

Signed-off-by: Michal Hocko <mhocko@suse.cz>
Cc: Tang Chen <tangchen@cn.fujitsu.com>
Cc: Wen Congyang <wency@cn.fujitsu.com>
Cc: Jiang Liu <jiang.liu@huawei.com>
Cc: Jianguo Wu <wujianguo@huawei.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agoacpi, memory-hotplug: parse SRAT before memblock is ready
Tang Chen [Thu, 7 Feb 2013 01:26:39 +0000 (12:26 +1100)]
acpi, memory-hotplug: parse SRAT before memblock is ready

On linux, the pages used by kernel could not be migrated.  As a result, if
a memory range is used by kernel, it cannot be hot-removed.  So if we want
to hot-remove memory, we should prevent kernel from using it.

The way now used to prevent this is specify a memory range by
movablemem_map boot option and set it as ZONE_MOVABLE.

But when the system is booting, memblock will allocate memory, and reserve
the memory for kernel.  And before we parse SRAT, and know the node memory
ranges, memblock is working.  And it may allocate memory in ranges to be
set as ZONE_MOVABLE.  This memory can be used by kernel, and never be
freed.

So, let's parse SRAT before memblock is called first. And it is early enough.

The first call of memblock_find_in_range_node() is in:
setup_arch()
 |-->setup_real_mode()

so, this patch add a function early_parse_srat() to parse SRAT, and call
it before setup_real_mode() is called.

NOTE:

1) Do not clear numa_nodes_parsed in numa_init() because SRAT was
   parsed earlier.

2) I don't know why using count of memory affinities parsed from SRAT
   as a return value in original acpi_numa_init().  So I add a static
   variable srat_mem_cnt to remember this count and use it as the return
   value of the new acpi_numa_init()

Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Reviewed-by: Wen Congyang <wency@cn.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Jiang Liu <jiang.liu@huawei.com>
Cc: Jianguo Wu <wujianguo@huawei.com>
Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Wu Jianguo <wujianguo@huawei.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Len Brown <lenb@kernel.org>
Cc: "Brown, Len" <len.brown@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agopage_alloc: bootmem limit with movablecore_map
Tang Chen [Thu, 7 Feb 2013 01:26:38 +0000 (12:26 +1100)]
page_alloc: bootmem limit with movablecore_map

Ensure the bootmem will not allocate memory from areas that may be
ZONE_MOVABLE.  The map info is from movablecore_map boot option.

Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Reviewed-by: Wen Congyang <wency@cn.fujitsu.com>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Tested-by: Lin Feng <linfeng@cn.fujitsu.com>
Cc: Wu Jianguo <wujianguo@huawei.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agopage_alloc: make movablemem_map have higher priority
Tang Chen [Thu, 7 Feb 2013 01:26:38 +0000 (12:26 +1100)]
page_alloc: make movablemem_map have higher priority

If kernelcore or movablecore is specified at the same time with
movablemem_map, movablemem_map will have higher priority to be
satisfied.  This patch will make find_zone_movable_pfns_for_nodes()
calculate zone_movable_pfn[] with the limit from zone_movable_limit[].

Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Reviewed-by: Wen Congyang <wency@cn.fujitsu.com>
Cc: Wu Jianguo <wujianguo@huawei.com>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Tested-by: Lin Feng <linfeng@cn.fujitsu.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agoBug fix: Remove the unused sanitize_zone_movable_limit() definition.
Tang Chen [Thu, 7 Feb 2013 01:26:38 +0000 (12:26 +1100)]
Bug fix: Remove the unused sanitize_zone_movable_limit() definition.

When CONFIG_HAVE_MEMBLOCK_NODE_MAP is not defined, sanitize_zone_movable_limit()
is also not used. So remove it.

Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Cc: Liu Jiang <jiang.liu@huawei.com>
Cc: Wen Congyang <wency@cn.fujitsu.com>
Cc: Wu Jianguo <wujianguo@huawei.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Lin Feng <linfeng@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agopage_alloc: introduce zone_movable_limit[] to keep movable limit for nodes
Tang Chen [Thu, 7 Feb 2013 01:26:37 +0000 (12:26 +1100)]
page_alloc: introduce zone_movable_limit[] to keep movable limit for nodes

Introduce a new array zone_movable_limit[] to store the ZONE_MOVABLE limit
from movablemem_map boot option for all nodes.  The function
sanitize_zone_movable_limit() will find out to which node the ranges in
movable_map.map[] belongs, and calculates the low boundary of ZONE_MOVABLE
for each node.

Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Signed-off-by: Liu Jiang <jiang.liu@huawei.com>
Reviewed-by: Wen Congyang <wency@cn.fujitsu.com>
Cc: Wu Jianguo <wujianguo@huawei.com>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Tested-by: Lin Feng <linfeng@cn.fujitsu.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agoRename movablecore_map to movablemem_map.
Tang Chen [Thu, 7 Feb 2013 01:26:37 +0000 (12:26 +1100)]
Rename movablecore_map to movablemem_map.

Since "core" could be confused with cpu cores, but here it is memory,
so rename the boot option movablecore_map to movablemem_map.

Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Cc: Wen Congyang <wency@cn.fujitsu.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Lin Feng <linfeng@cn.fujitsu.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agopage_alloc-add-movable_memmap-kernel-parameter-fix-fix-fix
Andrew Morton [Thu, 7 Feb 2013 01:26:37 +0000 (12:26 +1100)]
page_alloc-add-movable_memmap-kernel-parameter-fix-fix-fix

remove unneeded parens

Cc: Tang Chen <tangchen@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agopage_alloc-add-movable_memmap-kernel-parameter-fix-fix-checkpatch-fixes
Andrew Morton [Thu, 7 Feb 2013 01:26:36 +0000 (12:26 +1100)]
page_alloc-add-movable_memmap-kernel-parameter-fix-fix-checkpatch-fixes

Cc: Tang Chen <tangchen@cn.fujitsu.com>
WARNING: please, no space before tabs
#48: FILE: mm/page_alloc.c:5171:
+ * ^Imovablecore_map=nn[KMG]@ss[KMG]$

total: 0 errors, 1 warnings, 39 lines checked

./patches/page_alloc-add-movable_memmap-kernel-parameter-fix-fix.patch has style problems, please review.

If any of these errors are false positives, please report
them to the maintainer, see CHECKPATCH in MAINTAINERS.

Please run checkpatch prior to sending patches

Cc: Tang Chen <tangchen@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agoBug fix: Fix the doc format.
Tang Chen [Thu, 7 Feb 2013 01:26:36 +0000 (12:26 +1100)]
Bug fix: Fix the doc format.

Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agopage_alloc-add-movable_memmap-kernel-parameter-fix
Andrew Morton [Thu, 7 Feb 2013 01:26:36 +0000 (12:26 +1100)]
page_alloc-add-movable_memmap-kernel-parameter-fix

improve comment

Cc: Tang Chen <tangchen@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agopage_alloc: add movable_memmap kernel parameter
Tang Chen [Thu, 7 Feb 2013 01:26:35 +0000 (12:26 +1100)]
page_alloc: add movable_memmap kernel parameter

Add functions to parse movablecore_map boot option.  Since the option
could be specified more then once, all the maps will be stored in the
global variable movablecore_map.map array.

And also, we keep the array in monotonic increasing order by start_pfn.
And merge all overlapped ranges.

Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Reviewed-by: Wen Congyang <wency@cn.fujitsu.com>
Tested-by: Lin Feng <linfeng@cn.fujitsu.com>
Cc: Wu Jianguo <wujianguo@huawei.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agox86: get pg_data_t's memory from other node
Yasuaki Ishimatsu [Thu, 7 Feb 2013 01:26:35 +0000 (12:26 +1100)]
x86: get pg_data_t's memory from other node

During the implementation of SRAT support, we met a problem.
In setup_arch(), we have the following call series:

1) memblock is ready;
2) some functions use memblock to allocate memory;
3) parse ACPI tables, such as SRAT.

Before 3), we don't know which memory is hotpluggable, and as a result, we
cannot prevent memblock from allocating hotpluggable memory.  So, in 2),
there could be some hotpluggable memory allocated by memblock.

Now, we are trying to parse SRAT earlier, before memblock is ready.  But I
think we need more investigation on this topic.  So in this v5, I dropped
all the SRAT support, and v5 is just the same as v3, and it is based on
3.8-rc3.

As we planned, we will support getting info from SRAT without users'
participation at last.  And we will post another patch-set to do so.

And also, I think for now, we can add this boot option as the first step of
supporting movable node. Since Linux cannot migrate the direct mapped pages,
the only way for now is to limit the whole node containing only movable memory.

Using SRAT is one way.  But even if we can use SRAT, users still need an
interface to enable/disable this functionality if they don't want to loose
their NUMA performance.  So I think, a user interface is always needed.

For now, users can disable this functionality by not specifying the boot
option.  Later, we will post SRAT support, and add another option value
"movablecore_map=acpi" to using SRAT.

This patch:

If system can create movable node which all memory of the node is
allocated as ZONE_MOVABLE, setup_node_data() cannot allocate memory for
the node's pg_data_t.  So, use memblock_alloc_try_nid() instead of
memblock_alloc_nid() to retry when the first allocation fails.

Signed-off-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Signed-off-by: Jiang Liu <jiang.liu@huawei.com>
Cc: Wu Jianguo <wujianguo@huawei.com>
Cc: Wen Congyang <wency@cn.fujitsu.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agosched: do not use cpu_to_node() to find an offlined cpu's node.
Tang Chen [Thu, 7 Feb 2013 01:26:35 +0000 (12:26 +1100)]
sched: do not use cpu_to_node() to find an offlined cpu's node.

If a cpu is offline, its nid will be set to -1, and cpu_to_node(cpu) will
return -1.  As a result, cpumask_of_node(nid) will return NULL.  In this
case, find_next_bit() in for_each_cpu will get a NULL pointer and cause
panic.

Here is a call trace:
[  609.824017] Call Trace:
[  609.824017]  <IRQ>
[  609.824017]  [<ffffffff810b0721>] select_fallback_rq+0x71/0x190
[  609.824017]  [<ffffffff810b086e>] ? try_to_wake_up+0x2e/0x2f0
[  609.824017]  [<ffffffff810b0b0b>] try_to_wake_up+0x2cb/0x2f0
[  609.824017]  [<ffffffff8109da08>] ? __run_hrtimer+0x78/0x320
[  609.824017]  [<ffffffff810b0b85>] wake_up_process+0x15/0x20
[  609.824017]  [<ffffffff8109ce62>] hrtimer_wakeup+0x22/0x30
[  609.824017]  [<ffffffff8109da13>] __run_hrtimer+0x83/0x320
[  609.824017]  [<ffffffff8109ce40>] ? update_rmtp+0x80/0x80
[  609.824017]  [<ffffffff8109df56>] hrtimer_interrupt+0x106/0x280
[  609.824017]  [<ffffffff810a72c8>] ? sd_free_ctl_entry+0x68/0x70
[  609.824017]  [<ffffffff8167cf39>] smp_apic_timer_interrupt+0x69/0x99
[  609.824017]  [<ffffffff8167be2f>] apic_timer_interrupt+0x6f/0x80

There is a hrtimer process sleeping, whose cpu has already been offlined.
When it is waken up, it tries to find another cpu to run, and get a -1
nid.  As a result, cpumask_of_node(-1) returns NULL, and causes ernel
panic.

This patch fixes this problem by judging if the nid is -1.  If nid is not
-1, a cpu on the same node will be picked.  Else, a online cpu on another
node will be picked.

Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Jiang Liu <liuj97@gmail.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agocpu-hotplugmemory-hotplug-clear-cpu_to_node-when-offlining-the-node-fix
Andrew Morton [Thu, 7 Feb 2013 01:26:34 +0000 (12:26 +1100)]
cpu-hotplugmemory-hotplug-clear-cpu_to_node-when-offlining-the-node-fix

numa_clear_node() and numa_set_node() can no longer be __cpuinit.

WARNING: vmlinux.o(.text+0x222702): Section mismatch in reference from the function check_and_unmap_cpu_on_node() to the function .cpuinit.text:numa_clear_node()
The function check_and_unmap_cpu_on_node() references
the function __cpuinit numa_clear_node().
This is often because check_and_unmap_cpu_on_node lacks a __cpuinit
annotation or the annotation of numa_clear_node is wrong.

Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jiang Liu <liuj97@gmail.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tang Chen <tangchen@cn.fujitsu.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Wen Congyang <wency@cn.fujitsu.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agocpu-hotplug,memory-hotplug: clear cpu_to_node() when offlining the node
Wen Congyang [Thu, 7 Feb 2013 01:26:34 +0000 (12:26 +1100)]
cpu-hotplug,memory-hotplug: clear cpu_to_node() when offlining the node

When the node is offlined, there is no memory/cpu on the node.  If a sleep
task runs on a cpu of this node, it will be migrated to the cpu on the
other node.  So we can clear cpu-to-node mapping.

Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Jiang Liu <liuj97@gmail.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agocpu-hotplug, memory-hotplug: try offlining the node when hotremoving a cpu
Wen Congyang [Thu, 7 Feb 2013 01:26:34 +0000 (12:26 +1100)]
cpu-hotplug, memory-hotplug: try offlining the node when hotremoving a cpu

The node will be offlined when all memory/cpu on the node is hotremoved.
So we should try offline the node when hotremoving a cpu on the node.

Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Jiang Liu <liuj97@gmail.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Len Brown <lenb@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agomemory-hotplug: export the function try_offline_node() fix
David Rientjes [Thu, 7 Feb 2013 01:26:33 +0000 (12:26 +1100)]
memory-hotplug: export the function try_offline_node() fix

"memory-hotplug: export the function try_offline_node()" declares
try_offline_node() for CONFIG_MEMORY_HOTPLUG, but this function is only
defined for CONFIG_MEMORY_HOTREMOVE:

ERROR: "try_offline_node" [drivers/acpi/processor.ko] undefined!

Fix the build by definining it appropriately.

Signed-off-by: David Rientjes <rientjes@google.com>
Cc: Wen Congyang <wency@cn.fujitsu.com>
Cc: Tang Chen <tangchen@cn.fujitsu.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agomemory-hotplug: export the function try_offline_node()
Wen Congyang [Thu, 7 Feb 2013 01:26:33 +0000 (12:26 +1100)]
memory-hotplug: export the function try_offline_node()

try_offline_node() will be needed in the tristate
drivers/acpi/processor_driver.c.

The node will be offlined when all memory/cpu on the node have been
hotremoved.  So we need the function try_offline_node() in cpu-hotplug
path.

If the memory-hotplug is disabled, and cpu-hotplug is enabled
1. no memory no the node
   we don't online the node, and cpu's node is the nearest node.
2. the node contains some memory
   the node has been onlined, and cpu's node is still needed
   to migrate the sleep task on the cpu to the same node.

So we do nothing in try_offline_node() in this case.

Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Jiang Liu <liuj97@gmail.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Len Brown <lenb@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agocpu_hotplug-clear-apicid-to-node-when-the-cpu-is-hotremoved-fix
Andrew Morton [Thu, 7 Feb 2013 01:26:33 +0000 (12:26 +1100)]
cpu_hotplug-clear-apicid-to-node-when-the-cpu-is-hotremoved-fix

fix section error

__apicid_to_node can no longer be __cpuinit as it is referred to from
acpi_unmap_lsapic().

>> WARNING: vmlinux.o(.text+0x43773): Section mismatch in reference from the function acpi_unmap_lsapic() to the variable .cpuinit.data:__apicid_to_node
   The function acpi_unmap_lsapic() references
   the variable __cpuinitdata __apicid_to_node.
   This is often because acpi_unmap_lsapic lacks a __cpuinitdata
   annotation or the annotation of __apicid_to_node is wrong.

Reported-by: Wu Fengguang <fengguang.wu@intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jiang Liu <liuj97@gmail.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tang Chen <tangchen@cn.fujitsu.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Wen Congyang <wency@cn.fujitsu.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agocpu_hotplug: clear apicid to node when the cpu is hotremoved
Wen Congyang [Thu, 7 Feb 2013 01:26:33 +0000 (12:26 +1100)]
cpu_hotplug: clear apicid to node when the cpu is hotremoved

When a cpu is hotpluged, we call acpi_map_cpu2node() in _acpi_map_lsapic()
to store the cpu's node and apicid's node.  But we don't clear the cpu's
node in acpi_unmap_lsapic() when this cpu is hotremoved.  If the node is
also hotremoved, we will get the following messages:

[ 1646.771485] kernel BUG at include/linux/gfp.h:329!
[ 1646.828729] invalid opcode: 0000 [#1] SMP
[ 1646.877872] Modules linked in: ebtable_nat ebtables ipt_MASQUERADE iptable_nat nf_nat xt_CHECKSUM iptable_mangle bridge stp llc sunrpc ipt_REJECT nf_conntrack_ipv4 nf_defrag_ipv4 iptable_filter ip_tables ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables binfmt_misc dm_mirror dm_region_hash dm_log dm_mod vhost_net macvtap macvlan tun uinput iTCO_wdt iTCO_vendor_support coretemp kvm_intel kvm crc32c_intel microcode pcspkr i2c_i801 i2c_core lpc_ich mfd_core ioatdma e1000e i7core_edac edac_core sg acpi_memhotplug igb dca sd_mod crc_t10dif megaraid_sas mptsas mptscsih mptbase scsi_transport_sas scsi_mod
[ 1647.588773] Pid: 3126, comm: init Not tainted 3.6.0-rc3-tangchen-hostbridge+ #13 FUJITSU-SV PRIMEQUEST 1800E/SB
[ 1647.711545] RIP: 0010:[<ffffffff811bc3fd>]  [<ffffffff811bc3fd>] allocate_slab+0x28d/0x300
[ 1647.810492] RSP: 0018:ffff88078a049cf8  EFLAGS: 00010246
[ 1647.874028] RAX: 0000000000000000 RBX: 0000000000000001 RCX: 0000000000000000
[ 1647.959339] RDX: 0000000000000001 RSI: 0000000000000001 RDI: 0000000000000246
[ 1648.044659] RBP: ffff88078a049d38 R08: 00000000000040d0 R09: 0000000000000001
[ 1648.129953] R10: 0000000000000000 R11: 0000000000000b5f R12: 00000000000052d0
[ 1648.215259] R13: ffff8807c1417300 R14: 0000000000030038 R15: 0000000000000003
[ 1648.300572] FS:  00007fa9b1b44700(0000) GS:ffff8807c3800000(0000) knlGS:0000000000000000
[ 1648.397272] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 1648.465985] CR2: 00007fa9b09acca0 CR3: 000000078b855000 CR4: 00000000000007e0
[ 1648.551265] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 1648.636565] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[ 1648.721838] Process init (pid: 3126, threadinfo ffff88078a048000, task ffff8807bb6f2650)
[ 1648.818534] Stack:
[ 1648.842548]  ffff8807c39d7fa0 ffffffff000040d0 00000000000000bb 00000000000080d0
[ 1648.931469]  ffff8807c1417300 ffff8807c39d7fa0 ffff8807c1417300 0000000000000001
[ 1649.020410]  ffff88078a049d88 ffffffff811bc4a0 ffff8807c1410c80 0000000000000000
[ 1649.109464] Call Trace:
[ 1649.138713]  [<ffffffff811bc4a0>] new_slab+0x30/0x1b0
[ 1649.199075]  [<ffffffff811bc978>] __slab_alloc+0x358/0x4c0
[ 1649.264683]  [<ffffffff810b71c0>] ? alloc_fair_sched_group+0xd0/0x1b0
[ 1649.341695]  [<ffffffff811be7d4>] kmem_cache_alloc_node_trace+0xb4/0x1e0
[ 1649.421824]  [<ffffffff8109d188>] ? hrtimer_init+0x48/0x100
[ 1649.488414]  [<ffffffff810b71c0>] ? alloc_fair_sched_group+0xd0/0x1b0
[ 1649.565402]  [<ffffffff810b71c0>] alloc_fair_sched_group+0xd0/0x1b0
[ 1649.640297]  [<ffffffff810a8bce>] sched_create_group+0x3e/0x110
[ 1649.711040]  [<ffffffff810bdbcd>] sched_autogroup_create_attach+0x4d/0x180
[ 1649.793260]  [<ffffffff81089614>] sys_setsid+0xd4/0xf0
[ 1649.854694]  [<ffffffff8167a029>] system_call_fastpath+0x16/0x1b
[ 1649.926483] Code: 89 c4 e9 73 fe ff ff 31 c0 89 de 48 c7 c7 45 de 9e 81 44 89 45 c8 e8 22 05 4b 00 85 db 44 8b 45 c8 0f 89 4f ff ff ff 0f 0b eb fe <0f> 0b 90 eb fd 0f 0b eb fe 89 de 48 c7 c7 45 de 9e 81 31 c0 44
[ 1650.161454] RIP  [<ffffffff811bc3fd>] allocate_slab+0x28d/0x300
[ 1650.232348]  RSP <ffff88078a049cf8>
[ 1650.274029] ---[ end trace adf84c90f3fea3e5 ]---

The reason is that the cpu's node is not NUMA_NO_NODE, we will call
alloc_pages_exact_node() to alloc memory on the node, but the node is
offlined.

If the node is onlined, we still need cpu's node.  For example: a task on
the cpu is sleeped when the cpu is hotremoved.  We will choose another cpu
to run this task when it is waked up.  If we know the cpu's node, we will
choose the cpu on the same node first.  So we should clear cpu-to-node
mapping when the node is offlined.

This patch only clears apicid-to-node mapping when the cpu is hotremoved.

Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Jiang Liu <liuj97@gmail.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agomempolicy: fix is_valid_nodemask()
Lai Jiangshan [Thu, 7 Feb 2013 01:26:32 +0000 (12:26 +1100)]
mempolicy: fix is_valid_nodemask()

is_valid_nodemask() was introduced by 19770b32 ("mm: filter based on a
nodemask as well as a gfp_mask").  but it does not match its comments,
because it does not check the zone which > policy_zone.

Also in b377fd ("Apply memory policies to top two highest zones when
highest zone is ZONE_MOVABLE"), this commits told us, if highest zone is
ZONE_MOVABLE, we should also apply memory policies to it.  so ZONE_MOVABLE
should be valid zone for policies.  is_valid_nodemask() need to be changed
to match it.

Fix: check all zones, even its zoneid > policy_zone.  Use
nodes_intersects() instead open code to check it.

Reported-by: Wen Congyang <wency@cn.fujitsu.com>
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: Jiang Liu <jiang.liu@huawei.com>
Cc: Jianguo Wu <wujianguo@huawei.com>
Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agomemory-hotplug: consider compound pages when free memmap
Wen Congyang [Thu, 7 Feb 2013 01:26:32 +0000 (12:26 +1100)]
memory-hotplug: consider compound pages when free memmap

usemap could also be allocated as compound pages.  Should also consider
compound pages when freeing memmap.

If we don't fix it, there could be problems when we free vmemmap
pagetables which are stored in compound pages.  The old pagetables will
not be freed properly, and when we add the memory again, no new pagetable
will be created.  And the old pagetable entry is used, than the kernel
will panic.

The call trace is like the following:

[  691.175487] BUG: unable to handle kernel paging request at ffffea0040000000
[  691.258872] IP: [<ffffffff816a483f>] sparse_add_one_section+0xef/0x166
[  691.336971] PGD 7ff7d4067 PUD 78e035067 PMD 78e11d067 PTE 0
[  691.403952] Oops: 0002 [#1] SMP
[  691.442695] Modules linked in: ip6table_filter ip6_tables ebtable_nat ebtables nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack ipt_REJECT xt_CHECKSUM iptable_mangle iptable_filter ip_tables bridge stp llc sunrpc binfmt_misc dm_mirror dm_region_hash dm_log dm_mod vhost_net macvtap macvlan tun uinput iTCO_wdt iTCO_vendor_support coretemp kvm_intel kvm crc32c_intel microcode pcspkr sg lpc_ich mfd_core i2c_i801 i2c_core i7core_edac edac_core ioatdma e1000e igb dca ptp pps_core sd_mod crc_t10dif megaraid_sas mptsas mptscsih mptbase scsi_transport_sas scsi_mod
[  692.042726] CPU 0
[  692.064641] Pid: 4, comm: kworker/0:0 Tainted: G        W 3.8.0-rc3-phy-hot-remove+ #3 FUJITSU-SV PRIMEQUEST 1800E/SB
[  692.196723] RIP: 0010:[<ffffffff816a483f>]  [<ffffffff816a483f>] sparse_add_one_section+0xef/0x166
[  692.303885] RSP: 0018:ffff8807bdcb35d8  EFLAGS: 00010006
[  692.367331] RAX: 0000000000000000 RBX: 0000000000000200 RCX: 0000000000200000
[  692.452578] RDX: ffff88078df01148 RSI: 0000000000000282 RDI: ffffea0040000000
[  692.537822] RBP: ffff8807bdcb3618 R08: 4cf05005b019467a R09: 0cd98fa09631467a
[  692.623071] R10: 0000000000000000 R11: 0000000000030e20 R12: 0000000000008000
[  692.708319] R13: ffffea0040000000 R14: ffff88078df66248 R15: ffff88078ea13b10
[  692.793562] FS:  0000000000000000(0000) GS:ffff8807c1a00000(0000) knlGS:0000000000000000
[  692.890233] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[  692.958870] CR2: ffffea0040000000 CR3: 0000000001c0c000 CR4: 00000000000007f0
[  693.044119] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  693.129367] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[  693.214617] Process kworker/0:0 (pid: 4, threadinfo ffff8807bdcb2000, task ffff8807bde18000)
[  693.315437] Stack:
[  693.339421]  0000000000000000 0000000000000282 0000000000000000 ffff88078df40f00
[  693.428208]  0000000000000001 0000000000000200 00000000000002ff 0000000000000200
[  693.516981]  ffff8807bdcb3668 ffffffff816940e5 0000000000000000 0000000001000000
[  693.605761] Call Trace:
[  693.634949]  [<ffffffff816940e5>] __add_pages+0x85/0x120
[  693.698398]  [<ffffffff8104f1d1>] arch_add_memory+0x71/0xf0
[  693.764960]  [<ffffffff81079bff>] ? request_resource_conflict+0x8f/0xa0
[  693.843982]  [<ffffffff81694796>] add_memory+0xd6/0x1f0
[  693.906393]  [<ffffffff814044df>] acpi_memory_device_add+0x170/0x20c
[  693.982302]  [<ffffffff813c1de2>] acpi_device_probe+0x50/0x18a
[  694.051977]  [<ffffffff8125a9d3>] ? sysfs_create_link+0x13/0x20
[  694.122691]  [<ffffffff8146c31c>] really_probe+0x6c/0x320
[  694.187170]  [<ffffffff8146c617>] driver_probe_device+0x47/0xa0
[  694.257885]  [<ffffffff8146c720>] ? __driver_attach+0xb0/0xb0
[  694.326521]  [<ffffffff8146c720>] ? __driver_attach+0xb0/0xb0
[  694.395157]  [<ffffffff8146c773>] __device_attach+0x53/0x60
[  694.461719]  [<ffffffff8146a34c>] bus_for_each_drv+0x6c/0xa0
[  694.529316]  [<ffffffff8146c298>] device_attach+0xa8/0xc0
[  694.593799]  [<ffffffff8146af70>] bus_probe_device+0xb0/0xe0
[  694.661398]  [<ffffffff814699c1>] device_add+0x301/0x570
[  694.724842]  [<ffffffff81469c4e>] device_register+0x1e/0x30
[  694.791403]  [<ffffffff813c354a>] acpi_device_register+0x1d8/0x27c
[  694.865230]  [<ffffffff813c37cd>] acpi_add_single_object+0x1df/0x2b9
[  694.941140]  [<ffffffff813fa078>] ? acpi_ut_release_mutex+0xac/0xb5
[  695.016009]  [<ffffffff813c39b9>] acpi_bus_check_add+0x112/0x18f
[  695.087764]  [<ffffffff810df61d>] ? trace_hardirqs_on+0xd/0x10
[  695.157445]  [<ffffffff810a1b0f>] ? up+0x2f/0x50
[  695.212585]  [<ffffffff813bdddb>] ? acpi_os_signal_semaphore+0x6b/0x74
[  695.290573]  [<ffffffff813ec519>] acpi_ns_walk_namespace+0x105/0x255
[  695.366478]  [<ffffffff813c38a7>] ? acpi_add_single_object+0x2b9/0x2b9
[  695.444459]  [<ffffffff813c38a7>] ? acpi_add_single_object+0x2b9/0x2b9
[  695.522439]  [<ffffffff813ecb6c>] acpi_walk_namespace+0xcf/0x118
[  695.594190]  [<ffffffff813c3a91>] acpi_bus_scan+0x5b/0x7c
[  695.658676]  [<ffffffff813c3b1e>] acpi_bus_add+0x2a/0x2c
[  695.722121]  [<ffffffff81402905>] container_notify_cb+0x112/0x1a9
[  695.794914]  [<ffffffff813d5859>] acpi_ev_notify_dispatch+0x46/0x61
[  695.869781]  [<ffffffff813be072>] acpi_os_execute_deferred+0x27/0x34
[  695.945687]  [<ffffffff81091c6e>] process_one_work+0x20e/0x5c0
[  696.015361]  [<ffffffff81091bff>] ? process_one_work+0x19f/0x5c0
[  696.087113]  [<ffffffff813be04b>] ? acpi_os_wait_events_complete+0x23/0x23
[  696.169248]  [<ffffffff81093d0e>] worker_thread+0x12e/0x370
[  696.235807]  [<ffffffff81093be0>] ? manage_workers+0x180/0x180
[  696.305485]  [<ffffffff81099e4e>] kthread+0xee/0x100
[  696.364773]  [<ffffffff810e1179>] ? __lock_release+0x129/0x190
[  696.434450]  [<ffffffff81099d60>] ? __init_kthread_worker+0x70/0x70
[  696.509317]  [<ffffffff816b34ac>] ret_from_fork+0x7c/0xb0
[  696.573799]  [<ffffffff81099d60>] ? __init_kthread_worker+0x70/0x70
[  696.648662] Code: 00 00 48 89 df 48 89 45 c8 e8 3e 71 b1 ff 48 89 c2 48 8b 75 c8 b8 ef ff ff ff f6 02 01 75 4b 49 63 cc 31 c0 4c 89 ef 48 c1 e1 06 <f3> aa 48 8b 02 48 83 c8 01 48 85 d2 48 89 02 74 29 a8 01 74 25
[  696.880997] RIP  [<ffffffff816a483f>] sparse_add_one_section+0xef/0x166
[  696.960128]  RSP <ffff8807bdcb35d8>
[  697.001768] CR2: ffffea0040000000
[  697.041336] ---[ end trace e7f94e3a34c442d4 ]---
[  697.096474] Kernel panic - not syncing: Fatal exception

Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Cc: Jiang Liu <jiang.liu@huawei.com>
Cc: Jianguo Wu <wujianguo@huawei.com>
Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agomemory-hotplug-do-not-allocate-pdgat-if-it-was-not-freed-when-offline-fix-fix
Andrew Morton [Thu, 7 Feb 2013 01:26:32 +0000 (12:26 +1100)]
memory-hotplug-do-not-allocate-pdgat-if-it-was-not-freed-when-offline-fix-fix

fix the warning again again

Tested-by: Wu Fengguang <fengguang.wu@intel.com>
Cc: Tang Chen <tangchen@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agomemory-hotplug-do-not-allocate-pdgat-if-it-was-not-freed-when-offline-fix
Andrew Morton [Thu, 7 Feb 2013 01:26:31 +0000 (12:26 +1100)]
memory-hotplug-do-not-allocate-pdgat-if-it-was-not-freed-when-offline-fix

fix warning when CONFIG_NEED_MULTIPLE_NODES=n

Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Jiang Liu <jiang.liu@huawei.com>
Cc: Jianguo Wu <wujianguo@huawei.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Tang Chen <tangchen@cn.fujitsu.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Wen Congyang <wency@cn.fujitsu.com>
Cc: Wu Jianguo <wujianguo@huawei.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agomemory-hotplug: do not allocate pgdat if it was not freed when offline.
Tang Chen [Thu, 7 Feb 2013 01:26:31 +0000 (12:26 +1100)]
memory-hotplug: do not allocate pgdat if it was not freed when offline.

Since there is no way to guarentee the address of pgdat/zone is not on
stack of any kernel threads or used by other kernel objects without
reference counting or other symchronizing method, we cannot reset
node_data and free pgdat when offlining a node.  Just reset pgdat to 0 and
reuse the memory when the node is online again.

The problem is suggested by Kamezawa Hiroyuki.  The idea is from Wen
Congyang.

NOTE: If we don't reset pgdat to 0, the WARN_ON in free_area_init_node()
      will be triggered.

Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Reviewed-by: Wen Congyang <wency@cn.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Jiang Liu <jiang.liu@huawei.com>
Cc: Jianguo Wu <wujianguo@huawei.com>
Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Wu Jianguo <wujianguo@huawei.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agomemory-hotplug: free node_data when a node is offlined
Wen Congyang [Thu, 7 Feb 2013 01:26:31 +0000 (12:26 +1100)]
memory-hotplug: free node_data when a node is offlined

We call hotadd_new_pgdat() to allocate memory to store node_data.  So we
should free it when removing a node.

Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Reviewed-by: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Jiang Liu <jiang.liu@huawei.com>
Cc: Jianguo Wu <wujianguo@huawei.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Wu Jianguo <wujianguo@huawei.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agomemory-hotplug: remove sysfs file of node
Tang Chen [Thu, 7 Feb 2013 01:26:30 +0000 (12:26 +1100)]
memory-hotplug: remove sysfs file of node

Introduce a new function try_offline_node() to remove sysfs file of node
when all memory sections of this node are removed.  If some memory
sections of this node are not removed, this function does nothing.

Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Jiang Liu <jiang.liu@huawei.com>
Cc: Jianguo Wu <wujianguo@huawei.com>
Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Wu Jianguo <wujianguo@huawei.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agomemory_hotplug: clear zone when removing the memory
Yasuaki Ishimatsu [Thu, 7 Feb 2013 01:26:30 +0000 (12:26 +1100)]
memory_hotplug: clear zone when removing the memory

When memory is added, we update zone's and pgdat's start_pfn and
spanned_pages in __add_zone().  So we should revert them when the memory
is removed.

The patch adds a new function __remove_zone() to do this.

Signed-off-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Jiang Liu <jiang.liu@huawei.com>
Cc: Jianguo Wu <wujianguo@huawei.com>
Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Wu Jianguo <wujianguo@huawei.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agomemory-hotplug: integrated __remove_section() of CONFIG_SPARSEMEM_VMEMMAP.
Tang Chen [Thu, 7 Feb 2013 01:26:30 +0000 (12:26 +1100)]
memory-hotplug: integrated __remove_section() of CONFIG_SPARSEMEM_VMEMMAP.

Currently __remove_section for SPARSEMEM_VMEMMAP does nothing.  But even
if we use SPARSEMEM_VMEMMAP, we can unregister the memory_section.

Signed-off-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Jiang Liu <jiang.liu@huawei.com>
Cc: Jianguo Wu <wujianguo@huawei.com>
Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Wu Jianguo <wujianguo@huawei.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agomemory-hotplug-remove-memmap-of-sparse-vmemmap-fix
Michal Hocko [Thu, 7 Feb 2013 01:26:29 +0000 (12:26 +1100)]
memory-hotplug-remove-memmap-of-sparse-vmemmap-fix

Defconfig for x86_64 complains:
arch/x86/mm/init_64.c: In function `vmemmap_free':
arch/x86/mm/init_64.c:1317: error: implicit declaration of function `remove_pagetable'

vmemmap_free is only used for CONFIG_MEMORY_HOTPLUG so let's move it
inside ifdef

Signed-off-by: Michal Hocko <mhocko@suse.cz>
Tested-by: Lin Feng <linfeng@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agomemory-hotplug: remove memmap of sparse-vmemmap
Tang Chen [Thu, 7 Feb 2013 01:26:29 +0000 (12:26 +1100)]
memory-hotplug: remove memmap of sparse-vmemmap

Introduce a new API vmemmap_free() to free and remove vmemmap pagetables.
Since pagetable implements are different, each architecture has to provide
its own version of vmemmap_free(), just like vmemmap_populate().

Note:  vmemmap_free() is not implemented for ia64, ppc, s390, and sparc.

Signed-off-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Signed-off-by: Jianguo Wu <wujianguo@huawei.com>
Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Jiang Liu <jiang.liu@huawei.com>
Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Wu Jianguo <wujianguo@huawei.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agomemory-hotplug-remove-page-table-of-x86_64-architecture-fix
Andrew Morton [Thu, 7 Feb 2013 01:26:29 +0000 (12:26 +1100)]
memory-hotplug-remove-page-table-of-x86_64-architecture-fix

make kernel_physical_mapping_remove() static

Cc: Jiang Liu <jiang.liu@huawei.com>
Cc: Jianguo Wu <wujianguo@huawei.com>
Cc: Tang Chen <tangchen@cn.fujitsu.com>
Cc: Wen Congyang <wency@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agomemory-hotplug: remove page table of x86_64 architecture
Tang Chen [Thu, 7 Feb 2013 01:26:28 +0000 (12:26 +1100)]
memory-hotplug: remove page table of x86_64 architecture

Search a page table about the removed memory, and clear page table for
x86_64 architecture.

Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Signed-off-by: Jianguo Wu <wujianguo@huawei.com>
Signed-off-by: Jiang Liu <jiang.liu@huawei.com>
Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Wu Jianguo <wujianguo@huawei.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agomemory-hotplug-common-apis-to-support-page-tables-hot-remove-fix-fix-fix-fix-fix...
Andrew Morton [Thu, 7 Feb 2013 01:26:28 +0000 (12:26 +1100)]
memory-hotplug-common-apis-to-support-page-tables-hot-remove-fix-fix-fix-fix-fix-fix-fix

fix used-uninitialised bug

Cc: Jiang Liu <jiang.liu@huawei.com>
Cc: Jianguo Wu <wujianguo@huawei.com>
Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Tang Chen <tangchen@cn.fujitsu.com>
Cc: Wen Congyang <wency@cn.fujitsu.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agomemory-hotplug-common-apis-to-support-page-tables-hot-remove-fix-fix-fix-fix-fix-fix
Andrew Morton [Thu, 7 Feb 2013 01:26:28 +0000 (12:26 +1100)]
memory-hotplug-common-apis-to-support-page-tables-hot-remove-fix-fix-fix-fix-fix-fix

Cc: Jiang Liu <jiang.liu@huawei.com>
Cc: Jianguo Wu <wujianguo@huawei.com>
Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Tang Chen <tangchen@cn.fujitsu.com>
Cc: Wen Congyang <wency@cn.fujitsu.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agoBug fix: Do not split pages when freeing pagetable pages.
Tang Chen [Thu, 7 Feb 2013 01:26:27 +0000 (12:26 +1100)]
Bug fix: Do not split pages when freeing pagetable pages.

The old way we free pagetable pages was wrong.

The old way is:
When we got a hugepage, we split it into smaller pages. And sometimes,
we only need to free some of the smaller pages because the others are
still in use. And the whole larger page will be freed if all the smaller
pages are not in use. All these were done in remove_pmd/pud_table().

But there is no way to know if the larger page has been split. As a result,
there is no way to decide when to split.

Actually, there is no need to split larger pages into smaller ones.

We do it in the following new way:
1) For direct mapped pages, all the pages were freed when they were offlined.
   And since menmory offline is done section by section, all the memory ranges
   being removed are aligned to PAGE_SIZE. So only need to deal with unaligned
   pages when freeing vmemmap pages.

2) For vmemmap pages being used to store page_struct, if part of the larger
   page is still in use, just fill the unused part with 0xFD. And when the
   whole page is fulfilled with 0xFD, then free the larger page.

This problem is caused by the following related patch:
memory-hotplug-common-apis-to-support-page-tables-hot-remove.patch
memory-hotplug-common-apis-to-support-page-tables-hot-remove-fix.patch
memory-hotplug-common-apis-to-support-page-tables-hot-remove-fix-fix.patch
memory-hotplug-common-apis-to-support-page-tables-hot-remove-fix-fix-fix.patch
memory-hotplug-common-apis-to-support-page-tables-hot-remove-fix-fix-fix-fix.patch

This patch will fix the problem based on the above patches.

Reported-by: Wen Congyang <wency@cn.fujitsu.com>
Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Cc: Jiang Liu <jiang.liu@huawei.com>
Cc: Jianguo Wu <wujianguo@huawei.com>
Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agoBug fix: Do not free page split from hugepage one by one.
Tang Chen [Thu, 7 Feb 2013 01:26:27 +0000 (12:26 +1100)]
Bug fix: Do not free page split from hugepage one by one.

When we split a larger page into smaller ones, we should not free them one
by one because only the _count of the first page struct makes sense.
Otherwise, the kernel will panic.

So fulfill the unused small pages with 0xFD, and when the whole larger
page is fulfilled with 0xFD, free the whole larger page.

The call trace is like the following:

[ 1052.819430] ------------[ cut here ]------------
[ 1052.874575] kernel BUG at include/linux/mm.h:278!
[ 1052.930754] invalid opcode: 0000 [#1] SMP
[ 1052.979888] Modules linked in: ip6table_filter ip6_tables ebtable_nat ebtables nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack ipt_REJECT xt_CHECKSUM iptable_mangle iptable_filter ip_tables bridge stp llc sunrpc binfmt_misc dm_mirror dm_region_hash dm_log dm_mod vhost_net macvtap macvlan tun uinput iTCO_wdt iTCO_vendor_support coretemp kvm_intel kvm crc32c_intel microcode pcspkr sg i2c_i801 i2c_core lpc_ich mfd_core ioatdma e1000e i7core_edac edac_core igb dca ptp pps_core sd_mod crc_t10dif megaraid_sas mptsas mptscsih mptbase scsi_transport_sas scsi_mod
[ 1053.580111] CPU 0
[ 1053.602026] Pid: 4, comm: kworker/0:0 Tainted: G        W    3.8.0-rc2-memory-hotremove+ #3 FUJITSU-SV PRIMEQUEST 1800E/SB
[ 1053.736188] RIP: 0010:[<ffffffff81175bd7>]  [<ffffffff81175bd7>] __free_pages+0x37/0x50
[ 1053.831952] RSP: 0018:ffff8807bdcb37f8  EFLAGS: 00010246
[ 1053.895403] RAX: 0000000000000000 RBX: ffff88077c401000 RCX: 000000000000002c
[ 1053.980660] RDX: ffff8807fffd7000 RSI: 0000000000000000 RDI: ffffea001df10040
[ 1054.065917] RBP: ffff8807bdcb37f8 R08: 0000000000000000 R09: 0000000000000000
[ 1054.151178] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
[ 1054.236433] R13: ffffea0040008000 R14: ffffea0040002000 R15: 00003ffffffff000
[ 1054.321691] FS:  0000000000000000(0000) GS:ffff8807c1a00000(0000) knlGS:0000000000000000
[ 1054.418372] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 1054.487015] CR2: 00007fbc137ad000 CR3: 0000000001c0c000 CR4: 00000000000007f0
[ 1054.572272] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 1054.657533] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[ 1054.742797] Process kworker/0:0 (pid: 4, threadinfo ffff8807bdcb2000, task ffff8807bde18000)
[ 1054.843628] Stack:
[ 1054.867622]  ffff8807bdcb3818 ffffffff81175ccf 000000077c401000 ffff880796349008
[ 1054.956421]  ffff8807bdcb3848 ffffffff816a16e2 ffffea0040001000 ffff880796349008
[ 1055.045227]  ffffea0040008000 ffffea0040002000 ffff8807bdcb38b8 ffffffff816a181c
[ 1055.134033] Call Trace:
[ 1055.163230]  [<ffffffff81175ccf>] free_pages+0x5f/0x70
[ 1055.224611]  [<ffffffff816a16e2>] free_pagetable+0x7f/0xee
[ 1055.290147]  [<ffffffff816a181c>] remove_pte_table+0xcb/0x1cd
[ 1055.358806]  [<ffffffff81055bd0>] ? leave_mm+0x50/0x50
[ 1055.420187]  [<ffffffff810df55d>] ? trace_hardirqs_on+0xd/0x10
[ 1055.489882]  [<ffffffff816a1f8b>] remove_pmd_table+0x191/0x253
[ 1055.559576]  [<ffffffff816a261e>] remove_pud_table+0x194/0x24d
[ 1055.629270]  [<ffffffff811bbc3f>] ? sparse_remove_one_section+0x2f/0x150
[ 1055.709348]  [<ffffffff816a278c>] remove_pagetable+0xb5/0x17c
[ 1055.778002]  [<ffffffff81692f28>] vmemmap_free+0x18/0x20
[ 1055.841465]  [<ffffffff811bbd15>] sparse_remove_one_section+0x105/0x150
[ 1055.920508]  [<ffffffff811c953c>] __remove_pages+0xec/0x110
[ 1055.987087]  [<ffffffff81692fa7>] arch_remove_memory+0x77/0xc0
[ 1056.056781]  [<ffffffff81694138>] remove_memory+0xb8/0xf0
[ 1056.121284]  [<ffffffff814040aa>] acpi_memory_device_remove+0x76/0xbc
[ 1056.198244]  [<ffffffff813c1e50>] acpi_device_remove+0x90/0xb2
[ 1056.267941]  [<ffffffff8146bf3c>] __device_release_driver+0x7c/0xf0
[ 1056.342824]  [<ffffffff8146c0bf>] device_release_driver+0x2f/0x50
[ 1056.415635]  [<ffffffff813c3142>] acpi_bus_remove+0x32/0x6d
[ 1056.482215]  [<ffffffff813c320e>] acpi_bus_trim+0x91/0x102
[ 1056.547755]  [<ffffffff813c3307>] acpi_bus_hot_remove_device+0x88/0x180
[ 1056.626794]  [<ffffffff813be152>] acpi_os_execute_deferred+0x27/0x34
[ 1056.702717]  [<ffffffff81091c5e>] process_one_work+0x20e/0x5c0
[ 1056.772411]  [<ffffffff81091bef>] ? process_one_work+0x19f/0x5c0
[ 1056.844184]  [<ffffffff813be12b>] ? acpi_os_wait_events_complete+0x23/0x23
[ 1056.926337]  [<ffffffff81093cfe>] worker_thread+0x12e/0x370
[ 1056.992908]  [<ffffffff81093bd0>] ? manage_workers+0x180/0x180
[ 1057.062602]  [<ffffffff81099e3e>] kthread+0xee/0x100
[ 1057.121913]  [<ffffffff810e10b9>] ? __lock_release+0x129/0x190
[ 1057.191609]  [<ffffffff81099d50>] ? __init_kthread_worker+0x70/0x70
[ 1057.266494]  [<ffffffff816b33ac>] ret_from_fork+0x7c/0xb0
[ 1057.330992]  [<ffffffff81099d50>] ? __init_kthread_worker+0x70/0x70
[ 1057.405863] Code: 85 c0 74 27 f0 ff 4f 1c 0f 94 c0 84 c0 74 0a 85 f6 74 11 90 e8 bb e3 ff ff c9 c3 66 0f 1f 84 00 00 00 00 00 e8 1b fe ff ff c9 c3 <0f> 0b 0f 1f 80 00 00 00 00 eb f7 66 66 66 66 66 2e 0f 1f 84 00
[ 1057.638271] RIP  [<ffffffff81175bd7>] __free_pages+0x37/0x50
[ 1057.705994]  RSP <ffff8807bdcb37f8>
[ 1057.747882] ---[ end trace 0f90e1e054d174f9 ]---
[ 1057.803158] Kernel panic - not syncing: Fatal exception

Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Jiang Liu <jiang.liu@huawei.com>
Cc: Jianguo Wu <wujianguo@huawei.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
11 years agoBug fix: Do not free direct mapping pages twice.
Tang Chen [Thu, 7 Feb 2013 01:26:27 +0000 (12:26 +1100)]
Bug fix: Do not free direct mapping pages twice.

Direct mapped pages were freed when they were offlined, or they were not
allocated.  So we only need to free vmemmap pages, no need to free direct
mapped pages.

Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Cc: Wen Congyang <wency@cn.fujitsu.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Tang Chen <tangchen@cn.fujitsu.com>
Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>