git.karo-electronics.de Git - karo-tx-linux.git/log

swap-add-per-partition-lock-for-swapfile-fix-fix-fix-fix

Fix building errors like:
> arch/sparc/mm/init_32.c: In function 'show_mem':
> arch/sparc/mm/init_32.c:60:23: error: invalid operands to binary << (have 'atomic_long_t' and 'int')

Signed-off-by: Shaohua Li <shli@fusionio.com>
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Reported-by: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

swap: add per-partition lock for swapfile fix

I had all cpus spinning in swap_info_get(), for the lock on an area
being swapped off: probably because get_swap_page() forgot to unlock.

Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: Shaohua Li <shli@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

swap: fix "add per-partition lock for swapfile" for nommu

The patch "swap: add per-partition lock for swapfile" made the
nr_swap_pages variable unaccessible but forgot to change the
mm/nommu.c file that uses it. This does the trivial conversion
to let us build nommu kernels again

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Cc: Shaohua Li <shli@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

swap-add-per-partition-lock-for-swapfile-fix-fix

> arch/sparc/mm/init_32.c: In function 'show_mem':
> arch/sparc/mm/init_32.c:60:23: error: invalid operands to binary << (have 'atomic_long_t' and 'int')
>

Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

swap: add per-partition lock for swapfile

swap_lock is heavily contended when I test swap to 3 fast SSD (even
slightly slower than swap to 2 such SSD).  The main contention comes from
swap_info_get().  This patch tries to fix the gap with adding a new
per-partition lock.

Global data like nr_swapfiles, total_swap_pages, least_priority and
swap_list are still protected by swap_lock.

nr_swap_pages is an atomic now, it can be changed without swap_lock.  In
theory, it's possible get_swap_page() finds no swap pages but actually
there are free swap pages.  But sounds not a big problem.

Accessing partition specific data (like scan_swap_map and so on) is only
protected by swap_info_struct.lock.

Changing swap_info_struct.flags need hold swap_lock and
swap_info_struct.lock, because scan_scan_map() will check it.  read the
flags is ok with either the locks hold.

If both swap_lock and swap_info_struct.lock must be hold, we always hold
the former first to avoid deadlock.

swap_entry_free() can change swap_list.  To delete that code, we add a new
highest_priority_index.  Whenever get_swap_page() is called, we check it.
If it's valid, we use it.

It's a pity get_swap_page() still holds swap_lock().  But in practice,
swap_lock() isn't heavily contended in my test with this patch (or I can
say there are other much more heavier bottlenecks like TLB flush).  And
BTW, looks get_swap_page() doesn't really need the lock.  We never free
swap_info[] and we check SWAP_WRITEOK flag.  The only risk without the
lock is we could swapout to some low priority swap, but we can quickly
recover after several rounds of swap, so sounds not a big deal to me.  But
I'd prefer to fix this if it's a real problem.

"swap: make each swap partition have one address_space" improved the
swapout speed from 1.7G/s to 2G/s.  This patch further improves the speed
to 2.3G/s, so around 15% improvement.  It's a multi-process test, so TLB
flush isn't the biggest bottleneck before the patches.

Signed-off-by: Shaohua Li <shli@fusionio.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Seth Jennings <sjenning@linux.vnet.ibm.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Cc: Dan Magenheimer <dan.magenheimer@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

swap-make-each-swap-partition-have-one-address_space-fix-fix

Sasha reported:
Commit "swap: make each swap partition have one address_space" is triggering
a series of warnings on boot:

[    3.446071] ------------[ cut here ]------------
[    3.446664] WARNING: at lib/debugobjects.c:261 debug_print_object+0x8e/0xb0()
[    3.447715] ODEBUG: init active (active state 0) object type: percpu_counter hint:           (null)
[    3.450360] Modules linked in:
[    3.451593] Pid: 1, comm: swapper/0 Tainted: G        W    3.8.0-rc4-next-20130124-sasha-00004-g838a1b4 #266
[    3.454508] Call Trace:
[    3.455248]  [<ffffffff8110d1bc>] warn_slowpath_common+0x8c/0xc0
[    3.455248]  [<ffffffff8110d291>] warn_slowpath_fmt+0x41/0x50
[    3.455248]  [<ffffffff81a2bb5e>] debug_print_object+0x8e/0xb0
[    3.455248]  [<ffffffff81a2c26b>] __debug_object_init+0x20b/0x290
[    3.455248]  [<ffffffff81a2c305>] debug_object_init+0x15/0x20
[    3.455248]  [<ffffffff81a3fbed>] __percpu_counter_init+0x6d/0xe0
[    3.455248]  [<ffffffff81231bdc>] bdi_init+0x1ac/0x270
[    3.455248]  [<ffffffff8618f20b>] swap_setup+0x3b/0x87
[    3.455248]  [<ffffffff8618f257>] ? swap_setup+0x87/0x87
[    3.455248]  [<ffffffff8618f268>] kswapd_init+0x11/0x7c
[    3.455248]  [<ffffffff810020ca>] do_one_initcall+0x8a/0x180
[    3.455248]  [<ffffffff86168cfd>] do_basic_setup+0x96/0xb4
[    3.455248]  [<ffffffff861685ae>] ? loglevel+0x31/0x31
[    3.455248]  [<ffffffff861885cd>] ? sched_init_smp+0x150/0x157
[    3.455248]  [<ffffffff86168ded>] kernel_init_freeable+0xd2/0x14c
[    3.455248]  [<ffffffff83cade10>] ? rest_init+0x140/0x140
[    3.455248]  [<ffffffff83cade19>] kernel_init+0x9/0xf0
[    3.455248]  [<ffffffff83d5727c>] ret_from_fork+0x7c/0xb0
[    3.455248]  [<ffffffff83cade10>] ? rest_init+0x140/0x140
[    3.455248] ---[ end trace 0b176d5c0f21bffb ]---

Initialize swap space backing_dev_info once to avoid the warning.

Signed-off-by: Shaohua Li <shli@fusionio.com>
Reported-by: Sasha Levin <sasha.levin@oracle.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

swap-make-each-swap-partition-have-one-address_space-fix

revert unneeded change to __add_to_swap_cache

Cc: Hugh Dickins <hughd@google.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Shaohua Li <shli@fusionio.com>
Cc: Shaohua Li <shli@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

swap: make each swap partition have one address_space

When I use several fast SSD to do swap, swapper_space.tree_lock is heavily
contended. This makes each swap partition have one address_space to
reduce the lock contention. There is an array of address_space for swap.
The swap entry type is the index to the array.

In my test with 3 SSD, this increases the swapout throughput 20%.

Signed-off-by: Shaohua Li <shli@fusionio.com>
Cc: Hugh Dickins <hughd@google.com>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mm: don't inline page_mapping()

According to akpm, this saves 1/2k text and makes things simple for the
next patch.

Numbers from Minchan:

add/remove: 1/0 grow/shrink: 6/22 up/down: 92/-516 (-424)
function                                     old     new   delta
page_mapping                                   -      48     +48
do_task_stat                                2292    2308     +16
page_remove_rmap                             240     248      +8
load_elf_binary                             4500    4508      +8
update_queue                                 532     536      +4
scsi_probe_and_add_lun                      2892    2896      +4
lookup_fast                                  644     648      +4
vcs_read                                    1040    1036      -4
__ip_route_output_key                       1904    1900      -4
ip_route_input_noref                        2508    2500      -8
shmem_file_aio_read                          784     772     -12
__isolate_lru_page                           272     256     -16
shmem_replace_page                           708     688     -20
mark_buffer_dirty                            228     208     -20
__set_page_dirty_buffers                     240     220     -20
__remove_mapping                             276     256     -20
update_mmu_cache                             500     476     -24
set_page_dirty_balance                        92      68     -24
set_page_dirty                               172     148     -24
page_evictable                                88      64     -24
page_cache_pipe_buf_steal                    248     224     -24
clear_page_dirty_for_io                      340     316     -24
test_set_page_writeback                      400     372     -28
test_clear_page_writeback                    516     488     -28
invalidate_inode_page                        156     128     -28
page_mkclean                                 432     400     -32
flush_dcache_page                            360     328     -32
__set_page_dirty_nobuffers                   324     280     -44
shrink_page_list                            2412    2356     -56

Signed-off-by: Shaohua Li <shli@fusionio.com>
Suggested-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Hugh Dickins <hughd@google.com>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mm: numa: Cleanup flow of transhuge page migration

When correcting commit 04fa5d6a ("mm: migrate: check page_count of THP
before migrating") Hugh Dickins noted that the control flow for transhuge
migration was difficult to follow.  Unconditionally calling put_page() in
numamigrate_isolate_page() made the failure paths of both
migrate_misplaced_transhuge_page() and migrate_misplaced_page() more
complex that they should be.  Further, he was extremely wary that an
unlock_page() should ever happen after a put_page() even if the put_page()
should never be the final put_page.

Hugh implemented the following cleanup to simplify the path by calling
putback_lru_page() inside numamigrate_isolate_page() if it failed to
isolate and always calling unlock_page() within
migrate_misplaced_transhuge_page().  There is no functional change after
this patch is applied but the code is easier to follow and unlock_page()
always happens before put_page().

[mgorman@suse.de: changelog only]
Signed-off-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Simon Jeons <simon.jeons@gmail.com>
Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mm: Fold page->_last_nid into page->flags where possible

page->_last_nid fits into page->flags on 64-bit. The unlikely 32-bit NUMA
configuration with NUMA Balancing will still need an extra page field. As
Peter notes "Completely dropping 32bit support for CONFIG_NUMA_BALANCING
would simplify things, but it would also remove the warning if we grow
enough 64bit only page-flags to push the last-cpu out."

[mgorman@suse.de: minor modifications]
Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Simon Jeons <simon.jeons@gmail.com>
Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mm: move page flags layout to separate header

This is a preparation patch for moving page->_last_nid into page->flags
that moves page flag layout information to a separate header. This patch
is necessary because otherwise there would be a circular dependency
between mm_types.h and mm.h.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Simon Jeons <simon.jeons@gmail.com>
Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mm: numa: handle side-effects in count_vm_numa_events() for !CONFIG_NUMA_BALANCING

The current definitions for count_vm_numa_events() is wrong for
!CONFIG_NUMA_BALANCING as the following would miss the side-effect.

count_vm_numa_events(NUMA_FOO, bar++);

There are no such users of count_vm_numa_events() but this patch fixes it
as it is a potential pitfall. Ideally both would be converted to static
inline but NUMA_PTE_UPDATES is not defined if !CONFIG_NUMA_BALANCING and
creating dummy constants just to have a static inline would be similarly
clumsy.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Simon Jeons <simon.jeons@gmail.com>
Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mm: numa: take THP into account when migrating pages for NUMA balancing

Wanpeng Li pointed out that numamigrate_isolate_page() assumes that only
one base page is being migrated when in fact it can also be checking THP.
The consequences are that a migration will be attempted when a target node
is nearly full and fail later. It's unlikely to be user-visible but it
should be fixed. While we are there, migrate_balanced_pgdat() should
treat nr_migrate_pages as an unsigned long as it is treated as a
watermark.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Suggested-by: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Simon Jeons <simon.jeons@gmail.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mm: numa: fix minor typo in numa_next_scan

s/me/be/ and clarify the comment a bit when we're changing it anyway.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Suggested-by: Simon Jeons <simon.jeons@gmail.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mm: remove unused memclear_highpage_flush()

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

usb: forbid memory allocation with I/O during bus reset

If one storage interface or usb network interface(iSCSI case) exists in
current configuration, memory allocation with GFP_KERNEL during
usb_device_reset() might trigger I/O transfer on the storage interface
itself and cause deadlock because the 'us->dev_mutex' is held in
.pre_reset() and the storage interface can't do I/O transfer when the
reset is triggered by other interface, or the error handling can't be
completed if the reset is triggered by the storage itself (error handling
path).

Signed-off-by: Ming Lei <ming.lei@canonical.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: David Decotigny <david.decotigny@google.com>
Cc: Tom Herbert <therbert@google.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Alan Stern <stern@rowland.harvard.edu>
Cc: Oliver Neukum <oneukum@suse.de>
Reviewed-by: Jiri Kosina <jkosina@suse.cz>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Cc: Greg KH <greg@kroah.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

pm / runtime: force memory allocation with no I/O during Runtime PM callbcack

Apply the introduced memalloc_noio_save() and memalloc_noio_restore() to
force memory allocation with no I/O during runtime_resume/runtime_suspend
callback on device with the flag of 'memalloc_noio' set.

Signed-off-by: Ming Lei <ming.lei@canonical.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: David Decotigny <david.decotigny@google.com>
Cc: Tom Herbert <therbert@google.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Alan Stern <stern@rowland.harvard.edu>
Cc: Oliver Neukum <oneukum@suse.de>
Cc: Jiri Kosina <jiri.kosina@suse.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Cc: Greg KH <greg@kroah.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

net/core: apply pm_runtime_set_memalloc_noio on network devices

Deadlock might be caused by allocating memory with GFP_KERNEL in
runtime_resume and runtime_suspend callback of network devices in iSCSI
situation, so mark network devices and its ancestor as 'memalloc_noio'
with the introduced pm_runtime_set_memalloc_noio().

Signed-off-by: Ming Lei <ming.lei@canonical.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: David Decotigny <david.decotigny@google.com>
Cc: Tom Herbert <therbert@google.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Alan Stern <stern@rowland.harvard.edu>
Cc: Oliver Neukum <oneukum@suse.de>
Cc: Jiri Kosina <jiri.kosina@suse.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Cc: Greg KH <greg@kroah.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

block/genhd.c: apply pm_runtime_set_memalloc_noio on block devices

Apply the introduced pm_runtime_set_memalloc_noio on block device so that
PM core will teach mm to not allocate memory with GFP_IOFS when calling
the runtime_resume and runtime_suspend callback for block devices and its
ancestors.

Signed-off-by: Ming Lei <ming.lei@canonical.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Alan Stern <stern@rowland.harvard.edu>
Cc: Oliver Neukum <oneukum@suse.de>
Cc: Jiri Kosina <jiri.kosina@suse.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Cc: Greg KH <greg@kroah.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: David Decotigny <david.decotigny@google.com>
Cc: Tom Herbert <therbert@google.com>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

pm / runtime: introduce pm_runtime_set_memalloc_noio()

Introduce the flag memalloc_noio in 'struct dev_pm_info' to help PM core
to teach mm not allocating memory with GFP_KERNEL flag for avoiding
probable deadlock.

As explained in the comment, any GFP_KERNEL allocation inside
runtime_resume() or runtime_suspend() on any one of device in the path
from one block or network device to the root device in the device tree may
cause deadlock, the introduced pm_runtime_set_memalloc_noio() sets or
clears the flag on device in the path recursively.

Signed-off-by: Ming Lei <ming.lei@canonical.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Alan Stern <stern@rowland.harvard.edu>
Cc: Oliver Neukum <oneukum@suse.de>
Cc: Jiri Kosina <jiri.kosina@suse.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Cc: Greg KH <greg@kroah.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: David Decotigny <david.decotigny@google.com>
Cc: Tom Herbert <therbert@google.com>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mm: teach mm by current context info to not do I/O during memory allocation

This patch introduces PF_MEMALLOC_NOIO on process flag('flags' field of
'struct task_struct'), so that the flag can be set by one task to avoid
doing I/O inside memory allocation in the task's context.

The patch trys to solve one deadlock problem caused by block device, and
the problem may happen at least in the below situations:

- during block device runtime resume, if memory allocation with
  GFP_KERNEL is called inside runtime resume callback of any one of its
  ancestors(or the block device itself), the deadlock may be triggered
  inside the memory allocation since it might not complete until the block
  device becomes active and the involed page I/O finishes.  The situation
  is pointed out first by Alan Stern.  It is not a good approach to
  convert all GFP_KERNEL[1] in the path into GFP_NOIO because several
  subsystems may be involved(for example, PCI, USB and SCSI may be
  involved for usb mass stoarage device, network devices involved too in
  the iSCSI case)

- during block device runtime suspend, because runtime resume need to
  wait for completion of concurrent runtime suspend.

- during error handling of usb mass storage deivce, USB bus reset will
  be put on the device, so there shouldn't have any memory allocation with
  GFP_KERNEL during USB bus reset, otherwise the deadlock similar with
  above may be triggered.  Unfortunately, any usb device may include one
  mass storage interface in theory, so it requires all usb interface
  drivers to handle the situation.  In fact, most usb drivers don't know
  how to handle bus reset on the device and don't provide .pre_set() and
  .post_reset() callback at all, so USB core has to unbind and bind driver
  for these devices.  So it is still not practical to resort to GFP_NOIO
  for solving the problem.

Also the introduced solution can be used by block subsystem or block
drivers too, for example, set the PF_MEMALLOC_NOIO flag before doing
actual I/O transfer.

It is not a good idea to convert all these GFP_KERNEL in the affected path
into GFP_NOIO because these functions doing that may be implemented as
library and will be called in many other contexts.

In fact, memalloc_noio_flags() can convert some of current static GFP_NOIO
allocation into GFP_KERNEL back in other non-affected contexts, at least
almost all GFP_NOIO in USB subsystem can be converted into GFP_KERNEL
after applying the approach and make allocation with GFP_NOIO only happen
in runtime resume/bus reset/block I/O transfer contexts generally.

[1], several GFP_KERNEL allocation examples in runtime resume path

- pci subsystem
acpi_os_allocate
<-acpi_ut_allocate
<-ACPI_ALLOCATE_ZEROED
<-acpi_evaluate_object
<-__acpi_bus_set_power
<-acpi_bus_set_power
<-acpi_pci_set_power_state
<-platform_pci_set_power_state
<-pci_platform_power_transition
<-__pci_complete_power_transition
<-pci_set_power_state
<-pci_restore_standard_config
<-pci_pm_runtime_resume
- usb subsystem
usb_get_status
<-finish_port_resume
<-usb_port_resume
<-generic_resume
<-usb_resume_device
<-usb_resume_both
<-usb_runtime_resume

- some individual usb drivers
usblp, uvc, gspca, most of dvb-usb-v2 media drivers, cpia2, az6007, ....

That is just what I have found.  Unfortunately, this allocation can only
be found by human being now, and there should be many not found since any
function in the resume path(call tree) may allocate memory with
GFP_KERNEL.

Signed-off-by: Ming Lei <ming.lei@canonical.com>
Signed-off-by: Minchan Kim <minchan@kernel.org>
Cc: Alan Stern <stern@rowland.harvard.edu>
Cc: Oliver Neukum <oneukum@suse.de>
Cc: Jiri Kosina <jiri.kosina@suse.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Cc: Greg KH <greg@kroah.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: David Decotigny <david.decotigny@google.com>
Cc: Tom Herbert <therbert@google.com>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mm: don't wait on congested zones in balance_pgdat()

Commit 92df3a72 (mm: vmscan: throttle reclaim if encountering too many
dirty pages under writeback) introduced waiting on congested zones
based on a sane algorithm in shrink_inactive_list(). What this means
is that there's no more need for throttling and additional heuristics
in balance_pgdat(). So, let's remove it and tidy up the code.

Signed-off-by: Zlatko Calusic <zlatko.calusic@iskon.hr>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Johannes Weiner <jweiner@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mm-memory-failurec-fix-wrong-num_poisoned_pages-in-handling-memory-error-on-thp-fix

tweak comment

Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mm/memory-failure.c: fix wrong num_poisoned_pages in handling memory error on thp

num_poisoned_pages counts up the number of pages isolated by memory
errors. But for thp, only one subpage is isolated because memory error
handler splits it, so it's wrong to add (1 << compound_trans_order).

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mm/memory-failure.c: clean up soft_offline_page()

Currently soft_offline_page() is hard to maintain because it has many
return points and goto statements.  All of this mess come from
get_any_page().  This function should only get page refcount as the name
implies, but it does some page isolating actions like SetPageHWPoison()
and dequeuing hugepage.  This patch corrects it and introduces some
internal subroutines to make soft offlining code more readable and
maintainable.

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Reviewed-by: Andi Kleen <andi@firstfloor.org>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Xishi Qiu <qiuxishi@huawei.com>
Cc: Jiang Liu <jiang.liu@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

memory-failure-use-num_poisoned_pages-instead-of-mce_bad_pages-fix

fix mm/sparse.c

Cc: Andi Kleen <andi@firstfloor.org>
Cc: Jiang Liu <jiang.liu@huawei.com>
Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Cc: Xishi Qiu <qiuxishi@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

memory-failure: use num_poisoned_pages instead of mce_bad_pages

Since MCE is an x86 concept, and this code is in mm/, it would be better
to use the name num_poisoned_pages instead of mce_bad_pages.

Signed-off-by: Xishi Qiu <qiuxishi@huawei.com>
Signed-off-by: Jiang Liu <jiang.liu@huawei.com>
Suggested-by: Borislav Petkov <bp@alien8.de>
Reviewed-by: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Cc: Andi Kleen <andi@firstfloor.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

memory-failure: do code refactor of soft_offline_page()

There are too many return points randomly intermingled with some "goto
done" return points. So adjust the function structure, one for the
success path, the other for the failure path. Use atomic_long_inc instead
of atomic_long_add.

Signed-off-by: Xishi Qiu <qiuxishi@huawei.com>
Signed-off-by: Jiang Liu <jiang.liu@huawei.com>
Suggested-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Cc: Andi Kleen <andi@firstfloor.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

memory-failure: fix an error of mce_bad_pages statistics

$ echo paddr > /sys/devices/system/memory/soft_offline_page to offline a
*free* page, the value of mce_bad_pages will be added, and the page is set
HWPoison flag, but it is still managed by page buddy alocator.

$ cat /proc/meminfo | grep HardwareCorrupted shows the value.

If we offline the same page, the value of mce_bad_pages will be added
*again*, this means the value is incorrect now. Assume the page is still
free during this short time.

soft_offline_page()
get_any_page()
"else if (is_free_buddy_page(p))" branch return 0
"goto done";
"atomic_long_add(1, &mce_bad_pages);"

This patch:

Move poisoned page check at the beginning of the function in order to
fix the error.

Signed-off-by: Xishi Qiu <qiuxishi@huawei.com>
Signed-off-by: Jiang Liu <jiang.liu@huawei.com>
Tested-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Cc: Andi Kleen <andi@firstfloor.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mm: remove MIGRATE_ISOLATE check in hotpath

Several functions test MIGRATE_ISOLATE and some of those are hotpath but
MIGRATE_ISOLATE is used only if we enable CONFIG_MEMORY_ISOLATION(ie, CMA,
memory-hotplug and memory-failure) which are not common config option. So
let's not add unnecessary overhead and code when we don't enable
CONFIG_MEMORY_ISOLATION.

Signed-off-by: Minchan Kim <minchan@kernel.org>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Acked-by: Michal Nazarewicz <mina86@mina86.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mm: increase totalram_pages when free pages allocated by bootmem allocator

Function put_page_bootmem() is used to free pages allocated by bootmem
allocator, so it should increase totalram_pages when freeing pages into
the buddy system.

Signed-off-by: Jiang Liu <jiang.liu@huawei.com>
Cc: Wen Congyang <wency@cn.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Jiang Liu <jiang.liu@huawei.com>
Cc: Maciej Rutecki <maciej.rutecki@gmail.com>
Cc: Chris Clayton <chris2553@googlemail.com>
Cc: "Rafael J . Wysocki" <rjw@sisk.pl>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Minchan Kim <minchan@kernel.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Jianguo Wu <wujianguo@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mm: set zone->present_pages to number of existing pages in the zone

Now all users of "number of pages managed by the buddy system" have been
converted to use zone->managed_pages, so set zone->present_pages to what
it should be:

present_pages = spanned_pages - absent_pages;

Signed-off-by: Jiang Liu <jiang.liu@huawei.com>
Cc: Wen Congyang <wency@cn.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Jiang Liu <jiang.liu@huawei.com>
Cc: Maciej Rutecki <maciej.rutecki@gmail.com>
Cc: Chris Clayton <chris2553@googlemail.com>
Cc: "Rafael J . Wysocki" <rjw@sisk.pl>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Minchan Kim <minchan@kernel.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Jianguo Wu <wujianguo@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mm: use zone->present_pages instead of zone->managed_pages where appropriate

Now we have zone->managed_pages for "pages managed by the buddy system in
the zone", so replace zone->present_pages with zone->managed_pages if what
the user really wants is number of allocatable pages.

Signed-off-by: Jiang Liu <jiang.liu@huawei.com>
Cc: Wen Congyang <wency@cn.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Jiang Liu <jiang.liu@huawei.com>
Cc: Maciej Rutecki <maciej.rutecki@gmail.com>
Cc: Chris Clayton <chris2553@googlemail.com>
Cc: "Rafael J . Wysocki" <rjw@sisk.pl>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Minchan Kim <minchan@kernel.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Jianguo Wu <wujianguo@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mm/memblock.c: use CONFIG_HAVE_MEMBLOCK_NODE_MAP to protect movablecore_map in memblock_overlaps_region().

The definition of struct movablecore_map is protected by
CONFIG_HAVE_MEMBLOCK_NODE_MAP but its use in memblock_overlaps_region() is
not. So add CONFIG_HAVE_MEMBLOCK_NODE_MAP to protect the use of
movablecore_map in memblock_overlaps_region().

Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

acpi, movablemem_map: Set numa_nodes_hotplug nodemask when using SRAT info.

We should also set movablemem_map.numa_nodes_hotplug nodemask when we
insert a hot-pluggable range in SRAT into movablemem_map.map[].

Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Jiang Liu <jiang.liu@huawei.com>
Cc: Jianguo Wu <wujianguo@huawei.com>
Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Wu Jianguo <wujianguo@huawei.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Len Brown <lenb@kernel.org>
Cc: "Brown, Len" <len.brown@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

acpi-memory-hotplug-support-getting-hotplug-info-from-srat-fix-fix

use strcmp()

Cc: Tang Chen <tangchen@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

acpi-memory-hotplug-support-getting-hotplug-info-from-srat-fix

mm/page_alloc.c: In function 'cmdline_parse_movablemem_map':
mm/page_alloc.c:5372: warning: comparison of distinct pointer types lacks a cast

not the right fix, but I'm tired of the warning

Cc: Tang Chen <tangchen@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

acpi, memory-hotplug: support getting hotplug info from SRAT

We now provide an option for users who don't want to specify physical
memory address in kernel commandline.

        /*
         * For movablemem_map=acpi:
         *
         * SRAT:                |_____| |_____| |_________| |_________| ......
         * node id:                0       1         1           2
         * hotpluggable:           n       y         y           n
         * movablemem_map:              |_____| |_________|
         *
         * Using movablemem_map, we can prevent memblock from allocating memory
         * on ZONE_MOVABLE at boot time.
         */

So user just specify movablemem_map=acpi, and the kernel will use
hotpluggable info in SRAT to determine which memory ranges should be set
as ZONE_MOVABLE.

NOTE: Using this way will cause NUMA performance down because the whole node
      will be set as ZONE_MOVABLE, and kernel cannot use memory on it.
      If users don't want to lose NUMA performance, just don't use it.

Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Jiang Liu <jiang.liu@huawei.com>
Cc: Jianguo Wu <wujianguo@huawei.com>
Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Wu Jianguo <wujianguo@huawei.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Len Brown <lenb@kernel.org>
Cc: "Brown, Len" <len.brown@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

acpi-memory-hotplug-extend-movablemem_map-ranges-to-the-end-of-node-fix

clean up code, fix build warning

Cc: "Brown, Len" <len.brown@intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Jiang Liu <jiang.liu@huawei.com>
Cc: Jianguo Wu <wujianguo@huawei.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Len Brown <lenb@kernel.org>
Cc: Tang Chen <tangchen@cn.fujitsu.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Wu Jianguo <wujianguo@huawei.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

acpi, memory-hotplug: extend movablemem_map ranges to the end of node

When implementing movablemem_map boot option, we introduced an array
movablemem_map.map[] to store the memory ranges to be set as ZONE_MOVABLE.

Since ZONE_MOVABLE is the latst zone of a node, if user didn't specify the
whole node memory range, we need to extend it to the node end so that we
can use it to prevent memblock from allocating memory in the ranges user
didn't specify.

We now implement movablemem_map boot option like this:
        /*
         * For movablemem_map=nn[KMG]@ss[KMG]:
         *
         * SRAT:                |_____| |_____| |_________| |_________| ......
         * node id:                0       1         1           2
         * user specified:                |__|                 |___|
         * movablemem_map:                |___| |_________|    |______| ......
         *
         * Using movablemem_map, we can prevent memblock from allocating memory
         * on ZONE_MOVABLE at boot time.
         *
         * NOTE: In this case, SRAT info will be ingored.
         */

Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Jiang Liu <jiang.liu@huawei.com>
Cc: Jianguo Wu <wujianguo@huawei.com>
Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Wu Jianguo <wujianguo@huawei.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Len Brown <lenb@kernel.org>
Cc: "Brown, Len" <len.brown@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

acpi, movablemem_map: Do not zero numa_meminfo in numa_init().

early_parse_srat() is called before numa_init(), and has initialized
numa_meminfo. So do not zero numa_meminfo in numa_init(), otherwise
we will lose memory numa info.

Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Reported-by: Li Shaohua <shli@kernel.org>
Cc: Tang Chen <tangchen@cn.fujitsu.com>
Cc: Wen Congyang <wency@cn.fujitsu.com>
Cc: Jiang Liu <jiang.liu@huawei.com>
Cc: Jianguo Wu <wujianguo@huawei.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

acpi, memory-hotplug: parse SRAT before memblock is ready fix

alnoconfig complains:
arch/x86/kernel/setup.c: In function `setup_arch':
arch/x86/kernel/setup.c:917: error: implicit declaration of function `early_parse_srat'

because early_parse_srat is not declared for !CONFIG_ACPI. Moreover it
is defined only for CONFIG_ACPI_NUMA.

I am not sure what is the correct way to fix this but I guess that
providing an empty definition for !CONFIG_ACPI_NUMA is OK.

Signed-off-by: Michal Hocko <mhocko@suse.cz>
Cc: Tang Chen <tangchen@cn.fujitsu.com>
Cc: Wen Congyang <wency@cn.fujitsu.com>
Cc: Jiang Liu <jiang.liu@huawei.com>
Cc: Jianguo Wu <wujianguo@huawei.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

acpi, memory-hotplug: parse SRAT before memblock is ready

On linux, the pages used by kernel could not be migrated.  As a result, if
a memory range is used by kernel, it cannot be hot-removed.  So if we want
to hot-remove memory, we should prevent kernel from using it.

The way now used to prevent this is specify a memory range by
movablemem_map boot option and set it as ZONE_MOVABLE.

But when the system is booting, memblock will allocate memory, and reserve
the memory for kernel.  And before we parse SRAT, and know the node memory
ranges, memblock is working.  And it may allocate memory in ranges to be
set as ZONE_MOVABLE.  This memory can be used by kernel, and never be
freed.

So, let's parse SRAT before memblock is called first. And it is early enough.

The first call of memblock_find_in_range_node() is in:
setup_arch()
|-->setup_real_mode()

so, this patch add a function early_parse_srat() to parse SRAT, and call
it before setup_real_mode() is called.

NOTE:

1) Do not clear numa_nodes_parsed in numa_init() because SRAT was
   parsed earlier.

2) I don't know why using count of memory affinities parsed from SRAT
   as a return value in original acpi_numa_init().  So I add a static
   variable srat_mem_cnt to remember this count and use it as the return
   value of the new acpi_numa_init()

Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Reviewed-by: Wen Congyang <wency@cn.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Jiang Liu <jiang.liu@huawei.com>
Cc: Jianguo Wu <wujianguo@huawei.com>
Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Wu Jianguo <wujianguo@huawei.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Len Brown <lenb@kernel.org>
Cc: "Brown, Len" <len.brown@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

page_alloc: bootmem limit with movablecore_map

Ensure the bootmem will not allocate memory from areas that may be
ZONE_MOVABLE. The map info is from movablecore_map boot option.

Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Reviewed-by: Wen Congyang <wency@cn.fujitsu.com>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Tested-by: Lin Feng <linfeng@cn.fujitsu.com>
Cc: Wu Jianguo <wujianguo@huawei.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

page_alloc: make movablemem_map have higher priority

If kernelcore or movablecore is specified at the same time with
movablemem_map, movablemem_map will have higher priority to be
satisfied. This patch will make find_zone_movable_pfns_for_nodes()
calculate zone_movable_pfn[] with the limit from zone_movable_limit[].

Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Reviewed-by: Wen Congyang <wency@cn.fujitsu.com>
Cc: Wu Jianguo <wujianguo@huawei.com>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Tested-by: Lin Feng <linfeng@cn.fujitsu.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Bug fix: Remove the unused sanitize_zone_movable_limit() definition.

When CONFIG_HAVE_MEMBLOCK_NODE_MAP is not defined, sanitize_zone_movable_limit()
is also not used. So remove it.

Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Cc: Liu Jiang <jiang.liu@huawei.com>
Cc: Wen Congyang <wency@cn.fujitsu.com>
Cc: Wu Jianguo <wujianguo@huawei.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Lin Feng <linfeng@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

page_alloc: introduce zone_movable_limit[] to keep movable limit for nodes

Introduce a new array zone_movable_limit[] to store the ZONE_MOVABLE limit
from movablemem_map boot option for all nodes. The function
sanitize_zone_movable_limit() will find out to which node the ranges in
movable_map.map[] belongs, and calculates the low boundary of ZONE_MOVABLE
for each node.

Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Signed-off-by: Liu Jiang <jiang.liu@huawei.com>
Reviewed-by: Wen Congyang <wency@cn.fujitsu.com>
Cc: Wu Jianguo <wujianguo@huawei.com>
Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Tested-by: Lin Feng <linfeng@cn.fujitsu.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Rename movablecore_map to movablemem_map.

Since "core" could be confused with cpu cores, but here it is memory,
so rename the boot option movablecore_map to movablemem_map.

Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Cc: Wen Congyang <wency@cn.fujitsu.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Lin Feng <linfeng@cn.fujitsu.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

page_alloc-add-movable_memmap-kernel-parameter-fix-fix-fix

remove unneeded parens

Cc: Tang Chen <tangchen@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

page_alloc-add-movable_memmap-kernel-parameter-fix-fix-checkpatch-fixes

Cc: Tang Chen <tangchen@cn.fujitsu.com>
WARNING: please, no space before tabs
#48: FILE: mm/page_alloc.c:5171:
+ * ^Imovablecore_map=nn[KMG]@ss[KMG]$

total: 0 errors, 1 warnings, 39 lines checked

./patches/page_alloc-add-movable_memmap-kernel-parameter-fix-fix.patch has style problems, please review.

If any of these errors are false positives, please report
them to the maintainer, see CHECKPATCH in MAINTAINERS.

Please run checkpatch prior to sending patches

Cc: Tang Chen <tangchen@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Bug fix: Fix the doc format.

Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

page_alloc-add-movable_memmap-kernel-parameter-fix

improve comment

Cc: Tang Chen <tangchen@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

page_alloc: add movable_memmap kernel parameter

Add functions to parse movablecore_map boot option. Since the option
could be specified more then once, all the maps will be stored in the
global variable movablecore_map.map array.

And also, we keep the array in monotonic increasing order by start_pfn.
And merge all overlapped ranges.

Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Reviewed-by: Wen Congyang <wency@cn.fujitsu.com>
Tested-by: Lin Feng <linfeng@cn.fujitsu.com>
Cc: Wu Jianguo <wujianguo@huawei.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

x86: get pg_data_t's memory from other node

During the implementation of SRAT support, we met a problem.
In setup_arch(), we have the following call series:

1) memblock is ready;
2) some functions use memblock to allocate memory;
3) parse ACPI tables, such as SRAT.

Before 3), we don't know which memory is hotpluggable, and as a result, we
cannot prevent memblock from allocating hotpluggable memory.  So, in 2),
there could be some hotpluggable memory allocated by memblock.

Now, we are trying to parse SRAT earlier, before memblock is ready.  But I
think we need more investigation on this topic.  So in this v5, I dropped
all the SRAT support, and v5 is just the same as v3, and it is based on
3.8-rc3.

As we planned, we will support getting info from SRAT without users'
participation at last.  And we will post another patch-set to do so.

And also, I think for now, we can add this boot option as the first step of
supporting movable node. Since Linux cannot migrate the direct mapped pages,
the only way for now is to limit the whole node containing only movable memory.

Using SRAT is one way.  But even if we can use SRAT, users still need an
interface to enable/disable this functionality if they don't want to loose
their NUMA performance.  So I think, a user interface is always needed.

For now, users can disable this functionality by not specifying the boot
option.  Later, we will post SRAT support, and add another option value
"movablecore_map=acpi" to using SRAT.

This patch:

If system can create movable node which all memory of the node is
allocated as ZONE_MOVABLE, setup_node_data() cannot allocate memory for
the node's pg_data_t.  So, use memblock_alloc_try_nid() instead of
memblock_alloc_nid() to retry when the first allocation fails.

Signed-off-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Signed-off-by: Jiang Liu <jiang.liu@huawei.com>
Cc: Wu Jianguo <wujianguo@huawei.com>
Cc: Wen Congyang <wency@cn.fujitsu.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

sched: do not use cpu_to_node() to find an offlined cpu's node.

If a cpu is offline, its nid will be set to -1, and cpu_to_node(cpu) will
return -1.  As a result, cpumask_of_node(nid) will return NULL.  In this
case, find_next_bit() in for_each_cpu will get a NULL pointer and cause
panic.

Here is a call trace:
[  609.824017] Call Trace:
[  609.824017]  <IRQ>
[  609.824017]  [<ffffffff810b0721>] select_fallback_rq+0x71/0x190
[  609.824017]  [<ffffffff810b086e>] ? try_to_wake_up+0x2e/0x2f0
[  609.824017]  [<ffffffff810b0b0b>] try_to_wake_up+0x2cb/0x2f0
[  609.824017]  [<ffffffff8109da08>] ? __run_hrtimer+0x78/0x320
[  609.824017]  [<ffffffff810b0b85>] wake_up_process+0x15/0x20
[  609.824017]  [<ffffffff8109ce62>] hrtimer_wakeup+0x22/0x30
[  609.824017]  [<ffffffff8109da13>] __run_hrtimer+0x83/0x320
[  609.824017]  [<ffffffff8109ce40>] ? update_rmtp+0x80/0x80
[  609.824017]  [<ffffffff8109df56>] hrtimer_interrupt+0x106/0x280
[  609.824017]  [<ffffffff810a72c8>] ? sd_free_ctl_entry+0x68/0x70
[  609.824017]  [<ffffffff8167cf39>] smp_apic_timer_interrupt+0x69/0x99
[  609.824017]  [<ffffffff8167be2f>] apic_timer_interrupt+0x6f/0x80

There is a hrtimer process sleeping, whose cpu has already been offlined.
When it is waken up, it tries to find another cpu to run, and get a -1
nid.  As a result, cpumask_of_node(-1) returns NULL, and causes ernel
panic.

This patch fixes this problem by judging if the nid is -1.  If nid is not
-1, a cpu on the same node will be picked.  Else, a online cpu on another
node will be picked.

Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Jiang Liu <liuj97@gmail.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

cpu-hotplugmemory-hotplug-clear-cpu_to_node-when-offlining-the-node-fix

numa_clear_node() and numa_set_node() can no longer be __cpuinit.

WARNING: vmlinux.o(.text+0x222702): Section mismatch in reference from the function check_and_unmap_cpu_on_node() to the function .cpuinit.text:numa_clear_node()
The function check_and_unmap_cpu_on_node() references
the function __cpuinit numa_clear_node().
This is often because check_and_unmap_cpu_on_node lacks a __cpuinit
annotation or the annotation of numa_clear_node is wrong.

Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jiang Liu <liuj97@gmail.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tang Chen <tangchen@cn.fujitsu.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Wen Congyang <wency@cn.fujitsu.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

cpu-hotplug,memory-hotplug: clear cpu_to_node() when offlining the node

When the node is offlined, there is no memory/cpu on the node. If a sleep
task runs on a cpu of this node, it will be migrated to the cpu on the
other node. So we can clear cpu-to-node mapping.

Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Jiang Liu <liuj97@gmail.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

cpu-hotplug, memory-hotplug: try offlining the node when hotremoving a cpu

The node will be offlined when all memory/cpu on the node is hotremoved.
So we should try offline the node when hotremoving a cpu on the node.

Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Jiang Liu <liuj97@gmail.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Len Brown <lenb@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

memory-hotplug: export the function try_offline_node() fix

"memory-hotplug: export the function try_offline_node()" declares
try_offline_node() for CONFIG_MEMORY_HOTPLUG, but this function is only
defined for CONFIG_MEMORY_HOTREMOVE:

ERROR: "try_offline_node" [drivers/acpi/processor.ko] undefined!

Fix the build by definining it appropriately.

Signed-off-by: David Rientjes <rientjes@google.com>
Cc: Wen Congyang <wency@cn.fujitsu.com>
Cc: Tang Chen <tangchen@cn.fujitsu.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

memory-hotplug: export the function try_offline_node()

try_offline_node() will be needed in the tristate
drivers/acpi/processor_driver.c.

The node will be offlined when all memory/cpu on the node have been
hotremoved.  So we need the function try_offline_node() in cpu-hotplug
path.

If the memory-hotplug is disabled, and cpu-hotplug is enabled
1. no memory no the node
   we don't online the node, and cpu's node is the nearest node.
2. the node contains some memory
   the node has been onlined, and cpu's node is still needed
   to migrate the sleep task on the cpu to the same node.

So we do nothing in try_offline_node() in this case.

Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Jiang Liu <liuj97@gmail.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Len Brown <lenb@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

cpu_hotplug-clear-apicid-to-node-when-the-cpu-is-hotremoved-fix

fix section error

__apicid_to_node can no longer be __cpuinit as it is referred to from
acpi_unmap_lsapic().

>> WARNING: vmlinux.o(.text+0x43773): Section mismatch in reference from the function acpi_unmap_lsapic() to the variable .cpuinit.data:__apicid_to_node
   The function acpi_unmap_lsapic() references
   the variable __cpuinitdata __apicid_to_node.
   This is often because acpi_unmap_lsapic lacks a __cpuinitdata
   annotation or the annotation of __apicid_to_node is wrong.

Reported-by: Wu Fengguang <fengguang.wu@intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jiang Liu <liuj97@gmail.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tang Chen <tangchen@cn.fujitsu.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Wen Congyang <wency@cn.fujitsu.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

cpu_hotplug: clear apicid to node when the cpu is hotremoved

When a cpu is hotpluged, we call acpi_map_cpu2node() in _acpi_map_lsapic()
to store the cpu's node and apicid's node.  But we don't clear the cpu's
node in acpi_unmap_lsapic() when this cpu is hotremoved.  If the node is
also hotremoved, we will get the following messages:

[ 1646.771485] kernel BUG at include/linux/gfp.h:329!
[ 1646.828729] invalid opcode: 0000 [#1] SMP
[ 1646.877872] Modules linked in: ebtable_nat ebtables ipt_MASQUERADE iptable_nat nf_nat xt_CHECKSUM iptable_mangle bridge stp llc sunrpc ipt_REJECT nf_conntrack_ipv4 nf_defrag_ipv4 iptable_filter ip_tables ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables binfmt_misc dm_mirror dm_region_hash dm_log dm_mod vhost_net macvtap macvlan tun uinput iTCO_wdt iTCO_vendor_support coretemp kvm_intel kvm crc32c_intel microcode pcspkr i2c_i801 i2c_core lpc_ich mfd_core ioatdma e1000e i7core_edac edac_core sg acpi_memhotplug igb dca sd_mod crc_t10dif megaraid_sas mptsas mptscsih mptbase scsi_transport_sas scsi_mod
[ 1647.588773] Pid: 3126, comm: init Not tainted 3.6.0-rc3-tangchen-hostbridge+ #13 FUJITSU-SV PRIMEQUEST 1800E/SB
[ 1647.711545] RIP: 0010:[<ffffffff811bc3fd>]  [<ffffffff811bc3fd>] allocate_slab+0x28d/0x300
[ 1647.810492] RSP: 0018:ffff88078a049cf8  EFLAGS: 00010246
[ 1647.874028] RAX: 0000000000000000 RBX: 0000000000000001 RCX: 0000000000000000
[ 1647.959339] RDX: 0000000000000001 RSI: 0000000000000001 RDI: 0000000000000246
[ 1648.044659] RBP: ffff88078a049d38 R08: 00000000000040d0 R09: 0000000000000001
[ 1648.129953] R10: 0000000000000000 R11: 0000000000000b5f R12: 00000000000052d0
[ 1648.215259] R13: ffff8807c1417300 R14: 0000000000030038 R15: 0000000000000003
[ 1648.300572] FS:  00007fa9b1b44700(0000) GS:ffff8807c3800000(0000) knlGS:0000000000000000
[ 1648.397272] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 1648.465985] CR2: 00007fa9b09acca0 CR3: 000000078b855000 CR4: 00000000000007e0
[ 1648.551265] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 1648.636565] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[ 1648.721838] Process init (pid: 3126, threadinfo ffff88078a048000, task ffff8807bb6f2650)
[ 1648.818534] Stack:
[ 1648.842548]  ffff8807c39d7fa0 ffffffff000040d0 00000000000000bb 00000000000080d0
[ 1648.931469]  ffff8807c1417300 ffff8807c39d7fa0 ffff8807c1417300 0000000000000001
[ 1649.020410]  ffff88078a049d88 ffffffff811bc4a0 ffff8807c1410c80 0000000000000000
[ 1649.109464] Call Trace:
[ 1649.138713]  [<ffffffff811bc4a0>] new_slab+0x30/0x1b0
[ 1649.199075]  [<ffffffff811bc978>] __slab_alloc+0x358/0x4c0
[ 1649.264683]  [<ffffffff810b71c0>] ? alloc_fair_sched_group+0xd0/0x1b0
[ 1649.341695]  [<ffffffff811be7d4>] kmem_cache_alloc_node_trace+0xb4/0x1e0
[ 1649.421824]  [<ffffffff8109d188>] ? hrtimer_init+0x48/0x100
[ 1649.488414]  [<ffffffff810b71c0>] ? alloc_fair_sched_group+0xd0/0x1b0
[ 1649.565402]  [<ffffffff810b71c0>] alloc_fair_sched_group+0xd0/0x1b0
[ 1649.640297]  [<ffffffff810a8bce>] sched_create_group+0x3e/0x110
[ 1649.711040]  [<ffffffff810bdbcd>] sched_autogroup_create_attach+0x4d/0x180
[ 1649.793260]  [<ffffffff81089614>] sys_setsid+0xd4/0xf0
[ 1649.854694]  [<ffffffff8167a029>] system_call_fastpath+0x16/0x1b
[ 1649.926483] Code: 89 c4 e9 73 fe ff ff 31 c0 89 de 48 c7 c7 45 de 9e 81 44 89 45 c8 e8 22 05 4b 00 85 db 44 8b 45 c8 0f 89 4f ff ff ff 0f 0b eb fe <0f> 0b 90 eb fd 0f 0b eb fe 89 de 48 c7 c7 45 de 9e 81 31 c0 44
[ 1650.161454] RIP  [<ffffffff811bc3fd>] allocate_slab+0x28d/0x300
[ 1650.232348]  RSP <ffff88078a049cf8>
[ 1650.274029] ---[ end trace adf84c90f3fea3e5 ]---

The reason is that the cpu's node is not NUMA_NO_NODE, we will call
alloc_pages_exact_node() to alloc memory on the node, but the node is
offlined.

If the node is onlined, we still need cpu's node.  For example: a task on
the cpu is sleeped when the cpu is hotremoved.  We will choose another cpu
to run this task when it is waked up.  If we know the cpu's node, we will
choose the cpu on the same node first.  So we should clear cpu-to-node
mapping when the node is offlined.

This patch only clears apicid-to-node mapping when the cpu is hotremoved.

Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Jiang Liu <liuj97@gmail.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mempolicy: fix is_valid_nodemask()

is_valid_nodemask() was introduced by 19770b32 ("mm: filter based on a
nodemask as well as a gfp_mask").  but it does not match its comments,
because it does not check the zone which > policy_zone.

Also in b377fd ("Apply memory policies to top two highest zones when
highest zone is ZONE_MOVABLE"), this commits told us, if highest zone is
ZONE_MOVABLE, we should also apply memory policies to it.  so ZONE_MOVABLE
should be valid zone for policies.  is_valid_nodemask() need to be changed
to match it.

Fix: check all zones, even its zoneid > policy_zone.  Use
nodes_intersects() instead open code to check it.

Reported-by: Wen Congyang <wency@cn.fujitsu.com>
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: Jiang Liu <jiang.liu@huawei.com>
Cc: Jianguo Wu <wujianguo@huawei.com>
Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

memory-hotplug: consider compound pages when free memmap

usemap could also be allocated as compound pages.  Should also consider
compound pages when freeing memmap.

If we don't fix it, there could be problems when we free vmemmap
pagetables which are stored in compound pages.  The old pagetables will
not be freed properly, and when we add the memory again, no new pagetable
will be created.  And the old pagetable entry is used, than the kernel
will panic.

The call trace is like the following:

[  691.175487] BUG: unable to handle kernel paging request at ffffea0040000000
[  691.258872] IP: [<ffffffff816a483f>] sparse_add_one_section+0xef/0x166
[  691.336971] PGD 7ff7d4067 PUD 78e035067 PMD 78e11d067 PTE 0
[  691.403952] Oops: 0002 [#1] SMP
[  691.442695] Modules linked in: ip6table_filter ip6_tables ebtable_nat ebtables nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack ipt_REJECT xt_CHECKSUM iptable_mangle iptable_filter ip_tables bridge stp llc sunrpc binfmt_misc dm_mirror dm_region_hash dm_log dm_mod vhost_net macvtap macvlan tun uinput iTCO_wdt iTCO_vendor_support coretemp kvm_intel kvm crc32c_intel microcode pcspkr sg lpc_ich mfd_core i2c_i801 i2c_core i7core_edac edac_core ioatdma e1000e igb dca ptp pps_core sd_mod crc_t10dif megaraid_sas mptsas mptscsih mptbase scsi_transport_sas scsi_mod
[  692.042726] CPU 0
[  692.064641] Pid: 4, comm: kworker/0:0 Tainted: G        W 3.8.0-rc3-phy-hot-remove+ #3 FUJITSU-SV PRIMEQUEST 1800E/SB
[  692.196723] RIP: 0010:[<ffffffff816a483f>]  [<ffffffff816a483f>] sparse_add_one_section+0xef/0x166
[  692.303885] RSP: 0018:ffff8807bdcb35d8  EFLAGS: 00010006
[  692.367331] RAX: 0000000000000000 RBX: 0000000000000200 RCX: 0000000000200000
[  692.452578] RDX: ffff88078df01148 RSI: 0000000000000282 RDI: ffffea0040000000
[  692.537822] RBP: ffff8807bdcb3618 R08: 4cf05005b019467a R09: 0cd98fa09631467a
[  692.623071] R10: 0000000000000000 R11: 0000000000030e20 R12: 0000000000008000
[  692.708319] R13: ffffea0040000000 R14: ffff88078df66248 R15: ffff88078ea13b10
[  692.793562] FS:  0000000000000000(0000) GS:ffff8807c1a00000(0000) knlGS:0000000000000000
[  692.890233] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[  692.958870] CR2: ffffea0040000000 CR3: 0000000001c0c000 CR4: 00000000000007f0
[  693.044119] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  693.129367] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[  693.214617] Process kworker/0:0 (pid: 4, threadinfo ffff8807bdcb2000, task ffff8807bde18000)
[  693.315437] Stack:
[  693.339421]  0000000000000000 0000000000000282 0000000000000000 ffff88078df40f00
[  693.428208]  0000000000000001 0000000000000200 00000000000002ff 0000000000000200
[  693.516981]  ffff8807bdcb3668 ffffffff816940e5 0000000000000000 0000000001000000
[  693.605761] Call Trace:
[  693.634949]  [<ffffffff816940e5>] __add_pages+0x85/0x120
[  693.698398]  [<ffffffff8104f1d1>] arch_add_memory+0x71/0xf0
[  693.764960]  [<ffffffff81079bff>] ? request_resource_conflict+0x8f/0xa0
[  693.843982]  [<ffffffff81694796>] add_memory+0xd6/0x1f0
[  693.906393]  [<ffffffff814044df>] acpi_memory_device_add+0x170/0x20c
[  693.982302]  [<ffffffff813c1de2>] acpi_device_probe+0x50/0x18a
[  694.051977]  [<ffffffff8125a9d3>] ? sysfs_create_link+0x13/0x20
[  694.122691]  [<ffffffff8146c31c>] really_probe+0x6c/0x320
[  694.187170]  [<ffffffff8146c617>] driver_probe_device+0x47/0xa0
[  694.257885]  [<ffffffff8146c720>] ? __driver_attach+0xb0/0xb0
[  694.326521]  [<ffffffff8146c720>] ? __driver_attach+0xb0/0xb0
[  694.395157]  [<ffffffff8146c773>] __device_attach+0x53/0x60
[  694.461719]  [<ffffffff8146a34c>] bus_for_each_drv+0x6c/0xa0
[  694.529316]  [<ffffffff8146c298>] device_attach+0xa8/0xc0
[  694.593799]  [<ffffffff8146af70>] bus_probe_device+0xb0/0xe0
[  694.661398]  [<ffffffff814699c1>] device_add+0x301/0x570
[  694.724842]  [<ffffffff81469c4e>] device_register+0x1e/0x30
[  694.791403]  [<ffffffff813c354a>] acpi_device_register+0x1d8/0x27c
[  694.865230]  [<ffffffff813c37cd>] acpi_add_single_object+0x1df/0x2b9
[  694.941140]  [<ffffffff813fa078>] ? acpi_ut_release_mutex+0xac/0xb5
[  695.016009]  [<ffffffff813c39b9>] acpi_bus_check_add+0x112/0x18f
[  695.087764]  [<ffffffff810df61d>] ? trace_hardirqs_on+0xd/0x10
[  695.157445]  [<ffffffff810a1b0f>] ? up+0x2f/0x50
[  695.212585]  [<ffffffff813bdddb>] ? acpi_os_signal_semaphore+0x6b/0x74
[  695.290573]  [<ffffffff813ec519>] acpi_ns_walk_namespace+0x105/0x255
[  695.366478]  [<ffffffff813c38a7>] ? acpi_add_single_object+0x2b9/0x2b9
[  695.444459]  [<ffffffff813c38a7>] ? acpi_add_single_object+0x2b9/0x2b9
[  695.522439]  [<ffffffff813ecb6c>] acpi_walk_namespace+0xcf/0x118
[  695.594190]  [<ffffffff813c3a91>] acpi_bus_scan+0x5b/0x7c
[  695.658676]  [<ffffffff813c3b1e>] acpi_bus_add+0x2a/0x2c
[  695.722121]  [<ffffffff81402905>] container_notify_cb+0x112/0x1a9
[  695.794914]  [<ffffffff813d5859>] acpi_ev_notify_dispatch+0x46/0x61
[  695.869781]  [<ffffffff813be072>] acpi_os_execute_deferred+0x27/0x34
[  695.945687]  [<ffffffff81091c6e>] process_one_work+0x20e/0x5c0
[  696.015361]  [<ffffffff81091bff>] ? process_one_work+0x19f/0x5c0
[  696.087113]  [<ffffffff813be04b>] ? acpi_os_wait_events_complete+0x23/0x23
[  696.169248]  [<ffffffff81093d0e>] worker_thread+0x12e/0x370
[  696.235807]  [<ffffffff81093be0>] ? manage_workers+0x180/0x180
[  696.305485]  [<ffffffff81099e4e>] kthread+0xee/0x100
[  696.364773]  [<ffffffff810e1179>] ? __lock_release+0x129/0x190
[  696.434450]  [<ffffffff81099d60>] ? __init_kthread_worker+0x70/0x70
[  696.509317]  [<ffffffff816b34ac>] ret_from_fork+0x7c/0xb0
[  696.573799]  [<ffffffff81099d60>] ? __init_kthread_worker+0x70/0x70
[  696.648662] Code: 00 00 48 89 df 48 89 45 c8 e8 3e 71 b1 ff 48 89 c2 48 8b 75 c8 b8 ef ff ff ff f6 02 01 75 4b 49 63 cc 31 c0 4c 89 ef 48 c1 e1 06 <f3> aa 48 8b 02 48 83 c8 01 48 85 d2 48 89 02 74 29 a8 01 74 25
[  696.880997] RIP  [<ffffffff816a483f>] sparse_add_one_section+0xef/0x166
[  696.960128]  RSP <ffff8807bdcb35d8>
[  697.001768] CR2: ffffea0040000000
[  697.041336] ---[ end trace e7f94e3a34c442d4 ]---
[  697.096474] Kernel panic - not syncing: Fatal exception

Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Cc: Jiang Liu <jiang.liu@huawei.com>
Cc: Jianguo Wu <wujianguo@huawei.com>
Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

memory-hotplug-do-not-allocate-pdgat-if-it-was-not-freed-when-offline-fix-fix

fix the warning again again

Tested-by: Wu Fengguang <fengguang.wu@intel.com>
Cc: Tang Chen <tangchen@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

memory-hotplug-do-not-allocate-pdgat-if-it-was-not-freed-when-offline-fix

fix warning when CONFIG_NEED_MULTIPLE_NODES=n

Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Jiang Liu <jiang.liu@huawei.com>
Cc: Jianguo Wu <wujianguo@huawei.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Tang Chen <tangchen@cn.fujitsu.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Wen Congyang <wency@cn.fujitsu.com>
Cc: Wu Jianguo <wujianguo@huawei.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

memory-hotplug: do not allocate pgdat if it was not freed when offline.

Since there is no way to guarentee the address of pgdat/zone is not on
stack of any kernel threads or used by other kernel objects without
reference counting or other symchronizing method, we cannot reset
node_data and free pgdat when offlining a node.  Just reset pgdat to 0 and
reuse the memory when the node is online again.

The problem is suggested by Kamezawa Hiroyuki.  The idea is from Wen
Congyang.

NOTE: If we don't reset pgdat to 0, the WARN_ON in free_area_init_node()
      will be triggered.

Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Reviewed-by: Wen Congyang <wency@cn.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Jiang Liu <jiang.liu@huawei.com>
Cc: Jianguo Wu <wujianguo@huawei.com>
Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Wu Jianguo <wujianguo@huawei.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

memory-hotplug: free node_data when a node is offlined

We call hotadd_new_pgdat() to allocate memory to store node_data. So we
should free it when removing a node.

Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Reviewed-by: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Jiang Liu <jiang.liu@huawei.com>
Cc: Jianguo Wu <wujianguo@huawei.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Wu Jianguo <wujianguo@huawei.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

memory-hotplug: remove sysfs file of node

Introduce a new function try_offline_node() to remove sysfs file of node
when all memory sections of this node are removed. If some memory
sections of this node are not removed, this function does nothing.

Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Jiang Liu <jiang.liu@huawei.com>
Cc: Jianguo Wu <wujianguo@huawei.com>
Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Wu Jianguo <wujianguo@huawei.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

memory_hotplug: clear zone when removing the memory

When memory is added, we update zone's and pgdat's start_pfn and
spanned_pages in __add_zone(). So we should revert them when the memory
is removed.

The patch adds a new function __remove_zone() to do this.

Signed-off-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Jiang Liu <jiang.liu@huawei.com>
Cc: Jianguo Wu <wujianguo@huawei.com>
Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Wu Jianguo <wujianguo@huawei.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

memory-hotplug: integrated __remove_section() of CONFIG_SPARSEMEM_VMEMMAP.

Currently __remove_section for SPARSEMEM_VMEMMAP does nothing. But even
if we use SPARSEMEM_VMEMMAP, we can unregister the memory_section.

Signed-off-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Jiang Liu <jiang.liu@huawei.com>
Cc: Jianguo Wu <wujianguo@huawei.com>
Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Wu Jianguo <wujianguo@huawei.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

memory-hotplug-remove-memmap-of-sparse-vmemmap-fix

Defconfig for x86_64 complains:
arch/x86/mm/init_64.c: In function `vmemmap_free':
arch/x86/mm/init_64.c:1317: error: implicit declaration of function `remove_pagetable'

vmemmap_free is only used for CONFIG_MEMORY_HOTPLUG so let's move it
inside ifdef

Signed-off-by: Michal Hocko <mhocko@suse.cz>
Tested-by: Lin Feng <linfeng@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

memory-hotplug: remove memmap of sparse-vmemmap

Introduce a new API vmemmap_free() to free and remove vmemmap pagetables.
Since pagetable implements are different, each architecture has to provide
its own version of vmemmap_free(), just like vmemmap_populate().

Note: vmemmap_free() is not implemented for ia64, ppc, s390, and sparc.

Signed-off-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Signed-off-by: Jianguo Wu <wujianguo@huawei.com>
Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Jiang Liu <jiang.liu@huawei.com>
Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Wu Jianguo <wujianguo@huawei.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

memory-hotplug-remove-page-table-of-x86_64-architecture-fix

make kernel_physical_mapping_remove() static

Cc: Jiang Liu <jiang.liu@huawei.com>
Cc: Jianguo Wu <wujianguo@huawei.com>
Cc: Tang Chen <tangchen@cn.fujitsu.com>
Cc: Wen Congyang <wency@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

memory-hotplug: remove page table of x86_64 architecture

Search a page table about the removed memory, and clear page table for
x86_64 architecture.

Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Signed-off-by: Jianguo Wu <wujianguo@huawei.com>
Signed-off-by: Jiang Liu <jiang.liu@huawei.com>
Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Wu Jianguo <wujianguo@huawei.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

memory-hotplug-common-apis-to-support-page-tables-hot-remove-fix-fix-fix-fix-fix-fix-fix

fix used-uninitialised bug

Cc: Jiang Liu <jiang.liu@huawei.com>
Cc: Jianguo Wu <wujianguo@huawei.com>
Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Tang Chen <tangchen@cn.fujitsu.com>
Cc: Wen Congyang <wency@cn.fujitsu.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

memory-hotplug-common-apis-to-support-page-tables-hot-remove-fix-fix-fix-fix-fix-fix

Cc: Jiang Liu <jiang.liu@huawei.com>
Cc: Jianguo Wu <wujianguo@huawei.com>
Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Tang Chen <tangchen@cn.fujitsu.com>
Cc: Wen Congyang <wency@cn.fujitsu.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Bug fix: Do not split pages when freeing pagetable pages.

The old way we free pagetable pages was wrong.

The old way is:
When we got a hugepage, we split it into smaller pages. And sometimes,
we only need to free some of the smaller pages because the others are
still in use. And the whole larger page will be freed if all the smaller
pages are not in use. All these were done in remove_pmd/pud_table().

But there is no way to know if the larger page has been split. As a result,
there is no way to decide when to split.

Actually, there is no need to split larger pages into smaller ones.

We do it in the following new way:
1) For direct mapped pages, all the pages were freed when they were offlined.
   And since menmory offline is done section by section, all the memory ranges
   being removed are aligned to PAGE_SIZE. So only need to deal with unaligned
   pages when freeing vmemmap pages.

2) For vmemmap pages being used to store page_struct, if part of the larger
   page is still in use, just fill the unused part with 0xFD. And when the
   whole page is fulfilled with 0xFD, then free the larger page.

This problem is caused by the following related patch:
memory-hotplug-common-apis-to-support-page-tables-hot-remove.patch
memory-hotplug-common-apis-to-support-page-tables-hot-remove-fix.patch
memory-hotplug-common-apis-to-support-page-tables-hot-remove-fix-fix.patch
memory-hotplug-common-apis-to-support-page-tables-hot-remove-fix-fix-fix.patch
memory-hotplug-common-apis-to-support-page-tables-hot-remove-fix-fix-fix-fix.patch

This patch will fix the problem based on the above patches.

Reported-by: Wen Congyang <wency@cn.fujitsu.com>
Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Cc: Jiang Liu <jiang.liu@huawei.com>
Cc: Jianguo Wu <wujianguo@huawei.com>
Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Bug fix: Do not free page split from hugepage one by one.

When we split a larger page into smaller ones, we should not free them one
by one because only the _count of the first page struct makes sense.
Otherwise, the kernel will panic.

So fulfill the unused small pages with 0xFD, and when the whole larger
page is fulfilled with 0xFD, free the whole larger page.

The call trace is like the following:

[ 1052.819430] ------------[ cut here ]------------
[ 1052.874575] kernel BUG at include/linux/mm.h:278!
[ 1052.930754] invalid opcode: 0000 [#1] SMP
[ 1052.979888] Modules linked in: ip6table_filter ip6_tables ebtable_nat ebtables nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack ipt_REJECT xt_CHECKSUM iptable_mangle iptable_filter ip_tables bridge stp llc sunrpc binfmt_misc dm_mirror dm_region_hash dm_log dm_mod vhost_net macvtap macvlan tun uinput iTCO_wdt iTCO_vendor_support coretemp kvm_intel kvm crc32c_intel microcode pcspkr sg i2c_i801 i2c_core lpc_ich mfd_core ioatdma e1000e i7core_edac edac_core igb dca ptp pps_core sd_mod crc_t10dif megaraid_sas mptsas mptscsih mptbase scsi_transport_sas scsi_mod
[ 1053.580111] CPU 0
[ 1053.602026] Pid: 4, comm: kworker/0:0 Tainted: G        W    3.8.0-rc2-memory-hotremove+ #3 FUJITSU-SV PRIMEQUEST 1800E/SB
[ 1053.736188] RIP: 0010:[<ffffffff81175bd7>]  [<ffffffff81175bd7>] __free_pages+0x37/0x50
[ 1053.831952] RSP: 0018:ffff8807bdcb37f8  EFLAGS: 00010246
[ 1053.895403] RAX: 0000000000000000 RBX: ffff88077c401000 RCX: 000000000000002c
[ 1053.980660] RDX: ffff8807fffd7000 RSI: 0000000000000000 RDI: ffffea001df10040
[ 1054.065917] RBP: ffff8807bdcb37f8 R08: 0000000000000000 R09: 0000000000000000
[ 1054.151178] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
[ 1054.236433] R13: ffffea0040008000 R14: ffffea0040002000 R15: 00003ffffffff000
[ 1054.321691] FS:  0000000000000000(0000) GS:ffff8807c1a00000(0000) knlGS:0000000000000000
[ 1054.418372] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 1054.487015] CR2: 00007fbc137ad000 CR3: 0000000001c0c000 CR4: 00000000000007f0
[ 1054.572272] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 1054.657533] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[ 1054.742797] Process kworker/0:0 (pid: 4, threadinfo ffff8807bdcb2000, task ffff8807bde18000)
[ 1054.843628] Stack:
[ 1054.867622]  ffff8807bdcb3818 ffffffff81175ccf 000000077c401000 ffff880796349008
[ 1054.956421]  ffff8807bdcb3848 ffffffff816a16e2 ffffea0040001000 ffff880796349008
[ 1055.045227]  ffffea0040008000 ffffea0040002000 ffff8807bdcb38b8 ffffffff816a181c
[ 1055.134033] Call Trace:
[ 1055.163230]  [<ffffffff81175ccf>] free_pages+0x5f/0x70
[ 1055.224611]  [<ffffffff816a16e2>] free_pagetable+0x7f/0xee
[ 1055.290147]  [<ffffffff816a181c>] remove_pte_table+0xcb/0x1cd
[ 1055.358806]  [<ffffffff81055bd0>] ? leave_mm+0x50/0x50
[ 1055.420187]  [<ffffffff810df55d>] ? trace_hardirqs_on+0xd/0x10
[ 1055.489882]  [<ffffffff816a1f8b>] remove_pmd_table+0x191/0x253
[ 1055.559576]  [<ffffffff816a261e>] remove_pud_table+0x194/0x24d
[ 1055.629270]  [<ffffffff811bbc3f>] ? sparse_remove_one_section+0x2f/0x150
[ 1055.709348]  [<ffffffff816a278c>] remove_pagetable+0xb5/0x17c
[ 1055.778002]  [<ffffffff81692f28>] vmemmap_free+0x18/0x20
[ 1055.841465]  [<ffffffff811bbd15>] sparse_remove_one_section+0x105/0x150
[ 1055.920508]  [<ffffffff811c953c>] __remove_pages+0xec/0x110
[ 1055.987087]  [<ffffffff81692fa7>] arch_remove_memory+0x77/0xc0
[ 1056.056781]  [<ffffffff81694138>] remove_memory+0xb8/0xf0
[ 1056.121284]  [<ffffffff814040aa>] acpi_memory_device_remove+0x76/0xbc
[ 1056.198244]  [<ffffffff813c1e50>] acpi_device_remove+0x90/0xb2
[ 1056.267941]  [<ffffffff8146bf3c>] __device_release_driver+0x7c/0xf0
[ 1056.342824]  [<ffffffff8146c0bf>] device_release_driver+0x2f/0x50
[ 1056.415635]  [<ffffffff813c3142>] acpi_bus_remove+0x32/0x6d
[ 1056.482215]  [<ffffffff813c320e>] acpi_bus_trim+0x91/0x102
[ 1056.547755]  [<ffffffff813c3307>] acpi_bus_hot_remove_device+0x88/0x180
[ 1056.626794]  [<ffffffff813be152>] acpi_os_execute_deferred+0x27/0x34
[ 1056.702717]  [<ffffffff81091c5e>] process_one_work+0x20e/0x5c0
[ 1056.772411]  [<ffffffff81091bef>] ? process_one_work+0x19f/0x5c0
[ 1056.844184]  [<ffffffff813be12b>] ? acpi_os_wait_events_complete+0x23/0x23
[ 1056.926337]  [<ffffffff81093cfe>] worker_thread+0x12e/0x370
[ 1056.992908]  [<ffffffff81093bd0>] ? manage_workers+0x180/0x180
[ 1057.062602]  [<ffffffff81099e3e>] kthread+0xee/0x100
[ 1057.121913]  [<ffffffff810e10b9>] ? __lock_release+0x129/0x190
[ 1057.191609]  [<ffffffff81099d50>] ? __init_kthread_worker+0x70/0x70
[ 1057.266494]  [<ffffffff816b33ac>] ret_from_fork+0x7c/0xb0
[ 1057.330992]  [<ffffffff81099d50>] ? __init_kthread_worker+0x70/0x70
[ 1057.405863] Code: 85 c0 74 27 f0 ff 4f 1c 0f 94 c0 84 c0 74 0a 85 f6 74 11 90 e8 bb e3 ff ff c9 c3 66 0f 1f 84 00 00 00 00 00 e8 1b fe ff ff c9 c3 <0f> 0b 0f 1f 80 00 00 00 00 eb f7 66 66 66 66 66 2e 0f 1f 84 00
[ 1057.638271] RIP  [<ffffffff81175bd7>] __free_pages+0x37/0x50
[ 1057.705994]  RSP <ffff8807bdcb37f8>
[ 1057.747882] ---[ end trace 0f90e1e054d174f9 ]---
[ 1057.803158] Kernel panic - not syncing: Fatal exception

Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Jiang Liu <jiang.liu@huawei.com>
Cc: Jianguo Wu <wujianguo@huawei.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Bug fix: Do not free direct mapping pages twice.

Direct mapped pages were freed when they were offlined, or they were not
allocated. So we only need to free vmemmap pages, no need to free direct
mapped pages.

Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Cc: Wen Congyang <wency@cn.fujitsu.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Tang Chen <tangchen@cn.fujitsu.com>
Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Bug fix: Do not calculate direct mapping pages when freeing vmemmap pagetables.

We only need to update direct_pages_count[level] when we freeing direct mapped
pagetables.

Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Cc: Wen Congyang <wency@cn.fujitsu.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Tang Chen <tangchen@cn.fujitsu.com>
Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

memory-hotplug-common-apis-to-support-page-tables-hot-remove-fix

fix typo in comment

Cc: Tang Chen <tangchen@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

memory-hotplug: common APIs to support page tables hot-remove

When memory is removed, the corresponding pagetables should alse be
removed.  This patch introduces some common APIs to support vmemmap
pagetable and x86_64 architecture pagetable removing.

All pages of virtual mapping in removed memory cannot be freed if some
pages used as PGD/PUD includes not only removed memory but also other
memory.  So the patch uses the following way to check whether page can be
freed or not.

1. When removing memory, the page structs of the removed memory are filled
    with 0FD.
2. All page structs are filled with 0xFD on PT/PMD, PT/PMD can be cleared.
    In this case, the page used as PT/PMD can be freed.

Signed-off-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Signed-off-by: Jianguo Wu <wujianguo@huawei.com>
Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Jiang Liu <jiang.liu@huawei.com>
Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Wu Jianguo <wujianguo@huawei.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

memory-hotplug: move pgdat_resize_lock into sparse_remove_one_section()

In __remove_section(), we locked pgdat_resize_lock when calling
sparse_remove_one_section().  This lock will disable irq.  But we don't
need to lock the whole function.  If we do some work to free pagetables in
free_section_usemap(), we need to call flush_tlb_all(), which need irq
enabled.  Otherwise the WARN_ON_ONCE() in smp_call_function_many() will be
triggered.

If we lock the whole sparse_remove_one_section(), then we come to this call trace:

[  454.796248] ------------[ cut here ]------------
[  454.851408] WARNING: at kernel/smp.c:461 smp_call_function_many+0xbd/0x260()
[  454.935620] Hardware name: PRIMEQUEST 1800E
......
[  455.652201] Call Trace:
[  455.681391]  [<ffffffff8106e73f>] warn_slowpath_common+0x7f/0xc0
[  455.753151]  [<ffffffff810560a0>] ? leave_mm+0x50/0x50
[  455.814527]  [<ffffffff8106e79a>] warn_slowpath_null+0x1a/0x20
[  455.884208]  [<ffffffff810e7a9d>] smp_call_function_many+0xbd/0x260
[  455.959082]  [<ffffffff810e7ecb>] smp_call_function+0x3b/0x50
[  456.027722]  [<ffffffff810560a0>] ? leave_mm+0x50/0x50
[  456.089098]  [<ffffffff810e7f4b>] on_each_cpu+0x3b/0xc0
[  456.151512]  [<ffffffff81055f0c>] flush_tlb_all+0x1c/0x20
[  456.216004]  [<ffffffff8104f8de>] remove_pagetable+0x14e/0x1d0
[  456.285683]  [<ffffffff8104f978>] vmemmap_free+0x18/0x20
[  456.349139]  [<ffffffff811b8797>] sparse_remove_one_section+0xf7/0x100
[  456.427126]  [<ffffffff811c5fc2>] __remove_section+0xa2/0xb0
[  456.494726]  [<ffffffff811c6070>] __remove_pages+0xa0/0xd0
[  456.560258]  [<ffffffff81669c7b>] arch_remove_memory+0x6b/0xc0
[  456.629937]  [<ffffffff8166ad28>] remove_memory+0xb8/0xf0
[  456.694431]  [<ffffffff813e686f>] acpi_memory_device_remove+0x53/0x96
[  456.771379]  [<ffffffff813b33c4>] acpi_device_remove+0x90/0xb2
[  456.841059]  [<ffffffff8144b02c>] __device_release_driver+0x7c/0xf0
[  456.915928]  [<ffffffff8144b1af>] device_release_driver+0x2f/0x50
[  456.988719]  [<ffffffff813b4476>] acpi_bus_remove+0x32/0x6d
[  457.055285]  [<ffffffff813b4542>] acpi_bus_trim+0x91/0x102
[  457.120814]  [<ffffffff813b463b>] acpi_bus_hot_remove_device+0x88/0x16b
[  457.199840]  [<ffffffff813afda7>] acpi_os_execute_deferred+0x27/0x34
[  457.275756]  [<ffffffff81091ece>] process_one_work+0x20e/0x5c0
[  457.345434]  [<ffffffff81091e5f>] ? process_one_work+0x19f/0x5c0
[  457.417190]  [<ffffffff813afd80>] ? acpi_os_wait_events_complete+0x23/0x23
[  457.499332]  [<ffffffff81093f6e>] worker_thread+0x12e/0x370
[  457.565896]  [<ffffffff81093e40>] ? manage_workers+0x180/0x180
[  457.635574]  [<ffffffff8109a09e>] kthread+0xee/0x100
[  457.694871]  [<ffffffff810dfaf9>] ? __lock_release+0x129/0x190
[  457.764552]  [<ffffffff81099fb0>] ? __init_kthread_worker+0x70/0x70
[  457.839427]  [<ffffffff81690aac>] ret_from_fork+0x7c/0xb0
[  457.903914]  [<ffffffff81099fb0>] ? __init_kthread_worker+0x70/0x70
[  457.978784] ---[ end trace 25e85300f542aa01 ]---

Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Jiang Liu <jiang.liu@huawei.com>
Cc: Jianguo Wu <wujianguo@huawei.com>
Cc: Wu Jianguo <wujianguo@huawei.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

memory-hotplug: mm/Kconfig: move auto selects from MEMORY_HOTPLUG to MEMORY_HOTREMOVE as needed

Since we have 2 config options called MEMORY_HOTPLUG and MEMORY_HOTREMOVE
used for memory hot-add and hot-remove separately, and codes in function
register_page_bootmem_info_node() are only used for collecting infomation
for hot-remove(commit 04753278), so move it to MEMORY_HOTREMOVE.

Besides page_isolation.c selected by MEMORY_ISOLATION under MEMORY_HOTPLUG
is also such case, move it too.

Signed-off-by: Lin Feng <linfeng@cn.fujitsu.com>
Cc: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

memory-hotplug: cleanup: removing the arch specific functions without any implementation

After introducing CONFIG_HAVE_BOOTMEM_INFO_NODE Kconfig option, the
related arch specific functions become confusing, remove them.

Guys who want to implement memory-hotplug feature on such archs for this
part should look into register_page_bootmem_info_node() and flesh out from
top to end.

Signed-off-by: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Lin Feng <linfeng@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

memory-hotplug: introduce CONFIG_HAVE_BOOTMEM_INFO_NODE and revert register_page_bootmem_info_node() when platform not support

It's implemented by adding a new Kconfig option named
CONFIG_HAVE_BOOTMEM_INFO_NODE, which will be automatically selected by
memory-hotplug feature fully supported archs(currently only on x86_64).

Signed-off-by: Lin Feng <linfeng@cn.fujitsu.com>
Reported-by: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

memory-hotplug-implement-register_page_bootmem_info_section-of-sparse-vmemmap-fix

put register_page_bootmem_memmap inside CONFIG_MEMORY_HOTPLUG_SPARSE

Signed-off-by: Michal Hocko <mhocko@suse.cz>
Tested-by: Lin Feng <linfeng@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

memory-hotplug: implement register_page_bootmem_info_section of sparse-vmemmap

For removing memmap region of sparse-vmemmap which is allocated bootmem,
memmap region of sparse-vmemmap needs to be registered by
get_page_bootmem(). So the patch searches pages of virtual mapping and
registers the pages by get_page_bootmem().

Note: register_page_bootmem_memmap() is not implemented for ia64, ppc, s390,
and sparc.

Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Signed-off-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Reviewed-by: Wu Jianguo <wujianguo@huawei.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Jiang Liu <jiang.liu@huawei.com>
Cc: Jianguo Wu <wujianguo@huawei.com>
Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

memory-hotplug: introduce new arch_remove_memory() for removing page table

For removing memory, we need to remove page tables.  But it depends on
architecture.  So the patch introduce arch_remove_memory() for removing
page table.  Now it only calls __remove_pages().

Note: __remove_pages() for some archtecuture is not implemented
      (I don't know how to implement it for s390).

Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Jiang Liu <jiang.liu@huawei.com>
Cc: Jianguo Wu <wujianguo@huawei.com>
Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Wu Jianguo <wujianguo@huawei.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Bug fix: Fix the doc format in drivers/firmware/memmap.c

Make the comments in drivers/firmware/memmap.c kernel-doc compliant.

Reported-by: Julian Calaby <julian.calaby@gmail.com>
Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Cc: Jiang Liu <jiang.liu@huawei.com>
Cc: Jianguo Wu <wujianguo@huawei.com>
Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Bug fix: Fix section mismatch problem of release_firmware_map_entry().

The function release_firmware_map_entry() references the function
__meminit firmware_map_find_entry_in_list(). So it should also have
__meminit.

And since the firmware_map_entry->kobj is initialized with memmap_ktype,
the memmap_ktype should also be prefixed by __refdata.

Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Cc: Jiang Liu <jiang.liu@huawei.com>
Cc: Jianguo Wu <wujianguo@huawei.com>
Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Bug fix: Reuse the storage of /sys/firmware/memmap/X/ allocated by bootmem.

Now we don't free firmware_map_entry which is allocated by bootmem because
there is no way to do so when the system is up. But we can at least remember
the address of that memory and reuse the storage when the memory is added
next time.

This patch introduces a new list map_entries_bootmem to link the map entries
allocated by bootmem when they are removed, and a lock to protect it.
And these entries will be reused when the memory is hot-added again.

The idea is suggestted by Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Cc: Wen Congyang <wency@cn.fujitsu.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Tang Chen <tangchen@cn.fujitsu.com>
Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Bug fix: Fix the wrong comments of map_entries.

Now we have a map_entries_lock to protect map_entries list.
So we need to update the comments.

Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Cc: Wen Congyang <wency@cn.fujitsu.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Tang Chen <tangchen@cn.fujitsu.com>
Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Bug fix: Hold spinlock across find|remove /sys/firmware/memmap/X operation.

It is unsafe to return an entry pointer and release the map_entries_lock.
So we should not hold the map_entries_lock separately in
firmware_map_find_entry() and firmware_map_remove_entry(). Hold the
map_entries_lock across find and remove /sys/firmware/memmap/X operation.

And also, users of these two functions need to be careful to hold the lock
when using these two functions.

The suggestion is from Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Cc: Wen Congyang <wency@cn.fujitsu.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Tang Chen <tangchen@cn.fujitsu.com>
Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

memory-hotplug: remove /sys/firmware/memmap/X sysfs

When (hot)adding memory into system, /sys/firmware/memmap/X/{end, start,
type} sysfs files are created.  But there is no code to remove these
files.  The patch implements the function to remove them.

The code does not free firmware_map_entry which is allocated by bootmem,
which is a slight memory leak.  But that memory is reused when the sysfs
file is recreated, so the leak is bounded.

Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Signed-off-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Reviewed-by: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Jiang Liu <jiang.liu@huawei.com>
Cc: Jianguo Wu <wujianguo@huawei.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Tang Chen <tangchen@cn.fujitsu.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

memory-hotplug: remove redundant codes

offlining memory blocks and checking whether memory blocks are offlined
are very similar. This patch introduces a new function to remove
redundant codes.

Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Reviewed-by: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Jiang Liu <jiang.liu@huawei.com>
Cc: Jianguo Wu <wujianguo@huawei.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Wu Jianguo <wujianguo@huawei.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

memory-hotplug: check whether all memory blocks are offlined or not when removing memory

We remove the memory like this:
1. lock memory hotplug
2. offline a memory block
3. unlock memory hotplug
4. repeat 1-3 to offline all memory blocks
5. lock memory hotplug
6. remove memory(TODO)
7. unlock memory hotplug

All memory blocks must be offlined before removing memory.  But we don't
hold the lock in the whole operation.  So we should check whether all
memory blocks are offlined before step6.  Otherwise, kernel maybe
panicked.

Offlining a memory block and removing a memory device can be two different
operations.  Users can just offline some memory blocks without removing
the memory device.  For this purpose, the kernel has held
lock_memory_hotplug() in __offline_pages().  To reuse the code for memory
hot-remove, we repeat step 1-3 to offline all the memory blocks,
repeatedly lock and unlock memory hotplug, but not hold the memory hotplug
lock in the whole operation.

Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Signed-off-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Jiang Liu <jiang.liu@huawei.com>
Cc: Jianguo Wu <wujianguo@huawei.com>
Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Wu Jianguo <wujianguo@huawei.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

memory-hotplug: try to offline the memory twice to avoid dependence

memory can't be offlined when CONFIG_MEMCG is selected.  For example:
there is a memory device on node 1.  The address range is [1G, 1.5G).  You
will find 4 new directories memory8, memory9, memory10, and memory11 under
the directory /sys/devices/system/memory/.

If CONFIG_MEMCG is selected, we will allocate memory to store page cgroup
when we online pages.  When we online memory8, the memory stored page
cgroup is not provided by this memory device.  But when we online memory9,
the memory stored page cgroup may be provided by memory8.  So we can't
offline memory8 now.  We should offline the memory in the reversed order.

When the memory device is hotremoved, we will auto offline memory provided
by this memory device.  But we don't know which memory is onlined first,
so offlining memory may fail.  In such case, iterate twice to offline the
memory.  1st iterate: offline every non primary memory block.  2nd
iterate: offline primary (i.e.  first added) memory block.

This idea is suggested by KOSAKI Motohiro.

Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Jiang Liu <jiang.liu@huawei.com>
Cc: Jianguo Wu <wujianguo@huawei.com>
Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Wu Jianguo <wujianguo@huawei.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>