Ming Lei [Wed, 20 Feb 2013 02:14:36 +0000 (13:14 +1100)]
usb: forbid memory allocation with I/O during bus reset
If one storage interface or usb network interface(iSCSI case) exists in
current configuration, memory allocation with GFP_KERNEL during
usb_device_reset() might trigger I/O transfer on the storage interface
itself and cause deadlock because the 'us->dev_mutex' is held in
.pre_reset() and the storage interface can't do I/O transfer when the
reset is triggered by other interface, or the error handling can't be
completed if the reset is triggered by the storage itself (error handling
path).
Signed-off-by: Ming Lei <ming.lei@canonical.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: David Decotigny <david.decotigny@google.com> Cc: Tom Herbert <therbert@google.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Jens Axboe <axboe@kernel.dk> Cc: Minchan Kim <minchan@kernel.org> Cc: Alan Stern <stern@rowland.harvard.edu> Cc: Oliver Neukum <oneukum@suse.de> Reviewed-by: Jiri Kosina <jkosina@suse.cz> Cc: Mel Gorman <mel@csn.ul.ie> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: "Rafael J. Wysocki" <rjw@sisk.pl> Cc: Greg KH <greg@kroah.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Ming Lei [Wed, 20 Feb 2013 02:14:35 +0000 (13:14 +1100)]
pm / runtime: force memory allocation with no I/O during Runtime PM callbcack
Apply the introduced memalloc_noio_save() and memalloc_noio_restore() to
force memory allocation with no I/O during runtime_resume/runtime_suspend
callback on device with the flag of 'memalloc_noio' set.
Signed-off-by: Ming Lei <ming.lei@canonical.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: David Decotigny <david.decotigny@google.com> Cc: Tom Herbert <therbert@google.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Jens Axboe <axboe@kernel.dk> Cc: Minchan Kim <minchan@kernel.org> Cc: Alan Stern <stern@rowland.harvard.edu> Cc: Oliver Neukum <oneukum@suse.de> Cc: Jiri Kosina <jiri.kosina@suse.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: "Rafael J. Wysocki" <rjw@sisk.pl> Cc: Greg KH <greg@kroah.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Ming Lei [Wed, 20 Feb 2013 02:14:35 +0000 (13:14 +1100)]
net/core: apply pm_runtime_set_memalloc_noio on network devices
Deadlock might be caused by allocating memory with GFP_KERNEL in
runtime_resume and runtime_suspend callback of network devices in iSCSI
situation, so mark network devices and its ancestor as 'memalloc_noio'
with the introduced pm_runtime_set_memalloc_noio().
Signed-off-by: Ming Lei <ming.lei@canonical.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: David Decotigny <david.decotigny@google.com> Cc: Tom Herbert <therbert@google.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Jens Axboe <axboe@kernel.dk> Cc: Minchan Kim <minchan@kernel.org> Cc: Alan Stern <stern@rowland.harvard.edu> Cc: Oliver Neukum <oneukum@suse.de> Cc: Jiri Kosina <jiri.kosina@suse.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: "Rafael J. Wysocki" <rjw@sisk.pl> Cc: Greg KH <greg@kroah.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Ming Lei [Wed, 20 Feb 2013 02:14:35 +0000 (13:14 +1100)]
block/genhd.c: apply pm_runtime_set_memalloc_noio on block devices
Apply the introduced pm_runtime_set_memalloc_noio on block device so that
PM core will teach mm to not allocate memory with GFP_IOFS when calling
the runtime_resume and runtime_suspend callback for block devices and its
ancestors.
Signed-off-by: Ming Lei <ming.lei@canonical.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Minchan Kim <minchan@kernel.org> Cc: Alan Stern <stern@rowland.harvard.edu> Cc: Oliver Neukum <oneukum@suse.de> Cc: Jiri Kosina <jiri.kosina@suse.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: "Rafael J. Wysocki" <rjw@sisk.pl> Cc: Greg KH <greg@kroah.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: David Decotigny <david.decotigny@google.com> Cc: Tom Herbert <therbert@google.com> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Introduce the flag memalloc_noio in 'struct dev_pm_info' to help PM core
to teach mm not allocating memory with GFP_KERNEL flag for avoiding
probable deadlock.
As explained in the comment, any GFP_KERNEL allocation inside
runtime_resume() or runtime_suspend() on any one of device in the path
from one block or network device to the root device in the device tree may
cause deadlock, the introduced pm_runtime_set_memalloc_noio() sets or
clears the flag on device in the path recursively.
Signed-off-by: Ming Lei <ming.lei@canonical.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Alan Stern <stern@rowland.harvard.edu> Cc: Oliver Neukum <oneukum@suse.de> Cc: Jiri Kosina <jiri.kosina@suse.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: "Rafael J. Wysocki" <rjw@sisk.pl> Cc: Greg KH <greg@kroah.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: "David S. Miller" <davem@davemloft.net> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: David Decotigny <david.decotigny@google.com> Cc: Tom Herbert <therbert@google.com> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Ming Lei [Wed, 20 Feb 2013 02:14:34 +0000 (13:14 +1100)]
mm: teach mm by current context info to not do I/O during memory allocation
This patch introduces PF_MEMALLOC_NOIO on process flag('flags' field of
'struct task_struct'), so that the flag can be set by one task to avoid
doing I/O inside memory allocation in the task's context.
The patch trys to solve one deadlock problem caused by block device, and
the problem may happen at least in the below situations:
- during block device runtime resume, if memory allocation with
GFP_KERNEL is called inside runtime resume callback of any one of its
ancestors(or the block device itself), the deadlock may be triggered
inside the memory allocation since it might not complete until the block
device becomes active and the involed page I/O finishes. The situation
is pointed out first by Alan Stern. It is not a good approach to
convert all GFP_KERNEL[1] in the path into GFP_NOIO because several
subsystems may be involved(for example, PCI, USB and SCSI may be
involved for usb mass stoarage device, network devices involved too in
the iSCSI case)
- during block device runtime suspend, because runtime resume need to
wait for completion of concurrent runtime suspend.
- during error handling of usb mass storage deivce, USB bus reset will
be put on the device, so there shouldn't have any memory allocation with
GFP_KERNEL during USB bus reset, otherwise the deadlock similar with
above may be triggered. Unfortunately, any usb device may include one
mass storage interface in theory, so it requires all usb interface
drivers to handle the situation. In fact, most usb drivers don't know
how to handle bus reset on the device and don't provide .pre_set() and
.post_reset() callback at all, so USB core has to unbind and bind driver
for these devices. So it is still not practical to resort to GFP_NOIO
for solving the problem.
Also the introduced solution can be used by block subsystem or block
drivers too, for example, set the PF_MEMALLOC_NOIO flag before doing
actual I/O transfer.
It is not a good idea to convert all these GFP_KERNEL in the affected path
into GFP_NOIO because these functions doing that may be implemented as
library and will be called in many other contexts.
In fact, memalloc_noio_flags() can convert some of current static GFP_NOIO
allocation into GFP_KERNEL back in other non-affected contexts, at least
almost all GFP_NOIO in USB subsystem can be converted into GFP_KERNEL
after applying the approach and make allocation with GFP_NOIO only happen
in runtime resume/bus reset/block I/O transfer contexts generally.
[1], several GFP_KERNEL allocation examples in runtime resume path
- some individual usb drivers
usblp, uvc, gspca, most of dvb-usb-v2 media drivers, cpia2, az6007, ....
That is just what I have found. Unfortunately, this allocation can only
be found by human being now, and there should be many not found since any
function in the resume path(call tree) may allocate memory with
GFP_KERNEL.
Signed-off-by: Ming Lei <ming.lei@canonical.com> Signed-off-by: Minchan Kim <minchan@kernel.org> Cc: Alan Stern <stern@rowland.harvard.edu> Cc: Oliver Neukum <oneukum@suse.de> Cc: Jiri Kosina <jiri.kosina@suse.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: "Rafael J. Wysocki" <rjw@sisk.pl> Cc: Greg KH <greg@kroah.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: "David S. Miller" <davem@davemloft.net> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: David Decotigny <david.decotigny@google.com> Cc: Tom Herbert <therbert@google.com> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Zlatko Calusic [Wed, 20 Feb 2013 02:14:34 +0000 (13:14 +1100)]
mm: don't wait on congested zones in balance_pgdat()
Commit 92df3a72 (mm: vmscan: throttle reclaim if encountering too many
dirty pages under writeback) introduced waiting on congested zones
based on a sane algorithm in shrink_inactive_list(). What this means
is that there's no more need for throttling and additional heuristics
in balance_pgdat(). So, let's remove it and tidy up the code.
Signed-off-by: Zlatko Calusic <zlatko.calusic@iskon.hr> Cc: Mel Gorman <mgorman@suse.de> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: Johannes Weiner <jweiner@redhat.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Naoya Horiguchi [Wed, 20 Feb 2013 02:14:33 +0000 (13:14 +1100)]
mm/memory-failure.c: fix wrong num_poisoned_pages in handling memory error on thp
num_poisoned_pages counts up the number of pages isolated by memory
errors. But for thp, only one subpage is isolated because memory error
handler splits it, so it's wrong to add (1 << compound_trans_order).
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Tony Luck <tony.luck@intel.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Naoya Horiguchi [Wed, 20 Feb 2013 02:14:33 +0000 (13:14 +1100)]
mm/memory-failure.c: clean up soft_offline_page()
Currently soft_offline_page() is hard to maintain because it has many
return points and goto statements. All of this mess come from
get_any_page(). This function should only get page refcount as the name
implies, but it does some page isolating actions like SetPageHWPoison()
and dequeuing hugepage. This patch corrects it and introduces some
internal subroutines to make soft offlining code more readable and
maintainable.
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Reviewed-by: Andi Kleen <andi@firstfloor.org> Cc: Tony Luck <tony.luck@intel.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Xishi Qiu <qiuxishi@huawei.com> Cc: Jiang Liu <jiang.liu@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Xishi Qiu [Wed, 20 Feb 2013 02:14:32 +0000 (13:14 +1100)]
memory-failure: do code refactor of soft_offline_page()
There are too many return points randomly intermingled with some "goto
done" return points. So adjust the function structure, one for the
success path, the other for the failure path. Use atomic_long_inc instead
of atomic_long_add.
Signed-off-by: Xishi Qiu <qiuxishi@huawei.com> Signed-off-by: Jiang Liu <jiang.liu@huawei.com> Suggested-by: Andrew Morton <akpm@linux-foundation.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com> Cc: Andi Kleen <andi@firstfloor.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Xishi Qiu [Wed, 20 Feb 2013 02:14:32 +0000 (13:14 +1100)]
memory-failure: fix an error of mce_bad_pages statistics
$ echo paddr > /sys/devices/system/memory/soft_offline_page to offline a
*free* page, the value of mce_bad_pages will be added, and the page is set
HWPoison flag, but it is still managed by page buddy alocator.
$ cat /proc/meminfo | grep HardwareCorrupted shows the value.
If we offline the same page, the value of mce_bad_pages will be added
*again*, this means the value is incorrect now. Assume the page is still
free during this short time.
Minchan Kim [Wed, 20 Feb 2013 02:14:31 +0000 (13:14 +1100)]
mm: remove MIGRATE_ISOLATE check in hotpath
Several functions test MIGRATE_ISOLATE and some of those are hotpath but
MIGRATE_ISOLATE is used only if we enable CONFIG_MEMORY_ISOLATION(ie, CMA,
memory-hotplug and memory-failure) which are not common config option. So
let's not add unnecessary overhead and code when we don't enable
CONFIG_MEMORY_ISOLATION.
Signed-off-by: Minchan Kim <minchan@kernel.org> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Acked-by: Michal Nazarewicz <mina86@mina86.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Jiang Liu [Wed, 20 Feb 2013 02:14:31 +0000 (13:14 +1100)]
mm: increase totalram_pages when free pages allocated by bootmem allocator
Function put_page_bootmem() is used to free pages allocated by bootmem
allocator, so it should increase totalram_pages when freeing pages into
the buddy system.
Signed-off-by: Jiang Liu <jiang.liu@huawei.com> Cc: Wen Congyang <wency@cn.fujitsu.com> Cc: David Rientjes <rientjes@google.com> Cc: Jiang Liu <jiang.liu@huawei.com> Cc: Maciej Rutecki <maciej.rutecki@gmail.com> Cc: Chris Clayton <chris2553@googlemail.com> Cc: "Rafael J . Wysocki" <rjw@sisk.pl> Cc: Mel Gorman <mgorman@suse.de> Cc: Minchan Kim <minchan@kernel.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Jianguo Wu <wujianguo@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Jiang Liu [Wed, 20 Feb 2013 02:14:31 +0000 (13:14 +1100)]
mm: set zone->present_pages to number of existing pages in the zone
Now all users of "number of pages managed by the buddy system" have been
converted to use zone->managed_pages, so set zone->present_pages to what
it should be:
present_pages = spanned_pages - absent_pages;
Signed-off-by: Jiang Liu <jiang.liu@huawei.com> Cc: Wen Congyang <wency@cn.fujitsu.com> Cc: David Rientjes <rientjes@google.com> Cc: Jiang Liu <jiang.liu@huawei.com> Cc: Maciej Rutecki <maciej.rutecki@gmail.com> Cc: Chris Clayton <chris2553@googlemail.com> Cc: "Rafael J . Wysocki" <rjw@sisk.pl> Cc: Mel Gorman <mgorman@suse.de> Cc: Minchan Kim <minchan@kernel.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Jianguo Wu <wujianguo@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Jiang Liu [Wed, 20 Feb 2013 02:14:30 +0000 (13:14 +1100)]
mm: use zone->present_pages instead of zone->managed_pages where appropriate
Now we have zone->managed_pages for "pages managed by the buddy system in
the zone", so replace zone->present_pages with zone->managed_pages if what
the user really wants is number of allocatable pages.
Signed-off-by: Jiang Liu <jiang.liu@huawei.com> Cc: Wen Congyang <wency@cn.fujitsu.com> Cc: David Rientjes <rientjes@google.com> Cc: Jiang Liu <jiang.liu@huawei.com> Cc: Maciej Rutecki <maciej.rutecki@gmail.com> Cc: Chris Clayton <chris2553@googlemail.com> Cc: "Rafael J . Wysocki" <rjw@sisk.pl> Cc: Mel Gorman <mgorman@suse.de> Cc: Minchan Kim <minchan@kernel.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Jianguo Wu <wujianguo@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tang Chen [Wed, 20 Feb 2013 02:14:30 +0000 (13:14 +1100)]
mm/memblock.c: use CONFIG_HAVE_MEMBLOCK_NODE_MAP to protect movablecore_map in memblock_overlaps_region().
The definition of struct movablecore_map is protected by
CONFIG_HAVE_MEMBLOCK_NODE_MAP but its use in memblock_overlaps_region() is
not. So add CONFIG_HAVE_MEMBLOCK_NODE_MAP to protect the use of
movablecore_map in memblock_overlaps_region().
Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com> Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tang Chen [Wed, 20 Feb 2013 02:14:29 +0000 (13:14 +1100)]
acpi, memory-hotplug: support getting hotplug info from SRAT
We now provide an option for users who don't want to specify physical
memory address in kernel commandline.
/*
* For movablemem_map=acpi:
*
* SRAT: |_____| |_____| |_________| |_________| ......
* node id: 0 1 1 2
* hotpluggable: n y y n
* movablemem_map: |_____| |_________|
*
* Using movablemem_map, we can prevent memblock from allocating memory
* on ZONE_MOVABLE at boot time.
*/
So user just specify movablemem_map=acpi, and the kernel will use
hotpluggable info in SRAT to determine which memory ranges should be set
as ZONE_MOVABLE.
NOTE: Using this way will cause NUMA performance down because the whole node
will be set as ZONE_MOVABLE, and kernel cannot use memory on it.
If users don't want to lose NUMA performance, just don't use it.
Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Jiang Liu <jiang.liu@huawei.com> Cc: Jianguo Wu <wujianguo@huawei.com> Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Lai Jiangshan <laijs@cn.fujitsu.com> Cc: Wu Jianguo <wujianguo@huawei.com> Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Len Brown <lenb@kernel.org> Cc: "Brown, Len" <len.brown@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tang Chen [Wed, 20 Feb 2013 02:14:28 +0000 (13:14 +1100)]
acpi, memory-hotplug: extend movablemem_map ranges to the end of node
When implementing movablemem_map boot option, we introduced an array
movablemem_map.map[] to store the memory ranges to be set as ZONE_MOVABLE.
Since ZONE_MOVABLE is the latst zone of a node, if user didn't specify the
whole node memory range, we need to extend it to the node end so that we
can use it to prevent memblock from allocating memory in the ranges user
didn't specify.
We now implement movablemem_map boot option like this:
/*
* For movablemem_map=nn[KMG]@ss[KMG]:
*
* SRAT: |_____| |_____| |_________| |_________| ......
* node id: 0 1 1 2
* user specified: |__| |___|
* movablemem_map: |___| |_________| |______| ......
*
* Using movablemem_map, we can prevent memblock from allocating memory
* on ZONE_MOVABLE at boot time.
*
* NOTE: In this case, SRAT info will be ingored.
*/
Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Jiang Liu <jiang.liu@huawei.com> Cc: Jianguo Wu <wujianguo@huawei.com> Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Lai Jiangshan <laijs@cn.fujitsu.com> Cc: Wu Jianguo <wujianguo@huawei.com> Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Len Brown <lenb@kernel.org> Cc: "Brown, Len" <len.brown@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tang Chen [Wed, 20 Feb 2013 02:14:28 +0000 (13:14 +1100)]
acpi, movablemem_map: Do not zero numa_meminfo in numa_init().
early_parse_srat() is called before numa_init(), and has initialized
numa_meminfo. So do not zero numa_meminfo in numa_init(), otherwise
we will lose memory numa info.
Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com> Reported-by: Li Shaohua <shli@kernel.org> Cc: Tang Chen <tangchen@cn.fujitsu.com> Cc: Wen Congyang <wency@cn.fujitsu.com> Cc: Jiang Liu <jiang.liu@huawei.com> Cc: Jianguo Wu <wujianguo@huawei.com> Cc: Lai Jiangshan <laijs@cn.fujitsu.com> Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Michal Hocko [Wed, 20 Feb 2013 02:14:28 +0000 (13:14 +1100)]
acpi, memory-hotplug: parse SRAT before memblock is ready fix
alnoconfig complains:
arch/x86/kernel/setup.c: In function `setup_arch':
arch/x86/kernel/setup.c:917: error: implicit declaration of function `early_parse_srat'
because early_parse_srat is not declared for !CONFIG_ACPI. Moreover it
is defined only for CONFIG_ACPI_NUMA.
I am not sure what is the correct way to fix this but I guess that
providing an empty definition for !CONFIG_ACPI_NUMA is OK.
Signed-off-by: Michal Hocko <mhocko@suse.cz> Cc: Tang Chen <tangchen@cn.fujitsu.com> Cc: Wen Congyang <wency@cn.fujitsu.com> Cc: Jiang Liu <jiang.liu@huawei.com> Cc: Jianguo Wu <wujianguo@huawei.com> Cc: Lai Jiangshan <laijs@cn.fujitsu.com> Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tang Chen [Wed, 20 Feb 2013 02:14:27 +0000 (13:14 +1100)]
acpi, memory-hotplug: parse SRAT before memblock is ready
On linux, the pages used by kernel could not be migrated. As a result, if
a memory range is used by kernel, it cannot be hot-removed. So if we want
to hot-remove memory, we should prevent kernel from using it.
The way now used to prevent this is specify a memory range by
movablemem_map boot option and set it as ZONE_MOVABLE.
But when the system is booting, memblock will allocate memory, and reserve
the memory for kernel. And before we parse SRAT, and know the node memory
ranges, memblock is working. And it may allocate memory in ranges to be
set as ZONE_MOVABLE. This memory can be used by kernel, and never be
freed.
So, let's parse SRAT before memblock is called first. And it is early enough.
The first call of memblock_find_in_range_node() is in:
setup_arch()
|-->setup_real_mode()
so, this patch add a function early_parse_srat() to parse SRAT, and call
it before setup_real_mode() is called.
NOTE:
1) Do not clear numa_nodes_parsed in numa_init() because SRAT was
parsed earlier.
2) I don't know why using count of memory affinities parsed from SRAT
as a return value in original acpi_numa_init(). So I add a static
variable srat_mem_cnt to remember this count and use it as the return
value of the new acpi_numa_init()
Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com> Reviewed-by: Wen Congyang <wency@cn.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Jiang Liu <jiang.liu@huawei.com> Cc: Jianguo Wu <wujianguo@huawei.com> Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Lai Jiangshan <laijs@cn.fujitsu.com> Cc: Wu Jianguo <wujianguo@huawei.com> Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Len Brown <lenb@kernel.org> Cc: "Brown, Len" <len.brown@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tang Chen [Wed, 20 Feb 2013 02:14:27 +0000 (13:14 +1100)]
page_alloc: make movablemem_map have higher priority
If kernelcore or movablecore is specified at the same time with
movablemem_map, movablemem_map will have higher priority to be
satisfied. This patch will make find_zone_movable_pfns_for_nodes()
calculate zone_movable_pfn[] with the limit from zone_movable_limit[].
Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com> Reviewed-by: Wen Congyang <wency@cn.fujitsu.com> Cc: Wu Jianguo <wujianguo@huawei.com> Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com> Tested-by: Lin Feng <linfeng@cn.fujitsu.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Ingo Molnar <mingo@elte.hu> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tang Chen [Wed, 20 Feb 2013 02:14:26 +0000 (13:14 +1100)]
page_alloc: introduce zone_movable_limit[] to keep movable limit for nodes
Introduce a new array zone_movable_limit[] to store the ZONE_MOVABLE limit
from movablemem_map boot option for all nodes. The function
sanitize_zone_movable_limit() will find out to which node the ranges in
movable_map.map[] belongs, and calculates the low boundary of ZONE_MOVABLE
for each node.
Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com> Signed-off-by: Liu Jiang <jiang.liu@huawei.com> Reviewed-by: Wen Congyang <wency@cn.fujitsu.com> Cc: Wu Jianguo <wujianguo@huawei.com> Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com> Tested-by: Lin Feng <linfeng@cn.fujitsu.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Ingo Molnar <mingo@elte.hu> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tang Chen [Wed, 20 Feb 2013 02:14:24 +0000 (13:14 +1100)]
page_alloc: add movable_memmap kernel parameter
Add functions to parse movablecore_map boot option. Since the option
could be specified more then once, all the maps will be stored in the
global variable movablecore_map.map array.
And also, we keep the array in monotonic increasing order by start_pfn.
And merge all overlapped ranges.
Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com> Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Reviewed-by: Wen Congyang <wency@cn.fujitsu.com> Tested-by: Lin Feng <linfeng@cn.fujitsu.com> Cc: Wu Jianguo <wujianguo@huawei.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Ingo Molnar <mingo@elte.hu> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
During the implementation of SRAT support, we met a problem.
In setup_arch(), we have the following call series:
1) memblock is ready;
2) some functions use memblock to allocate memory;
3) parse ACPI tables, such as SRAT.
Before 3), we don't know which memory is hotpluggable, and as a result, we
cannot prevent memblock from allocating hotpluggable memory. So, in 2),
there could be some hotpluggable memory allocated by memblock.
Now, we are trying to parse SRAT earlier, before memblock is ready. But I
think we need more investigation on this topic. So in this v5, I dropped
all the SRAT support, and v5 is just the same as v3, and it is based on
3.8-rc3.
As we planned, we will support getting info from SRAT without users'
participation at last. And we will post another patch-set to do so.
And also, I think for now, we can add this boot option as the first step of
supporting movable node. Since Linux cannot migrate the direct mapped pages,
the only way for now is to limit the whole node containing only movable memory.
Using SRAT is one way. But even if we can use SRAT, users still need an
interface to enable/disable this functionality if they don't want to loose
their NUMA performance. So I think, a user interface is always needed.
For now, users can disable this functionality by not specifying the boot
option. Later, we will post SRAT support, and add another option value
"movablecore_map=acpi" to using SRAT.
This patch:
If system can create movable node which all memory of the node is
allocated as ZONE_MOVABLE, setup_node_data() cannot allocate memory for
the node's pg_data_t. So, use memblock_alloc_try_nid() instead of
memblock_alloc_nid() to retry when the first allocation fails.
Signed-off-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com> Signed-off-by: Jiang Liu <jiang.liu@huawei.com> Cc: Wu Jianguo <wujianguo@huawei.com> Cc: Wen Congyang <wency@cn.fujitsu.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Ingo Molnar <mingo@elte.hu> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tang Chen [Wed, 20 Feb 2013 02:14:23 +0000 (13:14 +1100)]
sched: do not use cpu_to_node() to find an offlined cpu's node.
If a cpu is offline, its nid will be set to -1, and cpu_to_node(cpu) will
return -1. As a result, cpumask_of_node(nid) will return NULL. In this
case, find_next_bit() in for_each_cpu will get a NULL pointer and cause
panic.
There is a hrtimer process sleeping, whose cpu has already been offlined.
When it is waken up, it tries to find another cpu to run, and get a -1
nid. As a result, cpumask_of_node(-1) returns NULL, and causes ernel
panic.
This patch fixes this problem by judging if the nid is -1. If nid is not
-1, a cpu on the same node will be picked. Else, a online cpu on another
node will be picked.
Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com> Signed-off-by: Wen Congyang <wency@cn.fujitsu.com> Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Cc: David Rientjes <rientjes@google.com> Cc: Jiang Liu <liuj97@gmail.com> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
numa_clear_node() and numa_set_node() can no longer be __cpuinit.
WARNING: vmlinux.o(.text+0x222702): Section mismatch in reference from the function check_and_unmap_cpu_on_node() to the function .cpuinit.text:numa_clear_node()
The function check_and_unmap_cpu_on_node() references
the function __cpuinit numa_clear_node().
This is often because check_and_unmap_cpu_on_node lacks a __cpuinit
annotation or the annotation of numa_clear_node is wrong.
Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: David Rientjes <rientjes@google.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jiang Liu <liuj97@gmail.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Tang Chen <tangchen@cn.fujitsu.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Wen Congyang <wency@cn.fujitsu.com> Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Wen Congyang [Wed, 20 Feb 2013 02:14:23 +0000 (13:14 +1100)]
cpu-hotplug,memory-hotplug: clear cpu_to_node() when offlining the node
When the node is offlined, there is no memory/cpu on the node. If a sleep
task runs on a cpu of this node, it will be migrated to the cpu on the
other node. So we can clear cpu-to-node mapping.
Signed-off-by: Wen Congyang <wency@cn.fujitsu.com> Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com> Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Cc: David Rientjes <rientjes@google.com> Cc: Jiang Liu <liuj97@gmail.com> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
David Rientjes [Wed, 20 Feb 2013 02:14:22 +0000 (13:14 +1100)]
memory-hotplug: export the function try_offline_node() fix
"memory-hotplug: export the function try_offline_node()" declares
try_offline_node() for CONFIG_MEMORY_HOTPLUG, but this function is only
defined for CONFIG_MEMORY_HOTREMOVE:
Wen Congyang [Wed, 20 Feb 2013 02:14:22 +0000 (13:14 +1100)]
memory-hotplug: export the function try_offline_node()
try_offline_node() will be needed in the tristate
drivers/acpi/processor_driver.c.
The node will be offlined when all memory/cpu on the node have been
hotremoved. So we need the function try_offline_node() in cpu-hotplug
path.
If the memory-hotplug is disabled, and cpu-hotplug is enabled
1. no memory no the node
we don't online the node, and cpu's node is the nearest node.
2. the node contains some memory
the node has been onlined, and cpu's node is still needed
to migrate the sleep task on the cpu to the same node.
So we do nothing in try_offline_node() in this case.
Signed-off-by: Wen Congyang <wency@cn.fujitsu.com> Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com> Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Cc: David Rientjes <rientjes@google.com> Cc: Jiang Liu <liuj97@gmail.com> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Len Brown <lenb@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
__apicid_to_node can no longer be __cpuinit as it is referred to from
acpi_unmap_lsapic().
>> WARNING: vmlinux.o(.text+0x43773): Section mismatch in reference from the function acpi_unmap_lsapic() to the variable .cpuinit.data:__apicid_to_node
The function acpi_unmap_lsapic() references
the variable __cpuinitdata __apicid_to_node.
This is often because acpi_unmap_lsapic lacks a __cpuinitdata
annotation or the annotation of __apicid_to_node is wrong.
Reported-by: Wu Fengguang <fengguang.wu@intel.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: David Rientjes <rientjes@google.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jiang Liu <liuj97@gmail.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Tang Chen <tangchen@cn.fujitsu.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Wen Congyang <wency@cn.fujitsu.com> Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Wen Congyang [Wed, 20 Feb 2013 02:14:21 +0000 (13:14 +1100)]
cpu_hotplug: clear apicid to node when the cpu is hotremoved
When a cpu is hotpluged, we call acpi_map_cpu2node() in _acpi_map_lsapic()
to store the cpu's node and apicid's node. But we don't clear the cpu's
node in acpi_unmap_lsapic() when this cpu is hotremoved. If the node is
also hotremoved, we will get the following messages:
The reason is that the cpu's node is not NUMA_NO_NODE, we will call
alloc_pages_exact_node() to alloc memory on the node, but the node is
offlined.
If the node is onlined, we still need cpu's node. For example: a task on
the cpu is sleeped when the cpu is hotremoved. We will choose another cpu
to run this task when it is waked up. If we know the cpu's node, we will
choose the cpu on the same node first. So we should clear cpu-to-node
mapping when the node is offlined.
This patch only clears apicid-to-node mapping when the cpu is hotremoved.
Signed-off-by: Wen Congyang <wency@cn.fujitsu.com> Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com> Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Cc: David Rientjes <rientjes@google.com> Cc: Jiang Liu <liuj97@gmail.com> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Lai Jiangshan [Wed, 20 Feb 2013 02:14:21 +0000 (13:14 +1100)]
mempolicy: fix is_valid_nodemask()
is_valid_nodemask() was introduced by 19770b32 ("mm: filter based on a
nodemask as well as a gfp_mask"). but it does not match its comments,
because it does not check the zone which > policy_zone.
Also in b377fd ("Apply memory policies to top two highest zones when
highest zone is ZONE_MOVABLE"), this commits told us, if highest zone is
ZONE_MOVABLE, we should also apply memory policies to it. so ZONE_MOVABLE
should be valid zone for policies. is_valid_nodemask() need to be changed
to match it.
Fix: check all zones, even its zoneid > policy_zone. Use
nodes_intersects() instead open code to check it.
Reported-by: Wen Congyang <wency@cn.fujitsu.com> Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: Jiang Liu <jiang.liu@huawei.com> Cc: Jianguo Wu <wujianguo@huawei.com> Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Lai Jiangshan <laijs@cn.fujitsu.com> Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Wen Congyang [Wed, 20 Feb 2013 02:14:21 +0000 (13:14 +1100)]
memory-hotplug: consider compound pages when free memmap
usemap could also be allocated as compound pages. Should also consider
compound pages when freeing memmap.
If we don't fix it, there could be problems when we free vmemmap
pagetables which are stored in compound pages. The old pagetables will
not be freed properly, and when we add the memory again, no new pagetable
will be created. And the old pagetable entry is used, than the kernel
will panic.
Tang Chen [Wed, 20 Feb 2013 02:14:20 +0000 (13:14 +1100)]
memory-hotplug: do not allocate pgdat if it was not freed when offline.
Since there is no way to guarentee the address of pgdat/zone is not on
stack of any kernel threads or used by other kernel objects without
reference counting or other symchronizing method, we cannot reset
node_data and free pgdat when offlining a node. Just reset pgdat to 0 and
reuse the memory when the node is online again.
The problem is suggested by Kamezawa Hiroyuki. The idea is from Wen
Congyang.
NOTE: If we don't reset pgdat to 0, the WARN_ON in free_area_init_node()
will be triggered.
Tang Chen [Wed, 20 Feb 2013 02:14:19 +0000 (13:14 +1100)]
memory-hotplug: remove sysfs file of node
Introduce a new function try_offline_node() to remove sysfs file of node
when all memory sections of this node are removed. If some memory
sections of this node are not removed, this function does nothing.
Defconfig for x86_64 complains:
arch/x86/mm/init_64.c: In function `vmemmap_free':
arch/x86/mm/init_64.c:1317: error: implicit declaration of function `remove_pagetable'
vmemmap_free is only used for CONFIG_MEMORY_HOTPLUG so let's move it
inside ifdef
Signed-off-by: Michal Hocko <mhocko@suse.cz> Tested-by: Lin Feng <linfeng@cn.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tang Chen [Wed, 20 Feb 2013 02:14:18 +0000 (13:14 +1100)]
memory-hotplug: remove memmap of sparse-vmemmap
Introduce a new API vmemmap_free() to free and remove vmemmap pagetables.
Since pagetable implements are different, each architecture has to provide
its own version of vmemmap_free(), just like vmemmap_populate().
Note: vmemmap_free() is not implemented for ia64, ppc, s390, and sparc.
Tang Chen [Wed, 20 Feb 2013 02:14:16 +0000 (13:14 +1100)]
Bug fix: Do not split pages when freeing pagetable pages.
The old way we free pagetable pages was wrong.
The old way is:
When we got a hugepage, we split it into smaller pages. And sometimes,
we only need to free some of the smaller pages because the others are
still in use. And the whole larger page will be freed if all the smaller
pages are not in use. All these were done in remove_pmd/pud_table().
But there is no way to know if the larger page has been split. As a result,
there is no way to decide when to split.
Actually, there is no need to split larger pages into smaller ones.
We do it in the following new way:
1) For direct mapped pages, all the pages were freed when they were offlined.
And since menmory offline is done section by section, all the memory ranges
being removed are aligned to PAGE_SIZE. So only need to deal with unaligned
pages when freeing vmemmap pages.
2) For vmemmap pages being used to store page_struct, if part of the larger
page is still in use, just fill the unused part with 0xFD. And when the
whole page is fulfilled with 0xFD, then free the larger page.
This problem is caused by the following related patch:
memory-hotplug-common-apis-to-support-page-tables-hot-remove.patch
memory-hotplug-common-apis-to-support-page-tables-hot-remove-fix.patch
memory-hotplug-common-apis-to-support-page-tables-hot-remove-fix-fix.patch
memory-hotplug-common-apis-to-support-page-tables-hot-remove-fix-fix-fix.patch
memory-hotplug-common-apis-to-support-page-tables-hot-remove-fix-fix-fix-fix.patch
This patch will fix the problem based on the above patches.
Tang Chen [Wed, 20 Feb 2013 02:14:16 +0000 (13:14 +1100)]
Bug fix: Do not free page split from hugepage one by one.
When we split a larger page into smaller ones, we should not free them one
by one because only the _count of the first page struct makes sense.
Otherwise, the kernel will panic.
So fulfill the unused small pages with 0xFD, and when the whole larger
page is fulfilled with 0xFD, free the whole larger page.
Tang Chen [Wed, 20 Feb 2013 02:14:16 +0000 (13:14 +1100)]
Bug fix: Do not free direct mapping pages twice.
Direct mapped pages were freed when they were offlined, or they were not
allocated. So we only need to free vmemmap pages, no need to free direct
mapped pages.
Wen Congyang [Wed, 20 Feb 2013 02:14:15 +0000 (13:14 +1100)]
memory-hotplug: common APIs to support page tables hot-remove
When memory is removed, the corresponding pagetables should alse be
removed. This patch introduces some common APIs to support vmemmap
pagetable and x86_64 architecture pagetable removing.
All pages of virtual mapping in removed memory cannot be freed if some
pages used as PGD/PUD includes not only removed memory but also other
memory. So the patch uses the following way to check whether page can be
freed or not.
1. When removing memory, the page structs of the removed memory are filled
with 0FD.
2. All page structs are filled with 0xFD on PT/PMD, PT/PMD can be cleared.
In this case, the page used as PT/PMD can be freed.
Tang Chen [Wed, 20 Feb 2013 02:14:14 +0000 (13:14 +1100)]
memory-hotplug: move pgdat_resize_lock into sparse_remove_one_section()
In __remove_section(), we locked pgdat_resize_lock when calling
sparse_remove_one_section(). This lock will disable irq. But we don't
need to lock the whole function. If we do some work to free pagetables in
free_section_usemap(), we need to call flush_tlb_all(), which need irq
enabled. Otherwise the WARN_ON_ONCE() in smp_call_function_many() will be
triggered.
If we lock the whole sparse_remove_one_section(), then we come to this call trace:
Lin Feng [Wed, 20 Feb 2013 02:14:14 +0000 (13:14 +1100)]
memory-hotplug: mm/Kconfig: move auto selects from MEMORY_HOTPLUG to MEMORY_HOTREMOVE as needed
Since we have 2 config options called MEMORY_HOTPLUG and MEMORY_HOTREMOVE
used for memory hot-add and hot-remove separately, and codes in function
register_page_bootmem_info_node() are only used for collecting infomation
for hot-remove(commit 04753278), so move it to MEMORY_HOTREMOVE.
Besides page_isolation.c selected by MEMORY_ISOLATION under MEMORY_HOTPLUG
is also such case, move it too.
Signed-off-by: Lin Feng <linfeng@cn.fujitsu.com> Cc: Michal Hocko <mhocko@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Michal Hocko [Wed, 20 Feb 2013 02:14:14 +0000 (13:14 +1100)]
memory-hotplug: cleanup: removing the arch specific functions without any implementation
After introducing CONFIG_HAVE_BOOTMEM_INFO_NODE Kconfig option, the
related arch specific functions become confusing, remove them.
Guys who want to implement memory-hotplug feature on such archs for this
part should look into register_page_bootmem_info_node() and flesh out from
top to end.
Signed-off-by: Michal Hocko <mhocko@suse.cz> Signed-off-by: Lin Feng <linfeng@cn.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Lin Feng [Wed, 20 Feb 2013 02:14:13 +0000 (13:14 +1100)]
memory-hotplug: introduce CONFIG_HAVE_BOOTMEM_INFO_NODE and revert register_page_bootmem_info_node() when platform not support
It's implemented by adding a new Kconfig option named
CONFIG_HAVE_BOOTMEM_INFO_NODE, which will be automatically selected by
memory-hotplug feature fully supported archs(currently only on x86_64).
Signed-off-by: Lin Feng <linfeng@cn.fujitsu.com> Reported-by: Michal Hocko <mhocko@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
memory-hotplug: implement register_page_bootmem_info_section of sparse-vmemmap
For removing memmap region of sparse-vmemmap which is allocated bootmem,
memmap region of sparse-vmemmap needs to be registered by
get_page_bootmem(). So the patch searches pages of virtual mapping and
registers the pages by get_page_bootmem().
Note: register_page_bootmem_memmap() is not implemented for ia64, ppc, s390,
and sparc.
Wen Congyang [Wed, 20 Feb 2013 02:14:12 +0000 (13:14 +1100)]
memory-hotplug: introduce new arch_remove_memory() for removing page table
For removing memory, we need to remove page tables. But it depends on
architecture. So the patch introduce arch_remove_memory() for removing
page table. Now it only calls __remove_pages().
Note: __remove_pages() for some archtecuture is not implemented
(I don't know how to implement it for s390).
Tang Chen [Wed, 20 Feb 2013 02:14:11 +0000 (13:14 +1100)]
Bug fix: Reuse the storage of /sys/firmware/memmap/X/ allocated by bootmem.
Now we don't free firmware_map_entry which is allocated by bootmem because
there is no way to do so when the system is up. But we can at least remember
the address of that memory and reuse the storage when the memory is added
next time.
This patch introduces a new list map_entries_bootmem to link the map entries
allocated by bootmem when they are removed, and a lock to protect it.
And these entries will be reused when the memory is hot-added again.
The idea is suggestted by Andrew Morton <akpm@linux-foundation.org>
Tang Chen [Wed, 20 Feb 2013 02:14:11 +0000 (13:14 +1100)]
Bug fix: Hold spinlock across find|remove /sys/firmware/memmap/X operation.
It is unsafe to return an entry pointer and release the map_entries_lock.
So we should not hold the map_entries_lock separately in
firmware_map_find_entry() and firmware_map_remove_entry(). Hold the
map_entries_lock across find and remove /sys/firmware/memmap/X operation.
And also, users of these two functions need to be careful to hold the lock
when using these two functions.
The suggestion is from Andrew Morton <akpm@linux-foundation.org>
When (hot)adding memory into system, /sys/firmware/memmap/X/{end, start,
type} sysfs files are created. But there is no code to remove these
files. The patch implements the function to remove them.
The code does not free firmware_map_entry which is allocated by bootmem,
which is a slight memory leak. But that memory is reused when the sysfs
file is recreated, so the leak is bounded.
Wen Congyang [Wed, 20 Feb 2013 02:14:10 +0000 (13:14 +1100)]
memory-hotplug: remove redundant codes
offlining memory blocks and checking whether memory blocks are offlined
are very similar. This patch introduces a new function to remove
redundant codes.
memory-hotplug: check whether all memory blocks are offlined or not when removing memory
We remove the memory like this:
1. lock memory hotplug
2. offline a memory block
3. unlock memory hotplug
4. repeat 1-3 to offline all memory blocks
5. lock memory hotplug
6. remove memory(TODO)
7. unlock memory hotplug
All memory blocks must be offlined before removing memory. But we don't
hold the lock in the whole operation. So we should check whether all
memory blocks are offlined before step6. Otherwise, kernel maybe
panicked.
Offlining a memory block and removing a memory device can be two different
operations. Users can just offline some memory blocks without removing
the memory device. For this purpose, the kernel has held
lock_memory_hotplug() in __offline_pages(). To reuse the code for memory
hot-remove, we repeat step 1-3 to offline all the memory blocks,
repeatedly lock and unlock memory hotplug, but not hold the memory hotplug
lock in the whole operation.
Wen Congyang [Wed, 20 Feb 2013 02:14:10 +0000 (13:14 +1100)]
memory-hotplug: try to offline the memory twice to avoid dependence
memory can't be offlined when CONFIG_MEMCG is selected. For example:
there is a memory device on node 1. The address range is [1G, 1.5G). You
will find 4 new directories memory8, memory9, memory10, and memory11 under
the directory /sys/devices/system/memory/.
If CONFIG_MEMCG is selected, we will allocate memory to store page cgroup
when we online pages. When we online memory8, the memory stored page
cgroup is not provided by this memory device. But when we online memory9,
the memory stored page cgroup may be provided by memory8. So we can't
offline memory8 now. We should offline the memory in the reversed order.
When the memory device is hotremoved, we will auto offline memory provided
by this memory device. But we don't know which memory is onlined first,
so offlining memory may fail. In such case, iterate twice to offline the
memory. 1st iterate: offline every non primary memory block. 2nd
iterate: offline primary (i.e. first added) memory block.
mm: make do_mmap_pgoff return populate as a size in bytes, not as a bool
do_mmap_pgoff() rounds up the desired size to the next PAGE_SIZE multiple,
however there was no equivalent code in mm_populate(), which caused issues.
This could be fixed by introduced the same rounding in mm_populate(),
however I think it's preferable to make do_mmap_pgoff() return populate
as a size rather than as a boolean, so we don't have to duplicate the
size rounding logic in mm_populate().
Signed-off-by: Michel Lespinasse <walken@google.com> Acked-by: Rik van Riel <riel@redhat.com> Tested-by: Andy Lutomirski <luto@amacapital.net> Cc: Greg Ungerer <gregungerer@westnet.com.au> Cc: David Howells <dhowells@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
mm: introduce VM_POPULATE flag to better deal with racy userspace programs
The vm_populate() code populates user mappings without constantly
holding the mmap_sem. This makes it susceptible to racy userspace
programs: the user mappings may change while vm_populate() is running,
and in this case vm_populate() may end up populating the new mapping
instead of the old one.
In order to reduce the possibility of userspace getting surprised by
this behavior, this change introduces the VM_POPULATE vma flag which
gets set on vmas we want vm_populate() to work on. This way
vm_populate() may still end up populating the new mapping after such a
race, but only if the new mapping is also one that the user has
requested (using MAP_SHARED, MAP_LOCKED or mlock) to be populated.
Signed-off-by: Michel Lespinasse <walken@google.com> Acked-by: Rik van Riel <riel@redhat.com> Tested-by: Andy Lutomirski <luto@amacapital.net> Cc: Greg Ungerer <gregungerer@westnet.com.au> Cc: David Howells <dhowells@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
mm: directly use __mlock_vma_pages_range() in find_extend_vma()
In find_extend_vma(), we don't need mlock_vma_pages_range() to verify the
vma type - we know we're working with a stack. So, we can call directly
into __mlock_vma_pages_range(), and remove the last make_pages_present()
call site.
Note that we don't use mm_populate() here, so we can't release the mmap_sem
while allocating new stack pages. This is deemed acceptable, because the
stack vmas grow by a bounded number of pages at a time, and these are
anon pages so we don't have to read from disk to populate them.
Signed-off-by: Michel Lespinasse <walken@google.com> Acked-by: Rik van Riel <riel@redhat.com> Tested-by: Andy Lutomirski <luto@amacapital.net> Cc: Greg Ungerer <gregungerer@westnet.com.au> Cc: David Howells <dhowells@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Andrew Morton [Wed, 20 Feb 2013 02:14:08 +0000 (13:14 +1100)]
mm-remove-flags-argument-to-mmap_region-fix
remove double parens
Cc: Andy Lutomirski <luto@amacapital.net> Cc: David Howells <dhowells@redhat.com> Cc: Greg Ungerer <gregungerer@westnet.com.au> Cc: Michel Lespinasse <walken@google.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
After the MAP_POPULATE handling has been moved to mmap_region() call sites,
the only remaining use of the flags argument is to pass the MAP_NORESERVE
flag. This can be just as easily handled by do_mmap_pgoff(), so do that
and remove the mmap_region() flags parameter.
Signed-off-by: Michel Lespinasse <walken@google.com> Acked-by: Rik van Riel <riel@redhat.com> Tested-by: Andy Lutomirski <luto@amacapital.net> Cc: Greg Ungerer <gregungerer@westnet.com.au> Cc: David Howells <dhowells@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
mm: use mm_populate() for mremap() of VM_LOCKED vmas
Signed-off-by: Michel Lespinasse <walken@google.com> Acked-by: Rik van Riel <riel@redhat.com> Tested-by: Andy Lutomirski <luto@amacapital.net> Cc: Greg Ungerer <gregungerer@westnet.com.au> Cc: David Howells <dhowells@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
mm: use mm_populate() when adjusting brk with MCL_FUTURE in effect
Signed-off-by: Michel Lespinasse <walken@google.com> Acked-by: Rik van Riel <riel@redhat.com> Tested-by: Andy Lutomirski <luto@amacapital.net> Cc: Greg Ungerer <gregungerer@westnet.com.au> Cc: David Howells <dhowells@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
mm: use mm_populate() for blocking remap_file_pages()
Signed-off-by: Michel Lespinasse <walken@google.com> Reviewed-by: Rik van Riel <riel@redhat.com> Tested-by: Andy Lutomirski <luto@amacapital.net> Cc: Greg Ungerer <gregungerer@westnet.com.au> Cc: David Howells <dhowells@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
mm: introduce mm_populate() for populating new vmas
When creating new mappings using the MAP_POPULATE / MAP_LOCKED flags
(or with MCL_FUTURE in effect), we want to populate the pages within the
newly created vmas. This may take a while as we may have to read pages
from disk, so ideally we want to do this outside of the write-locked
mmap_sem region.
This change introduces mm_populate(), which is used to defer populating
such mappings until after the mmap_sem write lock has been released.
This is implemented as a generalization of the former do_mlock_pages(),
which accomplished the same task but was using during mlock() / mlockall().
Signed-off-by: Michel Lespinasse <walken@google.com> Reported-by: Andy Lutomirski <luto@amacapital.net> Acked-by: Rik van Riel <riel@redhat.com> Tested-by: Andy Lutomirski <luto@amacapital.net> Cc: Greg Ungerer <gregungerer@westnet.com.au> Cc: David Howells <dhowells@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
We have many vma manipulation functions that are fast in the typical case,
but can optionally be instructed to populate an unbounded number of ptes
within the region they work on:
- mmap with MAP_POPULATE or MAP_LOCKED flags;
- remap_file_pages() with MAP_NONBLOCK not set or when working on a
VM_LOCKED vma;
- mmap_region() and all its wrappers when mlock(MCL_FUTURE) is in effect;
- brk() when mlock(MCL_FUTURE) is in effect.
Current code handles these pte operations locally, while the sourrounding
code has to hold the mmap_sem write side since it's manipulating vmas.
This means we're doing an unbounded amount of pte population work with
mmap_sem held, and this causes problems as Andy Lutomirski reported (we've
hit this at Google as well, though it's not entirely clear why people keep
trying to use mlock(MCL_FUTURE) in the first place).
I propose introducing a new mm_populate() function to do this pte
population work after the mmap_sem has been released. mm_populate() does
need to acquire the mmap_sem read side, but critically, it doesn't need to
hold it continuously for the entire duration of the operation - it can
drop it whenever things take too long (such as when hitting disk for a
file read) and re-acquire it later on.
The following patches are included
- Patches 1 fixes some issues I noticed while working on the existing code.
If needed, they could potentially go in before the rest of the patches.
- Patch 2 introduces the new mm_populate() function and changes
mmap_region() call sites to use it after they drop mmap_sem. This is
inspired from Andy Lutomirski's proposal and is built as an extension
of the work I had previously done for mlock() and mlockall() around
v2.6.38-rc1. I had tried doing something similar at the time but had
given up as there were so many do_mmap() call sites; the recent cleanups
by Linus and Viro are a tremendous help here.
- Patches 3-5 convert some of the less-obvious places doing unbounded
pte populates to the new mm_populate() mechanism.
- Patches 6-7 are code cleanups that are made possible by the
mm_populate() work. In particular, they remove more code than the
entire patch series added, which should be a good thing :)
- Patch 8 is optional to this entire series. It only helps to deal more
nicely with racy userspace programs that might modify their mappings
while we're trying to populate them. It adds a new VM_POPULATE flag
on the mappings we do want to populate, so that if userspace replaces
them with mappings it doesn't want populated, mm_populate() won't
populate those replacement mappings.
This patch:
Assorted small fixes. The first two are quite small:
- Move check for vma->vm_private_data && !(vma->vm_flags & VM_NONLINEAR)
within existing if (!(vma->vm_flags & VM_NONLINEAR)) block.
Purely cosmetic.
- In the VM_LOCKED case, when dropping PG_Mlocked for the over-mapped
range, make sure we own the mmap_sem write lock around the
munlock_vma_pages_range call as this manipulates the vma's vm_flags.
Last fix requires a longer explanation. remap_file_pages() can do its work
either through VM_NONLINEAR manipulation or by creating extra vmas.
These two cases were inconsistent with each other (and ultimately, both wrong)
as to exactly when did they fault in the newly mapped file pages:
- In the VM_NONLINEAR case, new file pages would be populated if
the MAP_NONBLOCK flag wasn't passed. If MAP_NONBLOCK was passed,
new file pages wouldn't be populated even if the vma is already
marked as VM_LOCKED.
- In the linear (emulated) case, the work is passed to the mmap_region()
function which would populate the pages if the vma is marked as
VM_LOCKED, and would not otherwise - regardless of the value of the
MAP_NONBLOCK flag, because MAP_POPULATE wasn't being passed to
mmap_region().
The desired behavior is that we want the pages to be populated and locked
if the vma is marked as VM_LOCKED, or to be populated if the MAP_NONBLOCK
flag is not passed to remap_file_pages().
Signed-off-by: Michel Lespinasse <walken@google.com> Acked-by: Rik van Riel <riel@redhat.com> Tested-by: Andy Lutomirski <luto@amacapital.net> Cc: Greg Ungerer <gregungerer@westnet.com.au> Cc: David Howells <dhowells@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Zlatko Calusic [Wed, 20 Feb 2013 02:14:06 +0000 (13:14 +1100)]
mm: avoid calling pgdat_balanced() needlessly
Now that balance_pgdat() is slightly tidied up, thanks to more capable
pgdat_balanced(), it's become obvious that pgdat_balanced() is called to
check the status, then break the loop if pgdat is balanced, just to be
immediately called again. The second call is completely unnecessary, of
course.
The patch introduces pgdat_is_balanced boolean, which helps resolve the
above suboptimal behavior, with the added benefit of slightly better
documenting one other place in the function where we jump and skip lots of
code.
Signed-off-by: Zlatko Calusic <zlatko.calusic@iskon.hr> Cc: Mel Gorman <mgorman@suse.de> Cc: Hugh Dickins <hughd@google.com> Cc: Minchan Kim <minchan.kim@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Andrew Morton [Wed, 20 Feb 2013 02:14:05 +0000 (13:14 +1100)]
mm: compaction: make __compact_pgdat() and compact_pgdat() return void
These functions always return 0. Formalise this.
Cc: Jason Liu <r64343@freescale.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Minchan Kim <minchan@kernel.org> Cc: Rik van Riel <riel@redhat.com> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Sasha Levin [Wed, 20 Feb 2013 02:14:05 +0000 (13:14 +1100)]
mm: fix BUG on madvise early failure
Commit "mm: make madvise(MADV_WILLNEED) support swap file prefetch" has
allowed a situation where blk_finish_plug() would be called without
blk_start_plug() being called before that, which would lead to a BUG: