Xi Wang [Wed, 20 Feb 2013 02:15:05 +0000 (13:15 +1100)]
mm/dmapool.c: fix null dev in dma_pool_create()
A few drivers invoke dma_pool_create() with a null dev. Note that dev is
dereferenced in dev_to_node(dev), causing a null pointer dereference.
A long term solution is to disallow null dev. Once the drivers are fixed,
we can simplify the core code here. For now we add WARN_ON(!dev) to
notify the driver maintainers and avoid the null pointer dereference.
Signed-off-by: Xi Wang <xi.wang@gmail.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Xi Wang [Wed, 20 Feb 2013 02:15:05 +0000 (13:15 +1100)]
drivers/usb/gadget/amd5536udc.c: avoid calling dma_pool_create() with NULL dev
Calling dma_pool_create() with dev==NULL will oops on a NUMA machine.
Rather than changing dma_pool_create() we wish to disallow passing
dev==NULL. This requires fixing up the small number of drivers which are
passing in dev==NULL.
Use &dev->pdev->dev instead of NULL.
Signed-off-by: Xi Wang <xi.wang@gmail.com> Cc: Felipe Balbi <balbi@ti.com> Cc: Greg KH <greg@kroah.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Michal Hocko [Wed, 20 Feb 2013 02:15:05 +0000 (13:15 +1100)]
drop_caches: add some documentation and info message
I would like to resurrect Dave's patch. The last time it was posted was
here https://lkml.org/lkml/2010/9/16/250 and there didn't seem to be any
strong opposition.
Kosaki was worried about possible excessive logging when somebody drops
caches too often (but then he claimed he didn't have a strong opinion on
that) but I would say opposite. If somebody does that then I would really
like to know that from the log when supporting a system because it almost
for sure means that there is something fishy going on. It is also worth
mentioning that only root can write drop caches so this is not an flooding
attack vector.
I am bringing that up again because this can be really helpful when
chasing strange performance issues which (surprise surprise) turn out to
be related to artificially dropped caches done because the admin thinks
this would help...
I have just refreshed the original patch on top of the current mm tree
but I could live with KERN_INFO as well if people think that KERN_NOTICE
is too hysterical.
: From: Dave Hansen <dave@linux.vnet.ibm.com>
: Date: Fri, 12 Oct 2012 14:30:54 +0200
:
: There is plenty of anecdotal evidence and a load of blog posts
: suggesting that using "drop_caches" periodically keeps your system
: running in "tip top shape". Perhaps adding some kernel
: documentation will increase the amount of accurate data on its use.
:
: If we are not shrinking caches effectively, then we have real bugs.
: Using drop_caches will simply mask the bugs and make them harder
: to find, but certainly does not fix them, nor is it an appropriate
: "workaround" to limit the size of the caches.
:
: It's a great debugging tool, and is really handy for doing things
: like repeatable benchmark runs. So, add a bit more documentation
: about it, and add a little KERN_NOTICE. It should help developers
: who are chasing down reclaim-related bugs.
[mhocko@suse.cz: refreshed to current -mm tree]
[akpm@linux-foundation.org: checkpatch fixes] Signed-off-by: Dave Hansen <dave@linux.vnet.ibm.com> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Michal Hocko <mhocko@suse.cz> Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Mel Gorman [Wed, 20 Feb 2013 02:15:05 +0000 (13:15 +1100)]
mm/fadvise.c: drain all pagevecs if POSIX_FADV_DONTNEED fails to discard all pages
Rob van der Heij reported the following (paraphrased) on private mail.
The scenario is that I want to avoid backups to fill up the page
cache and purge stuff that is more likely to be used again (this is
with s390x Linux on z/VM, so I don't give it as much memory that
we don't care anymore). So I have something with LD_PRELOAD that
intercepts the close() call (from tar, in this case) and issues
a posix_fadvise() just before closing the file.
This mostly works, except for small files (less than 14 pages)
that remains in page cache after the face.
Unfortunately Rob has not had a chance to test this exact patch but the
test program below should be reproducing the problem he described.
The issue is the per-cpu pagevecs for LRU additions. If the pages are added
by one CPU but fadvise() is called on another then the pages remain resident
as the invalidate_mapping_pages() only drains the local pagevecs via its
call to pagevec_release(). The user-visible effect is that a program that
uses fadvise() properly is not obeyed.
A possible fix for this is to put the necessary smarts into
invalidate_mapping_pages() to globally drain the LRU pagevecs if a pagevec
page could not be discarded. The downside with this is that an inode cache
shrink would send a global IPI and memory pressure potentially causing
global IPI storms is very undesirable.
Instead, this patch adds a check during fadvise(POSIX_FADV_DONTNEED) to
check if invalidate_mapping_pages() discarded all the requested pages. If a
subset of pages are discarded it drains the LRU pagevecs and tries again. If
the second attempt fails, it assumes it is due to the pages being mapped,
locked or dirty and does not care. With this patch, an application using
fadvise() correctly will be obeyed but there is a downside that a malicious
application can force the kernel to send global IPIs and increase overhead.
If accepted, I would like this to be considered as a -stable candidate.
It's not an urgent issue but it's a system call that is not working as
advertised which is weak.
The following test program demonstrates the problem. It should never
report that pages are still resident but will without this patch. It
assumes that CPU 0 and 1 exist.
int main() {
int fd;
int pagesize = getpagesize();
ssize_t written = 0, expected;
char *buf;
unsigned char *vec;
int resident, i;
cpu_set_t set;
/* Prepare the mincore vec */
vec = malloc(FILESIZE_PAGES);
if (vec == NULL) {
printf("ENOMEM\n");
exit(EXIT_FAILURE);
}
/* Bind ourselves to CPU 0 */
CPU_ZERO(&set);
CPU_SET(0, &set);
if (sched_setaffinity(getpid(), sizeof(set), &set) == -1) {
perror("sched_setaffinity");
exit(EXIT_FAILURE);
}
/* open file, unlink and write buffer */
fd = open("fadvise-test-file", O_CREAT|O_EXCL|O_RDWR);
if (fd == -1) {
perror("open");
exit(EXIT_FAILURE);
}
unlink("fadvise-test-file");
while (written < expected) {
ssize_t this_write;
this_write = write(fd, buf + written, expected - written);
if (this_write == -1) {
perror("write");
exit(EXIT_FAILURE);
}
written += this_write;
}
free(buf);
/*
* Force ourselves to another CPU. If fadvise only flushes the local
* CPUs pagevecs then the fadvise will fail to discard all file pages
*/
CPU_ZERO(&set);
CPU_SET(1, &set);
if (sched_setaffinity(getpid(), sizeof(set), &set) == -1) {
perror("sched_setaffinity");
exit(EXIT_FAILURE);
}
/* sync and fadvise to discard the page cache */
fsync(fd);
if (posix_fadvise(fd, 0, expected, POSIX_FADV_DONTNEED) == -1) {
perror("posix_fadvise");
exit(EXIT_FAILURE);
}
/* map the file and use mincore to see which parts of it are resident */
buf = mmap(NULL, expected, PROT_READ, MAP_SHARED, fd, 0);
if (buf == NULL) {
perror("mmap");
exit(EXIT_FAILURE);
}
if (mincore(buf, expected, vec) == -1) {
perror("mincore");
exit(EXIT_FAILURE);
}
/* Check residency */
for (i = 0, resident = 0; i < FILESIZE_PAGES; i++) {
if (vec[i])
resident++;
}
if (resident != 0) {
printf("Nr unexpected pages resident: %d\n", resident);
exit(EXIT_FAILURE);
}
Signed-off-by: Mel Gorman <mgorman@suse.de> Reported-by: Rob van der Heij <rvdheij@gmail.com> Tested-by: Rob van der Heij <rvdheij@gmail.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cliff Wickman [Wed, 20 Feb 2013 02:15:04 +0000 (13:15 +1100)]
mm: export mmu notifier invalidates
We at SGI have a need to address some very high physical address ranges
with our GRU (global reference unit), sometimes across partitioned machine
boundaries and sometimes with larger addresses than the cpu supports. We
do this with the aid of our own 'extended vma' module which mimics the
vma. When something (either unmap or exit) frees an 'extended vma' we use
the mmu notifiers to clean them up.
We had been able to mimic the functions
__mmu_notifier_invalidate_range_start() and
__mmu_notifier_invalidate_range_end() by locking the per-mm lock and
walking the per-mm notifier list. But with the change to a global srcu
lock (static in mmu_notifier.c) we can no longer do that. Our module has
no access to that lock.
So we request that these two functions be exported.
Signed-off-by: Cliff Wickman <cpw@sgi.com> Acked-by: Robin Holt <holt@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
munlock_vma_pages_range() was always incrementing addresses by PAGE_SIZE
at a time. When munlocking THP pages (or the huge zero page), this
resulted in taking the mm->page_table_lock 512 times in a row.
We can do better by making use of the page_mask returned by
follow_page_mask (for the huge zero page case), or the size of the page
munlock_vma_page() operated on (for the true THP page case).
Note - I am sending this as RFC only for now as I can't currently put my
finger on what if anything prevents split_huge_page() from operating
concurrently on the same page as munlock_vma_page(), which would mess up
our NR_MLOCK statistics. Is this a latent bug or is there a subtle point
I missed here ?
Signed-off-by: Michel Lespinasse <walken@google.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
mm: accelerate mm_populate() treatment of THP pages
This change adds a follow_page_mask function which is equivalent to
follow_page, but with an extra page_mask argument.
follow_page_mask sets *page_mask to HPAGE_PMD_NR - 1 when it encounters a
THP page, and to 0 in other cases.
__get_user_pages() makes use of this in order to accelerate populating THP
ranges - that is, when both the pages and vmas arrays are NULL, we don't
need to iterate HPAGE_PMD_NR times to cover a single THP page (and we also
avoid taking mm->page_table_lock that many times).
Signed-off-by: Michel Lespinasse <walken@google.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Zhang Yanfei [Wed, 20 Feb 2013 02:15:03 +0000 (13:15 +1100)]
mm: accurately document nr_free_*_pages functions with code comments
nr_free_zone_pages(), nr_free_buffer_pages() and nr_free_pagecache_pages()
are horribly badly named, so accurately document them with code comments
in case of the misuse of them.
Reviewed-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Naoya Horiguchi [Wed, 20 Feb 2013 02:15:02 +0000 (13:15 +1100)]
HWPOISON: change order of error_states[]'s elements
error_states[] has two separate states "unevictable LRU page" and "mlocked
LRU page", and the former one has the higher priority now. But because of
that the latter one is rarely chosen because pages with PageMlocked highly
likely have PG_unevictable set. On the other hand, PG_unevictable without
PageMlocked is common for ramfs or SHM_LOCKed shared memory, so reversing
the priority of these two states helps us clearly distinguish them.
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Chen Gong <gong.chen@linux.intel.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Naoya Horiguchi [Wed, 20 Feb 2013 02:15:02 +0000 (13:15 +1100)]
HWPOISON: fix misjudgement of page_action() for errors on mlocked pages
memory_failure() can't handle memory errors on mlocked pages correctly,
because page_action() judges such errors as ones on "unknown pages"
instead of ones on "unevictable LRU page" or "mlocked LRU page". In order
to determine page_state page_action() checks page flags at the timing of
the judgement, but such page flags are not the same with those just after
memory_failure() is called, because memory_failure() does unmapping of the
error pages before doing page_action(). This unmapping changes the page
state, especially page_remove_rmap() (called from try_to_unmap_one())
clears PG_mlocked, so page_action() can't catch mlocked pages after that.
With this patch, we store the page flag of the error page before doing
unmap, and (only) if the first check with page flags at the time decided
the error page is unknown, we do the second check with the stored page
flag. This implementation doesn't change error handling for the page
types for which the first check can determine the page state correctly.
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Tony Luck <tony.luck@intel.com> Cc: Chen Gong <gong.chen@linux.intel.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Hugh Dickins [Wed, 20 Feb 2013 02:15:01 +0000 (13:15 +1100)]
memcg: stop warning on memcg_propagate_kmem
Whilst I run the risk of a flogging for disloyalty to the Lord of Sealand,
I do have CONFIG_MEMCG=y CONFIG_MEMCG_KMEM not set, and grow tired of the
"mm/memcontrol.c:4972:12: warning: `memcg_propagate_kmem' defined but not
used [-Wunused-function]" seen in 3.8-rc: move the #ifdef outwards.
Signed-off-by: Hugh Dickins <hughd@google.com> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: Glauber Costa <glommer@parallels.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Zhang Yanfei [Wed, 20 Feb 2013 02:15:01 +0000 (13:15 +1100)]
net: change type of virtio_chan->p9_max_pages
This member of struct virtio_chan is calculated from nr_free_buffer_pages
so change its type to unsigned long in case of overflow.
Signed-off-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Cc: David Miller <davem@davemloft.net> Cc: Eric Van Hensbergen <ericvh@gmail.com> Cc: Ron Minnich <rminnich@sandia.gov> Cc: Latchesar Ionkov <lucho@ionkov.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Zhang Yanfei [Wed, 20 Feb 2013 02:14:59 +0000 (13:14 +1100)]
mm: fix return type for functions nr_free_*_pages
Currently, the amount of RAM that functions nr_free_*_pages return
is held in unsigned int. But in machines with big memory (exceeding
16TB), the amount may be incorrect because of overflow, so fix it.
Signed-off-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Cc: Simon Horman <horms@verge.net.au> Cc: Julian Anastasov <ja@ssi.bg> Cc: David Miller <davem@davemloft.net> Cc: Eric Van Hensbergen <ericvh@gmail.com> Cc: Ron Minnich <rminnich@sandia.gov> Cc: Latchesar Ionkov <lucho@ionkov.net> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Michal Hocko [Wed, 20 Feb 2013 02:14:59 +0000 (13:14 +1100)]
memcg: cleanup mem_cgroup_init comment
We should encourage all memcg controller initialization independent on a
specific mem_cgroup to be done here rather than exploit css_alloc callback
and assume that nothing happens before root cgroup is created.
Signed-off-by: Michal Hocko <mhocko@suse.cz> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Tejun Heo <htejun@gmail.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Michal Hocko [Wed, 20 Feb 2013 02:14:59 +0000 (13:14 +1100)]
memcg: move memcg_stock initialization to mem_cgroup_init
memcg_stock are currently initialized during the root cgroup allocation
which is OK but it pointlessly pollutes memcg allocation code with
something that can be called when the memcg subsystem is initialized by
mem_cgroup_init along with other controller specific parts.
This patch wraps the current memcg_stock initialization code into a helper
calls it from the controller subsystem initialization code.
Signed-off-by: Michal Hocko <mhocko@suse.cz> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Tejun Heo <htejun@gmail.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Michal Hocko [Wed, 20 Feb 2013 02:14:58 +0000 (13:14 +1100)]
memcg: move mem_cgroup_soft_limit_tree_init to mem_cgroup_init
Per-node-zone soft limit tree is currently initialized when the root
cgroup is created which is OK but it pointlessly pollutes memcg allocation
code with something that can be called when the memcg subsystem is
initialized by mem_cgroup_init along with other controller specific parts.
While we are at it let's make mem_cgroup_soft_limit_tree_init void because
it doesn't make much sense to report memory failure because if we fail to
allocate memory that early during the boot then we are screwed anyway
(this saves some code).
Signed-off-by: Michal Hocko <mhocko@suse.cz> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Tejun Heo <htejun@gmail.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Minchan Kim [Wed, 20 Feb 2013 02:14:58 +0000 (13:14 +1100)]
mm: use up free swap space before reaching OOM kill
Recently, Luigi reported there are lots of free swap space when OOM
happens. It's easily reproduced on zram-over-swap, where many instance of
memory hogs are running and laptop_mode is enabled. He said there was no
problem when he disabled laptop_mode. The problem when I investigate
problem is following as.
Assumption for easy explanation: There are no page cache page in system
because they all are already reclaimed.
1. try_to_free_pages disable may_writepage when laptop_mode is enabled.
2. shrink_inactive_list isolates victim pages from inactive anon lru list.
3. shrink_page_list adds them to swapcache via add_to_swap but it doesn't
pageout because sc->may_writepage is 0 so the page is rotated back into
inactive anon lru list. The add_to_swap made the page Dirty by SetPageDirty.
4. 3 couldn't reclaim any pages so do_try_to_free_pages increase priority and
retry reclaim with higher priority.
5. shrink_inactlive_list try to isolate victim pages from inactive anon lru list
but got failed because it try to isolate pages with ISOLATE_CLEAN mode but
inactive anon lru list is full of dirty pages by 3 so it just returns
without any reclaim progress.
6. do_try_to_free_pages doesn't set may_writepage due to zero total_scanned.
Because sc->nr_scanned is increased by shrink_page_list but we don't call
shrink_page_list in 5 due to short of isolated pages.
Above loop is continued until OOM happens.
The problem didn't happen before [1] was merged because old logic's
isolatation in shrink_inactive_list was successful and tried to call
shrink_page_list to pageout them but it still ends up failed to page out
by may_writepage. But important point is that sc->nr_scanned was
increased although we couldn't swap out them so do_try_to_free_pages could
set may_writepages.
Since f80c067 ("mm: zone_reclaim: make isolate_lru_page() filter-aware")
was introduced, it's not a good idea any more to depends on only the
number of scanned pages for setting may_writepage. So this patch adds new
trigger point of setting may_writepage as below DEF_PRIOIRTY - 2 which is
used to show the significant memory pressure in VM so it's good fit for
our purpose which would be better to lose power saving or clickety rather
than OOM killing.
Signed-off-by: Minchan Kim <minchan@kernel.org> Reported-by: Luigi Semenzato <semenzato@google.com> Cc: Rik van Riel <riel@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Robin Holt [Wed, 20 Feb 2013 02:14:57 +0000 (13:14 +1100)]
mmu_notifier_unregister NULL Pointer deref and multiple ->release() callouts
There is a race condition between mmu_notifier_unregister() and
__mmu_notifier_release().
Assume two tasks, one calling mmu_notifier_unregister() as a result of a
filp_close() ->flush() callout (task A), and the other calling
mmu_notifier_release() from an mmput() (task B).
A B
t1 srcu_read_lock()
t2 if (!hlist_unhashed())
t3 srcu_read_unlock()
t4 srcu_read_lock()
t5 hlist_del_init_rcu()
t6 synchronize_srcu()
t7 srcu_read_unlock()
t8 hlist_del_rcu() <--- NULL pointer deref.
Additionally, the list traversal in __mmu_notifier_release() is not
protected by the by the mmu_notifier_mm->hlist_lock which can result in
callouts to the ->release() notifier from both mmu_notifier_unregister()
and __mmu_notifier_release().
-stable suggestions:
The stable trees prior to 3.7.y need commits 21a9273 and 7040030
cherry-picked in that order prior to cherry-picking this commit. The
3.7.y tree already has those two commits.
Signed-off-by: Robin Holt <holt@sgi.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com> Cc: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> Cc: Avi Kivity <avi@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: Sagi Grimberg <sagig@mellanox.co.il> Cc: Haggai Eran <haggaie@mellanox.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cody P Schafer [Wed, 20 Feb 2013 02:14:56 +0000 (13:14 +1100)]
mm: add helper ensure_zone_is_initialized()
ensure_zone_is_initialized() checks if a zone is in a empty & not
initialized state (typically occuring after it is created in memory
hotplugging), and, if so, calls init_currently_empty_zone() to initialize
the zone.
Signed-off-by: Cody P Schafer <cody@linux.vnet.ibm.com> Cc: David Hansen <dave@linux.vnet.ibm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cody P Schafer [Wed, 20 Feb 2013 02:14:55 +0000 (13:14 +1100)]
mm/page_alloc: add informative debugging message in page_outside_zone_boundaries()
Add a debug message which prints when a page is found outside of the
boundaries of the zone it should belong to. Format is:
"page $pfn outside zone [ $start_pfn - $end_pfn ]"
Signed-off-by: Cody P Schafer <cody@linux.vnet.ibm.com> Cc: David Hansen <dave@linux.vnet.ibm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cody P Schafer [Wed, 20 Feb 2013 02:14:54 +0000 (13:14 +1100)]
mm: add & use zone_end_pfn() and zone_spans_pfn()
Add 2 helpers (zone_end_pfn() and zone_spans_pfn()) to reduce code
duplication.
This also switches to using them in compaction (where an additional
variable needed to be renamed), page_alloc, vmstat, memory_hotplug, and
kmemleak.
Note that in compaction.c I avoid calling zone_end_pfn() repeatedly because I
expect at some point the sycronization issues with start_pfn & spanned_pages
will need fixing, either by actually using the seqlock or clever memory barrier
usage.
Signed-off-by: Cody P Schafer <cody@linux.vnet.ibm.com> Cc: David Hansen <dave@linux.vnet.ibm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Johannes Weiner [Wed, 20 Feb 2013 02:14:53 +0000 (13:14 +1100)]
mm: refactor inactive_file_is_low() to use get_lru_size()
An inactive file list is considered low when its active counterpart is
bigger, regardless of whether it is a global zone LRU list or a memcg zone
LRU list. The only difference is in how the LRU size is assessed.
get_lru_size() does the right thing for both global and memcg reclaim
situations.
Get rid of inactive_file_is_low_global() and
mem_cgroup_inactive_file_is_low() by using get_lru_size() and compare the
numbers in common code.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Hugh Dickins [Wed, 20 Feb 2013 02:14:52 +0000 (13:14 +1100)]
ksm: stop hotremove lockdep warning
Complaints are rare, but lockdep still does not understand the way
ksm_memory_callback(MEM_GOING_OFFLINE) takes ksm_thread_mutex, and holds
it until the ksm_memory_callback(MEM_OFFLINE): that appears to be a
problem because notifier callbacks are made under down_read of
blocking_notifier_head->rwsem (so first the mutex is taken while holding
the rwsem, then later the rwsem is taken while still holding the mutex);
but is not in fact a problem because mem_hotplug_mutex is held throughout
the dance.
There was an attempt to fix this with mutex_lock_nested(); but if that
happened to fool lockdep two years ago, apparently it does so no longer.
I had hoped to eradicate this issue in extending KSM page migration not to
need the ksm_thread_mutex. But then realized that although the page
migration itself is safe, we do still need to lock out ksmd and other
users of get_ksm_page() while offlining memory - at some point between
MEM_GOING_OFFLINE and MEM_OFFLINE, the struct pages themselves may vanish,
and get_ksm_page()'s accesses to them become a violation.
So, give up on holding ksm_thread_mutex itself from MEM_GOING_OFFLINE to
MEM_OFFLINE, and add a KSM_RUN_OFFLINE flag, and wait_while_offlining()
checks, to achieve the same lockout without being caught by lockdep. This
is less elegant for KSM, but it's more important to keep lockdep useful to
other users - and I apologize for how long it took to fix.
Signed-off-by: Hugh Dickins <hughd@google.com> Reported-by: Gerald Schaefer <gerald.schaefer@de.ibm.com> Tested-by: Gerald Schaefer <gerald.schaefer@de.ibm.com> Cc: Rik van Riel <riel@redhat.com> Cc: Petr Holasek <pholasek@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Izik Eidus <izik.eidus@ravellosystems.com> Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Hugh Dickins [Wed, 20 Feb 2013 02:14:52 +0000 (13:14 +1100)]
mm: remove offlining arg to migrate_pages
No functional change, but the only purpose of the offlining argument to
migrate_pages() etc, was to ensure that __unmap_and_move() could migrate a
KSM page for memory hotremove (which took ksm_thread_mutex) but not for
other callers. Now all cases are safe, remove the arg.
Signed-off-by: Hugh Dickins <hughd@google.com> Cc: Rik van Riel <riel@redhat.com> Cc: Petr Holasek <pholasek@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Izik Eidus <izik.eidus@ravellosystems.com> Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com> Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Hugh Dickins [Wed, 20 Feb 2013 02:14:51 +0000 (13:14 +1100)]
ksm: enable KSM page migration
Migration of KSM pages is now safe: remove the PageKsm restrictions from
mempolicy.c and migrate.c.
But keep PageKsm out of __unmap_and_move()'s anon_vma contortions, which
are irrelevant to KSM: it looks as if that code was preventing hotremove
migration of KSM pages, unless they happened to be in swapcache.
There is some question as to whether enforcing a NUMA mempolicy migration
ought to migrate KSM pages, mapped into entirely unrelated processes; but
moving page_mapcount > 1 is only permitted with MPOL_MF_MOVE_ALL anyway,
and it seems reasonable to assume that you wouldn't set MADV_MERGEABLE on
any area where this is a worry.
Signed-off-by: Hugh Dickins <hughd@google.com> Cc: Rik van Riel <riel@redhat.com> Cc: Petr Holasek <pholasek@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Izik Eidus <izik.eidus@ravellosystems.com> Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com> Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Hugh Dickins [Wed, 20 Feb 2013 02:14:51 +0000 (13:14 +1100)]
ksm: make !merge_across_nodes migration safe
The new KSM NUMA merge_across_nodes knob introduces a problem, when it's
set to non-default 0: if a KSM page is migrated to a different NUMA node,
how do we migrate its stable node to the right tree? And what if that
collides with an existing stable node?
ksm_migrate_page() can do no more than it's already doing, updating
stable_node->kpfn: the stable tree itself cannot be manipulated without
holding ksm_thread_mutex. So accept that a stable tree may temporarily
indicate a page belonging to the wrong NUMA node, leave updating until the
next pass of ksmd, just be careful not to merge other pages on to a
misplaced page. Note nid of holding tree in stable_node, and recognize
that it will not always match nid of kpfn.
A misplaced KSM page is discovered, either when ksm_do_scan() next comes
around to one of its rmap_items (we now have to go to cmp_and_merge_page
even on pages in a stable tree), or when stable_tree_search() arrives at a
matching node for another page, and this node page is found misplaced.
In each case, move the misplaced stable_node to a list of migrate_nodes
(and use the address of migrate_nodes as magic by which to identify them):
we don't need them in a tree. If stable_tree_search() finds no match for
a page, but it's currently exiled to this list, then slot its stable_node
right there into the tree, bringing all of its mappings with it; otherwise
they get migrated one by one to the original page of the colliding node.
stable_tree_search() is now modelled more like stable_tree_insert(), in
order to handle these insertions of migrated nodes.
remove_node_from_stable_tree(), remove_all_stable_nodes() and
ksm_check_stable_tree() have to handle the migrate_nodes list as well as
the stable tree itself. Less obviously, we do need to prune the list of
stale entries from time to time (scan_get_next_rmap_item() does it once
each full scan): whereas stale nodes in the stable tree get naturally
pruned as searches try to brush past them, these migrate_nodes may get
forgotten and accumulate.
Signed-off-by: Hugh Dickins <hughd@google.com> Cc: Rik van Riel <riel@redhat.com> Cc: Petr Holasek <pholasek@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Izik Eidus <izik.eidus@ravellosystems.com> Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com> Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Hugh Dickins [Wed, 20 Feb 2013 02:14:51 +0000 (13:14 +1100)]
ksm: make KSM page migration possible
KSM page migration is already supported in the case of memory hotremove,
which takes the ksm_thread_mutex across all its migrations to keep life
simple.
But the new KSM NUMA merge_across_nodes knob introduces a problem, when
it's set to non-default 0: if a KSM page is migrated to a different NUMA
node, how do we migrate its stable node to the right tree? And what if
that collides with an existing stable node?
So far there's no provision for that, and this patch does not attempt to
deal with it either. But how will I test a solution, when I don't know
how to hotremove memory? The best answer is to enable KSM page migration
in all cases now, and test more common cases. With THP and compaction
added since KSM came in, page migration is now mainstream, and it's a
shame that a KSM page can frustrate freeing a page block.
Without worrying about merge_across_nodes 0 for now, this patch gets KSM
page migration working reliably for default merge_across_nodes 1 (but
leave the patch enabling it until near the end of the series).
It's much simpler than I'd originally imagined, and does not require an
additional tier of locking: page migration relies on the page lock, KSM
page reclaim relies on the page lock, the page lock is enough for KSM page
migration too.
Almost all the care has to be in get_ksm_page(): that's the function which
worries about when a stable node is stale and should be freed, now it also
has to worry about the KSM page being migrated.
The only new overhead is an additional put/get/lock/unlock_page when
stable_tree_search() arrives at a matching node: to make sure migration
respects the raised page count, and so does not migrate the page while
we're busy with it here. That's probably avoidable, either by changing
internal interfaces from using kpage to stable_node, or by moving the
ksm_migrate_page() callsite into a page_freeze_refs() section (even if not
swapcache); but this works well, I've no urge to pull it apart now.
(Descents of the stable tree may pass through nodes whose KSM pages are
under migration: being unlocked, the raised page count does not prevent
that, nor need it: it's safe to memcmp against either old or new page.)
You might worry about mremap, and whether page migration's rmap_walk to
remove migration entries will find all the KSM locations where it inserted
earlier: that should already be handled, by the satisfyingly heavy hammer
of move_vma()'s call to ksm_madvise(,,,MADV_UNMERGEABLE,).
Signed-off-by: Hugh Dickins <hughd@google.com> Cc: Rik van Riel <riel@redhat.com> Cc: Petr Holasek <pholasek@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Izik Eidus <izik.eidus@ravellosystems.com> Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com> Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Hugh Dickins [Wed, 20 Feb 2013 02:14:51 +0000 (13:14 +1100)]
ksm: remove old stable nodes more thoroughly
Switching merge_across_nodes after running KSM is liable to oops on stale
nodes still left over from the previous stable tree. It's not something
that people will often want to do, but it would be lame to demand a reboot
when they're trying to determine which merge_across_nodes setting is best.
How can this happen? We only permit switching merge_across_nodes when
pages_shared is 0, and usually set run 2 to force that beforehand, which
ought to unmerge everything: yet oopses still occur when you then run 1.
Three causes:
1. The old stable tree (built according to the inverse
merge_across_nodes) has not been fully torn down. A stable node
lingers until get_ksm_page() notices that the page it references no
longer references it: but the page is not necessarily freed as soon as
expected, particularly when swapcache.
Fix this with a pass through the old stable tree, applying
get_ksm_page() to each of the remaining nodes (most found stale and
removed immediately), with forced removal of any left over. Unless the
page is still mapped: I've not seen that case, it shouldn't occur, but
better to WARN_ON_ONCE and EBUSY than BUG.
2. __ksm_enter() has a nice little optimization, to insert the new mm
just behind ksmd's cursor, so there's a full pass for it to stabilize
(or be removed) before ksmd addresses it. Nice when ksmd is running,
but not so nice when we're trying to unmerge all mms: we were missing
those mms forked and inserted behind the unmerge cursor. Easily fixed
by inserting at the end when KSM_RUN_UNMERGE.
3. It is possible for a KSM page to be faulted back from swapcache
into an mm, just after unmerge_and_remove_all_rmap_items() scanned past
it. Fix this by copying on fault when KSM_RUN_UNMERGE: but that is
private to ksm.c, so dissolve the distinction between
ksm_might_need_to_copy() and ksm_does_need_to_copy(), doing it all in
the one call into ksm.c.
A long outstanding, unrelated bugfix sneaks in with that third fix:
ksm_does_need_to_copy() would copy from a !PageUptodate page (implying I/O
error when read in from swap) to a page which it then marks Uptodate. Fix
this case by not copying, letting do_swap_page() discover the error.
Signed-off-by: Hugh Dickins <hughd@google.com> Cc: Rik van Riel <riel@redhat.com> Cc: Petr Holasek <pholasek@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Izik Eidus <izik.eidus@ravellosystems.com> Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com> Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com> Acked-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Hugh Dickins [Wed, 20 Feb 2013 02:14:50 +0000 (13:14 +1100)]
ksm: get_ksm_page locked
In some places where get_ksm_page() is used, we need the page to be locked.
When KSM migration is fully enabled, we shall want that to make sure that
the page just acquired cannot be migrated beneath us (raised page count is
only effective when there is serialization to make sure migration
notices). Whereas when navigating through the stable tree, we certainly
do not want to lock each node (raised page count is enough to guarantee
the memcmps, even if page is migrated to another node).
Since we're about to add another use case, add the locked argument to
get_ksm_page() now.
Hmm, what's that rcu_read_lock() about? Complete misunderstanding, I
really got the wrong end of the stick on that! There's a configuration in
which page_cache_get_speculative() can do something cheaper than
get_page_unless_zero(), relying on its caller's rcu_read_lock() to have
disabled preemption for it. There's no need for rcu_read_lock() around
get_page_unless_zero() (and mapping checks) here. Cut out that silliness
before making this any harder to understand.
Signed-off-by: Hugh Dickins <hughd@google.com> Cc: Rik van Riel <riel@redhat.com> Cc: Petr Holasek <pholasek@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Izik Eidus <izik.eidus@ravellosystems.com> Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com> Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Hugh Dickins [Wed, 20 Feb 2013 02:14:50 +0000 (13:14 +1100)]
ksm: reorganize ksm_check_stable_tree
Memory hotremove's ksm_check_stable_tree() is pitifully inefficient
(restarting whenever it finds a stale node to remove), but rearrange so
that at least it does not needlessly restart from nid 0 each time. And
add a couple of comments: here is why we keep pfn instead of page.
Signed-off-by: Hugh Dickins <hughd@google.com> Cc: Rik van Riel <riel@redhat.com> Cc: Petr Holasek <pholasek@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Izik Eidus <izik.eidus@ravellosystems.com> Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com> Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Hugh Dickins [Wed, 20 Feb 2013 02:14:49 +0000 (13:14 +1100)]
ksm: trivial tidyups
Add NUMA() and DO_NUMA() macros to minimize blight of #ifdef CONFIG_NUMAs
(but indeed we don't want to expand struct rmap_item by nid when not
NUMA). Add comment, remove "unsigned" from rmap_item->nid, as "int nid"
elsewhere. Define ksm_merge_across_nodes 1U when #ifndef NUMA to help
optimizing out. Use ?: in get_kpfn_nid(). Adjust a few comments noticed
in ongoing work.
Leave stable_tree_insert()'s rb_linkage until after the node has been set
up, as unstable_tree_search_insert() does: ksm_thread_mutex and page lock
make either way safe, but we're going to copy and I prefer this precedent.
Signed-off-by: Hugh Dickins <hughd@google.com> Cc: Rik van Riel <riel@redhat.com> Cc: Petr Holasek <pholasek@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Izik Eidus <izik.eidus@ravellosystems.com> Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com> Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Petr Holasek [Wed, 20 Feb 2013 02:14:49 +0000 (13:14 +1100)]
ksm: allow trees per NUMA node
Here's a KSM series, based on mmotm 2013-01-23-17-04: starting with
Petr's v7 "KSM: numa awareness sysfs knob"; then fixing the two issues
we had with that, fully enabling KSM page migration on the way.
(A different kind of KSM/NUMA issue which I've certainly not begun to
address here: when KSM pages are unmerged, there's usually no sense
in preferring to allocate the new pages local to the caller's node.)
This patch:
Introduces new sysfs boolean knob /sys/kernel/mm/ksm/merge_across_nodes
which control merging pages across different numa nodes. When it is set
to zero only pages from the same node are merged, otherwise pages from all
nodes can be merged together (default behavior).
Typical use-case could be a lot of KVM guests on NUMA machine and cpus
from more distant nodes would have significant increase of access latency
to the merged ksm page. Sysfs knob was choosen for higher variability
when some users still prefers higher amount of saved physical memory
regardless of access latency.
Every numa node has its own stable & unstable trees because of faster
searching and inserting. Changing of merge_across_nodes value is possible
only when there are not any ksm shared pages in system.
I've tested this patch on numa machines with 2, 4 and 8 nodes and measured
speed of memory access inside of KVM guests with memory pinned to one of
nodes with this benchmark:
http://pholasek.fedorapeople.org/alloc_pg.c
Population standard deviations of access times in percentage of average
were following:
Hugh notes that this patch brings two problems, whose solution needs
further support in mm/ksm.c, which follows in subsequent patches:
1) switching merge_across_nodes after running KSM is liable to oops
on stale nodes still left over from the previous stable tree;
2) memory hotremove may migrate KSM pages, but there is no provision
here for !merge_across_nodes to migrate nodes to the proper tree.
Signed-off-by: Petr Holasek <pholasek@redhat.com> Signed-off-by: Hugh Dickins <hughd@google.com> Acked-by: Rik van Riel <riel@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Izik Eidus <izik.eidus@ravellosystems.com> Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com> Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Mel Gorman [Wed, 20 Feb 2013 02:14:48 +0000 (13:14 +1100)]
mm: rename page struct field helpers
The function names page_xchg_last_nid(), page_last_nid() and
reset_page_last_nid() were judged to be inconsistent so rename them to a
struct_field_op style pattern. As it looked jarring to have
reset_page_mapcount() and page_nid_reset_last() beside each other in
memmap_init_zone(), this patch also renames reset_page_mapcount() to
page_mapcount_reset(). There are others like init_page_count() but as it
is used throughout the arch code a rename would likely cause more
conflicts than it is worth.
Signed-off-by: Mel Gorman <mgorman@suse.de> Suggested-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Mike Yoknis [Wed, 20 Feb 2013 02:14:48 +0000 (13:14 +1100)]
mm: memmap_init_zone() performance improvement
We have what we call an "architectural simulator". It is a computer
program that pretends that it is a computer system. We use it to test the
firmware before real hardware is available. We have booted Linux on our
simulator. As you would expect it takes longer to boot on the simulator
than it does on real hardware.
With my patch - boot time 41 minutes
Without patch - boot time 94 minutes
These numbers do not scale linearly to real hardware. But indicate to me
a place where Linux can be improved.
memmap_init_zone() loops through every Page Frame Number (pfn), including
pfn values that are within the gaps between existing memory sections. The
unneeded looping will become a boot performance issue when machines
configure larger memory ranges that will contain larger and more numerous
gaps.
The code will skip across invalid pfn values to reduce the number of loops
executed.
Signed-off-by: Mike Yoknis <mike.yoknis@hp.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Glauber Costa [Wed, 20 Feb 2013 02:14:47 +0000 (13:14 +1100)]
memcg: avoid dangling reference count in creation failure.
When use_hierarchy is enabled, we acquire an extra reference count in our
parent during cgroup creation. We don't release it, though, if any
failure exist in the creation process.
Signed-off-by: Glauber Costa <glommer@parallels.com> Reported-by: Michal Hocko <mhocko@suse.cz> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: Tejun Heo <tj@kernel.org> Cc: Hiroyuki Kamezawa <kamezawa.hiroyuki@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Glauber Costa [Wed, 20 Feb 2013 02:14:47 +0000 (13:14 +1100)]
memcg: increment static branch right after limit set
We were deferring the kmemcg static branch increment to a later time, due
to a nasty dependency between the cpu_hotplug lock, taken by the jump
label update, and the cgroup_lock.
Now we no longer take the cgroup lock, and we can save ourselves the
trouble.
Signed-off-by: Glauber Costa <glommer@parallels.com> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: Tejun Heo <tj@kernel.org> Cc: Hiroyuki Kamezawa <kamezawa.hiroyuki@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Glauber Costa [Wed, 20 Feb 2013 02:14:46 +0000 (13:14 +1100)]
memcg: replace cgroup_lock with memcg specific memcg_lock
After the preparation work done in earlier patches, the cgroup_lock can be
trivially replaced with a memcg-specific lock. This is an automatic
translation at every site where the values involved were queried.
The sites where values are written, however, used to be naturally called
under cgroup_lock. This is the case for instance in the css_online
callback. For those, we now need to explicitly add the memcg lock.
With this, all the calls to cgroup_lock outside cgroup core are gone.
Signed-off-by: Glauber Costa <glommer@parallels.com> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: Tejun Heo <tj@kernel.org> Cc: Hiroyuki Kamezawa <kamezawa.hiroyuki@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Glauber Costa [Wed, 20 Feb 2013 02:14:45 +0000 (13:14 +1100)]
memcg: fast hierarchy-aware child test
Currently, we use cgroups' provided list of children to verify if it is
safe to proceed with any value change that is dependent on the cgroup
being empty.
This is less than ideal, because it enforces a dependency over cgroup core
that we would be better off without. The solution proposed here is to
iterate over the child cgroups and if any is found that is already online,
we bounce and return: we don't really care how many children we have, only
if we have any.
This is also made to be hierarchy aware. IOW, cgroups with hierarchy
disabled, while they still exist, will be considered for the purpose of
this interface as having no children.
Signed-off-by: Glauber Costa <glommer@parallels.com> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: Tejun Heo <tj@kernel.org> Cc: Hiroyuki Kamezawa <kamezawa.hiroyuki@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Glauber Costa [Wed, 20 Feb 2013 02:14:45 +0000 (13:14 +1100)]
memcg: split part of memcg creation to css_online
This patch is a preparatory work for later locking rework to get rid of
big cgroup lock from memory controller code.
The memory controller uses some tunables to adjust its operation. Those
tunables are inherited from parent to children upon children
intialization. For most of them, the value cannot be changed after the
parent has a new children.
cgroup core splits initialization in two phases: css_alloc and css_online.
After css_alloc, the memory allocation and basic initialization are done.
But the new group is not yet visible anywhere, not even for cgroup core
code. It is only somewhere between css_alloc and css_online that it is
inserted into the internal children lists. Copying tunable values in
css_alloc will lead to inconsistent values: the children will copy the old
parent values, that can change between the copy and the moment in which
the groups is linked to any data structure that can indicate the presence
of children.
Signed-off-by: Glauber Costa <glommer@parallels.com> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: Tejun Heo <tj@kernel.org> Cc: Hiroyuki Kamezawa <kamezawa.hiroyuki@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Glauber Costa [Wed, 20 Feb 2013 02:14:45 +0000 (13:14 +1100)]
memcg: prevent changes to move_charge_at_immigrate during task attach
In memcg, we use the cgroup_lock basically to synchronize against
attaching new children to a cgroup. We do this because we rely on cgroup
core to provide us with this information.
We need to guarantee that upon child creation, our tunables are
consistent. For those, the calls to cgroup_lock() all live in handlers
like mem_cgroup_hierarchy_write(), where we change a tunable in the group
that is hierarchy-related. For instance, the use_hierarchy flag cannot be
changed if the cgroup already have children.
Furthermore, those values are propagated from the parent to the child when
a new child is created. So if we don't lock like this, we can end up with
the following situation:
A B
memcg_css_alloc() mem_cgroup_hierarchy_write()
copy use hierarchy from parent change use hierarchy in parent
finish creation.
This is mainly because during create, we are still not fully connected to
the css tree. So all iterators and the such that we could use, will fail
to show that the group has children.
My observation is that all of creation can proceed in parallel with those
tasks, except value assignment. So what this patch series does is to first
move all value assignment that is dependent on parent values from
css_alloc to css_online, where the iterators all work, and then we lock
only the value assignment. This will guarantee that parent and children
always have consistent values. Together with an online test, that can be
derived from the observation that the refcount of an online memcg can be
made to be always positive, we should be able to synchronize our side
without the cgroup lock.
This patch:
Currently, we rely on the cgroup_lock() to prevent changes to
move_charge_at_immigrate during task migration. However, this is only
needed because the current strategy keeps checking this value throughout
the whole process. Since all we need is serialization, one needs only to
guarantee that whatever decision we made in the beginning of a specific
migration is respected throughout the process.
We can achieve this by just saving it in mc. By doing this, no kind of
locking is needed.
Signed-off-by: Glauber Costa <glommer@parallels.com> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: Tejun Heo <tj@kernel.org> Cc: Hiroyuki Kamezawa <kamezawa.hiroyuki@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Glauber Costa [Wed, 20 Feb 2013 02:14:44 +0000 (13:14 +1100)]
memcg: reduce the size of struct memcg 244-fold.
In order to maintain all the memcg bookkeeping, we need per-node
descriptors, which will in turn contain a per-zone descriptor.
Because we want to statically allocate those, this array ends up being
very big. Part of the reason is that we allocate something large enough
to hold MAX_NUMNODES, the compile time constant that holds the maximum
number of nodes we would ever consider.
However, we can do better in some cases if the firmware help us. This is
true for modern x86 machines; coincidentally one of the architectures in
which MAX_NUMNODES tends to be very big.
By using the firmware-provided maximum number of nodes instead of
MAX_NUMNODES, we can reduce the memory footprint of struct memcg
considerably. In the extreme case in which we have only one node, this
reduces the size of the structure from ~ 64k to ~2k. This is particularly
important because it means that we will no longer resort to the vmalloc
area for the struct memcg on defconfigs. We also have enough room for an
extra node and still be outside vmalloc.
One also has to keep in mind that with the industry's ability to fit more
processors in a die as fast as the FED prints money, a nodes = 2
configuration is already respectably big.
Signed-off-by: Glauber Costa <glommer@parallels.com> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Greg Thelen <gthelen@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Ying Han <yinghan@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Mel Gorman [Wed, 20 Feb 2013 02:14:44 +0000 (13:14 +1100)]
mm: init: report on last-nid information stored in page->flags
Answering the question "how much space remains in the page->flags" is
time-consuming. mminit_loglevel can help answer the question but it does
not take last_nid information into account. This patch corrects it and
while there it corrects the messages related to page flag usage, pgshifts
and node/zone id. When applied the relevant output looks something like
this but will depend on the kernel configuration.
Mel Gorman [Wed, 20 Feb 2013 02:14:43 +0000 (13:14 +1100)]
mm: uninline page_xchg_last_nid()
Andrew Morton pointed out that page_xchg_last_nid() and
reset_page_last_nid() were "getting nuttily large" and asked that it be
investigated.
reset_page_last_nid() is on the page free path and it would be unfortunate
to make that path more expensive than it needs to be. Due to the internal
use of page_xchg_last_nid() it is already too expensive but fortunately,
it should also be impossible for the page->flags to be updated in parallel
when we call reset_page_last_nid(). Instead of unlining the function, it
uses a simplier implementation that assumes no parallel updates and should
now be sufficiently short for inlining.
page_xchg_last_nid() is called in paths that are already quite expensive
(splitting huge page, fault handling, migration) and it is reasonable to
uninline. There was not really a good place to place the function but
mm/mmzone.c was the closest fit IMO.
This patch saved 128 bytes of text in the vmlinux file for the kernel
configuration I used for testing automatic NUMA balancing.
Signed-off-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Michal Hocko [Wed, 20 Feb 2013 02:14:43 +0000 (13:14 +1100)]
memcg: clean up swap accounting initialization code
Memcg swap accounting is currently enabled by enable_swap_cgroup when the
root cgroup is created. mem_cgroup_init acts as a memcg subsystem
initializer which sounds like a much better place for enable_swap_cgroup
as well. We already register memsw files from there so it makes a lot of
sense to merge those two into a single enable_swap_cgroup function.
This patch doesn't introduce any semantic changes.
Signed-off-by: Michal Hocko <mhocko@suse.cz> Cc: Zhouping Liu <zliu@redhat.com> Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: David Rientjes <rientjes@google.com> Cc: Li Zefan <lizefan@huawei.com> Cc: CAI Qian <caiqian@redhat.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Michal Hocko [Wed, 20 Feb 2013 02:14:43 +0000 (13:14 +1100)]
memcg: do not create memsw files if swap accounting is disabled
Zhouping Liu has reported that memsw files are exported even though swap
accounting is runtime disabled if CONFIG_MEMCG_SWAP is enabled. This
behavior has been introduced by af36f906 (memcg: always create memsw files
if CONFIG_CGROUP_MEM_RES_CTLR_SWAP) and it causes any attempt to open the
file to return EOPNOTSUPP. Although EOPNOTSUPP should say be clear that
memsw operations are not supported in the given configuration it is fair
to say that this behavior could be quite confusing.
Let's tear memsw files out of default cgroup files and add them only if
the swap accounting is really enabled (either by CONFIG_MEMCG_SWAP_ENABLED
or swapaccount=1 boot parameter). We can hook into mem_cgroup_init which
is called when the memcg subsystem is initialized and which happens after
boot command line is processed.
Signed-off-by: Michal Hocko <mhocko@suse.cz> Reported-by: Zhouping Liu <zliu@redhat.com> Tested-by: Zhouping Liu <zliu@redhat.com> Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: David Rientjes <rientjes@google.com> Cc: Li Zefan <lizefan@huawei.com> Cc: CAI Qian <caiqian@redhat.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
comment in 4fc3f1d66b1ef0d ("mm/rmap, migration: Make rmap_walk_anon() and
try_to_unmap_anon() more scalable") says:
| Rename anon_vma_[un]lock() => anon_vma_[un]lock_write(),
| to make it clearer that it's an exclusive write-lock in
| that case - suggested by Rik van Riel.
But that commit renames only anon_vma_lock()
Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org> Cc: Ingo Molnar <mingo@kernel.org> Reviewed-by: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Fix building errors like:
> arch/sparc/mm/init_32.c: In function 'show_mem':
> arch/sparc/mm/init_32.c:60:23: error: invalid operands to binary << (have 'atomic_long_t' and 'int')
Signed-off-by: Shaohua Li <shli@fusionio.com> Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Reported-by: Wu Fengguang <fengguang.wu@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Arnd Bergmann [Wed, 20 Feb 2013 02:14:40 +0000 (13:14 +1100)]
swap: fix "add per-partition lock for swapfile" for nommu
The patch "swap: add per-partition lock for swapfile" made the
nr_swap_pages variable unaccessible but forgot to change the
mm/nommu.c file that uses it. This does the trivial conversion
to let us build nommu kernels again
Signed-off-by: Arnd Bergmann <arnd@arndb.de> Cc: Shaohua Li <shli@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Shaohua Li [Wed, 20 Feb 2013 02:14:40 +0000 (13:14 +1100)]
swap-add-per-partition-lock-for-swapfile-fix-fix
> arch/sparc/mm/init_32.c: In function 'show_mem':
> arch/sparc/mm/init_32.c:60:23: error: invalid operands to binary << (have 'atomic_long_t' and 'int')
>
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Shaohua Li [Wed, 20 Feb 2013 02:14:40 +0000 (13:14 +1100)]
swap: add per-partition lock for swapfile
swap_lock is heavily contended when I test swap to 3 fast SSD (even
slightly slower than swap to 2 such SSD). The main contention comes from
swap_info_get(). This patch tries to fix the gap with adding a new
per-partition lock.
Global data like nr_swapfiles, total_swap_pages, least_priority and
swap_list are still protected by swap_lock.
nr_swap_pages is an atomic now, it can be changed without swap_lock. In
theory, it's possible get_swap_page() finds no swap pages but actually
there are free swap pages. But sounds not a big problem.
Accessing partition specific data (like scan_swap_map and so on) is only
protected by swap_info_struct.lock.
Changing swap_info_struct.flags need hold swap_lock and
swap_info_struct.lock, because scan_scan_map() will check it. read the
flags is ok with either the locks hold.
If both swap_lock and swap_info_struct.lock must be hold, we always hold
the former first to avoid deadlock.
swap_entry_free() can change swap_list. To delete that code, we add a new
highest_priority_index. Whenever get_swap_page() is called, we check it.
If it's valid, we use it.
It's a pity get_swap_page() still holds swap_lock(). But in practice,
swap_lock() isn't heavily contended in my test with this patch (or I can
say there are other much more heavier bottlenecks like TLB flush). And
BTW, looks get_swap_page() doesn't really need the lock. We never free
swap_info[] and we check SWAP_WRITEOK flag. The only risk without the
lock is we could swapout to some low priority swap, but we can quickly
recover after several rounds of swap, so sounds not a big deal to me. But
I'd prefer to fix this if it's a real problem.
"swap: make each swap partition have one address_space" improved the
swapout speed from 1.7G/s to 2G/s. This patch further improves the speed
to 2.3G/s, so around 15% improvement. It's a multi-process test, so TLB
flush isn't the biggest bottleneck before the patches.
Signed-off-by: Shaohua Li <shli@fusionio.com> Cc: Hugh Dickins <hughd@google.com> Cc: Rik van Riel <riel@redhat.com> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Seth Jennings <sjenning@linux.vnet.ibm.com> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> Cc: Dan Magenheimer <dan.magenheimer@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Hugh Dickins <hughd@google.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Rik van Riel <riel@redhat.com> Cc: Shaohua Li <shli@fusionio.com> Cc: Shaohua Li <shli@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Shaohua Li [Wed, 20 Feb 2013 02:14:39 +0000 (13:14 +1100)]
swap: make each swap partition have one address_space
When I use several fast SSD to do swap, swapper_space.tree_lock is heavily
contended. This makes each swap partition have one address_space to
reduce the lock contention. There is an array of address_space for swap.
The swap entry type is the index to the array.
In my test with 3 SSD, this increases the swapout throughput 20%.
Signed-off-by: Shaohua Li <shli@fusionio.com> Cc: Hugh Dickins <hughd@google.com> Acked-by: Rik van Riel <riel@redhat.com> Acked-by: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Shaohua Li <shli@fusionio.com> Suggested-by: Andrew Morton <akpm@linux-foundation.org> Cc: Hugh Dickins <hughd@google.com> Acked-by: Rik van Riel <riel@redhat.com> Cc: Minchan Kim <minchan.kim@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Hugh Dickins [Wed, 20 Feb 2013 02:14:38 +0000 (13:14 +1100)]
mm: numa: Cleanup flow of transhuge page migration
When correcting commit 04fa5d6a ("mm: migrate: check page_count of THP
before migrating") Hugh Dickins noted that the control flow for transhuge
migration was difficult to follow. Unconditionally calling put_page() in
numamigrate_isolate_page() made the failure paths of both
migrate_misplaced_transhuge_page() and migrate_misplaced_page() more
complex that they should be. Further, he was extremely wary that an
unlock_page() should ever happen after a put_page() even if the put_page()
should never be the final put_page.
Hugh implemented the following cleanup to simplify the path by calling
putback_lru_page() inside numamigrate_isolate_page() if it failed to
isolate and always calling unlock_page() within
migrate_misplaced_transhuge_page(). There is no functional change after
this patch is applied but the code is easier to follow and unlock_page()
always happens before put_page().
[mgorman@suse.de: changelog only] Signed-off-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Hugh Dickins <hughd@google.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Simon Jeons <simon.jeons@gmail.com> Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Peter Zijlstra [Wed, 20 Feb 2013 02:14:38 +0000 (13:14 +1100)]
mm: Fold page->_last_nid into page->flags where possible
page->_last_nid fits into page->flags on 64-bit. The unlikely 32-bit NUMA
configuration with NUMA Balancing will still need an extra page field. As
Peter notes "Completely dropping 32bit support for CONFIG_NUMA_BALANCING
would simplify things, but it would also remove the warning if we grow
enough 64bit only page-flags to push the last-cpu out."
[mgorman@suse.de: minor modifications] Signed-off-by: Mel Gorman <mgorman@suse.de> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Simon Jeons <simon.jeons@gmail.com> Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Peter Zijlstra [Wed, 20 Feb 2013 02:14:37 +0000 (13:14 +1100)]
mm: move page flags layout to separate header
This is a preparation patch for moving page->_last_nid into page->flags
that moves page flag layout information to a separate header. This patch
is necessary because otherwise there would be a circular dependency
between mm_types.h and mm.h.
Signed-off-by: Mel Gorman <mgorman@suse.de> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Simon Jeons <simon.jeons@gmail.com> Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Mel Gorman [Wed, 20 Feb 2013 02:14:37 +0000 (13:14 +1100)]
mm: numa: handle side-effects in count_vm_numa_events() for !CONFIG_NUMA_BALANCING
The current definitions for count_vm_numa_events() is wrong for
!CONFIG_NUMA_BALANCING as the following would miss the side-effect.
count_vm_numa_events(NUMA_FOO, bar++);
There are no such users of count_vm_numa_events() but this patch fixes it
as it is a potential pitfall. Ideally both would be converted to static
inline but NUMA_PTE_UPDATES is not defined if !CONFIG_NUMA_BALANCING and
creating dummy constants just to have a static inline would be similarly
clumsy.
Signed-off-by: Mel Gorman <mgorman@suse.de> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Simon Jeons <simon.jeons@gmail.com> Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Mel Gorman [Wed, 20 Feb 2013 02:14:37 +0000 (13:14 +1100)]
mm: numa: take THP into account when migrating pages for NUMA balancing
Wanpeng Li pointed out that numamigrate_isolate_page() assumes that only
one base page is being migrated when in fact it can also be checking THP.
The consequences are that a migration will be attempted when a target node
is nearly full and fail later. It's unlikely to be user-visible but it
should be fixed. While we are there, migrate_balanced_pgdat() should
treat nr_migrate_pages as an unsigned long as it is treated as a
watermark.
Signed-off-by: Mel Gorman <mgorman@suse.de> Suggested-by: Wanpeng Li <liwanp@linux.vnet.ibm.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Simon Jeons <simon.jeons@gmail.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Ming Lei [Wed, 20 Feb 2013 02:14:36 +0000 (13:14 +1100)]
usb: forbid memory allocation with I/O during bus reset
If one storage interface or usb network interface(iSCSI case) exists in
current configuration, memory allocation with GFP_KERNEL during
usb_device_reset() might trigger I/O transfer on the storage interface
itself and cause deadlock because the 'us->dev_mutex' is held in
.pre_reset() and the storage interface can't do I/O transfer when the
reset is triggered by other interface, or the error handling can't be
completed if the reset is triggered by the storage itself (error handling
path).
Signed-off-by: Ming Lei <ming.lei@canonical.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: David Decotigny <david.decotigny@google.com> Cc: Tom Herbert <therbert@google.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Jens Axboe <axboe@kernel.dk> Cc: Minchan Kim <minchan@kernel.org> Cc: Alan Stern <stern@rowland.harvard.edu> Cc: Oliver Neukum <oneukum@suse.de> Reviewed-by: Jiri Kosina <jkosina@suse.cz> Cc: Mel Gorman <mel@csn.ul.ie> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: "Rafael J. Wysocki" <rjw@sisk.pl> Cc: Greg KH <greg@kroah.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Ming Lei [Wed, 20 Feb 2013 02:14:35 +0000 (13:14 +1100)]
pm / runtime: force memory allocation with no I/O during Runtime PM callbcack
Apply the introduced memalloc_noio_save() and memalloc_noio_restore() to
force memory allocation with no I/O during runtime_resume/runtime_suspend
callback on device with the flag of 'memalloc_noio' set.
Signed-off-by: Ming Lei <ming.lei@canonical.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: David Decotigny <david.decotigny@google.com> Cc: Tom Herbert <therbert@google.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Jens Axboe <axboe@kernel.dk> Cc: Minchan Kim <minchan@kernel.org> Cc: Alan Stern <stern@rowland.harvard.edu> Cc: Oliver Neukum <oneukum@suse.de> Cc: Jiri Kosina <jiri.kosina@suse.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: "Rafael J. Wysocki" <rjw@sisk.pl> Cc: Greg KH <greg@kroah.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Ming Lei [Wed, 20 Feb 2013 02:14:35 +0000 (13:14 +1100)]
net/core: apply pm_runtime_set_memalloc_noio on network devices
Deadlock might be caused by allocating memory with GFP_KERNEL in
runtime_resume and runtime_suspend callback of network devices in iSCSI
situation, so mark network devices and its ancestor as 'memalloc_noio'
with the introduced pm_runtime_set_memalloc_noio().
Signed-off-by: Ming Lei <ming.lei@canonical.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: David Decotigny <david.decotigny@google.com> Cc: Tom Herbert <therbert@google.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Jens Axboe <axboe@kernel.dk> Cc: Minchan Kim <minchan@kernel.org> Cc: Alan Stern <stern@rowland.harvard.edu> Cc: Oliver Neukum <oneukum@suse.de> Cc: Jiri Kosina <jiri.kosina@suse.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: "Rafael J. Wysocki" <rjw@sisk.pl> Cc: Greg KH <greg@kroah.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Ming Lei [Wed, 20 Feb 2013 02:14:35 +0000 (13:14 +1100)]
block/genhd.c: apply pm_runtime_set_memalloc_noio on block devices
Apply the introduced pm_runtime_set_memalloc_noio on block device so that
PM core will teach mm to not allocate memory with GFP_IOFS when calling
the runtime_resume and runtime_suspend callback for block devices and its
ancestors.
Signed-off-by: Ming Lei <ming.lei@canonical.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Minchan Kim <minchan@kernel.org> Cc: Alan Stern <stern@rowland.harvard.edu> Cc: Oliver Neukum <oneukum@suse.de> Cc: Jiri Kosina <jiri.kosina@suse.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: "Rafael J. Wysocki" <rjw@sisk.pl> Cc: Greg KH <greg@kroah.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: David Decotigny <david.decotigny@google.com> Cc: Tom Herbert <therbert@google.com> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Introduce the flag memalloc_noio in 'struct dev_pm_info' to help PM core
to teach mm not allocating memory with GFP_KERNEL flag for avoiding
probable deadlock.
As explained in the comment, any GFP_KERNEL allocation inside
runtime_resume() or runtime_suspend() on any one of device in the path
from one block or network device to the root device in the device tree may
cause deadlock, the introduced pm_runtime_set_memalloc_noio() sets or
clears the flag on device in the path recursively.
Signed-off-by: Ming Lei <ming.lei@canonical.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Alan Stern <stern@rowland.harvard.edu> Cc: Oliver Neukum <oneukum@suse.de> Cc: Jiri Kosina <jiri.kosina@suse.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: "Rafael J. Wysocki" <rjw@sisk.pl> Cc: Greg KH <greg@kroah.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: "David S. Miller" <davem@davemloft.net> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: David Decotigny <david.decotigny@google.com> Cc: Tom Herbert <therbert@google.com> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Ming Lei [Wed, 20 Feb 2013 02:14:34 +0000 (13:14 +1100)]
mm: teach mm by current context info to not do I/O during memory allocation
This patch introduces PF_MEMALLOC_NOIO on process flag('flags' field of
'struct task_struct'), so that the flag can be set by one task to avoid
doing I/O inside memory allocation in the task's context.
The patch trys to solve one deadlock problem caused by block device, and
the problem may happen at least in the below situations:
- during block device runtime resume, if memory allocation with
GFP_KERNEL is called inside runtime resume callback of any one of its
ancestors(or the block device itself), the deadlock may be triggered
inside the memory allocation since it might not complete until the block
device becomes active and the involed page I/O finishes. The situation
is pointed out first by Alan Stern. It is not a good approach to
convert all GFP_KERNEL[1] in the path into GFP_NOIO because several
subsystems may be involved(for example, PCI, USB and SCSI may be
involved for usb mass stoarage device, network devices involved too in
the iSCSI case)
- during block device runtime suspend, because runtime resume need to
wait for completion of concurrent runtime suspend.
- during error handling of usb mass storage deivce, USB bus reset will
be put on the device, so there shouldn't have any memory allocation with
GFP_KERNEL during USB bus reset, otherwise the deadlock similar with
above may be triggered. Unfortunately, any usb device may include one
mass storage interface in theory, so it requires all usb interface
drivers to handle the situation. In fact, most usb drivers don't know
how to handle bus reset on the device and don't provide .pre_set() and
.post_reset() callback at all, so USB core has to unbind and bind driver
for these devices. So it is still not practical to resort to GFP_NOIO
for solving the problem.
Also the introduced solution can be used by block subsystem or block
drivers too, for example, set the PF_MEMALLOC_NOIO flag before doing
actual I/O transfer.
It is not a good idea to convert all these GFP_KERNEL in the affected path
into GFP_NOIO because these functions doing that may be implemented as
library and will be called in many other contexts.
In fact, memalloc_noio_flags() can convert some of current static GFP_NOIO
allocation into GFP_KERNEL back in other non-affected contexts, at least
almost all GFP_NOIO in USB subsystem can be converted into GFP_KERNEL
after applying the approach and make allocation with GFP_NOIO only happen
in runtime resume/bus reset/block I/O transfer contexts generally.
[1], several GFP_KERNEL allocation examples in runtime resume path
- some individual usb drivers
usblp, uvc, gspca, most of dvb-usb-v2 media drivers, cpia2, az6007, ....
That is just what I have found. Unfortunately, this allocation can only
be found by human being now, and there should be many not found since any
function in the resume path(call tree) may allocate memory with
GFP_KERNEL.
Signed-off-by: Ming Lei <ming.lei@canonical.com> Signed-off-by: Minchan Kim <minchan@kernel.org> Cc: Alan Stern <stern@rowland.harvard.edu> Cc: Oliver Neukum <oneukum@suse.de> Cc: Jiri Kosina <jiri.kosina@suse.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: "Rafael J. Wysocki" <rjw@sisk.pl> Cc: Greg KH <greg@kroah.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: "David S. Miller" <davem@davemloft.net> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: David Decotigny <david.decotigny@google.com> Cc: Tom Herbert <therbert@google.com> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>