Jiang Liu [Fri, 9 Nov 2012 03:03:42 +0000 (14:03 +1100)]
mm: fix a regression with HIGHMEM
Changeset 7f1290f2f2 ("mm: fix-up zone present pages") tried to fix a
issue when calculating zone->present_pages, but it causes a regression to
32bit systems with HIGHMEM. With that changeset,
reset_zone_present_pages() resets all zone->present_pages to zero, and
fixup_zone_present_pages() is called to recalculate zone->present_pages
when the boot allocator frees core memory pages into the buddy allocator.
Because highmem pages are not freed by bootmem allocator, all highmem
zones' present_pages becomes zero.
Actually there's no need to recalculate present_pages for the highmem zone
because the bootmem allocator never allocates pages from it. So fix the
regression by skipping highmem in function reset_zone_present_pages() and
fixup_zone_present_pages().
Signed-off-by: Jiang Liu <jiang.liu@huawei.com> Signed-off-by: Jianguo Wu <wujianguo@huawei.com> Reported-by: Maciej Rutecki <maciej.rutecki@gmail.com> Tested-by: Maciej Rutecki <maciej.rutecki@gmail.com> Tested-by: Chris Clayton <chris2553@googlemail.com> Cc: Rafael J. Wysocki <rjw@sisk.pl> Cc: Mel Gorman <mgorman@suse.de> Cc: Minchan Kim <minchan@kernel.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Michal Hocko <mhocko@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Michal Hocko [Fri, 9 Nov 2012 03:03:42 +0000 (14:03 +1100)]
doc: describe memcg swappiness more precisely
since fe35004f (mm: avoid swapping out with swappiness==0) memcg reclaim
stopped swapping out anon pages completely when 0 value is used.
Although this is somehow expected it hasn't been done for a really long
time this way and so it is probably better to be explicit about the
effect. Moreover global reclaim swapps out even when swappiness is 0
to prevent from OOM killer.
The original issue (a wrong tasks get killed in a small group and memcg
swappiness=0) has been reported on top of our 3.0 based kernel (with fe35004f backported). I have tried to replicate it by the test case
mentioned https://lkml.org/lkml/2012/10/10/223.
As David correctly pointed out (https://lkml.org/lkml/2012/10/10/418) the
significant role played the fact that all the processes in the group have
CAP_SYS_ADMIN but oom_score_adj has the similar effect. Say there is 2G
of swap space which is 524288 pages. If you add CAP_SYS_ADMIN bonus then
you have -15728 score for the bias. This means that all tasks with less
than 60M get the minimum score and it is tasks ordering which determines
who gets killed as a result.
To summarize it. Users of small groups (relatively to the swap size) with
CAP_SYS_ADMIN tasks resp. oom_score_adj are affected the most others
might see an unexpected oom_badness calculation. Whether this is a
workload which is representative, I don't know but I think that it is
worth fixing and pushing to stable as well.
Signed-off-by: Michal Hocko <mhocko@suse.cz> Cc: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Michal Hocko [Fri, 9 Nov 2012 03:03:42 +0000 (14:03 +1100)]
memcg: oom: fix totalpages calculation for memory.swappiness==0
oom_badness() takes a totalpages argument which says how many pages are
available and it uses it as a base for the score calculation. The value
is calculated by mem_cgroup_get_limit which considers both limit and
total_swap_pages (resp. memsw portion of it).
This is usually correct but since fe35004f ("mm: avoid swapping out with
swappiness==0") we do not swap when swappiness is 0 which means that we
cannot really use up all the totalpages pages. This in turn confuses oom
score calculation if the memcg limit is much smaller than the available
swap because the used memory (capped by the limit) is negligible comparing
to totalpages so the resulting score is too small if adj!=0 (typically
task with CAP_SYS_ADMIN or non zero oom_score_adj). A wrong process might
be selected as result.
The problem can be worked around by checking mem_cgroup_swappiness==0 and
not considering swap at all in such a case.
Signed-off-by: Michal Hocko <mhocko@suse.cz> Acked-by: David Rientjes <rientjes@google.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
David Rientjes [Fri, 9 Nov 2012 03:03:41 +0000 (14:03 +1100)]
mm: fix build warning for uninitialized value
do_wp_page() sets mmun_called if mmun_start and mmun_end were initialized
and, if so, may call mmu_notifier_invalidate_range_end() with these
values. This doesn't prevent gcc from emitting a build warning though:
mm/memory.c: In function `do_wp_page':
mm/memory.c:2530: warning: `mmun_start' may be used uninitialized in this function
mm/memory.c:2531: warning: `mmun_end' may be used uninitialized in this function
It's much easier to initialize the variables to impossible values and do a
simple comparison to determine if they were initialized to remove the bool
entirely.
Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Iterating over the vma->anon_vma_chain without anon_vma_lock may cause
NULL ptr deref in anon_vma_interval_tree_verify(), because the node in the
chain might have been removed.
Reported-by: Sasha Levin <sasha.levin@oracle.com> Cc: Bob Liu <lliubbo@gmail.com> Signed-off-by: Michel Lespinasse <walken@google.com> Reviewed-by: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Hugh Dickins [Fri, 9 Nov 2012 03:03:41 +0000 (14:03 +1100)]
tmpfs: change final i_blocks BUG to WARNING
Under a particular load on one machine, I have hit shmem_evict_inode()'s
BUG_ON(inode->i_blocks), enough times to narrow it down to a particular
race between swapout and eviction.
It comes from the "if (freed > 0)" asymmetry in shmem_recalc_inode(), and
the lack of coherent locking between mapping's nrpages and shmem's swapped
count. There's a window in shmem_writepage(), between lowering nrpages in
shmem_delete_from_page_cache() and then raising swapped count, when the
freed count appears to be +1 when it should be 0, and then the asymmetry
stops it from being corrected with -1 before hitting the BUG.
One answer is coherent locking: using tree_lock throughout, without
info->lock; reasonable, but the raw_spin_lock in percpu_counter_add() on
used_blocks makes that messier than expected. Another answer may be a
further effort to eliminate the weird shmem_recalc_inode() altogether, but
previous attempts at that failed.
So far undecided, but for now change the BUG_ON to WARN_ON: in usual
circumstances it remains a useful consistency check.
Signed-off-by: Hugh Dickins <hughd@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Thanks to Johannes for pointing to truncation: free_swap_and_cache() only
does a trylock on the page, so the page lock we've held since before
confirming swap is not enough to protect against truncation.
What cleanup is needed in this case? Just delete_from_swap_cache(), which
takes care of the memcg uncharge.
Reported-by: Dave Jones <davej@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.com> Signed-off-by: Hugh Dickins <hughd@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>