7b540d0646ce ("proc_map_files_readdir(): don't bother with grabbing
files") switched proc_map_files_readdir() to use @f_mode directly instead of
grabbing @file reference, but same time the test for @vm_file presence was
lost leading to nil dereference. The patch brings the test back.
The all proc_map_files feature is CONFIG_\10CHECKPOINT_RESTORE wrapped
(which is set to 'n' by default) so the bug doesn't affect regular
kernels.
The regression is 3.7-rc1 only as far as I can tell.
[gorcunov@openvz.org: provided changelog] Signed-off-by: Stanislav Kinsbursky <skinsbursky@parallels.com> Acked-by: Cyrill Gorcunov <gorcunov@openvz.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Mel Gorman [Fri, 9 Nov 2012 03:03:42 +0000 (14:03 +1100)]
mm: vmscan: scale number of pages reclaimed by reclaim/compaction only in direct reclaim
Jiri Slaby reported the following:
(It's an effective revert of "mm: vmscan: scale number of pages
reclaimed by reclaim/compaction based on failures".)
Given kswapd had hours of runtime in ps/top output yesterday in the
morning and after the revert it's now 2 minutes in sum for the last 24h,
I would say, it's gone.
The intention of the patch in question was to compensate for the loss of
lumpy reclaim. Part of the reason lumpy reclaim worked is because it
aggressively reclaimed pages and this patch was meant to be a sane
compromise.
When compaction fails, it gets deferred and both compaction and
reclaim/compaction is deferred avoid excessive reclaim. However, since
commit c6543459 ("mm: remove __GFP_NO_KSWAPD"), kswapd is woken up each
time and continues reclaiming which was not taken into account when the
patch was developed.
As it is not taking deferred compaction into account in this path it scans
aggressively before falling out and making the compaction_deferred check
in compaction_ready. This patch avoids kswapd scaling pages for reclaim
and leaves the aggressive reclaim to the process attempting the THP
allocation.
Signed-off-by: Mel Gorman <mgorman@suse.de> Reported-by: Jiri Slaby <jslaby@suse.cz> Cc: Rik van Riel <riel@redhat.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Jiang Liu [Fri, 9 Nov 2012 03:03:42 +0000 (14:03 +1100)]
mm: fix a regression with HIGHMEM
Changeset 7f1290f2f2 ("mm: fix-up zone present pages") tried to fix a
issue when calculating zone->present_pages, but it causes a regression to
32bit systems with HIGHMEM. With that changeset,
reset_zone_present_pages() resets all zone->present_pages to zero, and
fixup_zone_present_pages() is called to recalculate zone->present_pages
when the boot allocator frees core memory pages into the buddy allocator.
Because highmem pages are not freed by bootmem allocator, all highmem
zones' present_pages becomes zero.
Actually there's no need to recalculate present_pages for the highmem zone
because the bootmem allocator never allocates pages from it. So fix the
regression by skipping highmem in function reset_zone_present_pages() and
fixup_zone_present_pages().
Signed-off-by: Jiang Liu <jiang.liu@huawei.com> Signed-off-by: Jianguo Wu <wujianguo@huawei.com> Reported-by: Maciej Rutecki <maciej.rutecki@gmail.com> Tested-by: Maciej Rutecki <maciej.rutecki@gmail.com> Tested-by: Chris Clayton <chris2553@googlemail.com> Cc: Rafael J. Wysocki <rjw@sisk.pl> Cc: Mel Gorman <mgorman@suse.de> Cc: Minchan Kim <minchan@kernel.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Michal Hocko <mhocko@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Michal Hocko [Fri, 9 Nov 2012 03:03:42 +0000 (14:03 +1100)]
doc: describe memcg swappiness more precisely
since fe35004f (mm: avoid swapping out with swappiness==0) memcg reclaim
stopped swapping out anon pages completely when 0 value is used.
Although this is somehow expected it hasn't been done for a really long
time this way and so it is probably better to be explicit about the
effect. Moreover global reclaim swapps out even when swappiness is 0
to prevent from OOM killer.
The original issue (a wrong tasks get killed in a small group and memcg
swappiness=0) has been reported on top of our 3.0 based kernel (with fe35004f backported). I have tried to replicate it by the test case
mentioned https://lkml.org/lkml/2012/10/10/223.
As David correctly pointed out (https://lkml.org/lkml/2012/10/10/418) the
significant role played the fact that all the processes in the group have
CAP_SYS_ADMIN but oom_score_adj has the similar effect. Say there is 2G
of swap space which is 524288 pages. If you add CAP_SYS_ADMIN bonus then
you have -15728 score for the bias. This means that all tasks with less
than 60M get the minimum score and it is tasks ordering which determines
who gets killed as a result.
To summarize it. Users of small groups (relatively to the swap size) with
CAP_SYS_ADMIN tasks resp. oom_score_adj are affected the most others
might see an unexpected oom_badness calculation. Whether this is a
workload which is representative, I don't know but I think that it is
worth fixing and pushing to stable as well.
Signed-off-by: Michal Hocko <mhocko@suse.cz> Cc: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Michal Hocko [Fri, 9 Nov 2012 03:03:42 +0000 (14:03 +1100)]
memcg: oom: fix totalpages calculation for memory.swappiness==0
oom_badness() takes a totalpages argument which says how many pages are
available and it uses it as a base for the score calculation. The value
is calculated by mem_cgroup_get_limit which considers both limit and
total_swap_pages (resp. memsw portion of it).
This is usually correct but since fe35004f ("mm: avoid swapping out with
swappiness==0") we do not swap when swappiness is 0 which means that we
cannot really use up all the totalpages pages. This in turn confuses oom
score calculation if the memcg limit is much smaller than the available
swap because the used memory (capped by the limit) is negligible comparing
to totalpages so the resulting score is too small if adj!=0 (typically
task with CAP_SYS_ADMIN or non zero oom_score_adj). A wrong process might
be selected as result.
The problem can be worked around by checking mem_cgroup_swappiness==0 and
not considering swap at all in such a case.
Signed-off-by: Michal Hocko <mhocko@suse.cz> Acked-by: David Rientjes <rientjes@google.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
David Rientjes [Fri, 9 Nov 2012 03:03:41 +0000 (14:03 +1100)]
mm: fix build warning for uninitialized value
do_wp_page() sets mmun_called if mmun_start and mmun_end were initialized
and, if so, may call mmu_notifier_invalidate_range_end() with these
values. This doesn't prevent gcc from emitting a build warning though:
mm/memory.c: In function `do_wp_page':
mm/memory.c:2530: warning: `mmun_start' may be used uninitialized in this function
mm/memory.c:2531: warning: `mmun_end' may be used uninitialized in this function
It's much easier to initialize the variables to impossible values and do a
simple comparison to determine if they were initialized to remove the bool
entirely.
Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Iterating over the vma->anon_vma_chain without anon_vma_lock may cause
NULL ptr deref in anon_vma_interval_tree_verify(), because the node in the
chain might have been removed.
Reported-by: Sasha Levin <sasha.levin@oracle.com> Cc: Bob Liu <lliubbo@gmail.com> Signed-off-by: Michel Lespinasse <walken@google.com> Reviewed-by: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Hugh Dickins [Fri, 9 Nov 2012 03:03:41 +0000 (14:03 +1100)]
tmpfs: change final i_blocks BUG to WARNING
Under a particular load on one machine, I have hit shmem_evict_inode()'s
BUG_ON(inode->i_blocks), enough times to narrow it down to a particular
race between swapout and eviction.
It comes from the "if (freed > 0)" asymmetry in shmem_recalc_inode(), and
the lack of coherent locking between mapping's nrpages and shmem's swapped
count. There's a window in shmem_writepage(), between lowering nrpages in
shmem_delete_from_page_cache() and then raising swapped count, when the
freed count appears to be +1 when it should be 0, and then the asymmetry
stops it from being corrected with -1 before hitting the BUG.
One answer is coherent locking: using tree_lock throughout, without
info->lock; reasonable, but the raw_spin_lock in percpu_counter_add() on
used_blocks makes that messier than expected. Another answer may be a
further effort to eliminate the weird shmem_recalc_inode() altogether, but
previous attempts at that failed.
So far undecided, but for now change the BUG_ON to WARN_ON: in usual
circumstances it remains a useful consistency check.
Signed-off-by: Hugh Dickins <hughd@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Thanks to Johannes for pointing to truncation: free_swap_and_cache() only
does a trylock on the page, so the page lock we've held since before
confirming swap is not enough to protect against truncation.
What cleanup is needed in this case? Just delete_from_swap_cache(), which
takes care of the memcg uncharge.
Reported-by: Dave Jones <davej@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.com> Signed-off-by: Hugh Dickins <hughd@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>