mm: vmscan: fix the page state calculation in too_many_isolated
It is observed that sometimes multiple tasks get blocked for long in the
congestion_wait loop below, in shrink_inactive_list. This is because of
vm_stat values not being synced.
(__schedule) from [<
c0a03328>]
(schedule_timeout) from [<
c0a04940>]
(io_schedule_timeout) from [<
c01d585c>]
(congestion_wait) from [<
c01cc9d8>]
(shrink_inactive_list) from [<
c01cd034>]
(shrink_zone) from [<
c01cdd08>]
(try_to_free_pages) from [<
c01c442c>]
(__alloc_pages_nodemask) from [<
c01f1884>]
(new_slab) from [<
c09fcf60>]
(__slab_alloc) from [<
c01f1a6c>]
In one such instance, zone_page_state(zone, NR_ISOLATED_FILE) had returned
14, zone_page_state(zone, NR_INACTIVE_FILE) returned 92, and GFP_IOFS was
set, and this resulted in too_many_isolated returning true. But one of
the CPU's pageset vm_stat_diff had NR_ISOLATED_FILE as "-14". So the
actual isolated count was zero. As there weren't any more updates to
NR_ISOLATED_FILE and vmstat_update deffered work had not been scheduled
yet, 7 tasks were spinning in the congestion wait loop for around 4
seconds, in the direct reclaim path.
This patch uses zone_page_state_snapshot instead, but restricts its usage
to avoid performance penalty.
The vmstat sync interval is HZ (sysctl_stat_interval), but since the
vmstat_work is declared as a deferrable work, the timer trigger can be
deferred to the next non-defferable timer expiry on the CPU which is in
idle. This results in the vmstat syncing on an idle CPU being delayed by
seconds. May be in most cases this behavior is fine, except in cases like
this.
[akpm@linux-foundation.org: move zone_page_state_snapshot() fallback logic into too_many_isolated()]
Signed-off-by: Vinayak Menon <vinmenon@codeaurora.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Vladimir Davydov <vdavydov@parallels.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>