From: Vladimir Davydov Date: Fri, 8 Feb 2013 07:10:46 +0000 (+0400) Subject: sched: Initialize cfs_rq->runtime_remaining to non-zero on cfs bw set X-Git-Tag: next-20130218~38^2~25^2 X-Git-Url: https://git.karo-electronics.de/?a=commitdiff_plain;h=0a702bb8af3c1b2dff355fb3c27e7f7d5285e30b;p=karo-tx-linux.git sched: Initialize cfs_rq->runtime_remaining to non-zero on cfs bw set If cfs_rq->runtime_remaining is <= 0 then either - cfs_rq is throttled and waiting for quota redistribution, or - cfs_rq is currently executing and will be throttled on put_prev_entity, or - cfs_rq is not throttled and has not executed since its quota was set (runtime_remaining is set to 0 on cfs bandwidth reconfiguration). It is obvious that the last case is rather an exception from the rule "runtime_remaining<=0 iff cfs_rq is throttled or will be throttled as soon as it finishes its execution". Moreover, it can lead to a task hang as follows. If put_prev_task() is called immediately after first pick_next_task after quota was set, "immediately" meaning rq->clock in both functions is the same, then the corresponding cfs_rq will be throttled. Besides being unfair (the cfs_rq has not executed in fact), the quota refilling timer can be idle at that time and it won't be activated on put_prev_task because update_curr calls account_cfs_rq_runtime, which activates the timer, only if delta_exec is strictly positive. As a result we can get a task "running" inside a throttled cfs_rq which will probably never be unthrottled. To avoid the problem, the patch makes tg_set_cfs_bandwidth initialize runtime_remaining of each cfs_rq to 1 instead of 0 so that the cfs_rq will be throttled only if it has executed for some positive number of nanoseconds. Several times we had our customers encountered such hangs inside a VM (seems something is wrong or rather different in time accounting there). Analyzing crash dumps revealed that hung tasks were running inside cfs_rq's, which had the following setup: cfs_rq->throttled=1 cfs_rq->runtime_enabled=1 cfs_rq->runtime_remaining=0 cfs_rq->tg->cfs_bandwidth.idle=1 cfs_rq->tg->cfs_bandwidth.timer_active=0 which conforms pretty nice to the explanation given above. Signed-off-by: Vladimir Davydov Cc: Cc: Peter Zijlstra Cc: Paul Turner Link: http://lkml.kernel.org/r/1360307446-26978-1-git-send-email-vdavydov@parallels.com Signed-off-by: Ingo Molnar --- diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 26058d0bebba..c7a078f39bb7 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -7686,7 +7686,7 @@ static int tg_set_cfs_bandwidth(struct task_group *tg, u64 period, u64 quota) raw_spin_lock_irq(&rq->lock); cfs_rq->runtime_enabled = runtime_enabled; - cfs_rq->runtime_remaining = 0; + cfs_rq->runtime_remaining = 1; if (cfs_rq->throttled) unthrottle_cfs_rq(cfs_rq);