cgroup: Fix task counter common ancestor logic
The task counter subsystem has been written assuming that
can_attach_task/attach_task/cancel_attach_task calls are serialized per
task. This is true when we attach only one task but not when we attach a
whole thread group, in which case the sequence is:
for each thread
if (can_attach_task() < 0)
goto rollback
for each_thread
attach_task()
rollback:
for each thread
cancel_attach_task()
The common ancestor, searched on task_counter_attach_task(), can thus
change between each of these calls for a given task. This breaks if some
tasks in the thread group are not in the same cgroup origin. The uncharge
made in attach_task() or the rollback in cancel_attach_task() there would
have an erroneous propagation.
This can even break seriously is some scenario. For example there
with $PID beeing the pid of a multithread process:
mkdir /dev/cgroup/cgroup0
echo $PID > /dev/cgroup/cgroup0/cgroup.procs
echo $PID > /dev/cgroup/tasks
echo $PID > /dev/cgroup/cgroup0/cgroup.procs
On the last move, attach_task() is called on the thread leader with
the wrong common_ancestor, leading to a crash because we uncharge
a res_counter that doesn't exist:
[ 149.805063] BUG: unable to handle kernel NULL pointer dereference at
0000000000000040
[ 149.806013] IP: [<
ffffffff810a0172>] __lock_acquire+0x62/0x15d0
[ 149.806013] PGD
51d38067 PUD
5119e067 PMD 0
[ 149.806013] Oops: 0000 [#1] PREEMPT SMP
[ 149.806013] Dumping ftrace buffer:
[ 149.806013] (ftrace buffer empty)
[ 149.806013] CPU 3
[ 149.806013] Modules linked in:
[ 149.806013]
[ 149.806013] Pid: 1111, comm: spread_thread_g Not tainted 3.1.0-rc3+ #165 FUJITSU SIEMENS AMD690VM-FMH/AMD690VM-FMH
[ 149.806013] RIP: 0010:[<
ffffffff810a0172>] [<
ffffffff810a0172>] __lock_acquire+0x62/0x15d0
[ 149.806013] RSP: 0018:
ffff880051479b38 EFLAGS:
00010046
[ 149.806013] RAX:
0000000000000046 RBX:
0000000000000040 RCX:
0000000000000000
[ 149.868002] RDX:
0000000000000001 RSI:
0000000000000000 RDI:
0000000000000040
[ 149.868002] RBP:
ffff880051479c08 R08:
0000000000000002 R09:
0000000000000001
[ 149.868002] R10:
0000000000000000 R11:
0000000000000000 R12:
0000000000000002
[ 149.868002] R13:
0000000000000000 R14:
0000000000000000 R15:
ffff880051f120a0
[ 149.868002] FS:
00007f1e01dd7700(0000) GS:
ffff880057d80000(0000) knlGS:
0000000000000000
[ 149.868002] CS: 0010 DS: 0000 ES: 0000 CR0:
0000000080050033
[ 149.868002] CR2:
0000000000000040 CR3:
0000000051c59000 CR4:
00000000000006e0
[ 149.868002] DR0:
0000000000000000 DR1:
0000000000000000 DR2:
0000000000000000
[ 149.868002] DR3:
0000000000000000 DR6:
00000000ffff0ff0 DR7:
0000000000000400
[ 149.868002] Process spread_thread_g (pid: 1111, threadinfo
ffff880051478000, task
ffff880051f120a0)
[ 149.868002] Stack:
[ 149.868002]
0000000000000000 0000000000000000 0000000000000000 0000000000000000
[ 149.868002]
0000000000000000 0000000000000000 0000000000000000 0000000000000000
[ 149.868002]
0000000000000000 0000000000000000 0000000000000000 0000000000000000
[ 149.868002] Call Trace:
[ 149.868002] [<
ffffffff810a1d32>] lock_acquire+0xa2/0x1a0
[ 149.868002] [<
ffffffff810c373c>] ? res_counter_uncharge_until+0x4c/0xb0
[ 149.868002] [<
ffffffff8180802b>] _raw_spin_lock+0x3b/0x50
[ 149.868002] [<
ffffffff810c373c>] ? res_counter_uncharge_until+0x4c/0xb0
[ 149.868002] [<
ffffffff810c373c>] res_counter_uncharge_until+0x4c/0xb0
[ 149.868002] [<
ffffffff810c26c4>] task_counter_attach_task+0x44/0x50
[ 149.868002] [<
ffffffff810bffcd>] cgroup_attach_proc+0x5ad/0x720
[ 149.868002] [<
ffffffff810bfa99>] ? cgroup_attach_proc+0x79/0x720
[ 149.868002] [<
ffffffff810c01cf>] attach_task_by_pid+0x8f/0x220
[ 149.868002] [<
ffffffff810c0230>] ? attach_task_by_pid+0xf0/0x220
[ 149.868002] [<
ffffffff810c0230>] ? attach_task_by_pid+0xf0/0x220
[ 149.868002] [<
ffffffff810c0388>] cgroup_procs_write+0x28/0x40
[ 149.868002] [<
ffffffff810c0bd9>] cgroup_file_write+0x209/0x2f0
[ 149.868002] [<
ffffffff812b8d08>] ? apparmor_file_permission+0x18/0x20
[ 149.868002] [<
ffffffff8127ef43>] ? security_file_permission+0x23/0x90
[ 149.868002] [<
ffffffff81157038>] vfs_write+0xc8/0x190
[ 149.868002] [<
ffffffff811571f1>] sys_write+0x51/0x90
[ 149.868002] [<
ffffffff818102c2>] system_call_fastpath+0x16/0x1b
To solve this, keep the original cgroup of each thread in the thread
group cached in the flex array and pass it to can_attach_task()/attach_task()
and cancel_attach_task() so that the correct common ancestor between the old
and new cgroup can be safely retrieved for each task.
This is inspired by a previous patch from Li Zefan:
"[PATCH] cgroups: don't cache common ancestor in task counter subsys".
Reported-by: Ben Blum <bblum@andrew.cmu.edu>
Reported-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Paul Menage <paul@paulmenage.org>
Cc: Tim Hockin <thockin@hockin.org>
Cc: Tejun Heo <htejun@gmail.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>