3 Effective NUMA scheduling problem statement, described formally:
5 * minimize interconnect traffic
7 For each task 't_i' we have memory, this memory can be spread over multiple
8 physical nodes, let us denote this as: 'p_i,k', the memory task 't_i' has on
11 If a task shares memory with another task let us denote this as:
12 's_i,k', the memory shared between tasks including 't_i' residing on node
15 Let 'M' be the distribution that governs all 'p' and 's', ie. the page placement.
17 Similarly, lets define 'fp_i,k' and 'fs_i,k' resp. as the (average) usage
18 frequency over those memory regions [1/s] such that the product gives an
19 (average) bandwidth 'bp' and 'bs' in [pages/s].
21 (note: multiple tasks sharing memory naturally avoid duplicat accounting
22 because each task will have its own access frequency 'fs')
24 (pjt: I think this frequency is more numerically consistent if you explicitly
25 restrict p/s above to be the working-set. (It also makes explicit the
26 requirement for <C0,M0> to change about a change in the working set.)
28 Doing this does have the nice property that it lets you use your frequency
29 measurement as a weak-ordering for the benefit a task would receive when
30 we can't fit everything.
32 e.g. task1 has working set 10mb, f=90%
33 task2 has working set 90mb, f=10%
35 Both are using 9mb/s of bandwidth, but we'd expect a much larger benefit
36 from task1 being on the right node than task2. )
38 Let 'C' map every task 't_i' to a cpu 'c_i' and its corresponding node 'n_i':
42 This gives us the total interconnect traffic between nodes 'k' and 'l',
45 T_k,l = \Sum_i bp_i,l + bs_i,l + \Sum bp_j,k + bs_j,k where n_i == k, n_j == l
47 And our goal is to obtain C0 and M0 such that:
49 T_k,l(C0, M0) =< T_k,l(C, M) for all C, M where k != l
51 (note: we could introduce 'nc(k,l)' as the cost function of accessing memory
52 on node 'l' from node 'k', this would be useful for bigger NUMA systems
54 pjt: I agree nice to have, but intuition suggests diminishing returns on more
55 usual systems given factors like things like Haswell's enormous 35mb l3
56 cache and QPI being able to do a direct fetch.)
58 (note: do we need a limit on the total memory per node?)
63 For each task 't_i' we have a weight 'w_i' (related to nice), and each cpu
64 'c_n' has a compute capacity 'P_n', again, using our map 'C' we can formulate a
67 L_n = 1/P_n * \Sum_i w_i for all c_i = n
69 using that we can formulate a load difference between CPUs
73 Which allows us to state the fairness goal like:
75 L_n,m(C0) =< L_n,m(C) for all C, n != m
77 (pjt: It can also be usefully stated that, having converged at C0:
79 | L_n(C0) - L_m(C0) | <= 4/3 * | G_n( U(t_i, t_j) ) - G_m( U(t_i, t_j) ) |
81 Where G_n,m is the greedy partition of tasks between L_n and L_m. This is
82 the "worst" partition we should accept; but having it gives us a useful
83 bound on how much we can reasonably adjust L_n/L_m at a Pareto point to
86 Together they give us the complete multi-objective optimization problem:
88 min_C,M [ L_n,m(C), T_k,l(C,M) ]
94 - the memory bandwidth problem is very much an inter-process problem, in
95 particular there is no such concept as a process in the above problem.
97 - the naive solution would completely prefer fairness over interconnect
98 traffic, the more complicated solution could pick another Pareto point using
99 an aggregate objective function such that we balance the loss of work
100 efficiency against the gain of running, we'd want to more or less suggest
101 there to be a fixed bound on the error from the Pareto line for any
106 http://en.wikipedia.org/wiki/Mathematical_optimization
107 http://en.wikipedia.org/wiki/Multi-objective_optimization
110 * warning, significant hand-waving ahead, improvements welcome *
113 Partial solutions / approximations:
115 1) have task node placement be a pure preference from the 'fairness' pov.
117 This means we always prefer fairness over interconnect bandwidth. This reduces
120 min_C,M [ T_k,l(C,M) ]
122 2a) migrate memory towards 'n_i' (the task's node).
124 This creates memory movement such that 'p_i,k for k != n_i' becomes 0 --
125 provided 'n_i' stays stable enough and there's sufficient memory (looks like
126 we might need memory limits for this).
128 This does however not provide us with any 's_i' (shared) information. It does
129 however remove 'M' since it defines memory placement in terms of task
132 XXX properties of this M vs a potential optimal
134 2b) migrate memory towards 'n_i' using 2 samples.
136 This separates pages into those that will migrate and those that will not due
137 to the two samples not matching. We could consider the first to be of 'p_i'
138 (private) and the second to be of 's_i' (shared).
140 This interpretation can be motivated by the previously observed property that
141 'p_i,k for k != n_i' should become 0 under sufficient memory, leaving only
142 's_i' (shared). (here we loose the need for memory limits again, since it
143 becomes indistinguishable from shared).
145 XXX include the statistical babble on double sampling somewhere near
147 This reduces the problem further; we loose 'M' as per 2a, it further reduces
148 the 'T_k,l' (interconnect traffic) term to only include shared (since per the
149 above all private will be local):
151 T_k,l = \Sum_i bs_i,l for every n_i = k, l != k
153 [ more or less matches the state of sched/numa and describes its remaining
154 problems and assumptions. It should work well for tasks without significant
155 shared memory usage between tasks. ]
157 Possible future directions:
159 Motivated by the form of 'T_k,l', try and obtain each term of the sum, so we
162 3a) add per-task per node counters
164 At fault time, count the number of pages the task faults on for each node.
165 This should give an approximation of 'p_i' for the local node and 's_i,k' for
168 While these numbers provide pages per scan, and so have the unit [pages/s] they
169 don't count repeat access and thus aren't actually representable for our
172 3b) additional frequency term
174 Additionally (or instead if it turns out we don't need the raw 'p' and 's'
175 numbers) we can approximate the repeat accesses by using the time since marking
176 the pages as indication of the access frequency.
178 Let 'I' be the interval of marking pages and 'e' the elapsed time since the
179 last marking, then we could estimate the number of accesses 'a' as 'a = I / e'.
180 If we then increment the node counters using 'a' instead of 1 we might get
181 a better estimate of bandwidth terms.
183 3c) additional averaging; can be applied on top of either a/b.
185 [ Rik argues that decaying averages on 3a might be sufficient for bandwidth since
186 the decaying avg includes the old accesses and therefore has a measure of repeat
189 Rik also argued that the sample frequency is too low to get accurate access
190 frequency measurements, I'm not entirely convinced, event at low sample
191 frequencies the avg elapsed time 'e' over multiple samples should still
192 give us a fair approximation of the avg access frequency 'a'.
194 So doing both b&c has a fair chance of working and allowing us to distinguish
195 between important and less important memory accesses.
197 Experimentation has shown no benefit from the added frequency term so far. ]
199 This will give us 'bp_i' and 'bs_i,k' so that we can approximately compute
200 'T_k,l' Our optimization problem now reads:
202 min_C [ \Sum_i bs_i,l for every n_i = k, l != k ]
204 And includes only shared terms, this makes sense since all task private memory
205 will become local as per 2.
207 This suggests that if there is significant shared memory, we should try and
210 4) move towards where 'most' memory is
212 The simplest significance test is comparing the biggest shared 's_i,k' against
213 the private 'p_i'. If we have more shared than private, move towards it.
215 This effectively makes us move towards where most our memory is and forms a
216 feed-back loop with 2. We migrate memory towards us and we migrate towards
217 where 'most' memory is.
219 (Note: even if there were two tasks fully trashing the same shared memory, it
220 is very rare for there to be an 50/50 split in memory, lacking a perfect
221 split, the small will move towards the larger. In case of the perfect
222 split, we'll tie-break towards the lower node number.)
224 5) 'throttle' 4's node placement
226 Since per 2b our 's_i,k' and 'p_i' require at least two scans to 'stabilize'
227 and show representative numbers, we should limit node-migration to not be
230 n) poke holes in previous that require more stuff and describe it.