Documentation/scheduler/numa-problem.txt

   1
   2
   3 Effective NUMA scheduling problem statement, described formally:
   4
   5  * minimize interconnect traffic
   6
   7 For each task 't_i' we have memory, this memory can be spread over multiple
   8 physical nodes, let us denote this as: 'p_i,k', the memory task 't_i' has on
   9 node 'k' in [pages].
  10
  11 If a task shares memory with another task let us denote this as:
  12 's_i,k', the memory shared between tasks including 't_i' residing on node
  13 'k'.
  14
  15 Let 'M' be the distribution that governs all 'p' and 's', ie. the page placement.
  16
  17 Similarly, lets define 'fp_i,k' and 'fs_i,k' resp. as the (average) usage
  18 frequency over those memory regions [1/s] such that the product gives an
  19 (average) bandwidth 'bp' and 'bs' in [pages/s].
  20
  21 (note: multiple tasks sharing memory naturally avoid duplicat accounting
  22        because each task will have its own access frequency 'fs')
  23
  24 (pjt: I think this frequency is more numerically consistent if you explicitly
  25       restrict p/s above to be the working-set. (It also makes explicit the
  26       requirement for <C0,M0> to change about a change in the working set.)
  27
  28       Doing this does have the nice property that it lets you use your frequency
  29       measurement as a weak-ordering for the benefit a task would receive when
  30       we can't fit everything.
  31
  32       e.g. task1 has working set 10mb, f=90%
  33            task2 has working set 90mb, f=10%
  34
  35       Both are using 9mb/s of bandwidth, but we'd expect a much larger benefit
  36       from task1 being on the right node than task2. )
  37
  38 Let 'C' map every task 't_i' to a cpu 'c_i' and its corresponding node 'n_i':
  39
  40   C: t_i -> {c_i, n_i}
  41
  42 This gives us the total interconnect traffic between nodes 'k' and 'l',
  43 'T_k,l', as:
  44
  45   T_k,l = \Sum_i bp_i,l + bs_i,l + \Sum bp_j,k + bs_j,k where n_i == k, n_j == l
  46
  47 And our goal is to obtain C0 and M0 such that:
  48
  49   T_k,l(C0, M0) =< T_k,l(C, M) for all C, M where k != l
  50
  51 (note: we could introduce 'nc(k,l)' as the cost function of accessing memory
  52        on node 'l' from node 'k', this would be useful for bigger NUMA systems
  53
  54  pjt: I agree nice to have, but intuition suggests diminishing returns on more
  55       usual systems given factors like things like Haswell's enormous 35mb l3
  56       cache and QPI being able to do a direct fetch.)
  57
  58 (note: do we need a limit on the total memory per node?)
  59
  60
  61  * fairness
  62
  63 For each task 't_i' we have a weight 'w_i' (related to nice), and each cpu
  64 'c_n' has a compute capacity 'P_n', again, using our map 'C' we can formulate a
  65 load 'L_n':
  66
  67   L_n = 1/P_n * \Sum_i w_i for all c_i = n
  68
  69 using that we can formulate a load difference between CPUs
  70
  71   L_n,m = | L_n - L_m |
  72
  73 Which allows us to state the fairness goal like:
  74
  75   L_n,m(C0) =< L_n,m(C) for all C, n != m
  76
  77 (pjt: It can also be usefully stated that, having converged at C0:
  78
  79    | L_n(C0) - L_m(C0) | <= 4/3 * | G_n( U(t_i, t_j) ) - G_m( U(t_i, t_j) ) |
  80
  81       Where G_n,m is the greedy partition of tasks between L_n and L_m. This is
  82       the "worst" partition we should accept; but having it gives us a useful
  83       bound on how much we can reasonably adjust L_n/L_m at a Pareto point to
  84       favor T_n,m. )
  85
  86 Together they give us the complete multi-objective optimization problem:
  87
  88   min_C,M [ L_n,m(C), T_k,l(C,M) ]
  89
  90
  91
  92 Notes:
  93
  94  - the memory bandwidth problem is very much an inter-process problem, in
  95    particular there is no such concept as a process in the above problem.
  96
  97  - the naive solution would completely prefer fairness over interconnect
  98    traffic, the more complicated solution could pick another Pareto point using
  99    an aggregate objective function such that we balance the loss of work
 100    efficiency against the gain of running, we'd want to more or less suggest
 101    there to be a fixed bound on the error from the Pareto line for any
 102    such solution.
 103
 104 References:
 105
 106   http://en.wikipedia.org/wiki/Mathematical_optimization
 107   http://en.wikipedia.org/wiki/Multi-objective_optimization
 108
 109
 110 * warning, significant hand-waving ahead, improvements welcome *
 111
 112
 113 Partial solutions / approximations:
 114
 115  1) have task node placement be a pure preference from the 'fairness' pov.
 116
 117 This means we always prefer fairness over interconnect bandwidth. This reduces
 118 the problem to:
 119
 120   min_C,M [ T_k,l(C,M) ]
 121
 122  2a) migrate memory towards 'n_i' (the task's node).
 123
 124 This creates memory movement such that 'p_i,k for k != n_i' becomes 0 --
 125 provided 'n_i' stays stable enough and there's sufficient memory (looks like
 126 we might need memory limits for this).
 127
 128 This does however not provide us with any 's_i' (shared) information. It does
 129 however remove 'M' since it defines memory placement in terms of task
 130 placement.
 131
 132 XXX properties of this M vs a potential optimal
 133
 134  2b) migrate memory towards 'n_i' using 2 samples.
 135
 136 This separates pages into those that will migrate and those that will not due
 137 to the two samples not matching. We could consider the first to be of 'p_i'
 138 (private) and the second to be of 's_i' (shared).
 139
 140 This interpretation can be motivated by the previously observed property that
 141 'p_i,k for k != n_i' should become 0 under sufficient memory, leaving only
 142 's_i' (shared). (here we loose the need for memory limits again, since it
 143 becomes indistinguishable from shared).
 144
 145 XXX include the statistical babble on double sampling somewhere near
 146
 147 This reduces the problem further; we loose 'M' as per 2a, it further reduces
 148 the 'T_k,l' (interconnect traffic) term to only include shared (since per the
 149 above all private will be local):
 150
 151   T_k,l = \Sum_i bs_i,l for every n_i = k, l != k
 152
 153 [ more or less matches the state of sched/numa and describes its remaining
 154   problems and assumptions. It should work well for tasks without significant
 155   shared memory usage between tasks. ]
 156
 157 Possible future directions:
 158
 159 Motivated by the form of 'T_k,l', try and obtain each term of the sum, so we
 160 can evaluate it;
 161
 162  3a) add per-task per node counters
 163
 164 At fault time, count the number of pages the task faults on for each node.
 165 This should give an approximation of 'p_i' for the local node and 's_i,k' for
 166 all remote nodes.
 167
 168 While these numbers provide pages per scan, and so have the unit [pages/s] they
 169 don't count repeat access and thus aren't actually representable for our
 170 bandwidth numberes.
 171
 172  3b) additional frequency term
 173
 174 Additionally (or instead if it turns out we don't need the raw 'p' and 's'
 175 numbers) we can approximate the repeat accesses by using the time since marking
 176 the pages as indication of the access frequency.
 177
 178 Let 'I' be the interval of marking pages and 'e' the elapsed time since the
 179 last marking, then we could estimate the number of accesses 'a' as 'a = I / e'.
 180 If we then increment the node counters using 'a' instead of 1 we might get
 181 a better estimate of bandwidth terms.
 182
 183  3c) additional averaging; can be applied on top of either a/b.
 184
 185 [ Rik argues that decaying averages on 3a might be sufficient for bandwidth since
 186   the decaying avg includes the old accesses and therefore has a measure of repeat
 187   accesses.
 188
 189   Rik also argued that the sample frequency is too low to get accurate access
 190   frequency measurements, I'm not entirely convinced, event at low sample
 191   frequencies the avg elapsed time 'e' over multiple samples should still
 192   give us a fair approximation of the avg access frequency 'a'.
 193
 194   So doing both b&c has a fair chance of working and allowing us to distinguish
 195   between important and less important memory accesses.
 196
 197   Experimentation has shown no benefit from the added frequency term so far. ]
 198
 199 This will give us 'bp_i' and 'bs_i,k' so that we can approximately compute
 200 'T_k,l' Our optimization problem now reads:
 201
 202   min_C [ \Sum_i bs_i,l for every n_i = k, l != k ]
 203
 204 And includes only shared terms, this makes sense since all task private memory
 205 will become local as per 2.
 206
 207 This suggests that if there is significant shared memory, we should try and
 208 move towards it.
 209
 210  4) move towards where 'most' memory is
 211
 212 The simplest significance test is comparing the biggest shared 's_i,k' against
 213 the private 'p_i'. If we have more shared than private, move towards it.
 214
 215 This effectively makes us move towards where most our memory is and forms a
 216 feed-back loop with 2. We migrate memory towards us and we migrate towards
 217 where 'most' memory is.
 218
 219 (Note: even if there were two tasks fully trashing the same shared memory, it
 220        is very rare for there to be an 50/50 split in memory, lacking a perfect
 221        split, the small will move towards the larger. In case of the perfect
 222        split, we'll tie-break towards the lower node number.)
 223
 224  5) 'throttle' 4's node placement
 225
 226 Since per 2b our 's_i,k' and 'p_i' require at least two scans to 'stabilize'
 227 and show representative numbers, we should limit node-migration to not be
 228 faster than this.
 229
 230  n) poke holes in previous that require more stuff and describe it.