X-Git-Url: https://git.karo-electronics.de/?a=blobdiff_plain;f=Documentation%2Fcpusets.txt;h=ec9de6917f01f34c088cd612121caaadc3edf08b;hb=fc8a327db6c46de783b1a4276d846841b9abc24c;hp=30c41459953c3ca04e6dd93ddf8702ced7cb89c4;hpb=11ed56fb7899f9eb9eaef8e5919db1bf08f1b07e;p=karo-tx-linux.git diff --git a/Documentation/cpusets.txt b/Documentation/cpusets.txt index 30c41459953c..ec9de6917f01 100644 --- a/Documentation/cpusets.txt +++ b/Documentation/cpusets.txt @@ -18,7 +18,8 @@ CONTENTS: 1.4 What are exclusive cpusets ? 1.5 What does notify_on_release do ? 1.6 What is memory_pressure ? - 1.7 How do I use cpusets ? + 1.7 What is memory spread ? + 1.8 How do I use cpusets ? 2. Usage Examples and Syntax 2.1 Basic Usage 2.2 Adding/removing cpus @@ -34,7 +35,8 @@ CONTENTS: ---------------------- Cpusets provide a mechanism for assigning a set of CPUs and Memory -Nodes to a set of tasks. +Nodes to a set of tasks. In this document "Memory Node" refers to +an on-line node that contains memory. Cpusets constrain the CPU and Memory placement of tasks to only the resources within a tasks current cpuset. They form a nested @@ -85,9 +87,6 @@ This can be especially valuable on: and a database), or * NUMA systems running large HPC applications with demanding performance characteristics. - * Also cpu_exclusive cpusets are useful for servers running orthogonal - workloads such as RT applications requiring low latency and HPC - applications that are throughput sensitive These subsets, or "soft partitions" must be able to be dynamically adjusted, as the job mix changes, without impacting other concurrently @@ -130,8 +129,6 @@ Cpusets extends these two mechanisms as follows: - A cpuset may be marked exclusive, which ensures that no other cpuset (except direct ancestors and descendents) may contain any overlapping CPUs or Memory Nodes. - Also a cpu_exclusive cpuset would be associated with a sched - domain. - You can list all the tasks (by pid) attached to any cpuset. The implementation of cpusets requires a few, simple hooks @@ -143,9 +140,6 @@ into the rest of the kernel, none in performance critical paths: allowed in that tasks cpuset. - in sched.c migrate_all_tasks(), to keep migrating tasks within the CPUs allowed by their cpuset, if possible. - - in sched.c, a new API partition_sched_domains for handling - sched domain changes associated with cpu_exclusive cpusets - and related changes in both sched.c and arch/ia64/kernel/domain.c - in the mbind and set_mempolicy system calls, to mask the requested Memory Nodes by what's allowed in that tasks cpuset. - in page_alloc.c, to restrict memory to allowed nodes. @@ -216,6 +210,12 @@ exclusive cpuset. Also, the use of a Linux virtual file system (vfs) to represent the cpuset hierarchy provides for a familiar permission and name space for cpusets, with a minimum of additional kernel code. +The cpus and mems files in the root (top_cpuset) cpuset are +read-only. The cpus file automatically tracks the value of +cpu_online_map using a CPU hotplug notifier, and the mems file +automatically tracks the value of node_states[N_MEMORY]--i.e., +nodes with memory--using the cpuset_track_online_nodes() hook. + 1.4 What are exclusive cpusets ? -------------------------------- @@ -224,15 +224,6 @@ If a cpuset is cpu or mem exclusive, no other cpuset, other than a direct ancestor or descendent, may share any of the same CPUs or Memory Nodes. -A cpuset that is cpu_exclusive has a scheduler (sched) domain -associated with it. The sched domain consists of all CPUs in the -current cpuset that are not part of any exclusive child cpusets. -This ensures that the scheduler load balancing code only balances -against the CPUs that are in the sched domain as defined above and -not all of the CPUs in the system. This removes any overhead due to -load balancing code trying to pull tasks outside of the cpu_exclusive -cpuset only to be prevented by the tasks' cpus_allowed mask. - A cpuset that is mem_exclusive restricts kernel allocations for page, buffer and other data commonly shared by the kernel across multiple users. All cpusets, whether mem_exclusive or not, restrict @@ -317,7 +308,78 @@ the tasks in the cpuset, in units of reclaims attempted per second, times 1000. -1.7 How do I use cpusets ? +1.7 What is memory spread ? +--------------------------- +There are two boolean flag files per cpuset that control where the +kernel allocates pages for the file system buffers and related in +kernel data structures. They are called 'memory_spread_page' and +'memory_spread_slab'. + +If the per-cpuset boolean flag file 'memory_spread_page' is set, then +the kernel will spread the file system buffers (page cache) evenly +over all the nodes that the faulting task is allowed to use, instead +of preferring to put those pages on the node where the task is running. + +If the per-cpuset boolean flag file 'memory_spread_slab' is set, +then the kernel will spread some file system related slab caches, +such as for inodes and dentries evenly over all the nodes that the +faulting task is allowed to use, instead of preferring to put those +pages on the node where the task is running. + +The setting of these flags does not affect anonymous data segment or +stack segment pages of a task. + +By default, both kinds of memory spreading are off, and memory +pages are allocated on the node local to where the task is running, +except perhaps as modified by the tasks NUMA mempolicy or cpuset +configuration, so long as sufficient free memory pages are available. + +When new cpusets are created, they inherit the memory spread settings +of their parent. + +Setting memory spreading causes allocations for the affected page +or slab caches to ignore the tasks NUMA mempolicy and be spread +instead. Tasks using mbind() or set_mempolicy() calls to set NUMA +mempolicies will not notice any change in these calls as a result of +their containing tasks memory spread settings. If memory spreading +is turned off, then the currently specified NUMA mempolicy once again +applies to memory page allocations. + +Both 'memory_spread_page' and 'memory_spread_slab' are boolean flag +files. By default they contain "0", meaning that the feature is off +for that cpuset. If a "1" is written to that file, then that turns +the named feature on. + +The implementation is simple. + +Setting the flag 'memory_spread_page' turns on a per-process flag +PF_SPREAD_PAGE for each task that is in that cpuset or subsequently +joins that cpuset. The page allocation calls for the page cache +is modified to perform an inline check for this PF_SPREAD_PAGE task +flag, and if set, a call to a new routine cpuset_mem_spread_node() +returns the node to prefer for the allocation. + +Similarly, setting 'memory_spread_cache' turns on the flag +PF_SPREAD_SLAB, and appropriately marked slab caches will allocate +pages from the node returned by cpuset_mem_spread_node(). + +The cpuset_mem_spread_node() routine is also simple. It uses the +value of a per-task rotor cpuset_mem_spread_rotor to select the next +node in the current tasks mems_allowed to prefer for the allocation. + +This memory placement policy is also known (in other contexts) as +round-robin or interleave. + +This policy can provide substantial improvements for jobs that need +to place thread local data on the corresponding node, but that need +to access large file system data sets that need to be spread across +the several nodes in the jobs cpuset in order to fit. Without this +policy, especially for jobs that might have one thread reading in the +data set, the memory allocation across the nodes in the jobs cpuset +can become very uneven. + + +1.8 How do I use cpusets ? -------------------------- In order to minimize the impact of cpusets on critical kernel @@ -479,6 +541,9 @@ Set some flags: Add some cpus: # /bin/echo 0-7 > cpus +Add some mems: +# /bin/echo 0-7 > mems + Now attach your shell to this cpuset: # /bin/echo $$ > tasks