1 # Kernel Self-Protection
3 Kernel self-protection is the design and implementation of systems and
4 structures within the Linux kernel to protect against security flaws in
5 the kernel itself. This covers a wide range of issues, including removing
6 entire classes of bugs, blocking security flaw exploitation methods,
7 and actively detecting attack attempts. Not all topics are explored in
8 this document, but it should serve as a reasonable starting point and
9 answer any frequently asked questions. (Patches welcome, of course!)
11 In the worst-case scenario, we assume an unprivileged local attacker
12 has arbitrary read and write access to the kernel's memory. In many
13 cases, bugs being exploited will not provide this level of access,
14 but with systems in place that defend against the worst case we'll
15 cover the more limited cases as well. A higher bar, and one that should
16 still be kept in mind, is protecting the kernel against a _privileged_
17 local attacker, since the root user has access to a vastly increased
18 attack surface. (Especially when they have the ability to load arbitrary
21 The goals for successful self-protection systems would be that they
22 are effective, on by default, require no opt-in by developers, have no
23 performance impact, do not impede kernel debugging, and have tests. It
24 is uncommon that all these goals can be met, but it is worth explicitly
25 mentioning them, since these aspects need to be explored, dealt with,
29 ## Attack Surface Reduction
31 The most fundamental defense against security exploits is to reduce the
32 areas of the kernel that can be used to redirect execution. This ranges
33 from limiting the exposed APIs available to userspace, making in-kernel
34 APIs hard to use incorrectly, minimizing the areas of writable kernel
37 ### Strict kernel memory permissions
39 When all of kernel memory is writable, it becomes trivial for attacks
40 to redirect execution flow. To reduce the availability of these targets
41 the kernel needs to protect its memory with a tight set of permissions.
43 #### Executable code and read-only data must not be writable
45 Any areas of the kernel with executable memory must not be writable.
46 While this obviously includes the kernel text itself, we must consider
47 all additional places too: kernel modules, JIT memory, etc. (There are
48 temporary exceptions to this rule to support things like instruction
49 alternatives, breakpoints, kprobes, etc. If these must exist in a
50 kernel, they are implemented in a way where the memory is temporarily
51 made writable during the update, and then returned to the original
54 In support of this are CONFIG_STRICT_KERNEL_RWX and
55 CONFIG_STRICT_MODULE_RWX, which seek to make sure that code is not
56 writable, data is not executable, and read-only data is neither writable
59 Most architectures have these options on by default and not user selectable.
60 For some architectures like arm that wish to have these be selectable,
61 the architecture Kconfig can select ARCH_OPTIONAL_KERNEL_RWX to enable
62 a Kconfig prompt. CONFIG_ARCH_OPTIONAL_KERNEL_RWX_DEFAULT determines
63 the default setting when ARCH_OPTIONAL_KERNEL_RWX is enabled.
65 #### Function pointers and sensitive variables must not be writable
67 Vast areas of kernel memory contain function pointers that are looked
68 up by the kernel and used to continue execution (e.g. descriptor/vector
69 tables, file/network/etc operation structures, etc). The number of these
70 variables must be reduced to an absolute minimum.
72 Many such variables can be made read-only by setting them "const"
73 so that they live in the .rodata section instead of the .data section
74 of the kernel, gaining the protection of the kernel's strict memory
75 permissions as described above.
77 For variables that are initialized once at __init time, these can
78 be marked with the (new and under development) __ro_after_init
81 What remains are variables that are updated rarely (e.g. GDT). These
82 will need another infrastructure (similar to the temporary exceptions
83 made to kernel code mentioned above) that allow them to spend the rest
84 of their lifetime read-only. (For example, when being updated, only the
85 CPU thread performing the update would be given uninterruptible write
86 access to the memory.)
88 #### Segregation of kernel memory from userspace memory
90 The kernel must never execute userspace memory. The kernel must also never
91 access userspace memory without explicit expectation to do so. These
92 rules can be enforced either by support of hardware-based restrictions
93 (x86's SMEP/SMAP, ARM's PXN/PAN) or via emulation (ARM's Memory Domains).
94 By blocking userspace memory in this way, execution and data parsing
95 cannot be passed to trivially-controlled userspace memory, forcing
96 attacks to operate entirely in kernel memory.
98 ### Reduced access to syscalls
100 One trivial way to eliminate many syscalls for 64-bit systems is building
101 without CONFIG_COMPAT. However, this is rarely a feasible scenario.
103 The "seccomp" system provides an opt-in feature made available to
104 userspace, which provides a way to reduce the number of kernel entry
105 points available to a running process. This limits the breadth of kernel
106 code that can be reached, possibly reducing the availability of a given
109 An area of improvement would be creating viable ways to keep access to
110 things like compat, user namespaces, BPF creation, and perf limited only
111 to trusted processes. This would keep the scope of kernel entry points
112 restricted to the more regular set of normally available to unprivileged
115 ### Restricting access to kernel modules
117 The kernel should never allow an unprivileged user the ability to
118 load specific kernel modules, since that would provide a facility to
119 unexpectedly extend the available attack surface. (The on-demand loading
120 of modules via their predefined subsystems, e.g. MODULE_ALIAS_*, is
121 considered "expected" here, though additional consideration should be
122 given even to these.) For example, loading a filesystem module via an
123 unprivileged socket API is nonsense: only the root or physically local
124 user should trigger filesystem module loading. (And even this can be up
125 for debate in some scenarios.)
127 To protect against even privileged users, systems may need to either
128 disable module loading entirely (e.g. monolithic kernel builds or
129 modules_disabled sysctl), or provide signed modules (e.g.
130 CONFIG_MODULE_SIG_FORCE, or dm-crypt with LoadPin), to keep from having
131 root load arbitrary kernel code via the module loader interface.
136 There are many memory structures in the kernel that are regularly abused
137 to gain execution control during an attack, By far the most commonly
138 understood is that of the stack buffer overflow in which the return
139 address stored on the stack is overwritten. Many other examples of this
140 kind of attack exist, and protections exist to defend against them.
142 ### Stack buffer overflow
144 The classic stack buffer overflow involves writing past the expected end
145 of a variable stored on the stack, ultimately writing a controlled value
146 to the stack frame's stored return address. The most widely used defense
147 is the presence of a stack canary between the stack variables and the
148 return address (CONFIG_CC_STACKPROTECTOR), which is verified just before
149 the function returns. Other defenses include things like shadow stacks.
151 ### Stack depth overflow
153 A less well understood attack is using a bug that triggers the
154 kernel to consume stack memory with deep function calls or large stack
155 allocations. With this attack it is possible to write beyond the end of
156 the kernel's preallocated stack space and into sensitive structures. Two
157 important changes need to be made for better protections: moving the
158 sensitive thread_info structure elsewhere, and adding a faulting memory
159 hole at the bottom of the stack to catch these overflows.
161 ### Heap memory integrity
163 The structures used to track heap free lists can be sanity-checked during
164 allocation and freeing to make sure they aren't being used to manipulate
167 ### Counter integrity
169 Many places in the kernel use atomic counters to track object references
170 or perform similar lifetime management. When these counters can be made
171 to wrap (over or under) this traditionally exposes a use-after-free
172 flaw. By trapping atomic wrapping, this class of bug vanishes.
174 ### Size calculation overflow detection
176 Similar to counter overflow, integer overflows (usually size calculations)
177 need to be detected at runtime to kill this class of bug, which
178 traditionally leads to being able to write past the end of kernel buffers.
181 ## Statistical defenses
183 While many protections can be considered deterministic (e.g. read-only
184 memory cannot be written to), some protections provide only statistical
185 defense, in that an attack must gather enough information about a
186 running system to overcome the defense. While not perfect, these do
187 provide meaningful defenses.
189 ### Canaries, blinding, and other secrets
191 It should be noted that things like the stack canary discussed earlier
192 are technically statistical defenses, since they rely on a secret value,
193 and such values may become discoverable through an information exposure
196 Blinding literal values for things like JITs, where the executable
197 contents may be partially under the control of userspace, need a similar
200 It is critical that the secret values used must be separate (e.g.
201 different canary per stack) and high entropy (e.g. is the RNG actually
202 working?) in order to maximize their success.
204 ### Kernel Address Space Layout Randomization (KASLR)
206 Since the location of kernel memory is almost always instrumental in
207 mounting a successful attack, making the location non-deterministic
208 raises the difficulty of an exploit. (Note that this in turn makes
209 the value of information exposures higher, since they may be used to
210 discover desired memory locations.)
212 #### Text and module base
214 By relocating the physical and virtual base address of the kernel at
215 boot-time (CONFIG_RANDOMIZE_BASE), attacks needing kernel code will be
216 frustrated. Additionally, offsetting the module loading base address
217 means that even systems that load the same set of modules in the same
218 order every boot will not share a common base address with the rest of
223 If the base address of the kernel stack is not the same between processes,
224 or even not the same between syscalls, targets on or beyond the stack
225 become more difficult to locate.
227 #### Dynamic memory base
229 Much of the kernel's dynamic memory (e.g. kmalloc, vmalloc, etc) ends up
230 being relatively deterministic in layout due to the order of early-boot
231 initializations. If the base address of these areas is not the same
232 between boots, targeting them is frustrated, requiring an information
233 exposure specific to the region.
235 #### Structure layout
237 By performing a per-build randomization of the layout of sensitive
238 structures, attacks must either be tuned to known kernel builds or expose
239 enough kernel memory to determine structure layouts before manipulating
243 ## Preventing Information Exposures
245 Since the locations of sensitive structures are the primary target for
246 attacks, it is important to defend against exposure of both kernel memory
247 addresses and kernel memory contents (since they may contain kernel
248 addresses or other sensitive things like canary values).
250 ### Unique identifiers
252 Kernel memory addresses must never be used as identifiers exposed to
253 userspace. Instead, use an atomic counter, an idr, or similar unique
256 ### Memory initialization
258 Memory copied to userspace must always be fully initialized. If not
259 explicitly memset(), this will require changes to the compiler to make
260 sure structure holes are cleared.
264 When releasing memory, it is best to poison the contents (clear stack on
265 syscall return, wipe heap memory on a free), to avoid reuse attacks that
266 rely on the old contents of memory. This frustrates many uninitialized
267 variable attacks, stack content exposures, heap content exposures, and
268 use-after-free attacks.
270 ### Destination tracking
272 To help kill classes of bugs that result in kernel addresses being
273 written to userspace, the destination of writes needs to be tracked. If
274 the buffer is destined for userspace (e.g. seq_file backed /proc files),
275 it should automatically censor sensitive values.