Ard Biesheuvel [Tue, 9 Feb 2016 23:13:35 +0000 (10:13 +1100)]
kallsyms: don't overload absolute symbol type for percpu symbols
Commit c6bda7c988a5 ("kallsyms: fix percpu vars on x86-64 with
relocation") overloaded the 'A' (absolute) symbol type to signify that a
symbol is not subject to dynamic relocation. However, the original A type
does not imply that at all, and depending on the version of the toolchain,
many A type symbols are emitted that are in fact relative to the kernel
text, i.e., if the kernel is relocated at runtime, these symbols should be
updated as well.
For instance, on sparc32, the following symbols are emitted as absolute
(kindly provided by Guenter Roeck):
Even if only a couple of them pass the symbol range check that results in
them to be taken into account for the final kallsyms symbol table, it is
obvious that 'A' does not mean the symbol does not need to be updated at
relocation time, and overloading its meaning to signify that is perhaps
not a good idea.
So instead, add a new percpu_absolute member to struct sym_entry, and when
--absolute-percpu is in effect, use it to record symbols whose addresses
should be emitted as final values rather than values that still require
relocation at runtime. That way, we can drop the check against the 'A'
type.
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org> Tested-by: Guenter Roeck <linux@roeck-us.net> Reviewed-by: Kees Cook <keescook@chromium.org> Tested-by: Kees Cook <keescook@chromium.org> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Ingo Molnar <mingo@kernel.org> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Michal Marek <mmarek@suse.cz> Acked-by: Rusty Russell <rusty@rustcorp.com.au> Cc: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Ard Biesheuvel [Tue, 9 Feb 2016 23:13:34 +0000 (10:13 +1100)]
x86: kallsyms: disable absolute percpu symbols on !SMP
scripts/kallsyms.c has a special --absolute-percpu command line
option which deals with the zero based per cpu offsets that are
used when building for SMP on x86_64. This means that the option
should only be passed in that case, so add a Kconfig symbol with
the correct predicate, and use that instead.
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org> Tested-by: Guenter Roeck <linux@roeck-us.net> Reviewed-by: Kees Cook <keescook@chromium.org> Tested-by: Kees Cook <keescook@chromium.org> Acked-by: Rusty Russell <rusty@rustcorp.com.au> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Ingo Molnar <mingo@kernel.org> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Michal Marek <mmarek@suse.cz> Acked-by: Rusty Russell <rusty@rustcorp.com.au> Cc: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Andy Lutomirski [Tue, 9 Feb 2016 23:13:33 +0000 (10:13 +1100)]
x86/compat: remove is_compat_task()
x86's is_compat_task always checked the current syscall type, not the task
type. It has no non-arch users any more, so just remove it to avoid
confusion.
On x86, nothing should really be checking the task ABI. There are
legitimate users for the syscall ABI and for the mm ABI.
Signed-off-by: Andy Lutomirski <luto@kernel.org> Cc: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Andy Lutomirski [Tue, 9 Feb 2016 23:13:33 +0000 (10:13 +1100)]
input: redefine INPUT_COMPAT_TEST as in_compat_syscall()
The input compat code should work like all other compat code: for 32-bit
syscalls, use the 32-bit ABI and for 64-bit syscalls, use the 64-bit ABI.
We have a helper for that (in_compat_syscall()): just use it.
Signed-off-by: Andy Lutomirski <luto@kernel.org> Cc: Dmitry Torokhov <dmitry.torokhov@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Andy Lutomirski [Tue, 9 Feb 2016 23:13:32 +0000 (10:13 +1100)]
drivers/gpu/drm/amd/amdkfd: use in_compat_syscall to check open() caller type
amdkfd wants to know syscall type, not task type. Check directly.
Unfortunately, amdkfd is making nasty assumptions that a process' bitness
is a well-defined constant thing. This isn't the case on x86. I don't
know how much this matters, but this patch has no effect on generated code
on x86, so amdkfd is equally broken with and without this patch.
Signed-off-by: Andy Lutomirski <luto@kernel.org> Cc: Oded Gabbay <oded.gabbay@gmail.com> Cc: David Airlie <airlied@linux.ie> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Andy Lutomirski [Tue, 9 Feb 2016 23:13:32 +0000 (10:13 +1100)]
drivers/firmware/efi/efivars.c: use in_compat_syscall() to check for compat callers
This should make no difference on any architecture, as x86's historical
is_compat_task behavior really did check whether the calling syscall was a
compat syscall. x86's is_compat_task is going away, though.
Signed-off-by: Andy Lutomirski <luto@kernel.org> Reviewed-by: Matt Fleming <matt@codeblueprint.co.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Andy Lutomirski [Tue, 9 Feb 2016 23:13:31 +0000 (10:13 +1100)]
firewire: Use in_compat_syscall to check ioctl compatness
Firewire was using is_compat_task to check whether it was in a
compat ioctl or a non-compat ioctl. Use is_compat_syscall instead
so it works properly on all architectures.
Signed-off-by: Andy Lutomirski <luto@kernel.org> Cc: Clemens Ladisch <clemens@ladisch.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Andy Lutomirski [Tue, 9 Feb 2016 23:13:31 +0000 (10:13 +1100)]
net/sctp: Use in_compat_syscall for sctp_getsockopt_connectx3
SCTP unfortunately has a different ABI for SCTP_SOCKOPT_CONNECTX3
for 32-bit and 64-bit callers. Use in_compat_syscall to correctly
distinguish them on all architectures.
Signed-off-by: Andy Lutomirski <luto@kernel.org> Cc: Vlad Yasevich <vyasevich@gmail.com> Cc: Neil Horman <nhorman@tuxdriver.com> Cc: David Miller <davem@davemloft.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Andy Lutomirski [Tue, 9 Feb 2016 23:13:30 +0000 (10:13 +1100)]
ext4: In ext4_dir_llseek, check syscall bitness directly
ext4 treats directory offsets differently for 32-bit and 64-bit
callers. Check the caller type using in_compat_syscall, not
is_compat_task. This changes behavior on SPARC slightly.
Signed-off-by: Andy Lutomirski <luto@kernel.org> Cc: "Theodore Ts'o" <tytso@mit.edu> Cc: Andreas Dilger <adilger.kernel@dilger.ca> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Andy Lutomirski [Tue, 9 Feb 2016 23:13:29 +0000 (10:13 +1100)]
auditsc: for seccomp events, log syscall compat state using in_compat_syscall
Except on SPARC, this is what the code always did. SPARC compat
seccomp was buggy, although the impact of the bug was limited
because SPARC 32-bit and 64-bit syscall numbers are the same.
Signed-off-by: Andy Lutomirski <luto@kernel.org> Cc: Paul Moore <paul@paul-moore.com> Cc: Eric Paris <eparis@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Andy Lutomirski [Tue, 9 Feb 2016 23:13:29 +0000 (10:13 +1100)]
seccomp: check in_compat_syscall, not is_compat_task, in strict mode
Seccomp wants to know the syscall bitness, not the caller task bitness,
when it selects the syscall whitelist.
As far as I know, this makes no difference on any architecture, so it's
not a security problem. (It generates identical code everywhere except
sparc, and, on sparc, the syscall numbering is the same for both ABIs.)
Signed-off-by: Andy Lutomirski <luto@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Andy Lutomirski [Tue, 9 Feb 2016 23:13:28 +0000 (10:13 +1100)]
sparc/syscall: fix syscall_get_arch
Sparc's syscall_get_arch was buggy: it returned the task arch, not the
syscall arch. This could confuse seccomp and audit.
I don't think this is as bad for seccomp as it looks: sparc's 32-bit and
64-bit syscalls are numbered the same.
Signed-off-by: Andy Lutomirski <luto@kernel.org> Cc: David S. Miller <davem@davemloft.net> Cc: Sam Ravnborg <sam@ravnborg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org> Cc: David S. Miller <davem@davemloft.net> Cc: Sam Ravnborg <sam@ravnborg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org> Cc: David S. Miller <davem@davemloft.net> Cc: Sam Ravnborg <sam@ravnborg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Andy Lutomirski [Tue, 9 Feb 2016 23:13:27 +0000 (10:13 +1100)]
sparc/compat: rrovide an accurate in_compat_syscall implementation
On sparc64 compat-enabled kernels, any task can make 32-bit and 64-bit
syscalls. is_compat_task returns true in 32-bit tasks, which does not
necessarily imply that the current syscall is 32-bit.
Provide an in_compat_syscall implementation that checks whether the
current syscall is compat.
As far as I know, sparc is the only architecture on which is_compat_task
checks the compat status of the task and on which the compat status of a
syscall can differ from the compat status of the task. On x86,
is_compat_task checks the syscall type, not the task type.
Signed-off-by: Andy Lutomirski <luto@kernel.org> Acked-by: David S. Miller <davem@davemloft.net> Cc: Sam Ravnborg <sam@ravnborg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Marian Chereji [Tue, 9 Feb 2016 23:13:26 +0000 (10:13 +1100)]
lib: Add CRC64 ECMA module
Add implementation of CRC64 ECMA checksum.
We have an IP Acceleration driver for Freescale network processors which
is using this CRC64. However, it still needs some work in order for it to
become upstreamable.
Signed-off-by: Marian Chereji <marian.chereji@freescale.com> Reviewed-by: Varvara Andrei-B21317 <andrei.varvara@freescale.com> Reviewed-by: Fleming Andrew-AFLEMING <AFLEMING@freescale.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
kstrimdup() creates a whitespace-trimmed duplicate of the passed in
null-terminated string. This is useful for strings coming from sysfs that
often include trailing whitespace due to user input.
Thanks to Joe Perches for this implementation.
Signed-off-by: Sebastian Capella <sebastian.capella@linaro.org> Cc: Joe Perches <joe@perches.com> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Kees Cook [Tue, 9 Feb 2016 23:13:24 +0000 (10:13 +1100)]
lib: update single-char callers of strtobool()
Some callers of strtobool() were passing a pointer to unterminated
strings. In preparation of adding multi-character processing to
kstrtobool(), update the callers to not pass single-character pointers,
and switch to using the new kstrtobool_from_user() helper where possible.
Signed-off-by: Kees Cook <keescook@chromium.org> Cc: Amitkumar Karwar <akarwar@marvell.com> Cc: Nishant Sarmukadam <nishants@marvell.com> Cc: Kalle Valo <kvalo@codeaurora.org> Cc: Steve French <sfrench@samba.org> Cc: Andy Shevchenko <andy.shevchenko@gmail.com> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Joe Perches <joe@perches.com> Cc: Kees Cook <keescook@chromium.org> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Kees Cook [Tue, 9 Feb 2016 23:13:24 +0000 (10:13 +1100)]
lib: move strtobool() to kstrtobool()
Create the kstrtobool_from_user() helper and move strtobool() logic into
the new kstrtobool() (matching all the other kstrto* functions). Provides
an inline wrapper for existing strtobool() callers.
Signed-off-by: Kees Cook <keescook@chromium.org> Cc: Joe Perches <joe@perches.com> Cc: Andy Shevchenko <andy.shevchenko@gmail.com> Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: Amitkumar Karwar <akarwar@marvell.com> Cc: Nishant Sarmukadam <nishants@marvell.com> Cc: Kalle Valo <kvalo@codeaurora.org> Cc: Steve French <sfrench@samba.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Denys Vlasenko [Tue, 9 Feb 2016 23:13:23 +0000 (10:13 +1100)]
include/linux/unaligned: force inlining of byteswap operations
Sometimes gcc mysteriously doesn't inline
very small functions we expect to be inlined. See
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66122
With this .config:
http://busybox.net/~vda/kernel_config_OPTIMIZE_INLINING_and_Os,
the following functions get deinlined many times.
Examples of disassembly:
Denys Vlasenko [Tue, 9 Feb 2016 23:13:23 +0000 (10:13 +1100)]
include/uapi/linux/byteorder, swab: force inlining of some byteswap operations
Sometimes gcc mysteriously doesn't inline
very small functions we expect to be inlined. See
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66122
With this .config:
http://busybox.net/~vda/kernel_config_OPTIMIZE_INLINING_and_Os,
the following functions get deinlined many times.
Examples of disassembly:
Denys Vlasenko [Tue, 9 Feb 2016 23:13:23 +0000 (10:13 +1100)]
include/asm-generic/atomic-long.h: force inlining of some atomic_long operations
Sometimes gcc mysteriously doesn't inline
very small functions we expect to be inlined. See
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66122
With this .config:
http://busybox.net/~vda/kernel_config_OPTIMIZE_INLINING_and_Os,
atomic_long_inc(), atomic_long_dec() and atomic_long_add()
functions get deinlined about 40 times. Examples of disassembly:
Andrzej Hajda [Tue, 9 Feb 2016 23:13:22 +0000 (10:13 +1100)]
err.h: allow IS_ERR_VALUE to handle properly more types
Current implementation of IS_ERR_VALUE works correctly only with following
types:
- unsigned long,
- short, int, long.
Other types are handled incorrectly either on 32-bit either on 64-bit
either on both architectures.
The patch fixes it by comparing argument with MAX_ERRNO casted to
argument's type for unsigned types and comparing with zero for signed
types. As a result all integer types bigger than char are handled
properly.
I have analyzed usage of IS_ERR_VALUE using coccinelle and in about 35
cases it is used incorrectly, ie it can hide errors depending of 32/64 bit
architecture. Instead of fixing usage I propose to enhance the macro to
cover more types.
And just for the record: the macro is used 101 times with signed
variables, I am not sure if it should be preferred over simple comparison
"ret < 0", but the new version can do it as well.
And below list of detected potential errors:
drivers/char/mem.c:698:45-46: WARNING: incorrect argument type in IS_ERR_VALUE(( unsigned long long ) offset)
drivers/media/platform/soc_camera/atmel-isi.c:1089:21-22: WARNING: incorrect argument type in IS_ERR_VALUE(irq)
drivers/net/ethernet/freescale/fs_enet/mac-scc.c:149:36-37: WARNING: incorrect argument type in IS_ERR_VALUE(fep -> ring_mem_addr)
drivers/net/ethernet/freescale/ucc_geth.c:2237:48-49: WARNING: incorrect argument type in IS_ERR_VALUE(ugeth -> tx_bd_ring_offset [ j ])
drivers/net/ethernet/freescale/ucc_geth.c:2314:48-49: WARNING: incorrect argument type in IS_ERR_VALUE(ugeth -> rx_bd_ring_offset [ j ])
drivers/net/ethernet/freescale/ucc_geth.c:2524:44-45: WARNING: incorrect argument type in IS_ERR_VALUE(ugeth -> tx_glbl_pram_offset)
drivers/net/ethernet/freescale/ucc_geth.c:2544:45-46: WARNING: incorrect argument type in IS_ERR_VALUE(ugeth -> thread_dat_tx_offset)
drivers/net/ethernet/freescale/ucc_geth.c:2571:46-47: WARNING: incorrect argument type in IS_ERR_VALUE(ugeth -> send_q_mem_reg_offset)
drivers/net/ethernet/freescale/ucc_geth.c:2612:42-43: WARNING: incorrect argument type in IS_ERR_VALUE(ugeth -> scheduler_offset)
drivers/net/ethernet/freescale/ucc_geth.c:2659:54-55: WARNING: incorrect argument type in IS_ERR_VALUE(ugeth -> tx_fw_statistics_pram_offset)
drivers/net/ethernet/freescale/ucc_geth.c:2696:44-45: WARNING: incorrect argument type in IS_ERR_VALUE(ugeth -> rx_glbl_pram_offset)
drivers/net/ethernet/freescale/ucc_geth.c:2715:45-46: WARNING: incorrect argument type in IS_ERR_VALUE(ugeth -> thread_dat_rx_offset)
drivers/net/ethernet/freescale/ucc_geth.c:2736:54-55: WARNING: incorrect argument type in IS_ERR_VALUE(ugeth -> rx_fw_statistics_pram_offset)
drivers/net/ethernet/freescale/ucc_geth.c:2756:53-54: WARNING: incorrect argument type in IS_ERR_VALUE(ugeth -> rx_irq_coalescing_tbl_offset)
drivers/net/ethernet/freescale/ucc_geth.c:2822:44-45: WARNING: incorrect argument type in IS_ERR_VALUE(ugeth -> rx_bd_qs_tbl_offset)
drivers/net/ethernet/freescale/ucc_geth.c:2908:47-48: WARNING: incorrect argument type in IS_ERR_VALUE(ugeth -> exf_glbl_param_offset)
drivers/net/ethernet/freescale/ucc_geth.c:292:36-37: WARNING: incorrect argument type in IS_ERR_VALUE(init_enet_offset)
drivers/net/ethernet/freescale/ucc_geth.c:3042:39-40: WARNING: incorrect argument type in IS_ERR_VALUE(init_enet_pram_offset)
drivers/soc/fsl/qe/ucc_fast.c:271:60-61: WARNING: incorrect argument type in IS_ERR_VALUE(uccf -> ucc_fast_tx_virtual_fifo_base_offset)
drivers/soc/fsl/qe/ucc_fast.c:284:60-61: WARNING: incorrect argument type in IS_ERR_VALUE(uccf -> ucc_fast_rx_virtual_fifo_base_offset)
drivers/soc/fsl/qe/ucc_slow.c:186:38-39: WARNING: incorrect argument type in IS_ERR_VALUE(uccs -> us_pram_offset)
drivers/soc/fsl/qe/ucc_slow.c:213:38-39: WARNING: incorrect argument type in IS_ERR_VALUE(uccs -> rx_base_offset)
drivers/soc/fsl/qe/ucc_slow.c:224:38-39: WARNING: incorrect argument type in IS_ERR_VALUE(uccs -> tx_base_offset)
drivers/tty/serial/clps711x.c:471:29-30: WARNING: incorrect argument type in IS_ERR_VALUE(s -> port . irq)
drivers/tty/serial/digicolor-usart.c:485:30-31: WARNING: incorrect argument type in IS_ERR_VALUE(dp -> port . irq)
drivers/usb/gadget/udc/fsl_qe_udc.c:2369:26-27: WARNING: incorrect argument type in IS_ERR_VALUE(tmp_addr)
drivers/video/fbdev/exynos/exynos_mipi_dsi.c:406:27-28: WARNING: incorrect argument type in IS_ERR_VALUE(dsim -> irq)
net/ipv4/netfilter/arp_tables.c:1427:39-40: WARNING: incorrect argument type in IS_ERR_VALUE(iter1 -> counters . pcnt)
net/ipv4/netfilter/arp_tables.c:530:34-35: WARNING: incorrect argument type in IS_ERR_VALUE(e -> counters . pcnt)
net/ipv4/netfilter/ip_tables.c:1614:34-35: WARNING: incorrect argument type in IS_ERR_VALUE(e -> counters . pcnt)
net/ipv4/netfilter/ip_tables.c:674:34-35: WARNING: incorrect argument type in IS_ERR_VALUE(e -> counters . pcnt)
net/ipv6/netfilter/ip6_tables.c:1624:34-35: WARNING: incorrect argument type in IS_ERR_VALUE(e -> counters . pcnt)
net/ipv6/netfilter/ip6_tables.c:687:34-35: WARNING: incorrect argument type in IS_ERR_VALUE(e -> counters . pcnt)
drivers/net/ethernet/freescale/fs_enet/mac-fcc.c:110:35-36: WARNING: unknown argument type in IS_ERR_VALUE(fpi -> dpram_offset)
Signed-off-by: Andrzej Hajda <a.hajda@samsung.com> Cc: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com> Cc: Marek Szyprowski <m.szyprowski@samsung.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Andy Shevchenko [Tue, 9 Feb 2016 23:13:21 +0000 (10:13 +1100)]
ide: hpt366: convert to use match_string() helper
The new helper returns index of the mathing string in an array. We would use it
here.
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Cc: "David S. Miller" <davem@davemloft.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Andy Shevchenko [Tue, 9 Feb 2016 23:13:20 +0000 (10:13 +1100)]
power: charger_manager: convert to use match_string() helper
The new helper returns index of the mathing string in an array. We would use it
here.
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Cc: Sebastian Reichel <sre@kernel.org> Cc: Dmitry Eremin-Solenikov <dbaryshkov@gmail.com> Cc: David Woodhouse <dwmw2@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Andy Shevchenko [Tue, 9 Feb 2016 23:13:20 +0000 (10:13 +1100)]
drm/edid: convert to use match_string() helper
The new helper returns index of the mathing string in an array. We would
use it here.
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Cc: David Airlie <airlied@linux.ie> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Andy Shevchenko [Tue, 9 Feb 2016 23:13:19 +0000 (10:13 +1100)]
device property: convert to use match_string() helper
The new helper returns index of the mathing string in an array. We would
use it here.
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Reviewed-by: Mika Westerberg <mika.westerberg@linux.intel.com> Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Ard Biesheuvel [Tue, 9 Feb 2016 23:13:17 +0000 (10:13 +1100)]
arm64: switch to relative exception tables
Instead of using absolute addresses for both the exception location and
the fixup, use offsets relative to the exception table entry values. Not
only does this cut the size of the exception table in half, it is also a
prerequisite for KASLR, since absolute exception table entries are subject
to dynamic relocation, which is incompatible with the sorting of the
exception table that occurs at build time.
This patch also introduces the _ASM_EXTABLE preprocessor macro (which
exists on x86 as well) and its _asm_extable assembly counterpart, as
shorthands to emit exception table entries.
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org> Acked-by: Will Deacon <will.deacon@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Ard Biesheuvel [Tue, 9 Feb 2016 23:13:17 +0000 (10:13 +1100)]
ia64/extable: use generic search and sort routines
Replace the arch specific versions of search_extable() and sort_extable()
with calls to the generic ones, which now support relative exception
tables as well.
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org> Acked-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Ard Biesheuvel [Tue, 9 Feb 2016 23:13:17 +0000 (10:13 +1100)]
x86/extable: use generic search and sort routines
Replace the arch specific versions of search_extable() and sort_extable()
with calls to the generic ones, which now support relative exception
tables as well.
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org> Acked-by: H. Peter Anvin <hpa@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Ard Biesheuvel [Tue, 9 Feb 2016 23:13:16 +0000 (10:13 +1100)]
s390/extable: use generic search and sort routines
Replace the arch specific versions of search_extable() and sort_extable()
with calls to the generic ones, which now support relative exception
tables as well.
Ard Biesheuvel [Tue, 9 Feb 2016 23:13:16 +0000 (10:13 +1100)]
alpha/extable: use generic search and sort routines
Replace the arch specific versions of search_extable() and sort_extable()
with calls to the generic ones, which now support relative exception
tables as well.
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org> Acked-by: Richard Henderson <rth@twiddle.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Ard Biesheuvel [Tue, 9 Feb 2016 23:13:15 +0000 (10:13 +1100)]
extable: add support for relative extables to search and sort routines
There are currently four architectures (x86, ia64, alpha and s390) whose
user-access exception tables are relative to the table entry address
rather than absolute. Each of these architectures has its own
search_extable() and sort_extable() implementation, which are not only
mostly identical to each other, but also deviate very little from the
generic absolute implementations in lib/extable.c that they override.
So before making arm64 the fifth architecture that reimplements this,
let's refactor the existing code so that all of these architectures use
common code for searching and sorting the relative extables. Archs may
set ARCH_HAS_RELATIVE_EXTABLE to indicate that the table consists of a
pair of relative ints, and may define swap_ex_entry_fixup() if the fixup
member needs special treatment in the swapping step of the sorting routine
(such as alpha).
This patch (of 6):
Add support to the generic search_extable() and sort_extable()
implementations for dealing with exception table entries whose fields
contain relative offsets rather than absolute addresses.
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org> Acked-by: Helge Deller <deller@gmx.de> Acked-by: Heiko Carstens <heiko.carstens@de.ibm.com> Acked-by: H. Peter Anvin <hpa@linux.intel.com> Acked-by: Tony Luck <tony.luck@intel.com> Acked-by: Will Deacon <will.deacon@arm.com> Acked-by: Richard Henderson <rth@twiddle.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
After calling radix_tree_iter_retry(), 'slot' will be set to NULL. This
can cause radix_tree_next_slot() to dereference the NULL pointer. Add
Konstantin Khlebnikov's test to the regression framework.
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com> Reported-by: Konstantin Khlebnikov <koct9i@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
shmem likes to occasionally drop the lock, schedule, then reacqire the
lock and continue with the iteration from the last place it left off.
This is currently done with a pretty ugly goto. Introduce
radix_tree_iter_next() and use it throughout shmem.c.
Signed-off-by: Matthew Wilcox <willy@linux.intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Konstantin Khlebnikov <khlebnikov@openvz.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Matthew Wilcox [Tue, 9 Feb 2016 23:13:14 +0000 (10:13 +1100)]
mm: use radix_tree_iter_retry()
Instead of a 'goto restart', we can now use radix_tree_iter_retry() to
restart from our current position. This will make a difference when there
are more ways to happen across an indirect pointer. And it eliminates
some confusing gotos.
Signed-off-by: Matthew Wilcox <willy@linux.intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Konstantin Khlebnikov <khlebnikov@openvz.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Andrew Morton [Tue, 9 Feb 2016 23:13:13 +0000 (10:13 +1100)]
btrfs-use-radix_tree_iter_retry-fix
fs/btrfs/tests/btrfs-tests.c: In function 'btrfs_free_dummy_fs_info':
fs/btrfs/tests/btrfs-tests.c:149: error: incompatible type for argument 1 of 'radix_tree_iter_retry'
include/linux/radix-tree.h:399: note: expected 'struct radix_tree_iter *' but argument is of type 'struct radix_tree_iter'
Cc: Matthew Wilcox <willy@linux.intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Konstantin Khlebnikov <khlebnikov@openvz.org> Cc: Chris Mason <clm@fb.com> Cc: Josef Bacik <jbacik@fb.com> Cc: David Sterba <dsterba@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Matthew Wilcox [Tue, 9 Feb 2016 23:13:13 +0000 (10:13 +1100)]
btrfs: use radix_tree_iter_retry()
Even though this is a 'can't happen' situation, use the new
radix_tree_iter_retry() pattern to eliminate a goto.
Signed-off-by: Matthew Wilcox <willy@linux.intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Konstantin Khlebnikov <khlebnikov@openvz.org> Cc: Chris Mason <clm@fb.com> Cc: Josef Bacik <jbacik@fb.com> Cc: David Sterba <dsterba@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Matthew Wilcox [Tue, 9 Feb 2016 23:13:13 +0000 (10:13 +1100)]
radix_tree: add radix_tree_dump
This is debug code which is #if 0 out.
Signed-off-by: Matthew Wilcox <willy@linux.intel.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Matthew Wilcox <willy@linux.intel.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Matthew Wilcox [Tue, 9 Feb 2016 23:13:12 +0000 (10:13 +1100)]
radix_tree: add support for multi-order entries
With huge pages, it is convenient to have the radix tree be able to return
an entry that covers multiple indices. Previous attempts to deal with the
problem have involved inserting N duplicate entries, which is a waste of
memory and leads to problems trying to handle aliased tags, or probing the
tree multiple times to find alternative entries which might cover the
requested index.
This approach inserts one canonical entry into the tree for a given range
of indices, and may also insert other entries in order to ensure that
lookups find the canonical entry.
This solution only tolerates inserting powers of two that are greater than
the fanout of the tree. If we wish to expand the radix tree's abilities
to support large-ish pages that is less than the fanout at the penultimate
level of the tree, then we would need to add one more step in lookup to
ensure that any sibling nodes in the final level of the tree are
dereferenced and we return the canonical entry that they reference.
Signed-off-by: Matthew Wilcox <willy@linux.intel.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Matthew Wilcox <willy@linux.intel.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Matthew Wilcox [Tue, 9 Feb 2016 23:13:12 +0000 (10:13 +1100)]
radix_tree: loop based on shift count, not height
When we introduce entries that can cover multiple indices, we will need to
stop in __radix_tree_create based on the shift, not the height. Split out
for ease of bisect.
Signed-off-by: Matthew Wilcox <willy@linux.intel.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Matthew Wilcox <willy@linux.intel.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Matthew Wilcox [Tue, 9 Feb 2016 23:13:12 +0000 (10:13 +1100)]
radix_tree: tag all internal tree nodes as indirect pointers
Set the 'indirect_ptr' bit on all the pointers to internal nodes, not just
on the root node. This enables the following patches to support
multi-order entries in the radix tree. This patch is split out for ease
of bisection.
Signed-off-by: Matthew Wilcox <willy@linux.intel.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Matthew Wilcox <willy@linux.intel.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Matthew Wilcox [Tue, 9 Feb 2016 23:13:11 +0000 (10:13 +1100)]
radix tree test harness
This code is mostly from Andrew Morton and Nick Piggin; tarball downloaded
from http://ozlabs.org/~akpm/rtth.tar.gz with sha1sum 0ce679db9ec047296b5d1ff7a1dfaa03a7bef1bd
Some small modifications were necessary to the test harness to fix the
build with the current Linux source code.
I also made minor modifications to automatically test the radix-tree.c and
radix-tree.h files that are in the current source tree, as opposed to a
copied and slightly modified version. I am sure more could be done to
tidy up the harness, as well as adding more tests.
Signed-off-by: Matthew Wilcox <willy@linux.intel.com> Cc: Shuah Khan <shuahkh@osg.samsung.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Matthew Wilcox <willy@linux.intel.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Matthew Wilcox [Tue, 9 Feb 2016 23:13:11 +0000 (10:13 +1100)]
radix-tree: add an explicit include of bitops.h
The radix-tree header uses the __ffs() function, which is defined in
bitops.h. The current kernel headers implicitly include bitops.h, but the
userspace test harness does not.
Signed-off-by: Matthew Wilcox <willy@linux.intel.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Matthew Wilcox <willy@linux.intel.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Heiko Carstens [Tue, 9 Feb 2016 23:13:10 +0000 (10:13 +1100)]
lib/bug.c: make panic_on_warn available for all architectures
Christian Borntraeger reported that panic_on_warn doesn't have any effect
on s390.
The panic_on_warn feature was introduced with 9e3961a09798 ("kernel: add
panic_on_warn"). However it did care only for the case when
WANT_WARN_ON_SLOWPATH is defined. This is turn is only the case for
architectures which do not have an own __WARN_TAINT defined.
Other architectures which do have __WARN_TAINT defined call report_bug()
for warnings within lib/bug.c which does not call panic() in case
panic_on_warn is set.
Let's simply enable the panic_on_warn feature by adding the same code like
it was added to warn_slowpath_common() in panic.c.
This enables panic_on_warn also for arm64, parisc, powerpc, s390 and sh.
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com> Reported-by: Christian Borntraeger <borntraeger@de.ibm.com> Tested-by: Christian Borntraeger <borntraeger@de.ibm.com> Acked-by: Prarit Bhargava <prarit@redhat.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will.deacon@arm.com> Cc: "James E.J. Bottomley" <jejb@parisc-linux.org> Cc: Helge Deller <deller@gmx.de> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Tested-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc) Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Petr Mladek [Tue, 9 Feb 2016 23:13:09 +0000 (10:13 +1100)]
printk/nmi: increase the size of NMI buffer and make it configurable
Testing has shown that the backtrace sometimes does not fit into the 4kB
temporary buffer that is used in NMI context. The warnings are gone when
I double the temporary buffer size.
This patch doubles the buffer size and makes it configurable.
Note that this problem existed even in the x86-specific implementation
that was added by the commit a9edc8809328 ("x86/nmi: Perform a safe NMI
stack trace on all CPUs"). Nobody noticed it because it did not print any
warnings.
Signed-off-by: Petr Mladek <pmladek@suse.com> Cc: Jan Kara <jack@suse.cz> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Russell King <rmk+kernel@arm.linux.org.uk> Cc: Daniel Thompson <daniel.thompson@linaro.org> Cc: Jiri Kosina <jkosina@suse.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: David Miller <davem@davemloft.net> Cc: Daniel Thompson <daniel.thompson@linaro.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Petr Mladek [Tue, 9 Feb 2016 23:13:09 +0000 (10:13 +1100)]
printk/nmi: warn when some message has been lost in NMI context
We could not resize the temporary buffer in NMI context. Let's warn if a
message is lost.
This is rather theoretical. printk() should not be used in NMI. The only
sensible use is when we want to print backtrace from all CPUs. The
current buffer should be enough for this purpose.
[akpm@linux-foundation.org: whitespace fixlet] Signed-off-by: Petr Mladek <pmladek@suse.com> Cc: Jan Kara <jack@suse.cz> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Russell King <rmk+kernel@arm.linux.org.uk> Cc: Daniel Thompson <daniel.thompson@linaro.org> Cc: Jiri Kosina <jkosina@suse.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: David Miller <davem@davemloft.net> Cc: Daniel Thompson <daniel.thompson@linaro.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Petr Mladek [Tue, 9 Feb 2016 23:13:09 +0000 (10:13 +1100)]
printk/nmi: use IRQ work only when ready
NMIs could happen at any time. This patch makes sure that the safe
printk() in NMI will schedule IRQ work only when the related structs are
initialized.
All pending messages are flushed when the IRQ work is being initialized.
Signed-off-by: Petr Mladek <pmladek@suse.com> Cc: Jan Kara <jack@suse.cz> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Russell King <rmk+kernel@arm.linux.org.uk> Cc: Daniel Thompson <daniel.thompson@linaro.org> Cc: Jiri Kosina <jkosina@suse.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: David Miller <davem@davemloft.net> Cc: Daniel Thompson <daniel.thompson@linaro.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Petr Mladek [Tue, 9 Feb 2016 23:13:08 +0000 (10:13 +1100)]
printk/nmi: generic solution for safe printk in NMI
printk() takes some locks and could not be used a safe way in NMI context.
The chance of a deadlock is real especially when printing stacks from all
CPUs. This particular problem has been addressed on x86 by the commit a9edc8809328 ("x86/nmi: Perform a safe NMI stack trace on all CPUs").
The patchset brings two big advantages. First, it makes the NMI
backtraces safe on all architectures for free. Second, it makes all NMI
messages almost safe on all architectures (the temporary buffer is
limited. We still should keep the number of messages in NMI context at
minimum).
Note that there already are several messages printed in NMI context:
WARN_ON(in_nmi()), BUG_ON(in_nmi()), anything being printed out from MCE
handlers. These are not easy to avoid.
This patch reuses most of the code and makes it generic. It is useful for
all messages and architectures that support NMI.
The alternative printk_func is set when entering and is reseted when
leaving NMI context. It queues IRQ work to copy the messages into the
main ring buffer in a safe context.
__printk_nmi_flush() copies all available messages and reset the buffer.
Then we could use a simple cmpxchg operations to get synchronized with
writers. There is also used a spinlock to get synchronized with other
flushers.
We do not longer use seq_buf because it depends on external lock. It
would be hard to make all supported operations safe for a lockless use.
It would be confusing and error prone to make only some operations safe.
The code is put into separate printk/nmi.c as suggested by Steven Rostedt.
It needs a per-CPU buffer and is compiled only on architectures that call
nmi_enter(). This is achieved by the new HAVE_NMI Kconfig flag.
One exception is arm where the deferred printing is used for printing
backtraces even without NMI. For this purpose, we define NEED_PRINTK_NMI
Kconfig flag. The alternative printk_func is explicitly set when
IPI_CPU_BACKTRACE is handled.
The other exceptions are MN10300 and Xtensa architectures. We need to
clean up NMI handling there first. Let's do it separately.
The patch is heavily based on the draft from Peter Zijlstra, see
https://lkml.org/lkml/2015/6/10/327
[arnd@arndb.de: printk-nmi: use %zu format string for size_t]
[akpm@linux-foundation.org: min_t->min - all types are size_t here] Signed-off-by: Petr Mladek <pmladek@suse.com> Suggested-by: Peter Zijlstra <peterz@infradead.org> Suggested-by: Steven Rostedt <rostedt@goodmis.org> Cc: Jan Kara <jack@suse.cz> Cc: Russell King <rmk+kernel@arm.linux.org.uk> Cc: Daniel Thompson <daniel.thompson@linaro.org> Cc: Jiri Kosina <jkosina@suse.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: David Miller <davem@davemloft.net> Cc: Daniel Thompson <daniel.thompson@linaro.org> Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tetsuo Handa [Tue, 9 Feb 2016 23:13:08 +0000 (10:13 +1100)]
kernel/hung_task.c: use timeout diff when timeout is updated
When new timeout is written to /proc/sys/kernel/hung_task_timeout_secs,
khungtaskd is interrupted and again sleeps for full timeout duration.
This means that hang task will not be checked if new timeout is written
periodically within old timeout duration and/or checking of hang task will
be delayed for up to previous timeout duration. Fix this by remembering
last time khungtaskd checked hang task.
This change will allow other watchdog tasks (if any) to share khungtaskd
by sleeping for minimal timeout diff of all watchdog tasks. Doing more
watchdog tasks from khungtaskd will reduce the possibility of printk()
collisions by multiple watchdog threads.
Michal Hocko [Tue, 9 Feb 2016 23:13:06 +0000 (10:13 +1100)]
mm: use watermark checks for __GFP_REPEAT high order allocations
__alloc_pages_slowpath retries costly allocations until at least order
worth of pages were reclaimed or the watermark check for at least one zone
would succeed after all reclaiming all pages if the reclaim hasn't made
any progress.
The first condition was added by a41f24ea9fd6 ("page allocator: smarter
retry of costly-order allocations) and it assumed that lumpy reclaim could
have created a page of the sufficient order. Lumpy reclaim, has been
removed quite some time ago so the assumption doesn't hold anymore. It
would be more appropriate to check the compaction progress instead but
this patch simply removes the check and relies solely on the watermark
check.
To prevent from too many retries the no_progress_loops is not reseted
after a reclaim which made progress because we cannot assume it helped
high order situation. Only costly allocation requests depended on
pages_reclaimed so we can drop it.
Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mgorman@suse.de> Cc: David Rientjes <rientjes@google.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Michal Hocko [Tue, 9 Feb 2016 23:13:06 +0000 (10:13 +1100)]
mm: throttle on IO only when there are too many dirty and writeback pages
wait_iff_congested has been used to throttle allocator before it retried
another round of direct reclaim to allow the writeback to make some
progress and prevent reclaim from looping over dirty/writeback pages
without making any progress. We used to do congestion_wait before 0e093d99763e ("writeback: do not sleep on the congestion queue if there
are no congested BDIs or if significant congestion is not being
encountered in the current zone") but that led to undesirable stalls and
sleeping for the full timeout even when the BDI wasn't congested. Hence
wait_iff_congested was used instead. But it seems that even
wait_iff_congested doesn't work as expected. We might have a small file
LRU list with all pages dirty/writeback and yet the bdi is not congested
so this is just a cond_resched in the end and can end up triggering pre
mature OOM.
This patch replaces the unconditional wait_iff_congested by
congestion_wait which is executed only if we _know_ that the last round of
direct reclaim didn't make any progress and dirty+writeback pages are more
than a half of the reclaimable pages on the zone which might be usable for
our target allocation. This shouldn't reintroduce stalls fixed by 0e093d99763e because congestion_wait is called only when we are getting
hopeless when sleeping is a better choice than OOM with many pages under
IO.
We have to preserve logic introduced by "mm, vmstat: allow WQ concurrency
to discover memory reclaim doesn't make any progress" into the
__alloc_pages_slowpath now that wait_iff_congested is not used anymore.
As the only remaining user of wait_iff_congested is shrink_inactive_list
we can remove the WQ specific short sleep from wait_iff_congested because
the sleep is needed to be done only once in the allocation retry cycle.
Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mgorman@suse.de> Cc: David Rientjes <rientjes@google.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Andrew Morton [Tue, 9 Feb 2016 23:13:06 +0000 (10:13 +1100)]
mm-oom-rework-oom-detection-checkpatch-fixes
Cc: David Rientjes <rientjes@google.com>
WARNING: line over 80 characters
#99: FILE: mm/page_alloc.c:2965:
+ * zone list (with a backoff mechanism which is a function of no_progress_loops).
WARNING: line over 80 characters
#129: FILE: mm/page_alloc.c:2995:
+ * Keep reclaiming pages while there is a chance this will lead somewhere.
WARNING: line over 80 characters
#134: FILE: mm/page_alloc.c:3000:
+ for_each_zone_zonelist_nodemask(zone, z, ac->zonelist, ac->high_zoneidx, ac->nodemask) {
WARNING: line over 80 characters
#138: FILE: mm/page_alloc.c:3004:
+ available -= DIV_ROUND_UP(no_progress_loops * available, MAX_RECLAIM_RETRIES);
WARNING: line over 80 characters
#142: FILE: mm/page_alloc.c:3008:
+ * Would the allocation succeed if we reclaimed the whole available?
WARNING: line over 80 characters
#146: FILE: mm/page_alloc.c:3012:
+ /* Wait for some write requests to complete then retry */
total: 0 errors, 6 warnings, 202 lines checked
./patches/mm-oom-rework-oom-detection.patch has style problems, please review.
NOTE: If any of the errors are false positives, please report
them to the maintainer, see CHECKPATCH in MAINTAINERS.
Please run checkpatch prior to sending patches
Cc: David Rientjes <rientjes@google.com> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Michal Hocko <mhocko@suse.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Michal Hocko [Tue, 9 Feb 2016 23:13:05 +0000 (10:13 +1100)]
mm, oom: rework oom detection
As pointed by Linus [2][3] relying on zone_reclaimable as a way to
communicate the reclaim progress is rater dubious. I tend to agree, not
only it is really obscure, it is not hard to imagine cases where a single
page freed in the loop keeps all the reclaimers looping without getting
any progress because their gfp_mask wouldn't allow to get that page anyway
(e.g. single GFP_ATOMIC alloc and free loop). This is rather rare so it
doesn't happen in the practice but the current logic which we have is
rather obscure and hard to follow a also non-deterministic.
This is an attempt to make the OOM detection more deterministic and easier
to follow because each reclaimer basically tracks its own progress which
is implemented at the page allocator layer rather spread out between the
allocator and the reclaim. The more on the implementation is described in
the first patch.
I have tested several different scenarios but it should be clear that
testing OOM killer is quite hard to be representative. There is usually a
tiny gap between almost OOM and full blown OOM which is often time
sensitive. Anyway, I have tested the following 3 scenarios and I would
appreciate if there are more to test.
Testing environment: a virtual machine with 2G of RAM and 2CPUs without
any swap to make the OOM more deterministic.
1) 2 writers (each doing dd with 4M blocks to an xfs partition with 1G size,
removes the files and starts over again) running in parallel for 10s
to build up a lot of dirty pages when 100 parallel mem_eaters (anon
private populated mmap which waits until it gets signal) with 80M
each.
This causes an OOM flood of course and I have compared both patched
and unpatched kernels. The test is considered finished after there
are no OOM conditions detected. This should tell us whether there are
any excessive kills or some of them premature:
I have performed two runs this time each after a fresh boot.
The number of OOM invocations is consistent with my last measurements but
the runtime is way too different (it took 800+s). One thing that could
have skewed results was that I was tail -f the serial log on the host
system to see the progress. I have stopped doing that. The results are
more consistent now but still too different from the last time. This is
really weird so I've retested with the last 4.2 mmotm again and I am
getting consistent ~220s which is really close to the above. If I apply
the WQ vmstat patch on top I am getting close to 160s so the stale vmstat
counters made a difference which is to be expected. I have a new SSD in
my laptop which migh have made a difference but I wouldn't expect it to be
that large.
So the number of OOM killer invocation is the same but the overall runtime
of the test was much longer with the patched kernel. This can be
attributed to more retries in general. The results from the base kernel
are quite inconsitent and I think that consistency is better here.
2) 2 writers again with 10s of run and then 10 mem_eaters to consume as much
memory as possible without triggering the OOM killer. This required a lot
of tuning but I've considered 3 consecutive runs without OOM as a success.
* base kernel
size=$(awk '/MemFree/{printf "%dK", ($2/10)-(15*1024)}' /proc/meminfo)
It was -14M for the base 4.2 kernel and -7500M for the patched 4.2 kernel in
my last measurements.
The patched kernel handled the low mem conditions better and fired OOM
killer later.
3) Costly high-order allocations with a limited amount of memory.
Start 10 memeaters in parallel each with
size=$(awk '/MemTotal/{printf "%d\n", $2/10}' /proc/meminfo)
This will cause an OOM killer which will kill one of them which will free up
200M and then try to use all the remaining space for hugetlb pages. See how
many of them will pass kill everything, wait 2s and try again.
This tests whether we do not fail __GFP_REPEAT costly allocations too early
now.
* base kernel
$ sort base-hugepages.log | uniq -c
1 64
13 65
6 66
20 Trying to allocate 73
This also doesn't look very bad but this particular test is quite timing
sensitive.
The above results do seem optimistic but more loads should be tested
obviously. I would really appreciate a feedback on the approach I have
chosen before I go into more tuning. Is this viable way to go?
__alloc_pages_slowpath has traditionally relied on the direct reclaim and
did_some_progress as an indicator that it makes sense to retry allocation
rather than declaring OOM. shrink_zones had to rely on zone_reclaimable
if shrink_zone didn't make any progress to prevent from a premature OOM
killer invocation - the LRU might be full of dirty or writeback pages and
direct reclaim cannot clean those up.
zone_reclaimable allows to rescan the reclaimable lists several times and
restart if a page is freed. This is really subtle behavior and it might
lead to a livelock when a single freed page keeps allocator looping but
the current task will not be able to allocate that single page. OOM
killer would be more appropriate than looping without any progress for
unbounded amount of time.
This patch changes OOM detection logic and pulls it out from shrink_zone
which is too low to be appropriate for any high level decisions such as
OOM which is per zonelist property. It is __alloc_pages_slowpath which
knows how many attempts have been done and what was the progress so far
therefore it is more appropriate to implement this logic.
The new heuristic is implemented in should_reclaim_retry helper called
from __alloc_pages_slowpath. It tries to be more deterministic and easier
to follow. It builds on an assumption that retrying makes sense only if
the currently reclaimable memory + free pages would allow the current
allocation request to succeed (as per __zone_watermark_ok) at least for
one zone in the usable zonelist.
This alone wouldn't be sufficient, though, because the writeback might get
stuck and reclaimable pages might be pinned for a really long time or even
depend on the current allocation context. Therefore there is a feedback
mechanism implemented which reduces the reclaim target after each reclaim
round without any progress. This means that we should eventually converge
to only NR_FREE_PAGES as the target and fail on the wmark check and
proceed to OOM. The backoff is simple and linear with 1/16 of the
reclaimable pages for each round without any progress. We are optimistic
and reset counter for successful reclaim rounds.
Costly high order pages mostly preserve their semantic and those without
__GFP_REPEAT fail right away while those which have the flag set will back
off after the amount of reclaimable pages reaches equivalent of the
requested order. The only difference is that if there was no progress
during the reclaim we rely on zone watermark check. This is more logical
thing to do than previous 1<<order attempts which were a result of
zone_reclaimable faking the progress.
[hannes@cmpxchg.org: separate the heuristic into should_reclaim_retry]
[rientjes@google.com: use zone_page_state_snapshot for NR_FREE_PAGES]
[rientjes@google.com: shrink_zones doesn't need to return anything] Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mgorman@suse.de> Cc: David Rientjes <rientjes@google.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tetsuo Handa [Tue, 9 Feb 2016 23:13:05 +0000 (10:13 +1100)]
mm,oom: do not loop !__GFP_FS allocation if the OOM killer is disabled.
After the OOM killer is disabled during suspend operation, any
!__GFP_NOFAIL && __GFP_FS allocations are forced to fail. Thus, any
!__GFP_NOFAIL && !__GFP_FS allocations should be forced to fail as well.
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tetsuo Handa [Tue, 9 Feb 2016 23:13:04 +0000 (10:13 +1100)]
mm,oom: make oom_killer_disable() killable.
While oom_killer_disable() is called by freeze_processes() after all user
threads except the current thread are frozen, it is possible that kernel
threads invoke the OOM killer and sends SIGKILL to the current thread due
to sharing the thawed victim's memory. Therefore, checking for SIGKILL is
preferable than TIF_MEMDIE.
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
I've been asked several very simple questions:
a) How can I ensure that zram uses (or used) several compression
streams?
b) What is the current number of comp streams (how much memory
does zram *actually* use for compression streams, if there are
more than one stream)?
zram, indeed, does not provide any info and does not answer these
questions. Reading from `max_comp_streams' let to estimate only
theoretical comp streams memory consumption, which assumes that zram will
allocate max_comp_streams. However, it's possible that the real number of
compression streams will never reach that max value, due to various
reasons, e.g. max_comp_streams is too high, etc.
The patch adds `avail_streams' column to the /sys/block/zram<id>/mm_stat
device file. For a single compression stream backend it's always 1, for a
multi stream backend - it shows the actual ->avail_strm value.
The number of allocated compression streams answers several
questions:
a) the current `level of concurrency' that the device has
experienced
b) the amount of memory used by compression streams (by multiplying
the `avail_streams' column value, ->buffer size and algorithm's
specific scratch buffer size; the last are easy to find out,
unlike `avail_streams').
Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
khugepaged does swap in during collapse under anon_vma lock. It causes
complain from lockdep. The trace below shows following scenario:
- khugepaged tries to swap in a page under mmap_sem and anon_vma lock;
- do_swap_page() calls swapin_readahead() with GFP_HIGHUSER_MOVABLE;
- __read_swap_cache_async() tries to allocate the page for swap in;
- lockdep_trace_alloc() in __alloc_pages_nodemask() notices that with
given gfp_mask we could end up in direct relaim.
- Lockdep already knows that reclaim sometimes (e.g. in case of
split_huge_page()) wants to take anon_vma lock on its own.
Therefore deadlock is possible.
The fix is to take anon_vma lock after swap in.
[18344.236625] =================================
[18344.236628] [ INFO: inconsistent lock state ]
[18344.236633] 4.3.0-rc1-next-20150918-dbg-00014-ge5128d0-dirty #361 Not tainted
[18344.236636] ---------------------------------
[18344.236640] inconsistent {IN-RECLAIM_FS-W} -> {RECLAIM_FS-ON-W} usage.
[18344.236645] khugepaged/32 [HC0[0]:SC0[0]:HE1:SE1] takes:
[18344.236648] (&anon_vma->rwsem){++++?.}, at: [<ffffffff81134403>] khugepaged+0x8b0/0x1987
[18344.236662] {IN-RECLAIM_FS-W} state was registered at:
[18344.236666] [<ffffffff8107d747>] __lock_acquire+0x8e2/0x1183
[18344.236673] [<ffffffff8107e7ac>] lock_acquire+0x10b/0x1a6
[18344.236678] [<ffffffff8150a367>] down_write+0x3b/0x6a
[18344.236686] [<ffffffff811360d8>] split_huge_page_to_list+0x5b/0x61f
[18344.236689] [<ffffffff811224b3>] add_to_swap+0x37/0x78
[18344.236691] [<ffffffff810fd650>] shrink_page_list+0x4c2/0xb9a
[18344.236694] [<ffffffff810fe47c>] shrink_inactive_list+0x371/0x5d9
[18344.236696] [<ffffffff810fee2f>] shrink_lruvec+0x410/0x5ae
[18344.236698] [<ffffffff810ff024>] shrink_zone+0x57/0x140
[18344.236700] [<ffffffff810ffc79>] kswapd+0x6a5/0x91b
[18344.236702] [<ffffffff81059588>] kthread+0x107/0x10f
[18344.236706] [<ffffffff8150c7bf>] ret_from_fork+0x3f/0x70
[18344.236708] irq event stamp: 6517947
[18344.236709] hardirqs last enabled at (6517947): [<ffffffff810f2d0c>] get_page_from_freelist+0x362/0x59e
[18344.236713] hardirqs last disabled at (6517946): [<ffffffff8150ba41>] _raw_spin_lock_irqsave+0x18/0x51
[18344.236715] softirqs last enabled at (6507072): [<ffffffff81041cb0>] __do_softirq+0x2df/0x3f5
[18344.236719] softirqs last disabled at (6507055): [<ffffffff81041fb5>] irq_exit+0x40/0x94
[18344.236722]
other info that might help us debug this:
[18344.236723] Possible unsafe locking scenario:
Ebru Akagunduz [Tue, 9 Feb 2016 23:13:02 +0000 (10:13 +1100)]
mm: make swapin readahead to improve thp collapse rate
This patch makes swapin readahead to improve thp collapse rate. When
khugepaged scanned pages, there can be a few of the pages in swap area.
With the patch THP can collapse 4kB pages into a THP when there are up to
max_ptes_swap swap ptes in a 2MB range.
The patch was tested with a test program that allocates 400B of memory,
writes to it, and then sleeps. I force the system to swap out all.
Afterwards, the test program touches the area by writing, it skips a page
in each 20 pages of the area.
Without the patch, system did not swap in readahead. THP rate was %65 of
the program of the memory, it did not change over time.
With this patch, after 10 minutes of waiting khugepaged had collapsed %99
of the program's memory.
Signed-off-by: Ebru Akagunduz <ebru.akagunduz@gmail.com> Acked-by: Rik van Riel <riel@redhat.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Xie XiuQi <xiexiuqi@huawei.com> Cc: Cyrill Gorcunov <gorcunov@openvz.org> Cc: Mel Gorman <mgorman@suse.de> Cc: David Rientjes <rientjes@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
mm/khugepaged: fix scan not aborted on SCAN_EXCEED_SWAP_PTE
This patch fixes a typo in khugepaged_scan_pmd(): instead of setting
"result" to SCAN_EXCEED_SWAP_PTE we set "ret". Setting "ret" results in
an attempt to collapse a huge page although we meant aborting the scan.
As a result, we can call khugepaged_find_target_node() with all entries
in the khugepaged_node_load array being zeros. The latter is not ready
for that and might return an offline node on such input. This leads to a
warning followed by kernel panic:
Ebru Akagunduz [Tue, 9 Feb 2016 23:13:01 +0000 (10:13 +1100)]
mm: make optimistic check for swapin readahead
Introduce a new sysfs integer knob
/sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_swap which makes
optimistic check for swapin readahead to increase thp collapse rate.
Before getting swapped out pages to memory, checks them and allows up to a
certain number. It also prints out using tracepoints amount of unmapped
ptes.
Signed-off-by: Ebru Akagunduz <ebru.akagunduz@gmail.com> Acked-by: Rik van Riel <riel@redhat.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Xie XiuQi <xiexiuqi@huawei.com> Cc: Cyrill Gorcunov <gorcunov@openvz.org> Cc: Mel Gorman <mgorman@suse.de> Cc: David Rientjes <rientjes@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
ksm: introduce ksm_max_page_sharing per page deduplication limit
Without a max deduplication limit for each KSM page, the list of the
rmap_items associated to each stable_node can grow infinitely large.
During the rmap walk each entry can take up to ~10usec to process because
of IPIs for the TLB flushing (both for the primary MMU and the secondary
MMUs with the MMU notifier). With only 16GB of address space shared in
the same KSM page, that would amount to dozens of seconds of kernel
runtime.
A ~256 max deduplication factor will reduce the latencies of the rmap
walks on KSM pages to order of a few msec. Just doing the cond_resched()
during the rmap walks is not enough, the list size must have a limit too,
otherwise the caller could get blocked in (schedule friendly) kernel
computations for seconds, unexpectedly.
There's room for optimization to significantly reduce the IPI delivery
cost during the page_referenced(), but at least for page_migration in the
KSM case (used by hard NUMA bindings, compaction and NUMA balancing) it
may be inevitable to send lots of IPIs if each rmap_item->mm is active on
a different CPU and there are lots of CPUs. Even if we ignore the IPI
delivery cost, we've still to walk the whole KSM rmap list, so we can't
allow millions or billions (ulimited) number of entries in the KSM
stable_node rmap_item lists.
The limit is enforced efficiently by adding a second dimension to the
stable rbtree. So there are three types of stable_nodes: the regular ones
(identical as before, living in the first flat dimension of the stable
rbtree), the "chains" and the "dups".
Every "chain" and all "dups" linked into a "chain" enforce the invariant
that they represent the same write protected memory content, even if each
"dup" will be pointed by a different KSM page copy of that content. This
way the stable rbtree lookup computational complexity is unaffected if
compared to an unlimited max_sharing_limit. It is still enforced that
there cannot be KSM page content duplicates in the stable rbtree itself.
Adding the second dimension to the stable rbtree only after the
max_page_sharing limit hits, provides for a zero memory footprint increase
on 64bit archs. The memory overhead of the per-KSM page stable_tree and
per virtual mapping rmap_item is unchanged. Only after the
max_page_sharing limit hits, we need to allocate a stable_tree "chain" and
rb_replace() the "regular" stable_node with the newly allocated
stable_node "chain". After that we simply add the "regular" stable_node
to the chain as a stable_node "dup" by linking hlist_dup in the
stable_node_chain->hlist. This way the "regular" (flat) stable_node is
converted to a stable_node "dup" living in the second dimension of the
stable rbtree.
During stable rbtree lookups the stable_node "chain" is identified as
stable_node->rmap_hlist_len == STABLE_NODE_CHAIN (aka
is_stable_node_chain()).
When dropping stable_nodes, the stable_node "dup" is identified as
stable_node->head == STABLE_NODE_DUP_HEAD (aka is_stable_node_dup()).
The STABLE_NODE_DUP_HEAD must be an unique valid pointer never used
elsewhere in any stable_node->head/node to avoid a clashes with the
stable_node->node.rb_parent_color pointer, and different from
&migrate_nodes. So the second field of &migrate_nodes is picked and
verified as always safe with a BUILD_BUG_ON in case the list_head
implementation changes in the future.
The STABLE_NODE_DUP is picked as a random negative value in
stable_node->rmap_hlist_len. rmap_hlist_len cannot become negative when
it's a "regular" stable_node or a stable_node "dup".
The stable_node_chain->nid is irrelevant. The stable_node_chain->kpfn is
aliased in a union with a time field used to rate limit the
stable_node_chain->hlist prunes.
The garbage collection of the stable_node_chain happens lazily during
stable rbtree lookups (as for all other kind of stable_nodes), or while
disabling KSM with "echo 2 >/sys/kernel/mm/ksm/run" while collecting the
entire stable rbtree.
While the "regular" stable_nodes and the stable_node "dups" must wait for
their underlying tree_page to be freed before they can be freed
themselves, the stable_node "chains" can be freed immediately if the
stable_node->hlist turns empty. This is because the "chains" are never
pointed by any page->mapping and they're effectively stable rbtree KSM
self contained metadata.
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Tested-by: Petr Holasek <pholasek@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Arjan van de Ven <arjan@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Counts how many times we put a THP in split queue. Currently, it happens
on partial unmap of a THP.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
mm: workingset: make shadow node shrinker memcg aware
Workingset code was recently made memcg aware, but shadow node shrinker is
still global. As a result, one small cgroup can consume all memory
available for shadow nodes, possibly hurting other cgroups by reclaiming
their shadow nodes, even though reclaim distances stored in its shadow
nodes have no effect. To avoid this, we need to make shadow node shrinker
memcg aware.
Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
A page is activated on refault if the refault distance stored in the
corresponding shadow entry is less than the number of active file pages.
Since active file pages can't occupy more than half memory, we assume that
the maximal effective refault distance can't be greater than half the
number of present pages and size the shadow nodes lru list appropriately.
Generally speaking, this assumption is correct, but it can result in
wasting a considerable chunk of memory on stale shadow nodes in case the
portion of file pages is small, e.g. if a workload mostly uses anonymous
memory.
To sort this out, we need to compute the size of shadow nodes lru basing
not on the maximal possible, but the current size of file cache. We could
take the size of active file lru for the maximal refault distance, but
active lru is pretty unstable - it can shrink dramatically at runtime
possibly disrupting workingset detection logic.
Instead we assume that the maximal refault distance equals half the total
number of file cache pages. This will protect us against active file lru
size fluctuations while still being correct, because size of active lru is
normally maintained lower than size of inactive lru.
Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
radix-tree: account radix_tree_node to memory cgroup
Allocation of radix_tree_node objects can be easily triggered from
userspace, so we should account them to memory cgroup. Besides, we need
them accounted for making shadow node shrinker per memcg (see
mm/workingset.c).
A tricky thing about accounting radix_tree_node objects is that they are
mostly allocated through radix_tree_preload(), so we can't just set
SLAB_ACCOUNT for radix_tree_node_cachep - that would likely result in a
lot of unrelated cgroups using objects from each other's caches.
One way to overcome this would be making radix tree preloads per memcg,
but that would probably look cumbersome and overcomplicated.
Instead, we make radix_tree_node_alloc() first try to allocate from the
cache with __GFP_ACCOUNT, no matter if the caller has preloaded or not,
and only if it fails fall back on using per cpu preloads. This should
make most allocations accounted.
Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
As kmem accounting is now either enabled for all cgroups or disabled
system-wide, there's no point in having memcg_kmem_online() helper -
instead one can use memcg_kmem_enabled() and mem_cgroup_online(), as
shrink_slab() now does.
There are only two places left where this helper is used -
__memcg_kmem_charge() and memcg_create_kmem_cache(). The former can only
be called if memcg_kmem_enabled() returned true. Since the cgroup it
operates on is online, mem_cgroup_is_root() check will be enough.
memcg_create_kmem_cache() can't use mem_cgroup_online() helper instead of
memcg_kmem_online(), because it relies on the fact that in
memcg_offline_kmem() memcg->kmem_state is changed before
memcg_deactivate_kmem_caches() is called, but there we can just open-code
the check.
Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
mm: vmscan: pass root_mem_cgroup instead of NULL to memcg aware shrinker
It's just convenient to implement a memcg aware shrinker when you know
that shrink_control->memcg != NULL unless memcg_kmem_enabled() returns
false.
Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
mm: memcontrol: enable kmem accounting for all cgroups in the legacy hierarchy
Workingset code was recently made memcg aware, but shadow node shrinker is
still global. As a result, one small cgroup can consume all memory
available for shadow nodes, possibly hurting other cgroups by reclaiming
their shadow nodes, even though reclaim distances stored in its shadow
nodes have no effect. To avoid this, we need to make shadow node shrinker
memcg aware.
The actual work is done in patch 6 of the series. Patches 1 and 2 prepare
memcg/shrinker infrastructure for the change. Patch 3 is just a
collateral cleanup. Patch 4 makes radix_tree_node accounted, which is
necessary for making shadow node shrinker memcg aware. Patch 5 reduces
shadow nodes overhead in case workload mostly uses anonymous pages.
This patch:
Currently, in the legacy hierarchy kmem accounting is off for all cgroups
by default and must be enabled explicitly by writing something to
memory.kmem.limit_in_bytes. Since we don't support reclaim on hitting
kmem limit, nor do we have any plans to implement it, this is likely to be
-1, just to enable kmem accounting and limit kernel memory consumption by
the memory.limit_in_bytes along with user memory.
This user API was introduced when the implementation of kmem accounting
lacked slab shrinker support and hence was useless in practice. Things
have changed since then - slab shrinkers were made memcg aware, the
accounting overhead seems to be negligible, and a failure to charge a kmem
allocation should not have critical consequences, because we only account
those kernel objects that should be safe to fail. That's why kmem
accounting is enabled by default for all cgroups in the default hierarchy,
which will eventually replace the legacy one.
The ability to enable kmem accounting for some cgroups while keeping it
disabled for others is getting difficult to maintain. E.g. to make
shadow node shrinker memcg aware (see mm/workingset.c), we need to know
the relationship between the number of shadow nodes allocated for a cgroup
and the size of its lru list. If kmem accounting is enabled for all
cgroups there is no problem, but what should we do if kmem accounting is
enabled only for half of cgroups? We've no other choice but use global
lru stats while scanning root cgroup's shadow nodes, but that would be
wrong if kmem accounting was enabled for all cgroups (which is the case if
the unified hierarchy is used), in which case we should use lru stats of
the root cgroup's lruvec.
That being said, let's enable kmem accounting for all memory cgroups by
default. If one finds it unstable or too costly, it can always be
disabled system-wide by passing cgroup.memory=nokmem to the kernel at boot
time.
Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Denys Vlasenko [Tue, 9 Feb 2016 23:12:57 +0000 (10:12 +1100)]
include/linux/page-flags.h: force inlining of selected page flag modifications
Sometimes gcc mysteriously doesn't inline
very small functions we expect to be inlined. See
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66122
With this .config:
http://busybox.net/~vda/kernel_config_OPTIMIZE_INLINING_and_Os,
the following functions get deinlined many times.
Examples of disassembly:
Denys Vlasenko [Tue, 9 Feb 2016 23:12:56 +0000 (10:12 +1100)]
bufferhead: Force inlining of buffer head flag operations
With both gcc 4.7.2 and 4.9.2, sometimes gcc mysteriously doesn't inline
very small functions we expect to be inlined. See
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66122
With this .config:
http://busybox.net/~vda/kernel_config_OPTIMIZE_INLINING_and_Os,
set_buffer_foo(), clear_buffer_foo() and similar functions get deinlined
about 60 times. Examples of disassembly:
tools/vm/page-types.c: add memory cgroup dumping and filtering
On Sat, Feb 06, 2016 at 01:06:29PM +0300, Konstantin Khlebnikov wrote:
...
> static int opt_list; /* list pages (in ranges) */
> static int opt_no_summary; /* don't show summary */
> static pid_t opt_pid; /* process to walk */
> -const char * opt_file;
> +const char * opt_file; /* file or directory path */
> +static int64_t opt_cgroup = -1;/* cgroup inode */
ino should be a positive number, so we could use uint64_t here. Of
course, ino=0 could be used for filtering pages not charged to any
cgroup (as it is in this patch), but I doubt this would be useful.
Also, this patch conflicts with the recent change by Naoya introducing
support of dumping swap entries - https://lkml.org/lkml/2016/2/4/50
I attached a fixlet that addresses these two issues. What do you think
about it?
Other than that the patch looks good to me,
Signed-off-by: Konstantin Khlebnikov <koct9i@gmail.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Dan Williams [Tue, 9 Feb 2016 23:12:55 +0000 (10:12 +1100)]
mm: CONFIG_NR_ZONES_EXTENDED
ZONE_DEVICE (merged in 4.3) and ZONE_CMA (proposed) are examples of new mm
zones that are bumping up against the current maximum limit of 4 zones,
i.e. 2 bits in page->flags. When adding a zone this equation still needs
to be satisified:
ZONE_DEVICE currently tries to satisfy this equation by requiring that
ZONE_DMA be disabled, but this is untenable given generic kernels want to
support ZONE_DEVICE and ZONE_DMA simultaneously. ZONE_CMA would like to
increase the amount of memory covered per section, but that limits the
minimum granularity at which consecutive memory ranges can be added via
devm_memremap_pages().
The trade-off of what is acceptable to sacrifice depends heavily on the
platform. For example, ZONE_CMA is targeted for 32-bit platforms where
page->flags is constrained, but those platforms likely do not care about
the minimum granularity of memory hotplug. A big iron machine with 1024
numa nodes can likely sacrifice ZONE_DMA where a general purpose
distribution kernel can not.
CONFIG_NR_ZONES_EXTENDED is a configuration symbol that gets selected when
the number of configured zones exceeds 4. It documents the configuration
symbols and definitions that get modified when ZONES_WIDTH is greater than
2.
For now, it steals a bit from NODES_SHIFT. Later on it can be used to
document the definitions that get modified when a 32-bit configuration
wants more zone bits.
Note that GFP_ZONE_TABLE poses an interesting constraint since
include/linux/gfp.h gets included by the 32-bit portion of a 64-bit build.
We need to be careful to only build the table for zones that have a
corresponding gfp_t flag. GFP_ZONES_SHIFT is introduced for this purpose.
This patch does not attempt to solve the problem of adding a new zone
that also has a corresponding GFP_ flag.
Link: https://bugzilla.kernel.org/show_bug.cgi?id=110931 Fixes: 033fbae988fc ("mm: ZONE_DEVICE for "device memory"") Signed-off-by: Dan Williams <dan.j.williams@intel.com> Reported-by: Mark <markk@clara.co.uk> Cc: Mel Gorman <mgorman@suse.de> Cc: Rik van Riel <riel@redhat.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Sudip Mukherjee <sudipm.mukherjee@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>