Vlastimil Babka [Tue, 9 Feb 2016 23:12:17 +0000 (10:12 +1100)]
mm, page_alloc: print symbolic gfp_flags on allocation failure
It would be useful to translate gfp_flags into string representation when
printing in case of an allocation failure, especially as the flags have
been undergoing some changes recently and the script
./scripts/gfp-translate needs a matching source version to be accurate.
Vlastimil Babka [Tue, 9 Feb 2016 23:12:17 +0000 (10:12 +1100)]
mm, debug: replace dump_flags() with the new printk formats
With the new printk format strings for flags, we can get rid of
dump_flags() in mm/debug.c.
This also fixes dump_vma() which used dump_flags() for printing vma flags.
However dump_flags() did a page-flags specific filtering of bits higher
than NR_PAGEFLAGS in order to remove the zone id part. For dump_vma()
this resulted in removing several VM_* flags from the symbolic
translation.
Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: David Rientjes <rientjes@google.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Sasha Levin <sasha.levin@oracle.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Mel Gorman <mgorman@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Due to rebasing mistake, mmdebug.h keeps including tracepoint.h, causing
header dependency issues on some arches.
Remove the include, and related declarations of flags arrays, which reside
in mm/internal.h and lib/vsprintf.c already includes that header.
Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Guenter Roeck <linux@roeck-us.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Vlastimil Babka [Tue, 9 Feb 2016 23:12:16 +0000 (10:12 +1100)]
mm, printk: introduce new format string for flags
In mm we use several kinds of flags bitfields that are sometimes printed
for debugging purposes, or exported to userspace via sysfs. To make them
easier to interpret independently on kernel version and config, we want to
dump also the symbolic flag names. So far this has been done with
repeated calls to pr_cont(), which is unreliable on SMP, and not usable
for e.g. sysfs export.
To get a more reliable and universal solution, this patch extends printk()
format string for pointers to handle the page flags (%pGp), gfp_flags
(%pGg) and vma flags (%pGv). Existing users of dump_flag_names() are
converted and simplified.
It would be possible to pass flags by value instead of pointer, but the %p
format string for pointers already has extensions for various kernel
structures, so it's a good fit, and the extra indirection in a
non-critical path is negligible.
[linux@rasmusvillemoes.dk: lots of good implementation suggestions] Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Sasha Levin <sasha.levin@oracle.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Mel Gorman <mgorman@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Vlastimil Babka [Tue, 9 Feb 2016 23:12:16 +0000 (10:12 +1100)]
mm, tracing: unify mm flags handling in tracepoints and printk
In tracepoints, it's possible to print gfp flags in a human-friendly
format through a macro show_gfp_flags(), which defines a translation array
and passes is to __print_flags(). Since the following patch will
introduce support for gfp flags printing in printk(), it would be nice to
reuse the array. This is not straightforward, since __print_flags() can't
simply reference an array defined in a .c file such as mm/debug.c - it has
to be a macro to allow the macro magic to communicate the format to
userspace tools such as trace-cmd.
The solution is to create a macro __def_gfpflag_names which is used both
in show_gfp_flags(), and to define the gfpflag_names[] array in
mm/debug.c.
On the other hand, mm/debug.c also defines translation tables for page
flags and vma flags, and desire was expressed (but not implemented in this
series) to use these also from tracepoints. Thus, this patch also renames
the events/gfpflags.h file to events/mmflags.h and moves the table
definitions there, using the same macro approach as for gfpflags. This
allows translating all three kinds of mm-specific flags both in
tracepoints and printk.
Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Michal Hocko <mhocko@suse.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Sasha Levin <sasha.levin@oracle.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Mel Gorman <mgorman@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Vlastimil Babka [Tue, 9 Feb 2016 23:12:15 +0000 (10:12 +1100)]
tools, perf: make gfp_compact_table up to date
When updating tracing's show_gfp_flags() I have noticed that perf's
gfp_compact_table is also outdated. Fill in the missing flags and place a
note in gfp.h to increase chance that future updates are synced. Convert
the __GFP_X flags from "GFP_X" to "__GFP_X" strings in line with the
previous patch.
Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: David Rientjes <rientjes@google.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Sasha Levin <sasha.levin@oracle.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Vlastimil Babka [Tue, 9 Feb 2016 23:12:15 +0000 (10:12 +1100)]
mm, tracing: make show_gfp_flags() up to date
The show_gfp_flags() macro provides human-friendly printing of gfp flags
in tracepoints. However, it is somewhat out of date and missing several
flags. This patches fills in the missing flags, and distinguishes
properly between GFP_ATOMIC and __GFP_ATOMIC which were both translated to
"GFP_ATOMIC". More generally, all __GFP_X flags which were previously
printed as GFP_X, are now printed as __GFP_X, since ommiting the
underscores results in output that doesn't actually match the source code,
and can only lead to confusion. Where both variants are defined equal
(e.g. _DMA and _DMA32), the variant without underscores are preferred.
Also add a note in gfp.h so hopefully future changes will be synced
better.
__GFP_MOVABLE is defined twice in include/linux/gfp.h with different
comments. Leave just the newer one, which was intended to replace the old
one.
Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Michal Hocko <mhocko@suse.com> Acked-by: David Rientjes <rientjes@google.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Sasha Levin <sasha.levin@oracle.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Mel Gorman <mgorman@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Vlastimil Babka [Tue, 9 Feb 2016 23:12:14 +0000 (10:12 +1100)]
tracepoints: move trace_print_flags definitions to tracepoint-defs.h
The following patch will need to declare array of struct trace_print_flags
in a header. To prevent this header from pulling in all of RCU through
trace_events.h, move the struct trace_print_flags{_64} definitions to the
new lightweight tracepoint-defs.h header.
Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: David Rientjes <rientjes@google.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Sasha Levin <sasha.levin@oracle.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Mel Gorman [Tue, 9 Feb 2016 23:12:14 +0000 (10:12 +1100)]
mm: filemap: avoid unnecessary calls to lock_page when waiting for IO to complete during a read
In the generic read paths the kernel looks up a page in the page cache and
if it's up to date, it is used. If not, the page lock is acquired to wait
for IO to complete and then check the page. If multiple processes are
waiting on IO, they all serialise against the lock and duplicate the
checks. This is unnecessary.
The page lock in itself does not give any guarantees to the callers about
the page state as it can be immediately truncated or reclaimed after the
page is unlocked. It's sufficient to wait_on_page_locked and then
continue if the page is up to date on wakeup.
It is possible that a truncated but up-to-date page is returned but the
reference taken during read prevents it disappearing underneath the caller
and the data is still valid if PageUptodate.
The overall impact is small as even if processes serialise on the lock,
the lock section is tiny once the IO is complete. Profiles indicated that
unlock_page and friends are generally a tiny portion of a read-intensive
workload. An artificial test was created that had instances of dd access
a cache-cold file on an ext4 filesystem and measure how long the read
took.
Mel Gorman [Tue, 9 Feb 2016 23:12:14 +0000 (10:12 +1100)]
mm: filemap: remove redundant code in do_read_cache_page
do_read_cache_page and __read_cache_page duplicate page filler code when
filling the page for the first time. This patch simply removes the
duplicate logic.
Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The difference between v1 is that the prot variable is reset to
reqprot for each loop iteration (thanks to Konstantin Khlebnikov for
pointing this out).
rier means "(current->personality & [R]EAD_[I]MPLIES_[E]XEC) &&
(prot & PROT_[R]EAD)".
Signed-off-by: Piotr Kwapulinski <kwapulinski.piotr@gmail.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Konstantin Khlebnikov <koct9i@gmail.com> Cc: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
mm/mprotect.c: don't imply PROT_EXEC on non-exec fs
The mprotect(PROT_READ) fails when called by the READ_IMPLIES_EXEC binary
on a memory mapped file located on non-exec fs. The mprotect does not
check whether fs is _executable_ or not. The PROT_EXEC flag is set
automatically even if a memory mapped file is located on non-exec fs. Fix
it by checking whether a memory mapped file is located on a non-exec fs.
If so the PROT_EXEC is not implied by the PROT_READ. The implementation
uses the VM_MAYEXEC flag set properly in mmap. Now it is consistent with
mmap.
I did the isolated tests (PT_GNU_STACK X/NX, multiple VMAs, X/NX fs). I
also patched the official 3.19.0-47-generic Ubuntu 14.04 kernel and it
seems to work.
Signed-off-by: Piotr Kwapulinski <kwapulinski.piotr@gmail.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Konstantin Khlebnikov <koct9i@gmail.com> Cc: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Andreas Ziegler [Tue, 9 Feb 2016 23:12:12 +0000 (10:12 +1100)]
mm: fix two typos in comments for to_vmem_altmap()
Commit 4b94ffdc4163 ("x86, mm: introduce vmem_altmap to augment
vmemmap_populate()"), introduced the to_vmem_altmap() function.
The comments in this function contain two typos (one misspelling of the
Kconfig option CONFIG_SPARSEMEM_VMEMMAP, and one missing letter 'n'),
let's fix them up.
Signed-off-by: Andreas Ziegler <andreas.ziegler@fau.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
WARNING: 'occuring' may be misspelled - perhaps 'occurring'?
#57: FILE: mm/Kconfig.debug:74:
+ zeros. This makes it harder to detect when errors are occuring
total: 0 errors, 1 warnings, 47 lines checked
./patches/mm-page_poisoningc-allow-for-zero-poisoning.patch has style problems, please review.
NOTE: If any of the errors are false positives, please report
them to the maintainer, see CHECKPATCH in MAINTAINERS.
Please run checkpatch prior to sending patches
Cc: Laura Abbott <labbott@fedoraproject.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Laura Abbott [Tue, 9 Feb 2016 23:12:12 +0000 (10:12 +1100)]
mm/page_poisoning.c: allow for zero poisoning
By default, page poisoning uses a poison value (0xaa) on free. If this is
changed to 0, the page is not only sanitized but zeroing on alloc with
__GFP_ZERO can be skipped as well. The tradeoff is that detecting
corruption from the poisoning is harder to detect. This feature also
cannot be used with hibernation since pages are not guaranteed to be
zeroed after hibernation.
Credit to Mathias Krause and grsecurity for original work
Signed-off-by: Laura Abbott <labbott@fedoraproject.org> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Michal Hocko <mhocko@suse.com> Cc: Kees Cook <keescook@chromium.org> Cc: Mathias Krause <minipli@googlemail.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Jianyu Zhan <nasa4836@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Laura Abbott [Tue, 9 Feb 2016 23:12:11 +0000 (10:12 +1100)]
mm/page_poison.c: enable PAGE_POISONING as a separate option
Page poisoning is currently set up as a feature if architectures don't
have architecture debug page_alloc to allow unmapping of pages. It has
uses apart from that though. Clearing of the pages on free provides an
increase in security as it helps to limit the risk of information leaks.
Allow page poisoning to be enabled as a separate option independent of any
other debug feature. Because of how hiberanation is implemented, the
checks on alloc cannot occur if hibernation is enabled. This option can
also be set on !HIBERNATION as well.
Credit to Mathias Krause and grsecurity for original work
Signed-off-by: Laura Abbott <labbott@fedoraproject.org> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Michal Hocko <mhocko@suse.com> Cc: Kees Cook <keescook@chromium.org> Cc: Mathias Krause <minipli@googlemail.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Jianyu Zhan <nasa4836@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
ERROR: code indent should use tabs where possible
#251: FILE: mm/page_poison.c:15:
+ if (!buf)$
WARNING: please, no spaces at the start of a line
#251: FILE: mm/page_poison.c:15:
+ if (!buf)$
ERROR: code indent should use tabs where possible
#252: FILE: mm/page_poison.c:16:
+ return -EINVAL;$
WARNING: please, no spaces at the start of a line
#252: FILE: mm/page_poison.c:16:
+ return -EINVAL;$
ERROR: code indent should use tabs where possible
#254: FILE: mm/page_poison.c:18:
+ if (strcmp(buf, "on") == 0)$
WARNING: please, no spaces at the start of a line
#254: FILE: mm/page_poison.c:18:
+ if (strcmp(buf, "on") == 0)$
ERROR: code indent should use tabs where possible
#255: FILE: mm/page_poison.c:19:
+ want_page_poisoning = true;$
WARNING: please, no spaces at the start of a line
#255: FILE: mm/page_poison.c:19:
+ want_page_poisoning = true;$
ERROR: code indent should use tabs where possible
#259: FILE: mm/page_poison.c:23:
+ return 0;$
WARNING: please, no spaces at the start of a line
#259: FILE: mm/page_poison.c:23:
+ return 0;$
WARNING: Prefer [subsystem eg: netdev]_err([subsystem]dev, ... then dev_err(dev, ... then pr_err(... to printk(KERN_ERR ...
#352: FILE: mm/page_poison.c:116:
+ printk(KERN_ERR "pagealloc: single bit error\n");
WARNING: Prefer [subsystem eg: netdev]_err([subsystem]dev, ... then dev_err(dev, ... then pr_err(... to printk(KERN_ERR ...
#354: FILE: mm/page_poison.c:118:
+ printk(KERN_ERR "pagealloc: memory corruption\n");
total: 6 errors, 7 warnings, 311 lines checked
NOTE: Whitespace errors detected.
You may wish to use scripts/cleanpatch or scripts/cleanfile
./patches/mm-debug-pageallocc-split-out-page-poisoning-from-debug-page_alloc.patch has style problems, please review.
NOTE: If any of the errors are false positives, please report
them to the maintainer, see CHECKPATCH in MAINTAINERS.
Please run checkpatch prior to sending patches
Cc: Laura Abbott <labbott@fedoraproject.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Laura Abbott [Tue, 9 Feb 2016 23:12:10 +0000 (10:12 +1100)]
mm/debug-pagealloc.c: split out page poisoning from debug page_alloc
This is an implementation of page poisoning/sanitization for all arches.
It takes advantage of the existing implementation for
!ARCH_SUPPORTS_DEBUG_PAGEALLOC arches. This is a different approach than
what the Grsecurity patches were taking but should provide equivalent
functionality.
For those who aren't familiar with this, the goal of sanitization is to
reduce the severity of use after free and uninitialized data bugs. Memory
is cleared on free so any sensitive data is no longer available.
Discussion of sanitization was brough up in a thread about CVEs
(lkml.kernel.org/g/<20160119112812.GA10818@mwanda>)
I eventually expect Kconfig names will want to be changed and or moved if
this is going to be used for security but that can happen later.
Credit to Mathias Krause for the version in grsecurity
This patch (of 3):
For architectures that do not have debug page_alloc
(!ARCH_SUPPORTS_DEBUG_PAGEALLOC), page poisoning is used instead. Even
architectures that do have DEBUG_PAGEALLOC may want to take advantage of
the poisoning feature. Separate out page poisoning into a separate file.
This does not change the default behavior for
!ARCH_SUPPORTS_DEBUG_PAGEALLOC.
Credit to Mathias Krause and grsecurity for original work
Signed-off-by: Laura Abbott <labbott@fedoraproject.org> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Michal Hocko <mhocko@suse.com> Cc: Kees Cook <keescook@chromium.org> Cc: Mathias Krause <minipli@googlemail.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Jianyu Zhan <nasa4836@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
mm/debug_pagealloc: Ask users for default setting of debug_pagealloc
Since commit 031bc5743f158 ("mm/debug-pagealloc: make debug-pagealloc
boottime configurable") CONFIG_DEBUG_PAGEALLOC is by default not adding
any page debugging.
as this behaviour change was not even documented in Kconfig.
Let's provide a new Kconfig symbol that allows to change the default back
to enabled, e.g. for debug kernels. This also makes the change obvious
to kernel packagers.
Let's also change the Kconfig description for CONFIG_DEBUG_PAGEALLOC,
to indicate that there are two stages of overhead.
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Calculation of dirty_ratelimit sometimes is not correct.
E.g. initial values of dirty_ratelimit == INIT_BW and step == 0,
lead to the following result:
UBSAN: Undefined behaviour in ../mm/page-writeback.c:1286:7
shift exponent 25600 is too large for 64-bit type 'long unsigned int'
The fix is straightforward - make step 0 if the shift exponent is too big.
Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Tejun Heo <tj@kernel.org> Cc: Andy Shevchenko <andy.shevchenko@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
This patch extends existing "kernelcore" option and introduces
kernelcore=mirror option. By specifying "mirror" instead of specifying
the amount of memory, non-mirrored (non-reliable) region will be arranged
into ZONE_MOVABLE.
Signed-off-by: Taku Izumi <izumi.taku@jp.fujitsu.com> Tested-by: Sudeep Holla <sudeep.holla@arm.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Xishi Qiu <qiuxishi@huawei.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Matt Fleming <matt@codeblueprint.co.uk> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Steve Capper <steve.capper@linaro.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Taku Izumi [Tue, 9 Feb 2016 23:12:07 +0000 (10:12 +1100)]
mm/page_alloc.c: calculate zone_start_pfn at zone_spanned_pages_in_node()
Xeon E7 v3 based systems supports Address Range Mirroring and UEFI BIOS
complied with UEFI spec 2.5 can notify which ranges are mirrored
(reliable) via EFI memory map. Now Linux kernel utilize its information
and allocates boot time memory from reliable region.
My requirement is:
- allocate kernel memory from mirrored region
- allocate user memory from non-mirrored region
In order to meet my requirement, ZONE_MOVABLE is useful. By arranging
non-mirrored range into ZONE_MOVABLE, mirrored memory is used for kernel
allocations.
My idea is to extend existing "kernelcore" option and introduces
kernelcore=mirror option. By specifying "mirror" instead of specifying
the amount of memory, non-mirrored region will be arranged into
ZONE_MOVABLE.
Earlier discussions are at:
https://lkml.org/lkml/2015/10/9/24
https://lkml.org/lkml/2015/10/15/9
https://lkml.org/lkml/2015/11/27/18
https://lkml.org/lkml/2015/12/8/836
For example, suppose 2-nodes system with the following memory range:
node 0 [mem 0x0000000000001000-0x000000109fffffff]
node 1 [mem 0x00000010a0000000-0x000000209fffffff]
and the following ranges are marked as reliable (mirrored):
[0x0000000000000000-0x0000000100000000]
[0x0000000100000000-0x0000000180000000]
[0x0000000800000000-0x0000000880000000]
[0x00000010a0000000-0x0000001120000000]
[0x00000017a0000000-0x0000001820000000]
If you specify kernelcore=mirror, ZONE_NORMAL and ZONE_MOVABLE are
arranged like bellow:
In overlapped range, pages to be ZONE_MOVABLE in ZONE_NORMAL are treated
as absent pages, and vice versa.
This patch (of 2):
Currently each zone's zone_start_pfn is calculated at
free_area_init_core(). However zone's range is fixed at the time when
invoking zone_spanned_pages_in_node().
This patch changes how each zone->zone_start_pfn is calculated in
zone_spanned_pages_in_node().
Signed-off-by: Taku Izumi <izumi.taku@jp.fujitsu.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Xishi Qiu <qiuxishi@huawei.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Matt Fleming <matt@codeblueprint.co.uk> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Steve Capper <steve.capper@linaro.org> Cc: Sudeep Holla <sudeep.holla@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Joonsoo Kim [Tue, 9 Feb 2016 23:12:06 +0000 (10:12 +1100)]
mm/slub: support left redzone
SLUB already has a redzone debugging feature. But it is only positioned
at the end of object (aka right redzone) so it cannot catch left oob.
Although current object's right redzone acts as left redzone of next
object, first object in a slab cannot take advantage of this effect. This
patch explicitly adds a left red zone to each object to detect left oob
more precisely.
Background:
Someone complained to me that left OOB doesn't catch even if KASAN is
enabled which does page allocation debugging. That page is out of our
control so it would be allocated when left OOB happens and, in this
case, we can't find OOB. Moreover, SLUB debugging feature can be
enabled without page allocator debugging and, in this case, we will
miss that OOB.
Before trying to implement, I expected that changes would be too
complex, but, it doesn't look that complex to me now. Almost changes
are applied to debug specific functions so I feel okay.
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Joonsoo Kim [Tue, 9 Feb 2016 23:12:06 +0000 (10:12 +1100)]
mm/slab: re-implement pfmemalloc support
Current implementation of pfmemalloc handling in SLAB has some problems.
1) pfmemalloc_active is set to true when there is just one or more
pfmemalloc slabs in the system, but it is cleared when there is no
pfmemalloc slab in one arbitrary kmem_cache. So, pfmemalloc_active
could be wrongly cleared.
2) Search to partial and free list doesn't happen when non-pfmemalloc
object are not found in cpu cache. Instead, allocating new slab
happens and it is not optimal.
3) Even after sk_memalloc_socks() is disabled, cpu cache would keep
pfmemalloc objects tagged with SLAB_OBJ_PFMEMALLOC. It isn't cleared
if sk_memalloc_socks() is disabled so it could cause problem.
4) If cpu cache is filled with pfmemalloc objects, it would cause slow
down non-pfmemalloc allocation.
To me, current pointer tagging approach looks complex and fragile
so this patch re-implement whole thing instead of fixing problems
one by one.
Design principle for new implementation is that
1) Don't disrupt non-pfmemalloc allocation in fast path even if
sk_memalloc_socks() is enabled. It's more likely case than pfmemalloc
allocation.
2) Ensure that pfmemalloc slab is used only for pfmemalloc allocation.
3) Don't consider performance of pfmemalloc allocation in memory
deficiency state.
As a result, all pfmemalloc alloc/free in memory tight state will
be handled in slow-path. If there is non-pfmemalloc free object,
it will be returned first even for pfmemalloc user in fast-path so that
performance of pfmemalloc user isn't affected in normal case and
pfmemalloc objects will be kept as long as possible.
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Joonsoo Kim [Tue, 9 Feb 2016 23:12:06 +0000 (10:12 +1100)]
fix SLAB_DESTROTY_BY_RCU
Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Joonsoo Kim [Tue, 9 Feb 2016 23:12:05 +0000 (10:12 +1100)]
mm/slab: introduce new slab management type, OBJFREELIST_SLAB
SLAB needs a array to manage freed objects in a slab. It is only used if
some objects are freed so we can use free object itself as this array.
This requires additional branch in somewhat critical lock path to check if
it is first freed object or not but that's all we need. Benefits is that
we can save extra memory usage and reduce some computational overhead by
allocating a management array when new slab is created.
Code change is rather complex than what we can expect from the idea, in
order to handle debugging feature efficiently. If you want to see core
idea only, please remove '#if DEBUG' block in the patch.
Although this idea can apply to all caches whose size is larger than
management array size, it isn't applied to caches which have a
constructor. If such cache's object is used for management array,
constructor should be called for it before that object is returned to
user. I guess that overhead overwhelm benefit in that case so this idea
doesn't applied to them at least now.
For summary, from now on, slab management type is determined by
following logic.
1) if management array size is smaller than object size and no ctor, it
becomes OBJFREELIST_SLAB.
2) if management array size is smaller than leftover, it becomes
NORMAL_SLAB which uses leftover as a array.
3) if OFF_SLAB help to save memory than way 4), it becomes OFF_SLAB.
It allocate a management array from the other cache so memory waste
happens.
4) others become NORMAL_SLAB. It uses dedicated internal memory in a
slab as a management array so it causes memory waste.
In my system, without enabling CONFIG_DEBUG_SLAB, Almost caches become
OBJFREELIST_SLAB and NORMAL_SLAB (using leftover) which doesn't waste
memory. Following is the result of number of caches with specific slab
management type.
TOTAL = OBJFREELIST + NORMAL(leftover) + NORMAL + OFF
/Before/
126 = 0 + 60 + 25 + 41
/After/
126 = 97 + 12 + 15 + 2
Result shows that number of caches that doesn't waste memory increase
from 60 to 109.
I did some benchmarking and it looks that benefit are more than loss.
In fact, I tested another idea implementing OBJFREELIST_SLAB with
extendable linked array through another freed object. It can remove
memory waste completely but it causes more computational overhead in
critical lock path and it seems that overhead outweigh benefit. So, this
patch doesn't include it.
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Joonsoo Kim [Tue, 9 Feb 2016 23:12:05 +0000 (10:12 +1100)]
mm/slab: factor out debugging initialization in cache_init_objs()
cache_init_objs() will be changed in following patch and current form
doesn't fit well for that change. So, before doing it, this patch
separates debugging initialization. This would cause two loop iteration
when debugging is enabled, but, this overhead seems too light than debug
feature itself so effect may not be visible. This patch will greatly
simplify changes in cache_init_objs() in following patch.
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Joonsoo Kim [Tue, 9 Feb 2016 23:12:04 +0000 (10:12 +1100)]
mm/slab: factor out slab list fixup code
Slab list should be fixed up after object is detached from the slab and
this happens at two places. They do exactly same thing. They will be
changed in the following patch, so, to reduce code duplication, this patch
factor out them and make it common function.
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Joonsoo Kim [Tue, 9 Feb 2016 23:12:04 +0000 (10:12 +1100)]
mm/slab: make criteria for off slab determination robust and simple
To become an off slab, there are some constraints to avoid bootstrapping
problem and recursive call. This can be avoided differently by simply
checking that corresponding kmalloc cache is ready and it's not a off
slab. It would be more robust because static size checking can be
affected by cache size change or architecture type but dynamic checking
isn't.
One check 'freelist_cache->size > cachep->size / 2' is added to check
benefit of choosing off slab, because, now, there is no size constraint
which ensures enough advantage when selecting off slab.
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Joonsoo Kim [Tue, 9 Feb 2016 23:12:04 +0000 (10:12 +1100)]
mm/slab: do not change cache size if debug pagealloc isn't possible
We can fail to setup off slab in some conditions. Even in this case,
debug pagealloc increases cache size to PAGE_SIZE in advance and it is
waste because debug pagealloc cannot work for it when it isn't the off
slab. To improve this situation, this patch checks first that this cache
with increased size is suitable for off slab. It actually increases cache
size when it is suitable for off-slab, so possible waste is removed.
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Joonsoo Kim [Tue, 9 Feb 2016 23:12:03 +0000 (10:12 +1100)]
mm/slab: clean up cache type determination
Current cache type determination code is open-code and looks not
understandable. Following patch will introduce one more cache type and it
would make code more complex. So, before it happens, this patch abstracts
these codes.
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Joonsoo Kim [Tue, 9 Feb 2016 23:12:03 +0000 (10:12 +1100)]
mm/slab: align cache size first before determination of OFF_SLAB candidate
Finding suitable OFF_SLAB candidate is more related to aligned cache size
rather than original size. Same reasoning can be applied to the debug
pagealloc candidate. So, this patch moves up alignment fixup to proper
position. From that point, size is aligned so we can remove some
alignment fixups.
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Joonsoo Kim [Tue, 9 Feb 2016 23:12:02 +0000 (10:12 +1100)]
mm/slab: put the freelist at the end of slab page
Currently, the freelist is at the front of slab page. This requires extra
space to meet object alignment requirement. If we put the freelist at the
end of slab page, object could start at page boundary and will be at
correct alignment. This is possible because freelist has no alignment
constraint itself.
This gives us two benefits. It removes extra memory space for the
freelist alignment and remove complex calculation at cache initialization
step. I can't think notable drawback here.
I mentioned that this would reduce extra memory space, but, this benefit
is rather theoretical because it can be applied to very few cases.
Following is the example cache type that can get benefit from this change.
before means whole size for objects and aligned freelist before applying
patch and after shows the result of this patch.
Since before is more than 4096, number of object should decrease and
memory waste happens.
Anyway, this patch removes complex calculation so looks beneficial to me.
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Acked-by: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Joonsoo Kim [Tue, 9 Feb 2016 23:12:02 +0000 (10:12 +1100)]
mm/slab: remove object status buffer for DEBUG_SLAB_LEAK
Now, we don't use object status buffer in any setup. Remove it.
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Joonsoo Kim [Tue, 9 Feb 2016 23:12:01 +0000 (10:12 +1100)]
mm/slab: alternative implementation for DEBUG_SLAB_LEAK
DEBUG_SLAB_LEAK is a debug option. It's current implementation requires
status buffer so we need more memory to use it. And, it cause kmem_cache
initialization step more complex.
To remove this extra memory usage and to simplify initialization step,
this patch implement this feature with another way.
When user requests to get slab object owner information, it marks that
getting information is started. And then, all free objects in caches are
flushed to corresponding slab page. Now, we can distinguish all freed
object so we can know all allocated objects, too. After collecting slab
object owner information on allocated objects, mark is checked that there
is no free during the processing. If true, we can be sure that our
information is correct so information is returned to user.
Although this way is rather complex, it has two important benefits
mentioned above. So, I think it is worth changing.
There is one drawback that it takes more time to get slab object owner
information but it is just a debug option so it doesn't matter at all.
To help review, this patch implements new way only. Following patch will
remove useless code.
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Jesper Dangaard Brouer <brouer@redhat.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Joonsoo Kim [Tue, 9 Feb 2016 23:12:00 +0000 (10:12 +1100)]
mm/slab: clean up DEBUG_PAGEALLOC processing code
Currently, open code for checking DEBUG_PAGEALLOC cache is spread to some
sites. It makes code unreadable and hard to change. This patch clean-up
these code. Following patch will change the criteria for DEBUG_PAGEALLOC
cache so this clean-up will help it, too.
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Joonsoo Kim [Tue, 9 Feb 2016 23:12:00 +0000 (10:12 +1100)]
mm/slab: use more appropriate condition check for debug_pagealloc
debug_pagealloc debugging is related to SLAB_POISON flag rather than
FORCED_DEBUG option, although FORCED_DEBUG option will enable SLAB_POISON.
Fix it.
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Joonsoo Kim [Tue, 9 Feb 2016 23:12:00 +0000 (10:12 +1100)]
mm/slab: activate debug_pagealloc in SLAB when it is actually enabled
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Joonsoo Kim [Tue, 9 Feb 2016 23:11:59 +0000 (10:11 +1100)]
mm/slab: remove the checks for slab implementation bug
Some of "#if DEBUG" are for reporting slab implementation bug rather than
user usecase bug. It's not really needed because slab is stable for a
quite long time and it makes code too dirty. This patch remove it.
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Joonsoo Kim [Tue, 9 Feb 2016 23:11:59 +0000 (10:11 +1100)]
mm/slab: remove useless structure define
It is obsolete so remove it.
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Acked-by: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Joonsoo Kim [Tue, 9 Feb 2016 23:11:58 +0000 (10:11 +1100)]
mm/slab: fix stale code comment
This patchset implements new freed object management way, that is,
OBJFREELIST_SLAB. Purpose of it is to reduce memory overhead in SLAB.
SLAB needs a array to manage freed objects in a slab. If there is
leftover after objects are packed into a slab, we can use it as a
management array, and, in this case, there is no memory waste. But, in
the other cases, we need to allocate extra memory for a management array
or utilize dedicated internal memory in a slab for it. Both cases causes
memory waste so it's not good.
With this patchset, freed object itself can be used for a management
array. So, memory waste could be reduced. Detailed idea and numbers are
described in last patch's commit description. Please refer it.
In fact, I tested another idea implementing OBJFREELIST_SLAB with
extendable linked array through another freed object. It can remove
memory waste completely but it causes more computational overhead in
critical lock path and it seems that overhead outweigh benefit. So, this
patchset doesn't include it. I will attach prototype just for a
reference.
This patch (of 16):
We use freelist_idx_t type for free object management whose size would be
smaller than size of unsigned int. Fix it.
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Fixup trivial spelling errors, noticed while reading the code.
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Vladimir Davydov <vdavydov@virtuozzo.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
This patch introduce a new API call kfree_bulk() for bulk freeing memory
objects not bound to a single kmem_cache.
Christoph pointed out that it is possible to implement freeing of objects,
without knowing the kmem_cache pointer as that information is available
from the object's page->slab_cache. Proposing to remove the kmem_cache
argument from the bulk free API.
Jesper demonstrated that these extra steps per object comes at a
performance cost. It is only in the case CONFIG_MEMCG_KMEM is compiled in
and activated runtime that these steps are done anyhow. The extra cost is
most visible for SLAB allocator, because the SLUB allocator does the page
lookup (virt_to_head_page()) anyhow.
Thus, the conclusion was to keep the kmem_cache free bulk API with a
kmem_cache pointer, but we can still implement a kfree_bulk() API fairly
easily. Simply by handling if kmem_cache_free_bulk() gets called with a
kmem_cache NULL pointer.
This does increase the code size a bit, but implementing a separate
kfree_bulk() call would likely increase code size even more.
Below benchmarks cost of alloc+free (obj size 256 bytes) on
CPU i7-4790K @ 4.00GHz, no PREEMPT and CONFIG_MEMCG_KMEM=y.
Code size increase for SLAB:
add/remove: 0/0 grow/shrink: 1/0 up/down: 74/0 (74)
function old new delta
kmem_cache_free_bulk 660 734 +74
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Vladimir Davydov <vdavydov@virtuozzo.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
This patch implements the free side of bulk API for the SLAB allocator
kmem_cache_free_bulk(), and concludes the implementation of optimized bulk
API for SLAB allocator.
Benchmarked[1] cost of alloc+free (obj size 256 bytes) on CPU i7-4790K @
4.00GHz, with no debug options, no PREEMPT and CONFIG_MEMCG_KMEM=y but no
active user of kmemcg.
SLAB single alloc+free cost: 87 cycles(tsc) 21.814 ns with this
optimized config.
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Vladimir Davydov <vdavydov@virtuozzo.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
slab: avoid running debug SLAB code with IRQs disabled for alloc_bulk
Move the call to cache_alloc_debugcheck_after() outside the IRQ disabled
section in kmem_cache_alloc_bulk().
When CONFIG_DEBUG_SLAB is disabled the compiler should remove this code.
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Vladimir Davydov <vdavydov@virtuozzo.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
This patch implements the alloc side of bulk API for the SLAB allocator.
Further optimization are still possible by changing the call to
__do_cache_alloc() into something that can return multiple objects. This
optimization is left for later, given end results already show in the area
of 80% speedup.
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Vladimir Davydov <vdavydov@virtuozzo.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
slab: use slab_post_alloc_hook in SLAB allocator shared with SLUB
Reviewers notice that the order in slab_post_alloc_hook() of
kmemcheck_slab_alloc() and kmemleak_alloc_recursive() gets swapped
compared to slab.c / SLAB allocator.
Also notice memset now occurs before calling kmemcheck_slab_alloc()
and kmemleak_alloc_recursive().
I assume this reordering of kmemcheck, kmemleak and memset is okay because
this is the order they are used by the SLUB allocator.
This patch completes the sharing of alloc_hook's between SLUB and SLAB.
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Vladimir Davydov <vdavydov@virtuozzo.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
mm: kmemcheck skip object if slab allocation failed
In the SLAB allocator kmemcheck_slab_alloc() is guarded against being
called in case the object is NULL. In SLUB allocator this NULL pointer
invocation can happen, which seems like an oversight.
Move the NULL pointer check into kmemcheck code (kmemcheck_slab_alloc) so
the check gets moved out of the fastpath, when not compiled with
CONFIG_KMEMCHECK.
This is a step towards sharing post_alloc_hook between SLUB and SLAB,
because slab_post_alloc_hook() does not perform this check before calling
kmemcheck_slab_alloc().
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Vladimir Davydov <vdavydov@virtuozzo.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Second fix is for correct masking of allowed GFP flags (gfp_allowed_mask),
in SLAB allocator. This triggered a WARN, by percpu_init_late ->
pcpu_mem_zalloc invoking kzalloc with GFP_KERNEL flags. The linux-next
commit needing this fix is a1fd55538cae ("slab: use
slab_pre_alloc_hook in SLAB allocator shared with SLUB").
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
slab: use slab_pre_alloc_hook in SLAB allocator shared with SLUB
Deduplicate code in SLAB allocator functions slab_alloc() and
slab_alloc_node() by using the slab_pre_alloc_hook() call, which is now
shared between SLUB and SLAB.
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Vladimir Davydov <vdavydov@virtuozzo.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
First fix is for compiling with CONFIG_FAILSLAB. The linux-next commit
needing this fix is 074b6f53c320 ("mm: fault-inject take over
bootstrap kmem_cache check").
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
mm: fault-inject take over bootstrap kmem_cache check
Remove the SLAB specific function slab_should_failslab(), by moving the
check against fault-injection for the bootstrap slab, into the shared
function should_failslab() (used by both SLAB and SLUB).
This is a step towards sharing alloc_hook's between SLUB and SLAB.
This bootstrap slab "kmem_cache" is used for allocating struct kmem_cache
objects to the allocator itself.
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Vladimir Davydov <vdavydov@virtuozzo.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
mm/slab: move SLUB alloc hooks to common mm/slab.h
First step towards sharing alloc_hook's between SLUB and SLAB allocators.
Move the SLUB allocators *_alloc_hook to the common mm/slab.h for internal
slab definitions.
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Vladimir Davydov <vdavydov@virtuozzo.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
slub: cleanup code for kmem cgroup support to kmem_cache_free_bulk
This change is primarily an attempt to make it easier to realize the
optimizations the compiler performs in-case CONFIG_MEMCG_KMEM is not
enabled.
Performance wise, even when CONFIG_MEMCG_KMEM is compiled in, the overhead
is zero. This is because, as long as no process have enabled kmem cgroups
accounting, the assignment is replaced by asm-NOP operations. This is
possible because memcg_kmem_enabled() uses a static_key_false() construct.
It also helps readability as it avoid accessing the p[] array like: p[size
- 1] which "expose" that the array is processed backwards inside helper
function build_detached_freelist().
Lastly this also makes the code more robust, in error case like passing
NULL pointers in the array. Which were previously handled before commit 033745189b1b ("slub: add missing kmem cgroup support to
kmem_cache_free_bulk").
Fixes: 033745189b1b ("slub: add missing kmem cgroup support to kmem_cache_free_bulk") Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Vladimir Davydov <vdavydov@virtuozzo.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Josh Hunt [Tue, 9 Feb 2016 23:11:53 +0000 (10:11 +1100)]
block: restore /proc/partitions to not display non-partitionable removable devices
We found with newer kernels we started seeing the cdrom device showing
up in /proc/partitions, but it was not there before.
Looking into this I found that commit d27769ec ("block: add
GENHD_FL_NO_PART_SCAN") introduces this change in behavior. It's not
clear to me from the commit's changelog if this change was intentional or
not. This comment still remains: /* Don't show non-partitionable
removeable devices or empty devices */ so I've decided to send a patch to
restore the behavior of not printing unpartitionable removable devices.
Signed-off-by: Josh Hunt <johunt@akamai.com> Cc: Tejun Heo <tj@kernel.org> Cc: Kay Sievers <kay.sievers@vrfy.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
San Mehat [Tue, 9 Feb 2016 23:11:52 +0000 (10:11 +1100)]
block: partition: add partition specific uevent callbacks for partition info
This patch has been carried in the Android tree for quite some time and is
one of the few patches required to get a mainline kernel up and running
with an exsiting Android userspace. So I wanted to submit it for review
and consideration if it should be merged.
For partitions, add new uevent parameters 'PARTN' which specifies the
partitions index in the table, and 'PARTNAME', which specifies PARTNAME
specifices the partition name of a partition device.
Android's userspace uses this for creating device node links from the
partition name and number: ie:
/dev/block/platform/soc/by-name/system
or
/dev/block/platform/soc/by-num/p1
One can see its usage here:
https://android.googlesource.com/platform/system/core/+/master/init/devices.cpp#355
and
https://android.googlesource.com/platform/system/core/+/master/init/devices.cpp#494
[john.stultz@linaro.org: dropped NPARTS and reworded commit message for context] Signed-off-by: Dima Zavin <dima@android.com> Signed-off-by: John Stultz <john.stultz@linaro.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Rom Lemarchand <romlem@google.com> Cc: Android Kernel Team <kernel-team@android.com> Cc: Jeff Moyer <jmoyer@redhat.com> Cc: <harald@redhat.com> Cc: Kees Cook <keescook@chromium.org> Cc: Kay Sievers <kay@vrfy.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Joel Becker <jlbec@evilplan.org> Cc: Joseph Qi <joseph.qi@huawei.com> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Mark Fasheh <mfasheh@suse.de> Cc: xuejiufei <xuejiufei@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
xuejiufei [Tue, 9 Feb 2016 23:11:51 +0000 (10:11 +1100)]
ocfs2/dlm: move lock to the tail of grant queue while doing in-place convert
We have found a bug when two nodes doing umount one after another.
1) Node 1 migrate a lockres that has 3 locks in grant queue such as
N2(PR)<->N3(NL)<->N4(PR) to N2. After migration, lvb of the lock
N3(NL) and N4(PR) are empty on node 2 because migration target do not
copy lvb to these two lock.
2) Node 3 want to convert to PR, it can be granted in
__dlmconvert_master(), and the order of these locks is unchanged. The
lvb of the lock N3(PR) on node 2 is copyed from lockres in function
dlm_update_lvb() while the lvb of lock N4(PR) is still empty.
3) Node 2 want to leave domain, it will migrate this lockres to node 3.
Then node 2 will trigger the BUG in dlm_prepare_lvb_for_migration()
when adding the lock N4(PR) to mres with the following message because
the lvb of mres is already copied from lock N3(PR), but the lvb of lock
N4(PR) is empty.
"Mismatched lvb in lock cookie=%u:%llu, name=%.*s, node=%u"
Signed-off-by: xuejiufei <xuejiufei@huawei.com> Cc: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Joseph Qi <joseph.qi@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Junxiao Bi [Tue, 9 Feb 2016 23:11:50 +0000 (10:11 +1100)]
ocfs2: o2hb: fix hb hung time
hr_last_timeout_start should be set as the last time where hb is still OK.
When hb write timeout, hung time will be (jiffies -
hr_last_timeout_start).
Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com> Reviewed-by: Ryan Ding <ryan.ding@oracle.com> Cc: Gang He <ghe@suse.com> Cc: rwxybh <rwxybh@126.com> Cc: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Cc: Joseph Qi <joseph.qi@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Junxiao Bi [Tue, 9 Feb 2016 23:11:50 +0000 (10:11 +1100)]
ocfs2: o2hb: don't negotiate if last hb fail
Sometimes io error is returned when storage is down for a while. Like for
iscsi device, stroage is made offline when session timeout, and this will
make all io return -EIO. For this case, nodes shouldn't do negotiate
timeout but should fence self. So let nodes fence self when
o2hb_do_disk_heartbeat return an error, this is the same behavior with
o2hb without negotiate timer.
Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com> Reviewed-by: Ryan Ding <ryan.ding@oracle.com> Cc: Gang He <ghe@suse.com> Cc: rwxybh <rwxybh@126.com> Cc: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Cc: Joseph Qi <joseph.qi@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Junxiao Bi [Tue, 9 Feb 2016 23:11:50 +0000 (10:11 +1100)]
ocfs2: o2hb: add some user/debug log
Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com> Reviewed-by: Ryan Ding <ryan.ding@oracle.com> Cc: Gang He <ghe@suse.com> Cc: rwxybh <rwxybh@126.com> Cc: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Cc: Joseph Qi <joseph.qi@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Junxiao Bi [Tue, 9 Feb 2016 23:11:49 +0000 (10:11 +1100)]
ocfs2: o2hb: add NEGOTIATE_APPROVE message
This message is used to re-queue write timeout timer and negotiate timer
when all nodes suffer a write hung to storage, this makes node not fence
self if storage down.
Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com> Reviewed-by: Ryan Ding <ryan.ding@oracle.com> Cc: Gang He <ghe@suse.com> Cc: rwxybh <rwxybh@126.com> Cc: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Cc: Joseph Qi <joseph.qi@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Junxiao Bi [Tue, 9 Feb 2016 23:11:49 +0000 (10:11 +1100)]
ocfs2: o2hb: add NEGO_TIMEOUT message
This message is sent to master node when non-master nodes's negotiate
timer expired. Master node records these nodes in a bitmap which is used
to do write timeout timer re-queue decision.
Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com> Reviewed-by: Ryan Ding <ryan.ding@oracle.com> Cc: Gang He <ghe@suse.com> Cc: rwxybh <rwxybh@126.com> Cc: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Cc: Joseph Qi <joseph.qi@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Junxiao Bi [Tue, 9 Feb 2016 23:11:49 +0000 (10:11 +1100)]
ocfs2: o2hb: add negotiate timer
This series of patches is to fix the issue that when storage down, all
nodes will fence self due to write timeout.
With this patch set, all nodes will keep going until storage back online,
except if the following issue happens, then all nodes will do as before to
fence self.
1. io error got
2. network between nodes down
3. nodes panic
This patch (of 6):
When storage down, all nodes will fence self due to write timeout. The
negotiate timer is designed to avoid this, with it node will wait until
storage up again.
Negotiate timer working in the following way:
1. The timer expires before write timeout timer, its timeout is half
of write timeout now. It is re-queued along with write timeout timer.
If expires, it will send NEGO_TIMEOUT message to master node(node with
lowest node number). This message does nothing but marks a bit in a
bitmap recording which nodes are negotiating timeout on master node.
2. If storage down, nodes will send this message to master node, then
when master node finds its bitmap including all online nodes, it sends
NEGO_APPROVL message to all nodes one by one, this message will
re-queue write timeout timer and negotiate timer. For any node doesn't
receive this message or meets some issue when handling this message, it
will be fenced. If storage up at any time, o2hb_thread will run and
re-queue all the timer, nothing will be affected by these two steps.
Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com> Reviewed-by: Ryan Ding <ryan.ding@oracle.com> Cc: Gang He <ghe@suse.com> Cc: rwxybh <rwxybh@126.com> Cc: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Cc: Joseph Qi <joseph.qi@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Gang He [Tue, 9 Feb 2016 23:11:48 +0000 (10:11 +1100)]
ocfs2: add feature document for online file check
This document will describe OCFS2 online file check feature. OCFS2 is
often used in high-availaibility systems. However, OCFS2 usually converts
the filesystem to read-only when encounters an error. This may not be
necessary, since turning the filesystem read-only would affect other
running processes as well, decreasing availability.
Then, a mount option (errors=continue) is introduced, which would return
the -EIO errno to the calling process and terminate furhter processing so
that the filesystem is not corrupted further. The filesystem is not
converted to read-only, and the problematic file's inode number is
reported in the kernel log. The user can try to check/fix this file via
online filecheck feature.
Signed-off-by: Gang He <ghe@suse.com> Cc: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Joseph Qi <joseph.qi@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Gang He [Tue, 9 Feb 2016 23:11:48 +0000 (10:11 +1100)]
ocfs2: check/fix inode block for online file check
Implement online check or fix inode block during reading a inode block to
memory.
Signed-off-by: Gang He <ghe@suse.com> Cc: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Joseph Qi <joseph.qi@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Gang He [Tue, 9 Feb 2016 23:11:47 +0000 (10:11 +1100)]
ocfs2: create/remove sysfile for online file check
Create online file check sysfile when ocfs2 mount, remove the related
sysfile when ocfs2 umount.
Signed-off-by: Gang He <ghe@suse.com> Reviewed-by: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Joseph Qi <joseph.qi@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Gang He [Tue, 9 Feb 2016 23:11:47 +0000 (10:11 +1100)]
ocfs2: sysfile interfaces for online file check
Implement online file check sysfile interfaces, e.g. how to create the
related sysfile according to device name, how to display/handle file check
request from the sysfile.
Signed-off-by: Gang He <ghe@suse.com> Cc: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Joseph Qi <joseph.qi@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Gang He [Tue, 9 Feb 2016 23:11:47 +0000 (10:11 +1100)]
ocfs2: export ocfs2_kset for online file check
When there are errors in the ocfs2 filesystem,
they are usually accompanied by the inode number which caused the error.
This inode number would be the input to fixing the file.
One of these options could be considered:
A file in the sys filesytem which would accept inode numbers.
This could be used to communication back what has to be fixed or is fixed.
You could write:
Compare with second version, I re-design filecheck sysfs interfaces, there
are three sysfs files (check, fix and set) under filecheck directory (see
above), sysfs will accept only one argument <inode>. Second, I adjust
some code in ocfs2_filecheck_repair_inode_block() function according to
upstream feedback, we cannot just add VALID_FL flag back as a inode block
fix, then we will not fix this field corruption currently until having a
complete solution. Compare with first version, I use strncasecmp instead
of double strncmp functions. Second, update the source file contribution
vendor.
This patch (of 4):
Export ocfs2_kset object from ocfs2_stackglue kernel module, then
online file check code will create the related sysfiles under
ocfs2_kset object. We're exporting this because it's built in
ocfs2_stackglue.ko.
Signed-off-by: Gang He <ghe@suse.com> Reviewed-by: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Joseph Qi <joseph.qi@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
jiangyiwen [Tue, 9 Feb 2016 23:11:46 +0000 (10:11 +1100)]
ocfs2: solve a problem of crossing the boundary in updating backups
In update_backups() there exists a problem of crossing the boundary
as follows:
we assume that lun will be resized to 1TB(cluster_size is 32kb),
it will include 0~33554431 cluster, in update_backups func,
it will backup super block in location of 1TB which is the 33554432th
cluster, so the phenomenon of crossing the boundary happens.
Signed-off-by: Yiwen Jiang <jiangyiwen@huawei.com> Reviewed-by: Joseph Qi <joseph.qi@huawei.com> Cc: Xue jiufei <xuejiufei@huawei.com> Cc: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
jiangyiwen [Tue, 9 Feb 2016 23:11:46 +0000 (10:11 +1100)]
ocfs2: fix occurring deadlock by changing ocfs2_wq from global to local
This patch fixes a deadlock, as follows:
Node 1 Node 2 Node 3
1)volume a and b are only mount vol a only mount vol b
mounted
2) start to mount b start to mount a
3) check hb of Node 3 check hb of Node 2
in vol a, qs_holds++ in vol b, qs_holds++
4) -------------------- all nodes' network down --------------------
5) progress of mount b the same situation as
failed, and then call Node 2
ocfs2_dismount_volume.
but the process is hung,
since there is a work
in ocfs2_wq cannot beo
completed. This work is
about vol a, because
ocfs2_wq is global wq.
BTW, this work which is
scheduled in ocfs2_wq is
ocfs2_orphan_scan_work,
and the context in this work
needs to take inode lock
of orphan_dir, because
lockres owner are Node 1 and
all nodes' nework has been down
at the same time, so it can't
get the inode lock.
6) Why can't this node be fenced
when network disconnected?
Because the process of
mount is hung what caused qs_holds
is not equal 0.
Because all works in the ocfs2_wq are relative to the super block.
The solution is to change the ocfs2_wq from global to local. In other
words, move it into struct ocfs2_super.
Signed-off-by: Yiwen Jiang <jiangyiwen@huawei.com> Cc: Xue jiufei <xuejiufei@huawei.com> Reviewed-by: Joseph Qi <joseph.qi@huawei.com> Cc: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Xue jiufei [Tue, 9 Feb 2016 23:11:45 +0000 (10:11 +1100)]
ocfs2: extend enough credits for freeing one truncate record while replaying truncate records
Now function ocfs2_replay_truncate_records() first modifies tl_used, then
calls ocfs2_extend_trans() to extend transactions for gd and alloc inode
used for freeing clusters. jbd2_journal_restart() may be called and it
may happen that tl_used in truncate log is decreased but the clusters are
not freed, which means these clusters are lost. So we should avoid
extending transactions in these two operations.
Signed-off-by: joyce.xue <xuejiufei@huawei.com> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Joseph Qi <joseph.qi@huawei.com> Reviewed-by: Mark Fasheh <mfasheh@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Xue jiufei [Tue, 9 Feb 2016 23:11:45 +0000 (10:11 +1100)]
ocfs2: extend transaction for ocfs2_remove_rightmost_path() and ocfs2_update_edge_lengths() before to avoid inconsistency between inode and et
I found that jbd2_journal_restart() is called in some places without
keeping things consistently before. However, jbd2_journal_restart() may
commit the handle's transaction and restart another one. If the first
transaction is committed successfully while another not, it may cause
filesystem inconsistency or read only. This is an effort to fix this kind
of problems.
This patch (of 3):
The following functions will be called while truncating an extent:
ocfs2_remove_btree_range
-> ocfs2_start_trans
-> ocfs2_remove_extent
-> ocfs2_truncate_rec
-> ocfs2_extend_rotate_transaction
-> jbd2_journal_restart if jbd2_journal_extend fail
-> ocfs2_rotate_tree_left
-> ocfs2_remove_rightmost_path
-> ocfs2_extend_rotate_transaction
-> ocfs2_unlink_subtree
-> ocfs2_update_edge_lengths
-> ocfs2_extend_trans
-> jbd2_journal_restart if jbd2_journal_extend fail
-> ocfs2_et_update_clusters
-> ocfs2_commit_trans
jbd2_journal_restart() may be called and it may happened that the buffers
dirtied in ocfs2_truncate_rec() are committed while buffers dirtied in
ocfs2_et_update_clusters() are not, the total clusters on extent tree and
i_clusters in ocfs2_dinode is inconsistency. So the clusters got from
ocfs2_dinode is incorrect, and it also cause read-only problem when call
ocfs2_commit_truncate() with the error message: "Inode %llu has empty
extent block at %llu".
We should extend enough credits for function ocfs2_remove_rightmost_path
and ocfs2_update_edge_lengths to avoid this inconsistency.
Signed-off-by: joyce.xue <xuejiufei@huawei.com> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Joseph Qi [Tue, 9 Feb 2016 23:11:45 +0000 (10:11 +1100)]
ocfs2/dlm: fix BUG in dlm_move_lockres_to_recovery_list
When master handles convert request, it queues ast first and then returns
status. This may happen that the ast is sent before the request status
because the above two messages are sent by two threads. And right after
the ast is sent, if master down, it may trigger BUG in
dlm_move_lockres_to_recovery_list in the requested node because ast
handler moves it to grant list without clear lock->convert_pending. So
remove BUG_ON statement and check if the ast is processed in
dlmconvert_remote.
Signed-off-by: Joseph Qi <joseph.qi@huawei.com> Reported-by: Yiwen Jiang <jiangyiwen@huawei.com> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Cc: Tariq Saeed <tariq.x.saeed@oracle.com> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Joseph Qi [Tue, 9 Feb 2016 23:11:43 +0000 (10:11 +1100)]
ocfs2/dlm: fix race between convert and recovery
There is a race window between dlmconvert_remote and
dlm_move_lockres_to_recovery_list, which will cause a lock with
OCFS2_LOCK_BUSY in grant list, thus system hangs.
status = dlm_send_remote_convert_request();
>>>>>> race window, master has queued ast and return DLM_NORMAL,
and then down before sending ast.
this node detects master down and calls
dlm_move_lockres_to_recovery_list, which will revert the
lock to grant list.
Then OCFS2_LOCK_BUSY won't be cleared as new master won't
send ast any more because it thinks already be authorized.
In this case, check if res->state has DLM_LOCK_RES_RECOVERING bit set (res
is still in recovering) or res master changed (new master has finished
recovery), reset the status to DLM_RECOVERING, then it will retry convert.
Signed-off-by: Joseph Qi <joseph.qi@huawei.com> Reported-by: Yiwen Jiang <jiangyiwen@huawei.com> Reviewed-by: Junxiao Bi <junxiao.bi@oracle.com> Cc: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Cc: Tariq Saeed <tariq.x.saeed@oracle.com> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Ryan Ding [Tue, 9 Feb 2016 23:11:43 +0000 (10:11 +1100)]
ocfs2: fix a deadlock issue in ocfs2_dio_end_io_write()
Should call ocfs2_free_alloc_context() to free meta_ac & data_ac before
calling ocfs2_run_deallocs(). Because ocfs2_run_deallocs() will acquire
the system inode's i_mutex hold by meta_ac. So try to release the lock
before ocfs2_run_deallocs().
Fixes: af1310367f41 ("ocfs2: fix sparse file & data ordering issue in direct io.") Signed-off-by: Ryan Ding <ryan.ding@oracle.com> Acked-by: Junxiao Bi <junxiao.bi@oracle.com> Cc: Joseph Qi <joseph.qi@huawei.com> Cc: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Ryan Ding [Tue, 9 Feb 2016 23:11:43 +0000 (10:11 +1100)]
ocfs2: fix disk file size and memory file size mismatch
When doing append direct write in an already allocated cluster, and fast
path in ocfs2_dio_get_block() is triggered, function
ocfs2_dio_end_io_write() will be skipped as there is no context allocated.
So disk file size will not be changed as it should be. The solution is
to skip fast path when we are about to change file size.
Fixes: af1310367f41 ("ocfs2: fix sparse file & data ordering issue in direct io.") Signed-off-by: Ryan Ding <ryan.ding@oracle.com> Acked-by: Junxiao Bi <junxiao.bi@oracle.com> Cc: Joseph Qi <joseph.qi@huawei.com> Cc: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Ryan Ding [Tue, 9 Feb 2016 23:11:42 +0000 (10:11 +1100)]
ocfs2: take ip_alloc_sem in ocfs2_dio_get_block & ocfs2_dio_end_io_write
Take ip_alloc_sem to prevent concurrent access to extent tree, which may
cause the extent tree in an unstable state.
Signed-off-by: Ryan Ding <ryan.ding@oracle.com> Reviewed-by: Junxiao Bi <junxiao.bi@oracle.com> Cc: Joseph Qi <joseph.qi@huawei.com> Cc: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Joel Becker <jlbec@evilplan.org> Cc: Joseph Qi <joseph.qi@huawei.com> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Mark Fasheh <mfasheh@suse.de> Cc: Ryan Ding <ryan.ding@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Ryan Ding [Tue, 9 Feb 2016 23:11:41 +0000 (10:11 +1100)]
ocfs2: fix ip_unaligned_aio deadlock with dio work queue
In the current implementation of unaligned aio+dio, lock order behave as
follow:
in user process context:
-> call io_submit()
-> get i_mutex
<== window1
-> get ip_unaligned_aio
-> submit direct io to block device
-> release i_mutex
-> io_submit() return
in dio work queue context(the work queue is created in __blockdev_direct_IO):
-> release ip_unaligned_aio
<== window2
-> get i_mutex
-> clear unwritten flag & change i_size
-> release i_mutex
There is a limitation to the thread number of dio work queue. 256 at
default. If all 256 thread are in the above 'window2' stage, and there is
a user process in the 'window1' stage, the system will became deadlock.
Since the user process hold i_mutex to wait ip_unaligned_aio lock, while
there is a direct bio hold ip_unaligned_aio mutex who is waiting for a dio
work queue thread to be schedule. But all the dio work queue thread is
waiting for i_mutex lock in 'window2'.
This case only happened in a test which send a large number(more than 256)
of aio at one io_submit() call.
My design is to remove ip_unaligned_aio lock. Change it to a sync io
instead. Just like ip_unaligned_aio lock, serialize the unaligned aio
dio.
Signed-off-by: Ryan Ding <ryan.ding@oracle.com> Reviewed-by: Junxiao Bi <junxiao.bi@oracle.com> Cc: Joseph Qi <joseph.qi@huawei.com> Cc: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Andrew Morton [Tue, 9 Feb 2016 23:11:41 +0000 (10:11 +1100)]
ocfs2-code-clean-up-for-direct-io-fix
remove unused label
Reported-by: kbuild test robot <fengguang.wu@intel.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Joseph Qi <joseph.qi@huawei.com> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Mark Fasheh <mfasheh@suse.de> Cc: Ryan Ding <ryan.ding@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Ryan Ding [Tue, 9 Feb 2016 23:11:41 +0000 (10:11 +1100)]
ocfs2: code clean up for direct io
Clean up ocfs2_file_write_iter & ocfs2_prepare_inode_for_write:
* remove append dio check: it will be checked in ocfs2_direct_IO()
* remove file hole check: file hole is supported for now
* remove inline data check: it will be checked in ocfs2_direct_IO()
* remove the full_coherence check when append dio: we will get the inode_lock
in ocfs2_dio_get_block, there is no need to fall back to buffer io to ensure
the coherence semantics.
Now the drop dio procedure is gone. :)
Signed-off-by: Ryan Ding <ryan.ding@oracle.com> Reviewed-by: Junxiao Bi <junxiao.bi@oracle.com> Cc: Joseph Qi <joseph.qi@huawei.com> Cc: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Ryan Ding [Tue, 9 Feb 2016 23:11:40 +0000 (10:11 +1100)]
ocfs2: fix sparse file & data ordering issue in direct io
There are mainly 3 issue in the direct io code path after commit 24c40b329e03 ("ocfs2: implement ocfs2_direct_IO_write"):
* Do not support sparse file.
* Do not support data ordering. eg: when write to a file hole, it will alloc
extent first. If system crashed before io finished, data will corrupt.
* Potential risk when doing aio+dio. The -EIOCBQUEUED return value is likely
to be ignored by ocfs2_direct_IO_write().
To resolve above problems, re-design direct io code with following ideas:
* Use buffer io to fill in holes. And this will make better performance also.
* Clear unwritten after direct write finished. So we can make sure meta data
changes after data write to disk. (Unwritten extent is invisible to user,
from user's view, meta data is not changed when allocate an unwritten
extent.)
* Clear ocfs2_direct_IO_write(). Do all ending work in end_io.
This patch has passed fs,dio,ltp-aiodio.part1,ltp-aiodio.part2,ltp-aiodio.part4
test cases of ltp.
For performance improvement, see following test result:
ocfs2 cluster size 1MB, ocfs2 volume is mounted on /mnt/.
The original way:
+ rm /mnt/test.img -f
+ dd if=/dev/zero of=/mnt/test.img bs=4K count=1048576 oflag=direct 1048576+0 records in 1048576+0 records out 4294967296 bytes (4.3 GB) copied, 1707.83 s, 2.5 MB/s
+ rm /mnt/test.img -f
+ dd if=/dev/zero of=/mnt/test.img bs=256K count=16384 oflag=direct
16384+0 records in
16384+0 records out 4294967296 bytes (4.3 GB) copied, 582.705 s, 7.4 MB/s
After this patch:
+ rm /mnt/test.img -f
+ dd if=/dev/zero of=/mnt/test.img bs=4K count=1048576 oflag=direct 1048576+0 records in 1048576+0 records out 4294967296 bytes (4.3 GB) copied, 64.6412 s, 66.4 MB/s
+ rm /mnt/test.img -f
+ dd if=/dev/zero of=/mnt/test.img bs=256K count=16384 oflag=direct
16384+0 records in
16384+0 records out 4294967296 bytes (4.3 GB) copied, 34.7611 s, 124 MB/s
Signed-off-by: Ryan Ding <ryan.ding@oracle.com> Reviewed-by: Junxiao Bi <junxiao.bi@oracle.com> Cc: Joseph Qi <joseph.qi@huawei.com> Cc: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Ryan Ding [Tue, 9 Feb 2016 23:11:40 +0000 (10:11 +1100)]
ocfs2: record UNWRITTEN extents when populate write desc
To support direct io in ocfs2_write_begin_nolock & ocfs2_write_end_nolock.
There is still one issue in the direct write procedure.
phase 1: alloc extent with UNWRITTEN flag
phase 2: submit direct data to disk, add zero page to page cache
phase 3: clear UNWRITTEN flag when data has been written to disk
When there are 2 direct write A(0~3KB),B(4~7KB) writing to the same
cluster 0~7KB (cluster size 8KB). Write request A arrive phase 2 first,
it will zero the region (4~7KB). Before request A enter to phase 3,
request B arrive phase 2, it will zero region (0~3KB). This is just like
request B steps request A.
To resolve this issue, we should let request B knows this cluster is already
under zero, to prevent it from steps the previous write request.
This patch will add function ocfs2_unwritten_check() to do this job. It
will record all clusters that are under direct write(it will be recorded
in the 'ip_unwritten_list' member of inode info), and prevent the later
direct write writing to the same cluster to do the zero work again.
Signed-off-by: Ryan Ding <ryan.ding@oracle.com> Reviewed-by: Junxiao Bi <junxiao.bi@oracle.com> Cc: Joseph Qi <joseph.qi@huawei.com> Cc: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Ryan Ding [Tue, 9 Feb 2016 23:11:39 +0000 (10:11 +1100)]
ocfs2: return the physical address in ocfs2_write_cluster
To support direct io in ocfs2_write_begin_nolock & ocfs2_write_end_nolock.
Direct io needs to get the physical address from write_begin, to map the
user page. This patch is to change the arg 'phys' of ocfs2_write_cluster
to a pointer, so it can be retrieved to write_begin. And we can retrieve
it to the direct io procedure.
Signed-off-by: Ryan Ding <ryan.ding@oracle.com> Reviewed-by: Junxiao Bi <junxiao.bi@oracle.com> Cc: Joseph Qi <joseph.qi@huawei.com> Cc: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Ryan Ding [Tue, 9 Feb 2016 23:11:39 +0000 (10:11 +1100)]
ocfs2: do not change i_size in write_end for direct io
To support direct io in ocfs2_write_begin_nolock & ocfs2_write_end_nolock.
Append direct io do not change i_size in get block phase. It only move to
orphan when starting write. After data is written to disk, it will delete
itself from orphan and update i_size. So skip i_size change section in
write_begin for direct io.
And when there is no extents alloc, no meta data changes needed for direct
io (since write_begin start trans for 2 reason: alloc extents & change
i_size. Now none of them needed). So we can skip start trans procedure.
Signed-off-by: Ryan Ding <ryan.ding@oracle.com> Reviewed-by: Junxiao Bi <junxiao.bi@oracle.com> Cc: Joseph Qi <joseph.qi@huawei.com> Cc: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Ryan Ding [Tue, 9 Feb 2016 23:11:39 +0000 (10:11 +1100)]
ocfs2: test target page before change it
To support direct io in ocfs2_write_begin_nolock & ocfs2_write_end_nolock.
Direct io data will not appear in buffer. The w_target_page member will
not be filled by direct io. So avoid to use it when it's NULL. Unlinke
buffer io and mmap, direct io will call write_begin with more than 1 page
a time. So the target_index is not sufficient to describe the actual
data. change it to a range start at target_index, end in end_index.
Signed-off-by: Ryan Ding <ryan.ding@oracle.com> Reviewed-by: Junxiao Bi <junxiao.bi@oracle.com> Cc: Joseph Qi <joseph.qi@huawei.com> Cc: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Ryan Ding [Tue, 9 Feb 2016 23:11:38 +0000 (10:11 +1100)]
ocfs2: use c_new to indicate newly allocated extents
To support direct io in ocfs2_write_begin_nolock & ocfs2_write_end_nolock.
There is a problem in ocfs2's direct io implement: if system crashed after
extents allocated, and before data return, we will get a extent with dirty
data on disk. This problem violate the journal=order semantics, which
means meta changes take effect after data written to disk. To resolve
this issue, direct write can use the UNWRITTEN flag to describe a extent
during direct data writeback. The direct write procedure should act in
the following order:
phase 1: alloc extent with UNWRITTEN flag
phase 2: submit direct data to disk, add zero page to page cache
phase 3: clear UNWRITTEN flag when data has been written to disk
This patch is to change the 'c_unwritten' member of
ocfs2_write_cluster_desc to 'c_clear_unwritten'. Means whether to clear
the unwritten flag. It do not care if a extent is allocated or not. And
use 'c_new' to specify a newly allocated extent. So the direct io
procedure can use c_clear_unwritten to control the UNWRITTEN bit on
extent.
Signed-off-by: Ryan Ding <ryan.ding@oracle.com> Reviewed-by: Junxiao Bi <junxiao.bi@oracle.com> Cc: Joseph Qi <joseph.qi@huawei.com> Cc: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Ryan Ding [Tue, 9 Feb 2016 23:11:38 +0000 (10:11 +1100)]
ocfs2: add ocfs2_write_type_t type to identify the caller of write
Patchset: fix ocfs2 direct io code patch to support sparse file and data
ordering semantics
The idea is to use buffer io(more precisely use the interface
ocfs2_write_begin_nolock & ocfs2_write_end_nolock) to do the zero work
beyond block size. And clear UNWRITTEN flag until direct io data has been
written to disk, which can prevent data corruption when system crashed
during direct write.
And we will also archive a better performance:
eg. dd direct write new file with block size 4KB:
before this patchset:
2.5 MB/s
after this patchset:
66.4 MB/s
This patch (of 8):
To support direct io in ocfs2_write_begin_nolock & ocfs2_write_end_nolock.
Remove unused args filp & flags. Add new arg type. The type is one of
buffer/direct/mmap. Indicate 3 way to perform write. buffer/mmap type
has implemented. direct type will be implemented later.
Signed-off-by: Ryan Ding <ryan.ding@oracle.com> Reviewed-by: Junxiao Bi <junxiao.bi@oracle.com> Cc: Joseph Qi <joseph.qi@huawei.com> Cc: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
xuejiufei [Tue, 9 Feb 2016 23:11:37 +0000 (10:11 +1100)]
ocfs2/dlm: return EINVAL when the lockres on migration target is in DROPPING_REF state
If master migrate this lock resource to node when it happened to purge it,
a new lock resource will be created and inserted into hash list. If then
master goes down, the lock resource being purged is recovered, so there
exist two lock resource with different owner. So return error to master
if the lock resource is in DROPPING state, master will retry to migrate
this lock resource.
Signed-off-by: xuejiufei <xuejiufei@huawei.com> Cc: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Reviewed-by: Joseph Qi <joseph.qi@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>