Tang Chen [Mon, 16 Dec 2013 23:45:10 +0000 (10:45 +1100)]
x86, numa, acpi, memory-hotplug: make movable_node have higher priority
If users specify the original movablecore=nn@ss boot option, the kernel
will arrange [ss, ss+nn) as ZONE_MOVABLE. The kernelcore=nn@ss boot
option is similar except it specifies ZONE_NORMAL ranges.
Now, if users specify "movable_node" in kernel commandline, the kernel
will arrange hotpluggable memory in SRAT as ZONE_MOVABLE. And if users do
this, all the other movablecore=nn@ss and kernelcore=nn@ss options should
be ignored.
For those who don't want this, just specify nothing. The kernel will act
as before.
Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com> Signed-off-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Reviewed-by: Wanpeng Li <liwanp@linux.vnet.ibm.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: "Rafael J . Wysocki" <rjw@sisk.pl> Cc: Chen Tang <imtangchen@gmail.com> Cc: Gong Chen <gong.chen@linux.intel.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Jiang Liu <jiang.liu@huawei.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Lai Jiangshan <laijs@cn.fujitsu.com> Cc: Larry Woodman <lwoodman@redhat.com> Cc: Len Brown <lenb@kernel.org> Cc: Liu Jiang <jiang.liu@huawei.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Michal Nazarewicz <mina86@mina86.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Prarit Bhargava <prarit@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Taku Izumi <izumi.taku@jp.fujitsu.com> Cc: Tejun Heo <tj@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Thomas Renninger <trenn@suse.de> Cc: Toshi Kani <toshi.kani@hp.com> Cc: Vasilis Liaskovitis <vasilis.liaskovitis@profitbricks.com> Cc: Wen Congyang <wency@cn.fujitsu.com> Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Cc: Yinghai Lu <yinghai@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
WARNING: line over 80 characters
#83: FILE: include/linux/memblock.h:83:
+static inline bool memblock_is_hotpluggable(struct memblock_region *m){ return false; }
ERROR: space required before the open brace '{'
#83: FILE: include/linux/memblock.h:83:
+static inline bool memblock_is_hotpluggable(struct memblock_region *m){ return false; }
total: 1 errors, 1 warnings, 67 lines checked
./patches/memblock-mem_hotplug-make-memblock-skip-hotpluggable-regions-if-needed.patch has style problems, please review.
If any of these errors are false positives, please report
them to the maintainer, see CHECKPATCH in MAINTAINERS.
Tang Chen [Mon, 16 Dec 2013 23:45:09 +0000 (10:45 +1100)]
memblock, mem_hotplug: make memblock skip hotpluggable regions if needed
Linux kernel cannot migrate pages used by the kernel. As a result,
hotpluggable memory used by the kernel won't be able to be hot-removed.
To solve this problem, the basic idea is to prevent memblock from
allocating hotpluggable memory for the kernel at early time, and arrange
all hotpluggable memory in ACPI SRAT(System Resource Affinity Table) as
ZONE_MOVABLE when initializing zones.
In the previous patches, we have marked hotpluggable memory regions with
MEMBLOCK_HOTPLUG flag in memblock.memory.
In this patch, we make memblock skip these hotpluggable memory regions in
the default top-down allocation function if movable_node boot option is
specified.
Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com> Signed-off-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: "Rafael J . Wysocki" <rjw@sisk.pl> Cc: Chen Tang <imtangchen@gmail.com> Cc: Gong Chen <gong.chen@linux.intel.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Jiang Liu <jiang.liu@huawei.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Lai Jiangshan <laijs@cn.fujitsu.com> Cc: Larry Woodman <lwoodman@redhat.com> Cc: Len Brown <lenb@kernel.org> Cc: Liu Jiang <jiang.liu@huawei.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Michal Nazarewicz <mina86@mina86.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Prarit Bhargava <prarit@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Taku Izumi <izumi.taku@jp.fujitsu.com> Cc: Tejun Heo <tj@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Thomas Renninger <trenn@suse.de> Cc: Toshi Kani <toshi.kani@hp.com> Cc: Vasilis Liaskovitis <vasilis.liaskovitis@profitbricks.com> Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com> Cc: Wen Congyang <wency@cn.fujitsu.com> Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Cc: Yinghai Lu <yinghai@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tang Chen [Mon, 16 Dec 2013 23:45:09 +0000 (10:45 +1100)]
acpi, numa, mem_hotplug: mark all nodes the kernel resides un-hotpluggable
At very early time, the kernel have to use some memory such as loading the
kernel image. We cannot prevent this anyway. So any node the kernel
resides in should be un-hotpluggable.
Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com> Reviewed-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: "Rafael J . Wysocki" <rjw@sisk.pl> Cc: Chen Tang <imtangchen@gmail.com> Cc: Gong Chen <gong.chen@linux.intel.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Jiang Liu <jiang.liu@huawei.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Lai Jiangshan <laijs@cn.fujitsu.com> Cc: Larry Woodman <lwoodman@redhat.com> Cc: Len Brown <lenb@kernel.org> Cc: Liu Jiang <jiang.liu@huawei.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Michal Nazarewicz <mina86@mina86.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Prarit Bhargava <prarit@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Taku Izumi <izumi.taku@jp.fujitsu.com> Cc: Tejun Heo <tj@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Thomas Renninger <trenn@suse.de> Cc: Toshi Kani <toshi.kani@hp.com> Cc: Vasilis Liaskovitis <vasilis.liaskovitis@profitbricks.com> Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com> Cc: Wen Congyang <wency@cn.fujitsu.com> Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Cc: Yinghai Lu <yinghai@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tang Chen [Mon, 16 Dec 2013 23:45:09 +0000 (10:45 +1100)]
acpi, numa, mem_hotplug: mark hotpluggable memory in memblock
When parsing SRAT, we know that which memory area is hotpluggable. So we
invoke function memblock_mark_hotplug() introduced by previous patch to
mark hotpluggable memory in memblock.
Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com> Reviewed-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: "Rafael J . Wysocki" <rjw@sisk.pl> Cc: Chen Tang <imtangchen@gmail.com> Cc: Gong Chen <gong.chen@linux.intel.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Jiang Liu <jiang.liu@huawei.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Lai Jiangshan <laijs@cn.fujitsu.com> Cc: Larry Woodman <lwoodman@redhat.com> Cc: Len Brown <lenb@kernel.org> Cc: Liu Jiang <jiang.liu@huawei.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Michal Nazarewicz <mina86@mina86.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Prarit Bhargava <prarit@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Taku Izumi <izumi.taku@jp.fujitsu.com> Cc: Tejun Heo <tj@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Thomas Renninger <trenn@suse.de> Cc: Toshi Kani <toshi.kani@hp.com> Cc: Vasilis Liaskovitis <vasilis.liaskovitis@profitbricks.com> Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com> Cc: Wen Congyang <wency@cn.fujitsu.com> Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Cc: Yinghai Lu <yinghai@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tang Chen [Mon, 16 Dec 2013 23:45:08 +0000 (10:45 +1100)]
memblock, mem_hotplug: introduce MEMBLOCK_HOTPLUG flag to mark hotpluggable regions
In find_hotpluggable_memory, once we find out a memory region which is
hotpluggable, we want to mark them in memblock.memory. So that we could
control memblock allocator not to allocte hotpluggable memory for the
kernel later.
To achieve this goal, we introduce MEMBLOCK_HOTPLUG flag to indicate the
hotpluggable memory regions in memblock and a function
memblock_mark_hotplug() to mark hotpluggable memory if we find one.
Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com> Reviewed-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: "Rafael J . Wysocki" <rjw@sisk.pl> Cc: Chen Tang <imtangchen@gmail.com> Cc: Gong Chen <gong.chen@linux.intel.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Jiang Liu <jiang.liu@huawei.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Lai Jiangshan <laijs@cn.fujitsu.com> Cc: Larry Woodman <lwoodman@redhat.com> Cc: Len Brown <lenb@kernel.org> Cc: Liu Jiang <jiang.liu@huawei.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Michal Nazarewicz <mina86@mina86.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Prarit Bhargava <prarit@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Taku Izumi <izumi.taku@jp.fujitsu.com> Cc: Tejun Heo <tj@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Thomas Renninger <trenn@suse.de> Cc: Toshi Kani <toshi.kani@hp.com> Cc: Vasilis Liaskovitis <vasilis.liaskovitis@profitbricks.com> Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com> Cc: Wen Congyang <wency@cn.fujitsu.com> Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Cc: Yinghai Lu <yinghai@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tang Chen [Mon, 16 Dec 2013 23:45:08 +0000 (10:45 +1100)]
memblock, numa: introduce flags field into memblock
There is no flag in memblock to describe what type the memory is.
Sometimes, we may use memblock to reserve some memory for special usage.
And we want to know what kind of memory it is. So we need a way to
In hotplug environment, we want to reserve hotpluggable memory so the
kernel won't be able to use it. And when the system is up, we have to
free these hotpluggable memory to buddy. So we need to mark these memory
first.
In order to do so, we need to mark out these special memory in memblock.
In this patch, we introduce a new "flags" member into memblock_region:
struct memblock_region {
phys_addr_t base;
phys_addr_t size;
unsigned long flags; /* This is new. */
#ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
int nid;
#endif
};
This patch does the following things:
1) Add "flags" member to memblock_region.
2) Modify the following APIs' prototype:
memblock_add_region()
memblock_insert_region()
3) Add memblock_reserve_region() to support reserve memory with flags, and keep
memblock_reserve()'s prototype unmodified.
4) Modify other APIs to support flags, but keep their prototype unmodified.
The idea is from Wen Congyang <wency@cn.fujitsu.com> and Liu Jiang <jiang.liu@huawei.com>.
Suggested-by: Wen Congyang <wency@cn.fujitsu.com> Suggested-by: Liu Jiang <jiang.liu@huawei.com> Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com> Reviewed-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: "Rafael J . Wysocki" <rjw@sisk.pl> Cc: Chen Tang <imtangchen@gmail.com> Cc: Gong Chen <gong.chen@linux.intel.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Jiang Liu <jiang.liu@huawei.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Lai Jiangshan <laijs@cn.fujitsu.com> Cc: Larry Woodman <lwoodman@redhat.com> Cc: Len Brown <lenb@kernel.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Michal Nazarewicz <mina86@mina86.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Prarit Bhargava <prarit@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Taku Izumi <izumi.taku@jp.fujitsu.com> Cc: Tejun Heo <tj@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Thomas Renninger <trenn@suse.de> Cc: Toshi Kani <toshi.kani@hp.com> Cc: Vasilis Liaskovitis <vasilis.liaskovitis@profitbricks.com> Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com> Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Cc: Yinghai Lu <yinghai@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
If system can create movable node which all memory of the node is
allocated as ZONE_MOVABLE, setup_node_data() cannot allocate memory for
the node's pg_data_t. So, invoke memblock_alloc_nid(...MAX_NUMNODES)
again to retry when the first allocation fails. Otherwise, the system
could failed to boot. (We don't use memblock_alloc_try_nid() to retry
because in this function, if the allocation fails, it will panic the
system.)
The node_data could be on hotpluggable node. And so could pagetable and
vmemmap. But for now, doing so will break memory hot-remove path.
A node could have several memory devices. And the device who holds node
data should be hot-removed in the last place. But in NUMA level, we don't
know which memory_block (/sys/devices/system/node/nodeX/memoryXXX) belongs
to which memory device. We only have node. So we can only do node
hotplug.
But in virtualization, developers are now developing memory hotplug in
qemu, which support a single memory device hotplug. So a whole node
hotplug will not satisfy virtualization users.
So at last, we concluded that we'd better do memory hotplug and local node
things (local node node data, pagetable, vmemmap, ...) in two steps.
Please refer to https://lkml.org/lkml/2013/6/19/73
For now, we put node_data of movable node to another node, and then
improve it in the future.
Signed-off-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com> Signed-off-by: Jiang Liu <jiang.liu@huawei.com> Signed-off-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Reviewed-by: Wanpeng Li <liwanp@linux.vnet.ibm.com> Acked-by: Toshi Kani <toshi.kani@hp.com> Cc: Tejun Heo <tj@kernel.org> CC: "Rafael J . Wysocki" <rjw@sisk.pl> Cc: Len Brown <lenb@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@elte.hu> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Toshi Kani <toshi.kani@hp.com> Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com> Cc: Thomas Renninger <trenn@suse.de> Cc: Yinghai Lu <yinghai@kernel.org> Cc: Jiang Liu <jiang.liu@huawei.com> Cc: Wen Congyang <wency@cn.fujitsu.com> Cc: Lai Jiangshan <laijs@cn.fujitsu.com> Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Cc: Taku Izumi <izumi.taku@jp.fujitsu.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Minchan Kim <minchan@kernel.org> Cc: Michal Nazarewicz <mina86@mina86.com> Cc: Gong Chen <gong.chen@linux.intel.com> Cc: Vasilis Liaskovitis <vasilis.liaskovitis@profitbricks.com> Cc: Larry Woodman <lwoodman@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Prarit Bhargava <prarit@redhat.com> Cc: Chen Tang <imtangchen@gmail.com> Cc: Tang Chen <tangchen@cn.fujitsu.com> Cc: Zhang Yanfei <zhangyanfei.yes@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
mm/memblock: debug: correct displaying of upper memory boundary
Current memblock APIs don't work on 32 PAE or LPAE extension arches where
the physical memory start address beyond 4GB. The problem was discussed
here [3] where Tejun, Yinghai(thanks) proposed a way forward with memblock
interfaces. Based on the proposal, this series adds necessary memblock
interfaces and convert the core kernel code to use them. Architectures
already converted to NO_BOOTMEM use these new interfaces and other which
still uses bootmem, these new interfaces just fallback to exiting bootmem
APIs.
So no functional change in behavior. In long run, once all the
architectures moves to NO_BOOTMEM, we can get rid of bootmem layer
completely. This is one step to remove the core code dependency with
bootmem and also gives path for architectures to move away from bootmem.
Testing is done on ARM architecture with 32 bit ARM LAPE machines
with normal as well sparse(faked) memory model.
This patch (of 23):
When debugging is enabled (cmdline has "memblock=debug") the memblock will
display upper memory boundary per each allocated/freed memory range
wrongly. For example:
The 0x0000009e7ed000 is displayed instead of 0x0000009e7ecfff
Hence, correct this by changing formula used to calculate upper memory
boundary to (u64)base + size - 1 instead of (u64)base + size everywhere
in the debug messages.
Signed-off-by: Grygorii Strashko <grygorii.strashko@ti.com> Signed-off-by: Santosh Shilimkar <santosh.shilimkar@ti.com> Cc: Yinghai Lu <yinghai@kernel.org> Acked-by: Tejun Heo <tj@kernel.org> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Russell King <linux@arm.linux.org.uk> Cc: "Rafael J. Wysocki" <rjw@sisk.pl> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Paul Walmsley <paul@pwsan.com> Cc: Pavel Machek <pavel@ucw.cz> Cc: Tony Lindgren <tony@atomide.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Davidlohr Bueso [Mon, 16 Dec 2013 23:45:07 +0000 (10:45 +1100)]
mm/mlock: prepare params outside critical region
All mlock related syscalls prepare lock limits, lengths and start
parameters with the mmap_sem held. Move this logic outside of the
critical region. For the case of mlock, continue incrementing the amount
already locked by mm->locked_vm with the rwsem taken.
Signed-off-by: Davidlohr Bueso <davidlohr@hp.com> Cc: Rik van Riel <riel@redhat.com> Reviewed-by: Michel Lespinasse <walken@google.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Davidlohr Bueso [Mon, 16 Dec 2013 23:45:07 +0000 (10:45 +1100)]
mm/mmap.c: add mlock_future_check() helper
Both do_brk and do_mmap_pgoff verify that we are actually capable of
locking future pages if the corresponding VM_LOCKED flags are used.
Encapsulate this logic into a single mlock_future_check() helper function.
Signed-off-by: Davidlohr Bueso <davidlohr@hp.com> Cc: Rik van Riel <riel@redhat.com> Reviewed-by: Michel Lespinasse <walken@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Jerome Marchand [Mon, 16 Dec 2013 23:45:06 +0000 (10:45 +1100)]
mm: add overcommit_kbytes sysctl variable
Some applications that run on HPC clusters are designed around the
availability of RAM and the overcommit ratio is fine tuned to get the
maximum usage of memory without swapping. With growing memory, the
1%-of-all-RAM grain provided by overcommit_ratio has become too coarse for
these workload (on a 2TB machine it represents no less than 20GB).
This patch adds the new overcommit_kbytes sysctl variable that allow a
much finer grain.
Signed-off-by: Jerome Marchand <jmarchan@redhat.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Alan Cox <alan@lxorguk.ukuu.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Mel Gorman [Mon, 16 Dec 2013 23:45:06 +0000 (10:45 +1100)]
mm, show_mem: remove SHOW_MEM_FILTER_PAGE_COUNT
Commit 4b59e6c4 ("mm, show_mem: suppress page counts in non-blockable
contexts") introduced SHOW_MEM_FILTER_PAGE_COUNT to suppress PFN walks on
large memory machines. Commit c78e9363 (:mm: do not walk all of system
memory during show_mem") avoided a PFN walk in the generic show_mem helper
which removes the requirement for SHOW_MEM_FILTER_PAGE_COUNT in that case.
This patch removes PFN walkers from the arch-specific implementations that
report on a per-node or per-zone granularity. ARM and unicore32 still do
a PFN walk as they report memory usage on each bank which is a much finer
granularity where the debugging information may still be of use. As the
remaining arches doing PFN walks have relatively small amounts of memory,
this patch simply removes SHOW_MEM_FILTER_PAGE_COUNT.
Signed-off-by: Mel Gorman <mgorman@suse.de> Acked-by: David Rientjes <rientjes@google.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Russell King <linux@arm.linux.org.uk> Cc: James Bottomley <jejb@parisc-linux.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Jianyu Zhan [Mon, 16 Dec 2013 23:45:05 +0000 (10:45 +1100)]
mm/vmalloc: interchage the implementation of vmalloc_to_{pfn,page}
Currently we are implementing vmalloc_to_pfn() as a wrapper around
vmalloc_to_page(), which is implemented as follow:
1. walks the page talbes to generates the corresponding pfn,
2. then converts the pfn to struct page,
3. returns it.
And vmalloc_to_pfn() re-wraps vmalloc_to_page() to get the pfn.
This seems too circuitous, so this patch reverses the way: implement
vmalloc_to_page() as a wrapper around vmalloc_to_pfn(). This makes
vmalloc_to_pfn() and vmalloc_to_page() slightly more efficient.
No functional change.
Signed-off-by: Jianyu Zhan <nasa4836@gmail.com> Cc: Vladimir Murzin <murzin.v@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
David Rientjes [Mon, 16 Dec 2013 23:45:05 +0000 (10:45 +1100)]
mm, mempolicy: remove unneeded functions for UMA configs
Mempolicies only exist for CONFIG_NUMA configurations. Therefore, a
certain class of functions are unneeded in configurations where
CONFIG_NUMA is disabled such as functions that duplicate existing
mempolicies, lookup existing policies, set certain mempolicy traits, or
test mempolicies for certain attributes.
Remove the unneeded functions so that any future callers get a compile-
time error and protect their code with CONFIG_NUMA as required.
Signed-off-by: David Rientjes <rientjes@google.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Andreas Sandberg [Mon, 16 Dec 2013 23:45:05 +0000 (10:45 +1100)]
mm/hugetlb.c: call MMU notifiers when copying a hugetlb page range
When copy_hugetlb_page_range() is called to copy a range of hugetlb
mappings, the secondary MMUs are not notified if there is a protection
downgrade, which breaks COW semantics in KVM.
This patch adds the necessary MMU notifier calls.
Signed-off-by: Andreas Sandberg <andreas@sandberg.pp.se> Acked-by: Steve Capper <steve.capper@linaro.org> Acked-by: Marc Zyngier <marc.zyngier@arm.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Rik van Riel <riel@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
mm: create a separate slab for page->ptl allocation
If DEBUG_SPINLOCK and DEBUG_LOCK_ALLOC are enabled spinlock_t on x86_64 is
72 bytes. For page->ptl they will be allocated from kmalloc-96 slab, so
we loose 24 on each. An average system can easily allocate few tens
thousands of page->ptl and overhead is significant.
Let's create a separate slab for page->ptl allocation to solve this.
To make sure that it really works this time, some numbers from my test
machine (just booted, no load):
mm: get rid of unnecessary pageblock scanning in setup_zone_migrate_reserve
Yasuaki Ishimatsu reported memory hot-add spent more than 5 _hours_ on 9TB
memory machine since onlining memory sections is too slow. And we found
out setup_zone_migrate_reserve spent >90% of the time.
The problem is, setup_zone_migrate_reserve scans all pageblocks
unconditionally, but it is only necessary if the number of reserved block
was reduced (i.e. memory hot remove).
Moreover, maximum MIGRATE_RESERVE per zone is currently 2. It means that
the number of reserved pageblocks is almost always unchanged.
This patch adds zone->nr_migrate_reserve_block to maintain the number of
MIGRATE_RESERVE pageblocks and it reduces the overhead of
setup_zone_migrate_reserve dramatically. The following table shows time
of onlining a memory section.
Rik van Riel [Mon, 16 Dec 2013 23:45:04 +0000 (10:45 +1100)]
/proc/meminfo: provide estimated available memory
Many load balancing and workload placing programs check /proc/meminfo to
estimate how much free memory is available. They generally do this by
adding up "free" and "cached", which was fine ten years ago, but is pretty
much guaranteed to be wrong today.
It is wrong because Cached includes memory that is not freeable as page
cache, for example shared memory segments, tmpfs, and ramfs, and it does
not include reclaimable slab memory, which can take up a large fraction of
system memory on mostly idle systems with lots of files.
Currently, the amount of memory that is available for a new workload,
without pushing the system into swap, can be estimated from MemFree,
Active(file), Inactive(file), and SReclaimable, as well as the "low"
watermarks from /proc/zoneinfo.
However, this may change in the future, and user space really should not
be expected to know kernel internals to come up with an estimate for the
amount of free memory.
It is more convenient to provide such an estimate in /proc/meminfo. If
things change in the future, we only have to change it in one place.
Signed-off-by: Rik van Riel <riel@redhat.com> Reported-by: Erik Mouw <erik.mouw_2@nxp.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Andrea Arcangeli [Mon, 16 Dec 2013 23:45:02 +0000 (10:45 +1100)]
mm: tail page refcounting optimization for slab and hugetlbfs
This skips the _mapcount mangling for slab and hugetlbfs pages.
The main trouble in doing this is to guarantee that PageSlab and
PageHeadHuge remains constant for all get_page/put_page run on the
tail of slab or hugetlbfs compound pages. Otherwise if they're set
during get_page but not set during put_page, the _mapcount of the tail
page would underflow.
PageHeadHuge will remain true until the compound page is released and
enters the buddy allocator so it won't risk to change even if the tail
page is the last reference left on the page.
PG_slab instead is cleared before the slab frees the head page with
put_page, so if the tail pin is released after the slab freed the
page, we would have a problem. But in the slab case the tail pin
cannot be the last reference left on the page. This is because the
slab code is free to reuse the compound page after a
kfree/kmem_cache_free without having to check if there's any tail pin
left. In turn all tail pins must be always released while the head is
still pinned by the slab code and so we know PG_slab will be still set
too.
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Reviewed-by: Khalid Aziz <khalid.aziz@oracle.com> Cc: Pravin Shelar <pshelar@nicira.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Ben Hutchings <bhutchings@solarflare.com> Cc: Christoph Lameter <cl@linux.com> Cc: Johannes Weiner <jweiner@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Rik van Riel <riel@redhat.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Andrea Arcangeli [Mon, 16 Dec 2013 23:45:02 +0000 (10:45 +1100)]
mm: thp: optimize compound_trans_huge
Currently we don't clobber page_tail->first_page during split_huge_page,
so compound_trans_head can be set to compound_head without adverse
effects, and this mostly optimizes away a smp_rmb.
It looks worthwhile to keep around the implementation that doesn't relay
on page_tail->first_page not to be clobbered, because it would be
necessary if we'll decide to enforce page->private to zero at all times
whenever PG_private is not set, also for anonymous pages. For anonymous
pages enforcing such an invariant doesn't matter as anonymous pages don't
use page->private so we can get away with this microoptimization.
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Cc: Khalid Aziz <khalid.aziz@oracle.com> Cc: Pravin Shelar <pshelar@nicira.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Ben Hutchings <bhutchings@solarflare.com> Cc: Christoph Lameter <cl@linux.com> Cc: Johannes Weiner <jweiner@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Rik van Riel <riel@redhat.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Andrea Arcangeli [Mon, 16 Dec 2013 23:45:02 +0000 (10:45 +1100)]
mm: hugetlbfs: move the put/get_page slab and hugetlbfs optimization in a faster path
We don't actually need a reference on the head page in the slab and
hugetlbfs paths, as long as we add a smp_rmb() which should be faster
than get_page_unless_zero.
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Cc: Khalid Aziz <khalid.aziz@oracle.com> Cc: Pravin Shelar <pshelar@nicira.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Ben Hutchings <bhutchings@solarflare.com> Cc: Christoph Lameter <cl@linux.com> Cc: Johannes Weiner <jweiner@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Rik van Riel <riel@redhat.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Andrea Arcangeli [Mon, 16 Dec 2013 23:45:02 +0000 (10:45 +1100)]
mm: hugetlb: use get_page_foll() in follow_hugetlb_page()
get_page_foll() is more optimal and is always safe to use under the PT
lock. More so for hugetlbfs as there's no risk of race conditions with
split_huge_page regardless of the PT lock.
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Tested-by: Khalid Aziz <khalid.aziz@oracle.com> Cc: Pravin Shelar <pshelar@nicira.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Ben Hutchings <bhutchings@solarflare.com> Cc: Christoph Lameter <cl@linux.com> Cc: Johannes Weiner <jweiner@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Rik van Riel <riel@redhat.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
David Rientjes [Mon, 16 Dec 2013 23:45:02 +0000 (10:45 +1100)]
mm, memcg: avoid oom notification when current needs access to memory reserves
When current has a pending SIGKILL or is already in the exit path, it only
needs access to memory reserves to fully exit. In that sense, the memcg
is not actually oom for current, it simply needs to bypass memory charges
to exit and free its memory, which is guarantee itself that memory will be
freed.
We only want to notify userspace for actionable oom conditions where
something needs to be done (and all oom handling can already be deferred
to userspace through this method by disabling the memcg oom killer with
memory.oom_control), not simply when a memcg has reached its limit, which
would actually have to happen before memcg reclaim actually frees memory
for charges.
Signed-off-by: David Rientjes <rientjes@google.com> Reported-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Dave Hansen [Mon, 16 Dec 2013 23:45:01 +0000 (10:45 +1100)]
mm: hugetlbfs: Add some VM_BUG_ON()s to catch non-hugetlbfs pages
Dave Jiang reported that he was seeing oopses when running NUMA systems
and default_hugepagesz=1G. I traced the issue down to migrate_page_copy()
trying to use the same code for hugetlb pages and transparent hugepages.
It should not have been trying to pass thp pages in there.
So, add some VM_BUG_ON()s for the next hapless VM developer that tries the
same thing.
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Tested-by: Dave Jiang <dave.jiang@intel.com> Acked-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Sasha Levin [Mon, 16 Dec 2013 23:45:01 +0000 (10:45 +1100)]
watchdog: trigger all-cpu backtrace when locked up and going to panic
Send an NMI to all CPUs when a lockup is detected and the lockup watchdog
code is configured to panic. This gives us a fairly uptodate snapshot of
all CPUs in the system.
This lets us get stack trace of all CPUs which makes life easier trying to
debug a deadlock, and the NMI doesn't change anything since the next step
is a kernel panic.
Dan Carpenter [Mon, 16 Dec 2013 23:45:01 +0000 (10:45 +1100)]
fs/compat_ioctl.c: fix an underflow issue (harmless)
We cap "nmsgs" at I2C_RDRW_IOCTL_MAX_MSGS (42) but the current code allows
negative values. It's harmless but it makes my static checker upset so
I've made nsmgs unsigned.
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Josh Hunt [Mon, 16 Dec 2013 23:45:00 +0000 (10:45 +1100)]
block: restore /proc/partitions to not display non-partitionable removable devices
We found with newer kernels we started seeing the cdrom device showing
up in /proc/partitions, but it was not there before.
Looking into this I found that commit d27769ec ("block: add
GENHD_FL_NO_PART_SCAN") introduces this change in behavior. It's not
clear to me from the commit's changelog if this change was intentional or
not. This comment still remains: /* Don't show non-partitionable
removeable devices or empty devices */ so I've decided to send a patch to
restore the behavior of not printing unpartitionable removable devices.
Signed-off-by: Josh Hunt <johunt@akamai.com> Cc: Tejun Heo <tj@kernel.org> Cc: Kay Sievers <kay.sievers@vrfy.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
CaiZhiyong [Mon, 16 Dec 2013 23:45:00 +0000 (10:45 +1100)]
block: remove unrelated header files and export symbol
Fix up the following items:
- remove unrelated header files.
- export interface function.
- modify function cmdline_parts_parse return value, this will make
it more friendly for the caller.
Signed-off-by: CaiZhiyong <caizhiyong@huawei.com> Cc: Ezequiel Garcia <ezequiel.garcia@free-electrons.com> CC: Brian Norris <computersforpeace@gmail.com> Cc: "Wanglin (Albert)" <albert.wanglin@hisilicon.com> Cc: Artem Bityutskiy <dedekind1@gmail.com> Cc: Karel Zak <kzak@redhat.com> Cc: Shmulik Ladkani <shmulik.ladkani@gmail.com> Cc: David Woodhouse <dwmw2@infradead.org> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Olaf Hering [Mon, 16 Dec 2013 23:45:00 +0000 (10:45 +1100)]
drivers/block/loop.c: fix comment typo in loop_config_discard
Discard requests are ignored if the encryption is enabled for the given
loop device. Update comment to match the code, and similar comments
elsewhere in the file.
Signed-off-by: Olaf Hering <olaf@aepfle.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Andrew Morton [Mon, 16 Dec 2013 23:45:00 +0000 (10:45 +1100)]
block/blk-mq-cpu.c: use hotcpu_notifier()
Cc: Christoph Hellwig <hch@lst.de> Cc: Jan Kara <jack@suse.cz> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Now that __smp_call_function_single is available for all builds and uses
llists to queue up items without taking a lock or disabling interrupts
there is no need to wrap around it in the block code.
Signed-off-by: Christoph Hellwig <hch@lst.de> Cc: Jan Kara <jack@suse.cz> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The driver core clears the driver data to NULL after device_release or on
probe failure. Thus, it is not needed to manually clear the device driver
data to NULL.
Signed-off-by: Jingoo Han <jg1.han@samsung.com> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Junxiao Bi [Mon, 16 Dec 2013 23:44:58 +0000 (10:44 +1100)]
ocfs2: update inode size after zeroing the hole
fs-writeback will release the dirty pages without page lock whose offset
are over inode size, the release happens at block_write_full_page_endio().
If not update, dirty pages in file holes may be released before flushed
to the disk, then file holes will contain some non-zero data, this will
cause sparse file md5sum error.
To reproduce the bug, find a big sparse file with many holes, like vm
image file, its actual size should be bigger than available mem size to
make writeback work more frequently, tar it with -S option, then keep
untar it and check its md5sum again and again until you get a wrong
md5sum.
Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com> Cc: Younger Liu <younger.liu@huawei.com> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Younger Liu [Mon, 16 Dec 2013 23:44:58 +0000 (10:44 +1100)]
ocfs2: fix issue that ocfs2_setattr() does not deal with new_i_size==i_size
The issue scenario is as following:
- Create a small file and fallocate a large disk space for a file with
FALLOC_FL_KEEP_SIZE option.
- ftruncate the file back to the original size again. but the disk free
space is not changed back. This is a real bug that be fixed in this
patch.
In order to solve the issue above, we modified ocfs2_setattr(), if
attr->ia_size != i_size_read(inode), It calls ocfs2_truncate_file(), and
truncate disk space to attr->ia_size.
Signed-off-by: Younger Liu <younger.liu@huawei.com> Reviewed-by: Jie Liu <jeff.liu@oracle.com> Tested-by: Jie Liu <jeff.liu@oracle.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Sunil Mushran <sunil.mushran@gmail.com> Reviewed-by: Jensen <shencanquan@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Jensen [Mon, 16 Dec 2013 23:44:58 +0000 (10:44 +1100)]
ocfs2: llseek requires ocfs2 inode lock for the file in SEEK_END
llseek requires ocfs2 inode lock for updating the file size in SEEK_END.
because the file size maybe update on another node.
This bug can be reproduce the following scenario: at first, we dd a test
fileA, the file size is 10k.
on NodeA:
---------
1) open the test fileA, lseek the end of file. and print the position.
2) close the test fileA
on NodeB:
1) open the test fileA, append the 5k data to test FileA.
2) lseek the end of file. and print the position.
3) close file.
At first we run the test program1 on NodeA , the result is 10k. And then
run the test program2 on NodeB, the result is 15k. At last, we run the
test program1 on NodeA again, the result is 10k.
After applying this patch the three step result is 15k.
Signed-off-by: Jensen <shencanquan@huawei.com> Cc: Jie Liu <jeff.liu@oracle.com> Acked-by: Joel Becker <jlbec@evilplan.org> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Sunil Mushran <sunil.mushran@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Younger Liu [Mon, 16 Dec 2013 23:44:57 +0000 (10:44 +1100)]
ocfs2: should call ocfs2_journal_access_di() before ocfs2_delete_entry() in ocfs2_orphan_del()
While deleting a file into orphan dir in ocfs2_orphan_del(), it calls
ocfs2_delete_entry() before ocfs2_journal_access_di(). If
ocfs2_delete_entry() succeeded and ocfs2_journal_access_di() failed, there
would be a inconsistency: the file is deleted from orphan dir, but orphan
dir dinode is not updated.
So we need to call ocfs2_journal_access_di() before ocfs2_orphan_del().
Signed-off-by: Younger Liu <younger.liu@huawei.com> Reviewed-by: Jensen <shencanquan@huawei.com> Cc: Jie Liu <jeff.liu@oracle.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Mark Fasheh <mfasheh@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Yiwen Jiang [Mon, 16 Dec 2013 23:44:57 +0000 (10:44 +1100)]
ocfs2: fix a tiny race when running dirop_fileop_racer
When running dirop_fileop_racer we found a dead lock case.
2 nodes, say Node A and Node B, mount the same ocfs2 volume. Create
/race/16/1 in the filesystem, and let the inode number of dir 16 is less
than the inode number of dir race.
Node A Node B
mv /race/16/1 /race/
right after Node A has got the
EX mode of /race/16/, and tries to
get EX mode of /race
ls /race/16/
In this case, Node A has got the EX mode of /race/16/, and wants to get EX
mode of /race/. Node B has got the PR mode of /race/, and wants to get
the PR mode of /race/16/. Since EX and PR are mutually exclusive, dead
lock happens.
This patch fixes this case by locking in ancestor order before trying
inode number order.
Signed-off-by: Yiwen Jiang <jiangyiwen@huawei.com> Signed-off-by: Joseph Qi <joseph.qi@huawei.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Mark Fasheh <mfasheh@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Jie Liu [Mon, 16 Dec 2013 23:44:57 +0000 (10:44 +1100)]
ocfs2: adjust minlen with discard_granularity in the FITRIM ioctl
Adjust minlen with discard_granularity for FITRIM ioctl(2) if the given
minimum size in bytes is less than it because, discard granularity is used
to tell us that the minimum size of extent that can be discarded by the
storage device.
This is inspired by ext4 commit 5c2ed62fd4 ("ext4: Adjust minlen with
discard_granularity in the FITRIM ioctl") from Lukas Czerner.
Signed-off-by: Jie Liu <jeff.liu@oracle.com> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Jie Liu [Mon, 16 Dec 2013 23:44:56 +0000 (10:44 +1100)]
ocfs2: return EINVAL if the given range to discard is less than block size
For FITRIM ioctl(2), we should not keep silence if the given range length
ls less than a block size as there is no data blocks would be discareded.
Hence it should return EINVAL instead. This issue can be verified via
xfstests/generic/288 which is used for FITRIM argument handling tests.
Signed-off-by: Jie Liu <jeff.liu@oracle.com> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Jie Liu [Mon, 16 Dec 2013 23:44:56 +0000 (10:44 +1100)]
ocfs2: return EOPNOTSUPP if the device does not support discard
For FITRIM ioctl(2), we should return EOPNOTSUPP to inform the user that
the storage device does not support discard if it is, otherwise return
success would confuse the user even though there is no free blocks were
trimmed at all.
Signed-off-by: Jie Liu <jeff.liu@oracle.com> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Younger Liu [Mon, 16 Dec 2013 23:44:56 +0000 (10:44 +1100)]
ocfs2: remove redundant ocfs2_alloc_dinode_update_counts() and ocfs2_block_group_set_bits()
ocfs2_alloc_dinode_update_counts() and ocfs2_block_group_set_bits() are
already provided in suballoc.c. So, the same functions in move_extents.c
are not needed any more.
Declare the functions in suballoc.h and remove redundant functions in
move_extents.c.
Signed-off-by: Younger Liu <liuyiyang@hisense.com> Cc: Younger Liu <younger.liucn@gmail.com> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
ocfs2: use the new DLM operation callbacks while requesting new lockspace
Attempt to use the new DLM operations. If it is not supported, use the
traditional ocfs2_controld.
To exchange ocfs2 versioning, we use the LVB of the version dlm lock. It
first attempts to take the lock in EX mode (non-blocking). If successful
(which means it is the first mount), it writes the version number and
downconverts to PR lock. If it is unsuccessful, it reads the version from
the lock.
If this becomes the standard (with o2cb as well), it could simplify
userspace tools to check if the filesystem is mounted on other nodes.
Dan: Since ocfs2_protocol_version are two u8 values, the additional
checks with LONG* don't make sense.
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Reviewed-by: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
ocfs2: shift allocation ocfs2_live_connection to user_connect()
We perform this because the DLM recovery callbacks will require the
ocfs2_live_connection structure to record the node information when
dlm_new_lockspace() is updated (in the last patch of the series).
Before calling dlm_new_lockspace(), we need the structure ready for the
.recover_done() callback, which would set oc_this_node. This is the
reason we allocate ocfs2_live_connection beforehand in user_connect().
[AKPM] rc initialization is not required because it assigned in case of
errors. It will be cleared by compiler anyways.
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Reveiwed-by: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
These are the callbacks called by the fs/dlm code in case the membership
changes. If there is a failure while/during calling any of these, the DLM
creates a new membership and relays to the rest of the nodes.
recover_prep() is called when DLM understands a node is down.
recover_slot() is called once all nodes have acknowledged recover_prep and
recovery can begin. recover_done() is called once the recovery is
complete. It returns the new membership.
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Reviewed-by: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
This is an effort of removing ocfs2_controld.pcmk and getting ocfs2 DLM
handling up to the times with respect to DLM (>=4.0.1) and corosync
(2.3.x). AFAIK, cman also is being phased out for a unified corosync
cluster stack.
fs/dlm performs all the functions with respect to fencing and node
management and provides the API's to do so for ocfs2. For all future
references, DLM stands for fs/dlm code.
The advantages are:
+ No need to run an additional userspace daemon (ocfs2_controld)
+ No controld device handling and controld protocol
+ Shifting responsibilities of node management to DLM layer
For backward compatibility, we are keeping the controld handling code. Once
enough time has passed we can remove a significant portion of the code. This
was tested by using the kernel with changes on older unmodified tools. The
kernel used ocfs2_controld as expected, and displayed the appropriate
warning message.
This feature requires modification in the userspace ocfs2-tools. The
changes can be found at: https://github.com/goldwynr/ocfs2-tools branch:
nocontrold Currently, not many checks are present in the userspace code,
but that would change soon.
This patch (of 6):
Add clustername to cluster connection.
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Reviewed-by: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tariq Saeed [Mon, 16 Dec 2013 23:44:54 +0000 (10:44 +1100)]
ocfs2/o2net: incorrect to terminate accepting connections loop upon rejecting an invalid one
When o2net-accept-one() rejects an illegal connection, it terminates the
loop picking up the remaining queued connections. This fix will continue
accepting connections till the queue is emtpy.
Zongxun Wang [Mon, 16 Dec 2013 23:44:54 +0000 (10:44 +1100)]
ocfs2: free allocated clusters if error occurs after ocfs2_claim_clusters
Even if using the same jbd2 handle, we cannot rollback a transaction. So
once some error occurs after successfully allocating clusters, the
allocated clusters will never be used and it means they are lost. For
example, call ocfs2_claim_clusters successfully when expanding a file, but
failed in ocfs2_insert_extent. So we need free the allocated clusters if
they are not used indeed.
Signed-off-by: Zongxun Wang <wangzongxun@huawei.com> Signed-off-by: Joseph Qi <joseph.qi@huawei.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Li Zefan <lizefan@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The versioning information is confusing for end-users. The numbers are
stuck at 1.5.0 when the tools version have moved to 1.8.2. Remove the
versioning system in the OCFS2 modules and let the kernel version be the
guide to debug issues.
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Acked-by: Sunil Mushran <sunil.mushran@gmail.com> Cc: Mark Fasheh <mfasheh@suse.com> Acked-by: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Younger Liu [Mon, 16 Dec 2013 23:44:54 +0000 (10:44 +1100)]
ocfs2: fix ocfs2_sync_file() if filesystem is readonly
If filesystem is readonly, there is no need to flush drive's caches or
force any uncommitted transactions.
Signed-off-by: Younger Liu <younger.liucn@gmail.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Mark Fasheh <mfasheh@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Samuel Thibault [Mon, 16 Dec 2013 23:44:53 +0000 (10:44 +1100)]
input: route kbd LEDs through the generic LEDs layer
This permits to reassign keyboard LEDs to something else than keyboard
"leds" state, by adding keyboard led and modifier triggers connected to a
series of VT input LEDs, themselves connected to VT input triggers, which
per-input device LEDs use by default. Userland can thus easily change the
LED behavior of (a priori) all input devices, or of particular input
devices.
This also permits to fix #7063 from userland by using a modifier to
implement proper CapsLock behavior and have the keyboard caps lock led
show that modifier state.
[ebroder@mokafive.com: Rebased to 3.2-rc1 or so, cleaned up some includes, and fixed some constants]
[blogic@openwrt.org: CONFIG_INPUT_LEDS stubs should be static inline] Signed-off-by: Samuel Thibault <samuel.thibault@ens-lyon.org> Signed-off-by: Evan Broder <evan@ebroder.net> Reviewed-by: David Herrmann <dh.herrmann@gmail.com> Tested-by: Pavel Machek <pavel@ucw.cz> Acked-by: Peter Korsgaard <jacmet@sunsite.dk> Cc: Pavel Machek <pavel@ucw.cz> Cc: Dmitry Torokhov <dmitry.torokhov@gmail.com> Cc: Bryan Wu <cooloney@gmail.com> Cc: Arnaud Patard <arnaud.patard@rtp-net.org> Cc: Sascha Hauer <s.hauer@pengutronix.de> Cc: Matt Sealey <matt@genesi-usa.com> Cc: Rob Clark <robdclark@gmail.com> Cc: Niels de Vos <devos@fedoraproject.org> Cc: Steev Klimaszewski <steev@genesi-usa.com> Signed-off-by: John Crispin <blogic@openwrt.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Stephen Boyd [Mon, 16 Dec 2013 23:44:52 +0000 (10:44 +1100)]
sched_clock: document 4Mhz vs 1Mhz decision
Bo Shen sent a patch to change this to 1Mhz instead of 4Mhz but according
to Russell King the use of 4Mhz was intentional. Add a comment to this
effect so that others don't try to change the code as well.
Signed-off-by: Stephen Boyd <sboyd@codeaurora.org> Cc: Bo Shen <voice.shen@atmel.com> Cc: Russell King <rmk@arm.linux.org.uk> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Jeff Mahoney [Mon, 16 Dec 2013 23:44:51 +0000 (10:44 +1100)]
drm/nouveau: make vga_switcheroo code depend on VGA_SWITCHEROO
Commit 8116188fdef594 ("nouveau/acpi: hook up to the MXM method for mux
switching.") broke the build on non-x86 architectures due to the new
dependency on MXM and MXM being an x86 platform driver.
It built previously since the vga switcheroo registration routines were
zereod out on !X86. The code was built in but unused.
This patch makes all of the DSM code depend on CONFIG_VGA_SWITCHEROO,
allowing it to build on non-x86 and shrinking the module size as well.
[rdunlap@infradead.org: fix build eror when VGA_SWITCHEROO is not enabled] Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: Jiri Slaby <jslaby@suse.cz> Cc: David Airlie <airlied@linux.ie> Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Takashi Iwai [Mon, 16 Dec 2013 23:44:51 +0000 (10:44 +1100)]
drm/cirrus: correct register values for 16bpp
When the mode is set with 16bpp on QEMU, the output gets totally broken.
The culprit is the bogus register values set for 16bpp, which was likely
copied from from a wrong place.
Daniel Vetter [Mon, 16 Dec 2013 23:44:51 +0000 (10:44 +1100)]
drm/fb-helper: don't sleep for screen unblank when an oops is in progress
Otherwise the system will burn even brighter and worse, leave the user
wondering what's going on exactly.
Since we already have a panic handler which will (try) to restore the
entire fbdev console mode, we can just bail out. Inspired by a patch from
Konstantin Khlebnikov. The callchain leading to this, cut&pasted from
Konstantin's original patch:
Note that the entire locking in the fb helper around panic/sysrq and kdbg
is ... non-existant. So we have a decent change of blowing up
everything. But since reworking this ties in with funny concepts like the
fbdev notifier chain or the impressive things which happen around
console_lock while oopsing, I'll leave that as an exercise for braver
souls than me.
Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch> Cc: Konstantin Khlebnikov <khlebnikov@openvz.org> Cc: Dave Airlie <airlied@gmail.com> Reviewed-by: Rob Clark <robdclark@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Jan Kara [Mon, 16 Dec 2013 23:44:51 +0000 (10:44 +1100)]
fsnotify: remove pointless NULL initializers
We usually rely on the fact that struct members not specified in the
initializer are set to NULL. So do that with fsnotify function pointers
as well.
Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: Eric Paris <eparis@parisplace.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Jan Kara [Mon, 16 Dec 2013 23:44:50 +0000 (10:44 +1100)]
fsnotify: remove .should_send_event callback
After removing event structure creation from the generic layer there is no
reason for separate .should_send_event and .handle_event callbacks. So
just remove the first one.
Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: Eric Paris <eparis@parisplace.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Jan Kara [Mon, 16 Dec 2013 23:44:50 +0000 (10:44 +1100)]
fsnotify: do not share events between notification groups
Currently fsnotify framework creates one event structure for each
notification event and links this event into all interested notification
groups. This is done so that we save memory when several notification
groups are interested in the event. However the need for event structure
shared between inotify & fanotify bloats the event structure so the result
is often higher memory consumption.
Another problem is that fsnotify framework keeps path references with
outstanding events so that fanotify can return open file descriptors with
its events. This has the undesirable effect that filesystem cannot be
unmounted while there are outstanding events - a regression for inotify
compared to a situation before it was converted to fsnotify framework.
For fanotify this problem is hard to avoid and users of fanotify should
kind of expect this behavior when they ask for file descriptors from
notified files.
This patch changes fsnotify and its users to create separate event
structure for each group. This allows for much simpler code (~400 lines
removed by this patch) and also smaller event structures. For example on
64-bit system original struct fsnotify_event consumes 120 bytes, plus
additional space for file name, additional 24 bytes for second and each
subsequent group linking the event, and additional 32 bytes for each
inotify group for private data. After the conversion inotify event
consumes 48 bytes plus space for file name which is considerably less
memory unless file names are long and there are several groups interested
in the events (both of which are uncommon). Fanotify event fits in 56
bytes after the conversion (fanotify doesn't care about file names so its
events don't have to have it allocated). A win unless there are four or
more fanotify groups interested in the event.
The conversion also solves the problem with unmount when only inotify is
used as we don't have to grab path references for inotify events.
Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: Eric Paris <eparis@parisplace.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Jan Kara [Mon, 16 Dec 2013 23:44:50 +0000 (10:44 +1100)]
inotify: provide function for name length rounding
Rounding of name length when passing it to userspace was done in several
places. Provide a function to do it and use it in all places.
Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: Eric Paris <eparis@parisplace.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Shuah Khan [Mon, 16 Dec 2013 23:44:50 +0000 (10:44 +1100)]
dma-debug: enhance dma_debug_device_change() to check for mapping errors
dma-debug checks to verify if driver validated the address returned by dma
mapping routines when driver does unmap. If a driver doesn't call unmap,
failure to check mapping errors isn't detected and reported.
Enhance the existing bus notifier_call dma_debug_device_change() to check
for mapping errors at the same time it detects leaked dma buffers for
BUS_NOTIFY_UNBOUND_DRIVER event. It scans for mapping errors and if any
found, prints one warning message that includes mapping error count.
Signed-off-by: Shuah Khan <shuah.kh@samsung.com> Cc: Alexander Duyck <alexander.h.duyck@intel.com> Cc: Joerg Roedel <joro@8bytes.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Vladimir Murzin [Mon, 16 Dec 2013 23:44:49 +0000 (10:44 +1100)]
arm: move arm_dma_limit to setup_dma_zone
Since 4dcfa600 ("ARM: DMA-API: better handing of DMA masks for coherent
allocations") arm_dma_limit_pfn has almost substituted the arm_dma_limit.
The remaining user is dma_contiguous_reserve(). It is also referenced in
setup_dma_zone() to calculate arm_dma_limit_pfn.
Kill the global arm_dma_limit and equip setup_zone_dma with the local one.
Signed-off-by: Vladimir Murzin <murzin.v@gmail.com> Reported-by: Vassili Karpov <av1474@comtv.ru> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Toshi Kani [Mon, 16 Dec 2013 23:44:49 +0000 (10:44 +1100)]
arch/x86/mm/srat.c: skip NUMA_NO_NODE while parsing SLIT
When ACPI SLIT table has an I/O locality (i.e. a locality unique to an
I/O device), numa_set_distance() emits the warning message below.
NUMA: Warning: node ids are out of bound, from=-1 to=-1 distance=10
acpi_numa_slit_init() calls numa_set_distance() with pxm_to_node(), which
assumes that all localities have been parsed with SRAT previously. SRAT
does not list I/O localities, where as SLIT lists all localities including
I/Os. Hence, pxm_to_node() returns NUMA_NO_NODE (-1) for an I/O locality.
I/O localities are not supported and are ignored today, but emitting such
warning message leads unnecessary confusion.
Change acpi_numa_slit_init() to avoid calling numa_set_distance() with
NUMA_NO_NODE.
Signed-off-by: Toshi Kani <toshi.kani@hp.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Thomas Gleixner <tglx@linutronix.de> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
sh: add EXPORT_SYMBOL(min_low_pfn) and EXPORT_SYMBOL(max_low_pfn) to sh_ksyms_32.c
Min_low_pfn and max_low_pfn were used in pfn_valid macro if defined
CONFIG_FLATMEM. When the functions that use the pfn_valid is used in driver
module, max_low_pfn and min_low_pfn is to undefined, and fail to build.
Jianguo Wu [Mon, 16 Dec 2013 23:44:48 +0000 (10:44 +1100)]
mm/memory-failure.c: recheck PageHuge() after hugetlb page migrate successfully
After a successful hugetlb page migration by soft offline, the source page
will either be freed into hugepage_freelists or buddy(over-commit page).
If page is in buddy, page_hstate(page) will be NULL. It will hit a NULL
pointer dereference in dequeue_hwpoisoned_huge_page().
Joonsoo Kim [Mon, 16 Dec 2013 23:44:48 +0000 (10:44 +1100)]
mm/compaction: respect ignore_skip_hint in update_pageblock_skip
update_pageblock_skip() only fits to compaction which tries to isolate by
pageblock unit. If isolate_migratepages_range() is called by CMA, it try
to isolate regardless of pageblock unit and it don't reference
get_pageblock_skip() by ignore_skip_hint. We should also respect it on
update_pageblock_skip() to prevent from setting the wrong information.
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Reviewed-by: Wanpeng Li <liwanp@linux.vnet.ibm.com> Cc: Christoph Lameter <cl@linux.com> Cc: Rafael Aquini <aquini@redhat.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Rik van Riel <riel@redhat.com> Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Cc: <stable@vger.kernel.org> [3.7+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
It's caused by uninitialized persistent_keyring_register_sem.
The bug was introduced by commit f36f8c75 ("KEYS: Add per-user_namespace
registers for persistent per-UID kerberos caches"). Two typos are in that
commit: CONFIG_KEYS_KERBEROS_CACHE should be CONFIG_PERSISTENT_KEYRINGS
and krb_cache_register_sem should be persistent_keyring_register_sem.
Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> Cc: David Howells <dhowells@redhat.com> Acked-by: Josh Boyer <jwboyer@fedoraproject.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
sh: always link in helper functions extracted from libgcc
E.g. landisk_defconfig, which has CONFIG_NTFS_FS=m:
ERROR: "__ashrdi3" [fs/ntfs/ntfs.ko] undefined!
For "lib-y", if no symbols in a compilation unit are referenced by other
units, the compilation unit will not be included in vmlinux.
This breaks modules that do reference those symbols.
This doesn't fix all cases. There are others, e.g. udivsi3.
This is also not limited to sh, many architectures handle this in the same
way.
A simple solution is to unconditionally include all helper functions.
A more complex solution is to make the choice of "lib-y" or "obj-y" depend
on CONFIG_MODULES:
obj-$(CONFIG_MODULES) += ...
lib-y($CONFIG_MODULES) += ...
Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Paul Mundt <lethal@linux-sh.org> Tested-by: Nobuhiro Iwamatsu <nobuhiro.iwamatsu.yj@renesas.com> Reviewed-by: Nobuhiro Iwamatsu <nobuhiro.iwamatsu.yj@renesas.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
This happens because the futex test in futex_init() lacks a switch to the
USER_DS address space, while cmpxchg_futex_value_locked() and
futex_atomic_cmpxchg_inatomic() operate on userspace pointers (albeit NULL
for this particular test).
Fix this by switching to USER_DS before running the test, and restoring
the old address space afterwards.
Bisected by Finn Thain.
Reported-by: Tuxist <tuxist@tuxist.de> Reported-by: Patrick McCarthy <patrickjmc@gmail.com> Suggested-by: Andreas Schwab <schwab@linux-m68k.org> Tested-by: Finn Thain <fthain@telegraphics.com.au> Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Darren Hart <dvhart@linux.intel.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>