Bob Liu [Tue, 5 Nov 2013 05:55:36 +0000 (16:55 +1100)]
mm: thp: khugepaged: add policy for finding target node
Khugepaged will scan/free HPAGE_PMD_NR normal pages and replace with a
hugepage which is allocated from the node of the first scanned normal
page, but this policy is too rough and may end with unexpected result to
upper users.
The problem is the original page-balancing among all nodes will be broken
after hugepaged started. Thinking about the case if the first scanned
normal page is allocated from node A, most of other scanned normal pages
are allocated from node B or C.. But hugepaged will always allocate
hugepage from node A which will cause extra memory pressure on node A
which is not the situation before khugepaged started.
This patch try to fix this problem by making khugepaged allocate hugepage
from the node which have max record of scaned normal pages hit, so that
the effect to original page-balancing can be minimized.
The other problem is if normal scanned pages are equally allocated from
Node A,B and C, after khugepaged started Node A will still suffer extra
memory pressure.
Andrew Davidoff reported a related issue several days ago. He wanted his
application interleaving among all nodes and "numactl --interleave=all
./test" was used to run the testcase, but the result wasn't not as
expected.
cat /proc/2814/numa_maps: 7f50bd440000 interleave:0-3 anon=51403 dirty=51403 N0=435 N1=435 N2=435
N3=50098
The end result showed that most pages are from Node3 instead of interleave
among node0-3 which was unreasonable.
This patch also fix this issue by allocating hugepage round robin from all
nodes have the same record, after this patch the result was as expected: 7f78399c0000 interleave:0-3 anon=51403 dirty=51403 N0=12723 N1=12723
N2=13235 N3=12722
The simple testcase is like this:
int main() {
char *p;
int i;
int j;
for (i=0; i < 200; i++) {
p = (char *)malloc(1048576);
printf("malloc done\n");
if (p == 0) {
printf("Out of memory\n");
return 1;
}
for (j=0; j < 1048576; j++) {
p[j] = 'A';
}
printf("touched memory\n");
Reported-by: Andrew Davidoff <davidoff@qedmf.net> Tested-by: Andrew Davidoff <davidoff@qedmf.net> Signed-off-by: Bob Liu <bob.liu@oracle.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Bob Liu [Tue, 5 Nov 2013 05:55:35 +0000 (16:55 +1100)]
mm: thp: cleanup: mv alloc_hugepage to better place
Move alloc_hugepage() to a better place, no need for a seperate #ifndef
CONFIG_NUMA
Signed-off-by: Bob Liu <bob.liu@oracle.com> Reviewed-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Andrew Davidoff <davidoff@qedmf.net> Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Wanpeng Li [Tue, 5 Nov 2013 05:55:34 +0000 (16:55 +1100)]
revert mm/vmalloc.c: emit the failure message before return
Don't warn twice in __vmalloc_area_node and __vmalloc_node_range if
__vmalloc_area_node allocation failure. This patch reverts commit 46c001a2 ("mm/vmalloc.c: emit the failure message before return").
Signed-off-by: Wanpeng Li <liwanp@linux.vnet.ibm.com> Reviewed-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Mitsuo Hayasaka <mitsuo.hayasaka.hu@hitachi.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Wanpeng Li [Tue, 5 Nov 2013 05:55:34 +0000 (16:55 +1100)]
mm/vmalloc: revert "mm/vmalloc.c: check VM_UNINITIALIZED flag in s_show instead of show_numa_info"
The VM_UNINITIALIZED/VM_UNLIST flag introduced by f5252e00 ("mm:
avoid null pointer access in vm_struct via /proc/vmallocinfo") is used to
avoid accessing the pages field with unallocated page when
show_numa_info() is called. This patch move the check just before
show_numa_info in order that some messages still can be dumped via
/proc/vmallocinfo. This patch revert commit d157a558 ("mm/vmalloc.c: check
VM_UNINITIALIZED flag in s_show instead of show_numa_info");
Reviewed-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Signed-off-by: Wanpeng Li <liwanp@linux.vnet.ibm.com> Cc: Mitsuo Hayasaka <mitsuo.hayasaka.hu@hitachi.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The assumption !VM_VM_AREA represents vm_map_ram allocation is introduced
by d4033afd ("mm, vmalloc: iterate vmap_area_list, instead of vmlist, in
vmallocinfo()"). However, !VM_VM_AREA also represents vmap_area is being
tear down in race window mentioned above. This patch fix it by don't dump
any information for !VM_VM_AREA case and also remove (VM_LAZY_FREE |
VM_LAZY_FREEING) check since they are not possible for !VM_VM_AREA case.
Suggested-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Wanpeng Li <liwanp@linux.vnet.ibm.com> Cc: Mitsuo Hayasaka <mitsuo.hayasaka.hu@hitachi.com> Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
David Rientjes [Tue, 5 Nov 2013 05:55:32 +0000 (16:55 +1100)]
mm, mempolicy: make mpol_to_str robust and always succeed
mpol_to_str() should not fail. Currently, it either fails because the
string buffer is too small or because a string hasn't been defined for a
mempolicy mode.
If a new mempolicy mode is introduced and no string is defined for it,
just warn and return "unknown".
If the buffer is too small, just truncate the string and return, the same
behavior as snprintf().
This also fixes a bug where there was no NULL-byte termination when doing
*p++ = '=' and *p++ ':' and maxlen has been reached.
Signed-off-by: David Rientjes <rientjes@google.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Chen Gang <gang.chen@asianux.com> Cc: Rik van Riel <riel@redhat.com> Cc: Dave Jones <davej@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Chen Gong pointed out that set/unset_migratetype_isolate() was done in
different functions in mm/memory-failure.c, which makes the code less
readable/maintainable. So this patch does it in soft_offline_page().
With this patch, we get to hold lock_memory_hotplug() longer but it's not
a problem because races between memory hotplug and soft offline are very
rare.
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Reviewed-by: Chen, Gong <gong.chen@linux.intel.com> Acked-by: Andi Kleen <ak@linux.intel.com> Reviewed-by: Wanpeng Li <liwanp@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Toshi Kani [Tue, 5 Nov 2013 05:55:30 +0000 (16:55 +1100)]
cpu/mem hotplug: add try_online_node() for cpu_up()
cpu_up() has #ifdef CONFIG_MEMORY_HOTPLUG code blocks, which call
mem_online_node() to put its node online if offlined and then call
build_all_zonelists() to initialize the zone list. These steps are
specific to memory hotplug, and should be managed in mm/memory_hotplug.c.
lock_memory_hotplug() should also be held for the whole steps.
For this reason, this patch replaces mem_online_node() with
try_online_node(), which performs the whole steps with
lock_memory_hotplug() held. try_online_node() is named after
try_offline_node() as they have similar purpose.
Robin Holt [Tue, 5 Nov 2013 05:55:30 +0000 (16:55 +1100)]
mm/nobootmem.c: have __free_pages_memory() free in larger chunks.
On large memory machines it can take a few minutes to get through
free_all_bootmem().
Currently, when free_all_bootmem() calls __free_pages_memory(), the number
of contiguous pages that __free_pages_memory() passes to the buddy
allocator is limited to BITS_PER_LONG. BITS_PER_LONG was originally
chosen to keep things similar to mm/nobootmem.c. But it is more efficient
to limit it to MAX_ORDER.
base new change
8TB 202s 172s 30s
16TB 401s 351s 50s
That is around 1%-3% improvement on total boot time.
This patch was spun off from the boot time rfc Robin and I had been
working on.
Signed-off-by: Robin Holt <robin.m.holt@gmail.com> Signed-off-by: Nathan Zimmer <nzimmer@sgi.com> Cc: Robin Holt <robinmholt@linux.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Mike Travis <travis@sgi.com> Cc: Yinghai Lu <yinghai@kernel.org> Cc: Mel Gorman <mgorman@suse.de> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Wanpeng Li <liwanp@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Xishi Qiu [Tue, 5 Nov 2013 05:55:28 +0000 (16:55 +1100)]
mm/memory_hotplug.c: rename the function is_memblock_offlined_cb()
A is_memblock_offlined() return or 1 means memory block is offlined, but
is_memblock_offlined_cb() returning 1 means memory block is not offlined,
this will confuse somebody, so rename the function.
Jianguo Wu [Tue, 5 Nov 2013 05:55:26 +0000 (16:55 +1100)]
mm/huge_memory.c: fix stale comments of transparent_hugepage_flags
Since commit 13ece886d9 ("thp: transparent hugepage config choice"),
transparent hugepage support is disabled by default, and
TRANSPARENT_HUGEPAGE_ALWAYS is configured when TRANSPARENT_HUGEPAGE=y.
And since commit d39d33c332 ("thp: enable direct defrag"), defrag is
enable for all transparent hugepage page faults by default, not only in
MADV_HUGEPAGE regions.
Signed-off-by: Jianguo Wu <wujianguo@huawei.com> Reviewed-by: Wanpeng Li <liwanp@linux.vnet.ibm.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Xishi Qiu [Tue, 5 Nov 2013 05:55:24 +0000 (16:55 +1100)]
mm/arch: use __free_reserved_page() to simplify the code
Use __free_reserved_page() to simplify the code in arch.
It used split_page() in consistent_alloc()/__dma_alloc_coherent()/dma_alloc_coherent(),
so page->_count == 1, and we can free it safely.
__free_reserved_page()
ClearPageReserved()
init_page_count() // it won't change the value
__free_page()
Signed-off-by: Xishi Qiu <qiuxishi@huawei.com> Cc: James Hogan <james.hogan@imgtec.com> Cc: Michal Simek <monstr@monstr.eu> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Jerome Marchand [Tue, 5 Nov 2013 05:55:24 +0000 (16:55 +1100)]
mm/compaction.c: update comment about zone lock in isolate_freepages_block
Since commit f40d1e4 ("mm: compaction: acquire the zone->lock as late as
possible"), isolate_freepages_block() takes the zone->lock itself. The
function description however still states that the zone->lock must be
held.
Joe Perches [Tue, 5 Nov 2013 05:55:23 +0000 (16:55 +1100)]
ksm: Remove redundant __GFP_ZERO from kcalloc
kcalloc returns zeroed memory. There's no need to use this flag.
Signed-off-by: Joe Perches <joe@perches.com> Reviewed-by: Wanpeng Li <liwanp@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Sasha Levin [Tue, 5 Nov 2013 05:55:22 +0000 (16:55 +1100)]
watchdog: trigger all-cpu backtrace when locked up and going to panic
Send an NMI to all CPUs when a lockup is detected and the lockup watchdog
code is configured to panic. This gives us a fairly uptodate snapshot of
all CPUs in the system.
This lets us get stack trace of all CPUs which makes life easier trying to
debug a deadlock, and the NMI doesn't change anything since the next step
is a kernel panic.
Oleg Nesterov [Tue, 5 Nov 2013 05:55:21 +0000 (16:55 +1100)]
anon_inodefs: forbid open via /proc
open("/proc/pid/$anon-fd") should fail, we can't create the new file with
correct f_op/etc correctly. Currently this creates the bogus file with
the empty anon_inode_fops, this is harmless but still wrong and
misleading.
Add anon_inode_fops->anon_open() which simply returns ENXIO like
sock_no_open() does in this case.
Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Andy Lutomirski <luto@amacapital.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Josh Hunt [Tue, 5 Nov 2013 05:55:20 +0000 (16:55 +1100)]
block: restore /proc/partitions to not display non-partitionable removable devices
We found with newer kernels we started seeing the cdrom device showing
up in /proc/partitions, but it was not there before.
Looking into this I found that commit d27769ec ("block: add
GENHD_FL_NO_PART_SCAN") introduces this change in behavior. It's not
clear to me from the commit's changelog if this change was intentional or
not. This comment still remains: /* Don't show non-partitionable
removeable devices or empty devices */ so I've decided to send a patch to
restore the behavior of not printing unpartitionable removable devices.
Signed-off-by: Josh Hunt <johunt@akamai.com> Cc: Tejun Heo <tj@kernel.org> Cc: Kay Sievers <kay.sievers@vrfy.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
block: do not call sector_div() with a 64-bit divisor
do_div() (called by sector_div() if CONFIG_LBDAF=y) is meant for divisions
of 64-bit number by 32-bit numbers. Passing 64-bit divisor types caused
issues in the past on 32-bit platforms, cfr. commit ea077b1b96e073ea
("m68k: Truncate base in do_div()").
As queue_limits.max_discard_sectors and .discard_granularity are unsigned
int, max_discard_sectors and granularity should be unsigned int. As
bdev_discard_alignment() returns int, alignment should be int. Now 2
calls to sector_div() can be replaced by 32-bit arithmetic:
- The 64-bit modulo operation can become a 32-bit modulo operation,
- The 64-bit division and multiplication can be replaced by a 32-bit
modulo operation and a subtraction.
drivers/block/cciss.c: return 0 from driver probe function on success, not 1
A return value of 1 is interpreted as an error.
Signed-off-by: Stephen M. Cameron <scameron@beardog.cce.hp.com> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
hpsa: return 0 from driver probe function on success, not 1
A return value of 1 is interpreted as an error. See pci_driver. in
local_pci_probe(). If you're wondering how this ever could have worked,
it's because it used to be the case that only return values less than zero
were interpreted as failure. But even in the current kernel if the driver
registers its various entry points with the kernel, and then returns a
value which is interpreted as failure, those registrations aren't undone,
so the driver still mostly works. However, the driver's remove function
wouldn't be called on rmmod, and pci power management functions wouldn't
work. In the case of Smart Array, since it has a battery backed cache (or
else no cache) even if the driver is not shut down properly as long as
there is no outstanding i/o, nothing too bad happens, which is why it took
so long to notice.
Requesting backport to stable because the change to pci-driver.c which
requires driver probe functions to return 0 occurred between 2.6.35 and
2.6.36 (the pci power management breakage) and again between 3.7 and 3.8
(pci_dev->driver getting set to NULL in local_pci_probe() preventing
driver remove function from being called on rmmod.)
Signed-off-by: Stephen M. Cameron <scameron@beardog.cce.hp.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The driver core clears the driver data to NULL after device_release or on
probe failure. Thus, it is not needed to manually clear the device driver
data to NULL.
Signed-off-by: Jingoo Han <jg1.han@samsung.com> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Gu Zheng [Tue, 5 Nov 2013 05:55:16 +0000 (16:55 +1100)]
fs/bio-integrity.c: remove duplicated code
Most code of function bio_integrity_verify and bio_integrity_generate is
the same, so introduce a common function bio_integrity_generate_verify()
to remove the reduplicate code.
Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Jens Axboe <axboe@kernel.dk> Cc: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
do_div() is meant for divisions of 64-bit number by 32-bit numbers.
Passing 64-bit divisor types caused issues in the past on 32-bit
platforms, cfr. commit ea077b1b96e073eac5c ("m68k: Truncate base in
do_div()").
As scsi_device.sector_size is unsigned (int), factor should be unsigned
int, too.
Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Andrew Morton [Tue, 5 Nov 2013 05:55:14 +0000 (16:55 +1100)]
mm/readahead.c:do_readhead(): don't check for ->readpage
The callee force_page_cache_readahead() already does this and unlike
do_readahead(), force_page_cache_readahead() remembers to check for
->readpages() as well.
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
drivers/pci/pci-driver.c: warn on driver probe return value greater than zero
Ages ago, drivers could return values greater than zero from their probe
function and this would be regarded as success. Commit f3ec4f87d607f40497
"PCI: change device runtime PM settings for probe and remove" slightly
altered this in 2010, and commit 967577b062417b4e4b8e27b ("PCI/PM: Keep
runtime PM enabled for unbound PCI devices") in late 2012 altered it more
signficantly, setting pci_dev->driver to NULL if the driver's probe
function returned a value greater than zero, which would for example
prevent the driver's remove function from being called on rmmod.
Neither of those changes would necessarily make the driver fail in an
obvious way though, and so at least a couple drivers (cciss, hpsa) fell
into this hole since they were returning 1, and this situation went
unnoticed for quite some time.
If a driver's probe function returns a value greater than zero, issue a
warning, but otherwise treat this as success.
Signed-off-by: Stephen M. Cameron <scameron@beardog.cce.hp.com> Cc: Bjorn Helgaas <bhelgaas@google.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Greg KH <greg@kroah.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Junxiao Bi [Tue, 5 Nov 2013 05:55:13 +0000 (16:55 +1100)]
ocfs2: update inode size after zeroing the hole
fs-writeback will release the dirty pages without page lock whose offset
are over inode size, the release happens at block_write_full_page_endio().
If not update, dirty pages in file holes may be released before flushed
to the disk, then file holes will contain some non-zero data, this will
cause sparse file md5sum error.
To reproduce the bug, find a big sparse file with many holes, like vm
image file, its actual size should be bigger than available mem size to
make writeback work more frequently, tar it with -S option, then keep
untar it and check its md5sum again and again until you get a wrong
md5sum.
Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com> Cc: Younger Liu <younger.liu@huawei.com> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Younger Liu [Tue, 5 Nov 2013 05:55:13 +0000 (16:55 +1100)]
ocfs2: fix issue that ocfs2_setattr() does not deal with new_i_size==i_size
The issue scenario is as following:
- Create a small file and fallocate a large disk space for a file with
FALLOC_FL_KEEP_SIZE option.
- ftruncate the file back to the original size again. but the disk free
space is not changed back. This is a real bug that be fixed in this
patch.
In order to solve the issue above, we modified ocfs2_setattr(), if
attr->ia_size != i_size_read(inode), It calls ocfs2_truncate_file(), and
truncate disk space to attr->ia_size.
Signed-off-by: Younger Liu <younger.liu@huawei.com> Reviewed-by: Jie Liu <jeff.liu@oracle.com> Tested-by: Jie Liu <jeff.liu@oracle.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Sunil Mushran <sunil.mushran@gmail.com> Reviewed-by: Jensen <shencanquan@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Jensen [Tue, 5 Nov 2013 05:55:12 +0000 (16:55 +1100)]
ocfs2: llseek requires ocfs2 inode lock for the file in SEEK_END
llseek requires ocfs2 inode lock for updating the file size in SEEK_END.
because the file size maybe update on another node.
This bug can be reproduce the following scenario: at first, we dd a test
fileA, the file size is 10k.
on NodeA:
---------
1) open the test fileA, lseek the end of file. and print the position.
2) close the test fileA
on NodeB:
1) open the test fileA, append the 5k data to test FileA.
2) lseek the end of file. and print the position.
3) close file.
At first we run the test program1 on NodeA , the result is 10k. And then
run the test program2 on NodeB, the result is 15k. At last, we run the
test program1 on NodeA again, the result is 10k.
After applying this patch the three step result is 15k.
Signed-off-by: Jensen <shencanquan@huawei.com> Cc: Jie Liu <jeff.liu@oracle.com> Acked-by: Joel Becker <jlbec@evilplan.org> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Sunil Mushran <sunil.mushran@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Younger Liu [Tue, 5 Nov 2013 05:55:11 +0000 (16:55 +1100)]
ocfs2: should call ocfs2_journal_access_di() before ocfs2_delete_entry() in ocfs2_orphan_del()
While deleting a file into orphan dir in ocfs2_orphan_del(), it calls
ocfs2_delete_entry() before ocfs2_journal_access_di(). If
ocfs2_delete_entry() succeeded and ocfs2_journal_access_di() failed, there
would be a inconsistency: the file is deleted from orphan dir, but orphan
dir dinode is not updated.
So we need to call ocfs2_journal_access_di() before ocfs2_orphan_del().
Signed-off-by: Younger Liu <younger.liu@huawei.com> Reviewed-by: Jensen <shencanquan@huawei.com> Cc: Jie Liu <jeff.liu@oracle.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Mark Fasheh <mfasheh@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
ocfs2: use the new DLM operation callbacks while requesting new lockspace
Attempt to use the new DLM operations. If it is not supported, use the
traditional ocfs2_controld.
To exchange ocfs2 versioning, we use the LVB of the version dlm lock. It
first attempts to take the lock in EX mode (non-blocking). If successful
(which means it is the first mount), it writes the version number and
downconverts to PR lock. If it is unsuccessful, it reads the version from
the lock.
If this becomes the standard (with o2cb as well), it could simplify
userspace tools to check if the filesystem is mounted on other nodes.
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Mark Fasheh <mfasheh@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
ocfs2: pass ocfs2_cluster_connection to ocfs2_this_node
This is done to differentiate between using and not using controld and use
the connection information accordingly. We need to be backward
compatible. So, we use a new enum ocfs2_connection_type to identify when
controld is used and when it is not.
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Mark Fasheh <mfasheh@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
ocfs2: shift allocation ocfs2_live_connection to user_connect()
We perform this because the DLM recovery callbacks will require the
ocfs2_live_connection structure to record the node information when
dlm_new_lockspace() is updated.
[AKPM] rc initialization is not required because it assigned in case of
errors. It will be cleared by compiler anyways.
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Mark Fasheh <mfasheh@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
These are the callbacks called by the fs/dlm code in case the membership
changes. If there is a failure while/during calling any of these, the DLM
creates a new membership and relays to the rest of the nodes.
recover_prep() is called when DLM understands a node is down.
recover_slot() is called once all nodes have acknowledged recover_prep and
recovery can begin. recover_done() is called once the recovery is
complete. It returns the new membership.
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Mark Fasheh <mfasheh@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
This is an effort of removing ocfs2_controld.pcmk and getting ocfs2 DLM
handling up to the times with respect to DLM (>=4.0.1) and corosync
(2.3.x). AFAIK, cman also is being phased out for a unified corosync
cluster stack.
fs/dlm performs all the functions with respect to fencing and node
management and provides the API's to do so for ocfs2. For all future
references, DLM stands for fs/dlm code.
The advantages are:
+ No need to run an additional userspace daemon (ocfs2_controld)
+ No contrrold devince handling and controld protocol
+ Shifting responsibilities of node management to DLM layer
For backward compatibility, we are keeping the controld handling code.
Once enough time has passed we can remove a significant portion of the
code.
This feature requires modification in the userspace ocfs2-tools. The
changes can be found at: https://github.com/goldwynr/ocfs2-tools branch:
nocontrold Currently, not many checks are present in the userspace code,
but that would change soon.
These changes were developed on linux-stable 3.11.y. However, the changes
are applicable to the current upstream as well. If you wish to give the
entire kernel a spin, the link is:
Xue jiufei [Tue, 5 Nov 2013 05:55:08 +0000 (16:55 +1100)]
ocfs2: fix possible double free in ocfs2_write_begin_nolock
When ocfs2_write_cluster_by_desc() failed in ocfs2_write_begin_nolock()
because of ENOSPC, it goes to out_quota, freeing data_ac(meta_ac). Then
it calls ocfs2_try_to_free_truncate_log() to free space. If enough space
freed, it will try to write again. Unfortunately, some error happenes
before ocfs2_lock_allocators(), it goes to out and free data_ac(meta_ac)
again.
Signed-off-by: joyce <xuejiufei@huawei.com> Reviewed-by: Jie Liu <jeff.liu@oracle.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Mark Fasheh <mfasheh@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Younger Liu [Tue, 5 Nov 2013 05:55:07 +0000 (16:55 +1100)]
ocfs2: add missing errno in ocfs2_ioctl_move_extents()
If the file is not regular or writeable, it should return errno(EPERM).
This patch is based on 85a258b70d ("ocfs2: fix error handling in
ocfs2_ioctl_move_extents()").
Signed-off-by: Younger Liu <younger.liu@huawei.com> Signed-off-by: Jie Liu <jeff.liu@oracle.com> Reviewed-by: Dan Carpenter <dan.carpenter@oracle.com> Reviewed-by: Jie Liu <jeff.liu@oracle.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Mark Fasheh <mfasheh@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Younger Liu [Tue, 5 Nov 2013 05:55:07 +0000 (16:55 +1100)]
ocfs2: do not call brelse() if group_bh is not initialized in ocfs2_group_add()
If group_bh is not initialized, there is no need to release. This problem
does not cause anything wrong, but the patch would make the code more
logical.
Signed-off-by: Younger Liu <younger.liu@huawei.com> Cc: Mark Fasheh <mfasheh@suse.com> Acked-by: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Younger Liu [Tue, 5 Nov 2013 05:55:06 +0000 (16:55 +1100)]
ocfs2: rollback transaction in ocfs2_group_add()
If ocfs2_journal_access_di() fails, group->bg_next_group should rollback.
Otherwise, there would be a inconsistency between group_bh and main_bm_bh.
Signed-off-by: Younger Liu <younger.liu@huawei.com> Cc: Mark Fasheh <mfasheh@suse.com> Acked-by: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Junxiao Bi [Tue, 5 Nov 2013 05:55:06 +0000 (16:55 +1100)]
ocfs2: break useless while loop
Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com> Signed-off-by: Joel Becker <jlbec@evilplan.org> Cc: Mark Fasheh <mfasheh@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Akinobu Mita [Tue, 5 Nov 2013 05:55:05 +0000 (16:55 +1100)]
ocfs2: use find_last_bit()
We already have find_last_bit(). So just use it as described in the
comment.
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Xue jiufei [Tue, 5 Nov 2013 05:55:05 +0000 (16:55 +1100)]
ocfs2: delay migration when the lockres is in migration state
We trigger a bug in __dlm_lockres_reserve_ast() when we parallel umount 4
nodes. The situation is as follows:
1) Node A migrate all lockres it owned(eg. lockres A) to other nodes
say node B when it umounts.
2) Receiving MIG_LOCKRES message from A, Node B masters the lockres A
with DLM_LOCK_RES_MIGRATING state set.
3) Then we umount ocfs2 on node B. It also should migrate lockres A to
another node, say node C. But now, DLM_LOCK_RES_MIGRATING state of
lockers A is not cleared. Node B triggered the BUG on lockres with
state DLM_LOCK_RES_MIGRATING.
Signed-off-by: Xuejiufei <xuejiufei@huawei.com> Signed-off-by: Joel Becker <jlbec@evilplan.org> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Tariq Saeed <tariq.x.saeed@oracle.com> Cc: Srinivas Eeda <srinivas.eeda@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Xue jiufei [Tue, 5 Nov 2013 05:55:04 +0000 (16:55 +1100)]
ocfs2: skip locks in the blocked list
A parallel umount on 4 nodes triggered a bug in
dlm_process_recovery_date(). Here's the situation:
Receiving MIG_LOCKRES message, A node processes the locks in migratable
lockres. It copys lvb from migratable lockres when processing the first
valid lock.
If there is a lock in the blocked list with the EX level, it triggers the
BUG. Since valid lvbs are set when locks are granted with EX or PR
levels, locks in the blocked list cannot have valid lvbs. Therefore I
think we should skip the locks in the blocked list.
Signed-off-by: Xuejiufei <xuejiufei@huawei.com> Signed-off-by: Joel Becker <jlbec@evilplan.org> Cc: Mark Fasheh <mfasheh@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Akinobu Mita [Tue, 5 Nov 2013 05:55:03 +0000 (16:55 +1100)]
ocfs2: use bitmap_weight()
Use bitmap_weight() instead of reinventing the wheel.
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Joel Becker [Tue, 5 Nov 2013 05:55:03 +0000 (16:55 +1100)]
ocfs2: don't spam on -EDQUOT
-EDQUOT is a user-visible error, not a logic problem. Teach mlog_errno()
to ignore it like it ignores -ENOSPC, etc.
Signed-off-by: Joel Becker <jlbec@evilplan.org> Reviewed-by: Jan Kara <jack@suse.cz> Reported-by: Marek Królikowski <admin@wset.edu.pl> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Need also add a check after calling sb_getblk in ocfs2_create_xattr_block.
Cc: Rui Xiang <rui.xiang@huawei.com> Cc: Jie Liu <jeff.liu@oracle.com> Cc: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Rui Xiang [Tue, 5 Nov 2013 05:55:02 +0000 (16:55 +1100)]
ocfs2: add necessary check in case sb_getblk() fails
sb_getblk() may return an err, so add a check for bh.
Signed-off-by: Rui Xiang <rui.xiang@huawei.com> Reviewed-by: Jie Liu <jeff.liu@oracle.com> Reviewed-by: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Joseph Qi [Tue, 5 Nov 2013 05:55:01 +0000 (16:55 +1100)]
ocfs2-return-enomem-when-sb_getblk-fails-update
ocfs2_symlink_get_block in aops.c, and ocfs2_read_blocks_sync and
ocfs2_read_blocks in buffer_head_io.c need do the same change for
consistency.
Cc: Rui Xiang <rui.xiang@huawei.com> Cc: Jie Liu <jeff.liu@oracle.com> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Rui Xiang [Tue, 5 Nov 2013 05:55:01 +0000 (16:55 +1100)]
ocfs2: return ENOMEM when sb_getblk() fails
The only reason for sb_getblk() failing is if it can't allocate the
buffer_head. So return ENOMEM instead when it fails.
Signed-off-by: Rui Xiang <rui.xiang@huawei.com> Reviewed-by: Jie Liu <jeff.liu@oracle.com> Reviewed-by: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Junxiao Bi [Tue, 5 Nov 2013 05:55:00 +0000 (16:55 +1100)]
fs/ocfs2/file.c: fix wrong comment
Unwritten extent only exists for file systems which support holes. But
the comment said was opposite meaning and also the comment is not very
clear, so rephase it.
Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Jamie Iles [Tue, 5 Nov 2013 05:54:58 +0000 (16:54 +1100)]
scripts/sortextable: support objects with more than 64K sections.
Building with a large config and -ffunction-sections results in a large
number of sections and sortextable needs to be able to handle that.
Implement support for > 64K sections as modpost does.
Signed-off-by: Jamie Iles <jamie.iles@oracle.com> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Michal Marek <mmarek@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Link input/leds.c along input/input.c instead of separate module
input.c needs to call leds.c and vice-versa, so it is simpler to stuff
them together. INPUT_LEDS thus now depends on LEDS_CLASS being available
enough for input.ko.
This also documents the new leds field.
Signed-off-by: Samuel Thibault <samuel.thibault@ens-lyon.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Samuel Thibault [Tue, 5 Nov 2013 05:54:57 +0000 (16:54 +1100)]
input: route kbd LEDs through the generic LEDs layer
This permits to reassign keyboard LEDs to something else than keyboard
"leds" state, by adding keyboard led and modifier triggers connected to a
series of VT input LEDs, themselves connected to VT input triggers, which
per-input device LEDs use by default. Userland can thus easily change the
LED behavior of (a priori) all input devices, or of particular input
devices.
This also permits to fix #7063 from userland by using a modifier to
implement proper CapsLock behavior and have the keyboard caps lock led
show that modifier state.
[ebroder@mokafive.com: Rebased to 3.2-rc1 or so, cleaned up some includes, and fixed some constants] Signed-off-by: Samuel Thibault <samuel.thibault@ens-lyon.org> Signed-off-by: Evan Broder <evan@ebroder.net> Reviewed-by: David Herrmann <dh.herrmann@gmail.com> Tested-by: Pavel Machek <pavel@ucw.cz> Acked-by: Peter Korsgaard <jacmet@sunsite.dk> Cc: Pavel Machek <pavel@ucw.cz> Cc: Dmitry Torokhov <dmitry.torokhov@gmail.com> Cc: Bryan Wu <cooloney@gmail.com> Cc: Arnaud Patard <arnaud.patard@rtp-net.org> Cc: Sascha Hauer <s.hauer@pengutronix.de> Cc: Matt Sealey <matt@genesi-usa.com> Cc: Rob Clark <robdclark@gmail.com> Cc: Niels de Vos <devos@fedoraproject.org> Cc: Steev Klimaszewski <steev@genesi-usa.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Zhao Hongjiang [Tue, 5 Nov 2013 05:54:55 +0000 (16:54 +1100)]
drivers/infiniband/core/cm.c: convert to using idr_alloc_cyclic()
commit 3e6628c4b347 ("idr: introduce idr_alloc_cyclic()") adds a new
idr_alloc_cyclic routine and converts several of these users to it. This
is just a missed one - add it.
Signed-off-by: Zhao Hongjiang <zhaohongjiang@huawei.com> Cc: Roland Dreier <roland@kernel.org> Cc: Tejun Heo <tj@kernel.org> Cc: Sean Hefty <sean.hefty@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Stephen Boyd [Tue, 5 Nov 2013 05:54:54 +0000 (16:54 +1100)]
sched_clock: document 4Mhz vs 1Mhz decision
Bo Shen sent a patch to change this to 1Mhz instead of 4Mhz but according
to Russell King the use of 4Mhz was intentional. Add a comment to this
effect so that others don't try to change the code as well.
Signed-off-by: Stephen Boyd <sboyd@codeaurora.org> Cc: Bo Shen <voice.shen@atmel.com> Cc: Russell King <rmk@arm.linux.org.uk> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
genirq: correct fuzzy and fragile IRQ_RETVAL() definition
commit bedd30d986a0 ("genirq: make irqreturn_t an enum") blindly replaced
"0" by "IRQ_NONE" in the "IRQ_RETVAL(x)" macro definition.
However, as "x" is a condition, "0" meant "boolean false", not an
irqreturn_t value.
All of this worked, and kept working after the addition of IRQ_WAKE_THREAD,
as
- both "boolean false" and "IRQ_NONE" are "0" (for the comparison),
- "boolean true" and "boolean false" nicely map to the correct values of
"IRQ_HANDLED" and "IRQ_NONE" (for the return value).
Correct the macro definition for clarity and future-proofness.
Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Randy Dunlap [Tue, 5 Nov 2013 05:54:52 +0000 (16:54 +1100)]
nouveau: fix build eror when VGA_SWITCHEROO is not enabled
Fix nouveau build error on x86, when ACPI is enabled but VGA_SWITCHEROO is
not enabled, by providing a stub function.
drivers/built-in.o: In function `nouveau_pmops_runtime_suspend':
nouveau_drm.c:(.text+0x3aac89): undefined reference to `nouveau_switcheroo_optimus_dsm'
Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Cc: Ben Skeggs <bskeggs@redhat.com> Cc: Jeff Mahoney <jeffm@suse.com> Cc: Jiri Slaby <jslaby@suse.cz> Cc: David Airlie <airlied@linux.ie> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Jeff Mahoney [Tue, 5 Nov 2013 05:54:52 +0000 (16:54 +1100)]
drm/nouveau: make vga_switcheroo code depend on VGA_SWITCHEROO
Commit 8116188fdef594 ("nouveau/acpi: hook up to the MXM method for mux
switching.") broke the build on non-x86 architectures due to the new
dependency on MXM and MXM being an x86 platform driver.
It built previously since the vga switcheroo registration routines were
zereod out on !X86. The code was built in but unused.
This patch makes all of the DSM code depend on CONFIG_VGA_SWITCHEROO,
allowing it to build on non-x86 and shrinking the module size as well.
Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: Jiri Slaby <jslaby@suse.cz> Cc: David Airlie <airlied@linux.ie> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Takashi Iwai [Tue, 5 Nov 2013 05:54:51 +0000 (16:54 +1100)]
drm/cirrus: correct register values for 16bpp
When the mode is set with 16bpp on QEMU, the output gets totally broken.
The culprit is the bogus register values set for 16bpp, which was likely
copied from from a wrong place.
Daniel Vetter [Tue, 5 Nov 2013 05:54:50 +0000 (16:54 +1100)]
drm/fb-helper: don't sleep for screen unblank when an oops is in progress
Otherwise the system will burn even brighter and worse, leave the user
wondering what's going on exactly.
Since we already have a panic handler which will (try) to restore the
entire fbdev console mode, we can just bail out. Inspired by a patch from
Konstantin Khlebnikov. The callchain leading to this, cut&pasted from
Konstantin's original patch:
Note that the entire locking in the fb helper around panic/sysrq and kdbg
is ... non-existant. So we have a decent change of blowing up
everything. But since reworking this ties in with funny concepts like the
fbdev notifier chain or the impressive things which happen around
console_lock while oopsing, I'll leave that as an exercise for braver
souls than me.
Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch> Cc: Konstantin Khlebnikov <khlebnikov@openvz.org> Cc: Dave Airlie <airlied@gmail.com> Reviewed-by: Rob Clark <robdclark@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
drivers/media/platform/sh_veu.c: In function 'sh_veu_reg_read':
drivers/media/platform/sh_veu.c:228:2: error: implicit declaration of function 'ioread32' [-Werror=implicit-function-declaration]
drivers/media/platform/sh_veu.c: In function 'sh_veu_reg_write':
drivers/media/platform/sh_veu.c:234:2: error: implicit declaration of function 'iowrite32' [-Werror=implicit-function-declaration]
drivers/media/platform/vsp1/vsp1.h: In function 'vsp1_read':
drivers/media/platform/vsp1/vsp1.h:66:2: error: implicit declaration of function 'ioread32' [-Werror=implicit-function-declaration]
drivers/media/platform/vsp1/vsp1.h: In function 'vsp1_write':
drivers/media/platform/vsp1/vsp1.h:71:2: error: implicit declaration of function 'iowrite32' [-Werror=implicit-function-declaration]
drivers/media/platform/vsp1/vsp1.h: In function 'vsp1_read':
drivers/media/platform/vsp1/vsp1.h:66:2: error: implicit declaration of function 'ioread32' [-Werror=implicit-function-declaration]
drivers/media/platform/vsp1/vsp1.h: In function 'vsp1_write':
drivers/media/platform/vsp1/vsp1.h:71:2: error: implicit declaration of function 'iowrite32' [-Werror=implicit-function-declaration]
drivers/media/platform/soc_camera/rcar_vin.c: In function 'rcar_vin_setup':
drivers/media/platform/soc_camera/rcar_vin.c:284:3: error: implicit declaration of function 'iowrite32' [-Werror=implicit-function-declaration]
drivers/media/platform/soc_camera/rcar_vin.c: In function 'rcar_vin_request_capture_stop':
drivers/media/platform/soc_camera/rcar_vin.c:353:2: error: implicit declaration of function 'ioread32' [-Werror=implicit-function-declaration]
Yet, they're available, as CONFIG_GENERIC_IOMAP is defined. What happens
is that asm/io.h was not including asm-generic/iomap.h.
Suggested-by: Ben Hutchings <ben@decadent.org.uk> Signed-off-by: Mauro Carvalho Chehab <m.chehab@samsung.com> Cc: Mikael Starvik <starvik@axis.com> Cc: Jesper Nilsson <jesper.nilsson@axis.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Gao feng [Tue, 5 Nov 2013 05:54:49 +0000 (16:54 +1100)]
kernel/audit.c: remove duplicated comment
Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com> Cc: Casey Schaufler <casey@schaufler-ca.com> Cc: Richard Guy Briggs <rgb@redhat.com> Cc: Eric Paris <eparis@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Matthew noticed that hugetlb doesn't participate in ASLR on x86-64. The
reason is genereic hugetlb_get_unmapped_area() which is used on x86-64.
It doesn't support randomization and use bottom-up unmapped area lookup,
instead of usual top-down on x86-64.
x86 has arch-specific hugetlb_get_unmapped_area(), but it's used only on
x86-32.
Let's use arch-specific hugetlb_get_unmapped_area() on x86-64 too. It
fixes the issue and make hugetlb use top-down unmapped area lookup.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Matthew Wilcox <willy@linux.intel.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tetsuo Handa [Tue, 5 Nov 2013 05:54:46 +0000 (16:54 +1100)]
kthread: make kthread_create() killable
Any user process callers of wait_for_completion() except global init
process might be chosen by the OOM killer while waiting for completion()
call by some other process which does memory allocation. See
CVE-2012-4398 "kernel: request_module() OOM local DoS" can happen.
When such users are chosen by the OOM killer when they are waiting for
completion() in TASK_UNINTERRUPTIBLE, the system will be kept stressed due
to memory starvation because the OOM killer cannot kill such users.
kthread_create() is one of such users and this patch fixes the problem for
kthreadd by making kthread_create() killable - the same approach used for
fixing CVE-2012-4398.
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Oleg Nesterov <oleg@redhat.com> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Pull networking fixes from David Miller:
"I'm sending a pull request of these lingering bug fixes for networking
before the normal merge window material because some of this stuff I'd
like to get to -stable ASAP"
1) cxgb3 stopped working on 32-bit machines, fix from Ben Hutchings.
2) Structures passed via netlink for netfilter logging are not fully
initialized. From Mathias Krause.
3) Properly unlink upper openvswitch device during notifications, from
Alexei Starovoitov.
4) Fix race conditions involving access to the IP compression scratch
buffer, from Michal Kubrecek.
5) We don't handle the expiration of MTU information contained in ipv6
routes sometimes, fix from Hannes Frederic Sowa.
6) With Fast Open we can miscompute the TCP SYN/ACK RTT, from Yuchung
Cheng.
7) Don't take TCP RTT sample when an ACK doesn't acknowledge new data,
also from Yuchung Cheng.
8) The decreased IPSEC garbage collection threshold causes problems for
some people, bump it back up. From Steffen Klassert.
9) Fix skb->truesize calculated by tcp_tso_segment(), from Eric
Dumazet.
10) flow_dissector doesn't validate packet lengths sufficiently, from
Jason Wang
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (41 commits)
net/mlx4_core: Fix call to __mlx4_unregister_mac
net: sctp: do not trigger BUG_ON in sctp_cmd_delete_tcb
net: flow_dissector: fail on evil iph->ihl
xfrm: Fix null pointer dereference when decoding sessions
can: kvaser_usb: fix usb endpoints detection
can: c_can: Fix RX message handling, handle lost message before EOB
doc:net: Fix typo in Documentation/networking
bgmac: don't update slot on skb alloc/dma mapping error
ibm emac: Fix locking for enable/disable eob irq
ibm emac: Don't call napi_complete if napi_reschedule failed
virtio-net: correctly handle cpu hotplug notifier during resuming
bridge: pass correct vlan id to multicast code
net: x25: Fix dead URLs in Kconfig
netfilter: xt_NFQUEUE: fix --queue-bypass regression
xen-netback: use jiffies_64 value to calculate credit timeout
cxgb3: Fix length calculation in write_ofld_wr() on 32-bit architectures
bnx2x: Disable VF access on PF removal
bnx2x: prevent FW assert on low mem during unload
tcp: gso: fix truesize tracking
xfrm: Increase the garbage collector threshold
...
In function mlx4_master_deactivate_admin_state() __mlx4_unregister_mac was
called using the MAC index. It should be called with the value of the MAC itself.
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il> Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Mon, 4 Nov 2013 05:48:00 +0000 (00:48 -0500)]
Merge branch 'fixes-for-3.12' of git://gitorious.org/linux-can/linux-can
Marc Kleine-Budde says:
====================
I have two late fixes for the v3.12 release:
The first patch fixes a problem in the c_can's RX message handling, which can
lead to an endless interrupt loop under heavy load if messages are lost. The
second patch is by Olivier Sobrie and fixes the endpoint detection of the
kvaser_usb driver, which is needed for some devices.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Daniel Borkmann [Thu, 31 Oct 2013 08:13:32 +0000 (09:13 +0100)]
net: sctp: do not trigger BUG_ON in sctp_cmd_delete_tcb
Introduced in f9e42b853523 ("net: sctp: sideeffect: throw BUG if
primary_path is NULL"), we intended to find a buggy assoc that's
part of the assoc hash table with a primary_path that is NULL.
However, we better remove the BUG_ON for now and find a more
suitable place to assert for these things as Mark reports that
this also triggers the bug when duplication cookie processing
happens, and the assoc is not part of the hash table (so all
good in this case). Such a situation can for example easily be
reproduced by:
tc qdisc add dev eth0 root handle 1: prio bands 2 priomap 1 1 1 1 1 1
tc qdisc add dev eth0 parent 1:2 handle 20: netem loss 20%
tc filter add dev eth0 protocol ip parent 1: prio 2 u32 match ip \
protocol 132 0xff match u8 0x0b 0xff at 32 flowid 1:2
This drops 20% of COOKIE-ACK packets. After some follow-up
discussion with Vlad we came to the conclusion that for now we
should still better remove this BUG_ON() assertion, and come up
with two follow-ups later on, that is, i) find a more suitable
place for this assertion, and possibly ii) have a special
allocator/initializer for such kind of temporary assocs.
Reported-by: Mark Thomas <Mark.Thomas@metaswitch.com> Signed-off-by: Vlad Yasevich <vyasevich@gmail.com> Signed-off-by: Daniel Borkmann <dborkman@redhat.com> Acked-by: Neil Horman <nhorman@tuxdriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Linus Torvalds [Sun, 3 Nov 2013 19:36:41 +0000 (11:36 -0800)]
Merge branch 'upstream' of git://git.linux-mips.org/pub/scm/ralf/upstream-linus
Pull MIPS fixes from Ralf Baechle:
"Three fixes across arch/mips with the most complex one being the GIC
interrupt fix - at nine lines still not monster. I'm confident this
are the final MIPS patches even if there should go for an rc8"
* 'upstream' of git://git.linux-mips.org/pub/scm/ralf/upstream-linus:
MIPS: ralink: fix return value check in rt_timer_probe()
MIPS: malta: Fix GIC interrupt offsets
MIPS: Perf: Fix 74K cache map
Mathias Krause [Sun, 3 Nov 2013 11:36:28 +0000 (12:36 +0100)]
ipc, msg: forbid negative values for "msg{max,mnb,mni}"
Negative message lengths make no sense -- so don't do negative queue
lenghts or identifier counts. Prevent them from getting negative.
Also change the underlying data types to be unsigned to avoid hairy
surprises with sign extensions in cases where those variables get
evaluated in unsigned expressions with bigger data types, e.g size_t.
In case a user still wants to have "unlimited" sizes she could just use
INT_MAX instead.
Vineet Gupta [Sat, 2 Nov 2013 12:17:49 +0000 (17:47 +0530)]
ARC: Incorrect mm reference used in vmalloc fault handler
A vmalloc fault needs to sync up PGD/PTE entry from init_mm to current
task's "active_mm". ARC vmalloc fault handler however was using mm.
A vmalloc fault for non user task context (actually pre-userland, from
init thread's open for /dev/console) caused the handler to deref NULL mm
(for mm->pgd)
The reasons it worked so far is amazing:
1. By default (!SMP), vmalloc fault handler uses a cached value of PGD.
In SMP that MMU register is repurposed hence need for mm pointer deref.
2. In pre-3.12 SMP kernel, the problem triggering vmalloc didn't exist in
pre-userland code path - it was introduced with commit 20bafb3d23d108bc
"n_tty: Move buffers into n_tty_data"