Christopher Yeoh [Wed, 24 Aug 2011 23:46:41 +0000 (09:46 +1000)]
- Add x86_64 specific wire up
- Change behaviour so process_vm_readv and process_vm_writev return
the number of bytes successfully read or written even if an error
occurs
- Add more kernel doc interface comments
- rename some internal functions (process_vm_rw_check_iovecs,
process_vm_rw) so they make more sense.
- Add licence message
- Fix kernel-doc comment format
Still need to do benchmarking to see if the optimisation for small copies
using a local on-stack array in process_vm_rw_core is worth it.
Signed-off-by: Chris Yeoh <cyeoh@au1.ibm.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Randy Dunlap <rdunlap@xenotime.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Christopher Yeoh [Wed, 24 Aug 2011 23:46:40 +0000 (09:46 +1000)]
The basic idea behind cross memory attach is to allow MPI programs doing
intra-node communication to do a single copy of the message rather than a
double copy of the message via shared memory.
The following patch attempts to achieve this by allowing a destination
process, given an address and size from a source process, to copy memory
directly from the source process into its own address space via a system
call. There is also a symmetrical ability to copy from the current
process's address space into a destination process's address space.
- Use of /proc/pid/mem has been considered, but there are issues with
using it:
- Does not allow for specifying iovecs for both src and dest, assuming
preadv or pwritev was implemented either the area read from or
written to would need to be contiguous.
- Currently mem_read allows only processes who are currently
ptrace'ing the target and are still able to ptrace the target to read
from the target. This check could possibly be moved to the open call,
but its not clear exactly what race this restriction is stopping
(reason appears to have been lost)
- Having to send the fd of /proc/self/mem via SCM_RIGHTS on unix
domain socket is a bit ugly from a userspace point of view,
especially when you may have hundreds if not (eventually) thousands
of processes that all need to do this with each other
- Doesn't allow for some future use of the interface we would like to
consider adding in the future (see below)
- Interestingly reading from /proc/pid/mem currently actually
involves two copies! (But this could be fixed pretty easily)
As mentioned previously use of vmsplice instead was considered, but has
problems. Since you need the reader and writer working co-operatively if
the pipe is not drained then you block. Which requires some wrapping to
do non blocking on the send side or polling on the receive. In all to all
communication it requires ordering otherwise you can deadlock. And in the
example of many MPI tasks writing to one MPI task vmsplice serialises the
copying.
There are some cases of MPI collectives where even a single copy interface
does not get us the performance gain we could. For example in an
MPI_Reduce rather than copy the data from the source we would like to
instead use it directly in a mathops (say the reduce is doing a sum) as
this would save us doing a copy. We don't need to keep a copy of the data
from the source. I haven't implemented this, but I think this interface
could in the future do all this through the use of the flags - eg could
specify the math operation and type and the kernel rather than just
copying the data would apply the specified operation between the source
and destination and store it in the destination.
Although we don't have a "second user" of the interface (though I've had
some nibbles from people who may be interested in using it for intra
process messaging which is not MPI). This interface is something which
hardware vendors are already doing for their custom drivers to implement
fast local communication. And so in addition to this being useful for
OpenMPI it would mean the driver maintainers don't have to fix things up
when the mm changes.
There was some discussion about how much faster a true zero copy would
go. Here's a link back to the email with some testing I did on that:
There is a basic man page for the proposed interface here:
http://ozlabs.org/~cyeoh/cma/process_vm_readv.txt
This has been implemented for x86 and powerpc, other architecture should
mainly (I think) just need to add syscall numbers for the process_vm_readv
and process_vm_writev. There are 32 bit compatibility versions for
64-bit kernels.
For arch maintainers there are some simple tests to be able to quickly
verify that the syscalls are working correctly here:
Stephen Boyd [Wed, 24 Aug 2011 23:46:34 +0000 (09:46 +1000)]
Instead of open coding this function use kstrtoul_from_user() directly.
Signed-off-by: Stephen Boyd <sboyd@codeaurora.org> Cc: Doug Gilbert <dgilbert@interlog.com> Cc: Douglas Gilbert <dougg@torque.net> Cc: James Bottomley <James.Bottomley@HansenPartnership.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Jesper Juhl [Wed, 24 Aug 2011 23:46:32 +0000 (09:46 +1000)]
We leak in drivers/scsi/aacraid/commctrl.c::aac_send_raw_srb() :
We allocate memory:
...
struct user_sgmap* usg;
usg = kmalloc(actual_fibsize - sizeof(struct aac_srb)
+ sizeof(struct sgmap), GFP_KERNEL);
and then neglect to free it:
...
for (i = 0; i < usg->count; i++) {
u64 addr;
void* p;
if (usg->sg[i].count >
((dev->adapter_info.options &
AAC_OPT_NEW_COMM) ?
(dev->scsi_host_ptr->max_sectors << 9) :
65536)) {
rcode = -EINVAL;
goto cleanup;
... this 'goto' makes 'usg' go out of scope and leak the memory we
allocated.
Other exits properly kfree(usg), it's just here it is neglected.
Signed-off-by: Jesper Juhl <jj@chaosbits.net> Cc: James Bottomley <James.Bottomley@HansenPartnership.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Randy Dunlap [Wed, 24 Aug 2011 23:46:31 +0000 (09:46 +1000)]
Fix sparse warnings of right shift bigger than source value size:
drivers/scsi/megaraid.c:311:65: warning: right shift by bigger than source value
drivers/scsi/megaraid.c:313:65: warning: right shift by bigger than source value
drivers/scsi/megaraid.c:317:67: warning: right shift by bigger than source value
drivers/scsi/megaraid.c:319:67: warning: right shift by bigger than source value
Patch suggestion from email by Al Viro:
"Since both are claimed to be strings, I really suspect that this >> 8 is
misspelled >> 4 and they have a character followed by pair of two-digit
packed decimals in there..."
Signed-off-by: Randy Dunlap <rdunlap@xenotime.net> Cc: Al Viro <viro@ZenIV.linux.org.uk> Cc: Neela Syam Kolli <megaraidlinux@lsi.com> Cc: James Bottomley <James.Bottomley@HansenPartnership.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
For headers that get exported to userland and make use of u32 style
type names, it is advised to include linux/types.h.
This fixes a headers_check warning.
Signed-off-by: Alexander Shishkin <virtuoso@slind.org> Cc: James Bottomley <James.Bottomley@HansenPartnership.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Jean Delvare [Wed, 24 Aug 2011 23:46:29 +0000 (09:46 +1000)]
The current implementation of dmi_name_in_vendors() is an invitation to
lazy coding and false positives [1]. Searching for a string in 8 know
what you're looking for, so you should know where to look. strstr isn't
fast, especially when it fails, so we should avoid calling it when it just
can't succeed.
Looking at the current users of the function, it seems clear to me that
they are looking for a system or board vendor name, so let's limit
dmi_name_in_vendors to these two DMI fields. This much better matches the
function name, BTW.
[1] We currently have code looking for short names in DMI data, such
as "IBM", "ASUS" or "Acer". I let you guess what will happen the day
other vendors ship products named, for example, "SCHREIBMEISTER",
"PEGASUS" or "Acerola".
Signed-off-by: Jean Delvare <khali@linux-fr.org> Cc: Andi Kleen <andi@firstfloor.org> Cc: Jesse Barnes <jbarnes@virtuousgeek.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Yinghai Lu [Wed, 24 Aug 2011 23:46:28 +0000 (09:46 +1000)]
When do pci remove/rescan on system that have more iommus, got
[ 894.089745] Set context mapping for c4:00.0
[ 894.110890] mpt2sas3: Allocated physical memory: size(4293 kB)
[ 894.112556] mpt2sas3: Current Controller Queue Depth(1883), Max Controller Queue Depth(2144)
[ 894.127278] mpt2sas3: Scatter Gather Elements per IO(128)
[ 894.361295] DRHD: handling fault status reg 2
[ 894.364053] DMAR:[DMA Read] Request device [c4:00.0] fault addr fffbe000
[ 894.364056] DMAR:[fault reason 02] Present bit in context entry is cl
it turns out when remove/rescan, pci dev will be freed and will get
another new dev. but drhd units still keep old one... so
dmar_find_matched_drhd_unit will return wrong drhd and iommu for the
device that is not on first iommu.
So need to update devices in drhd_units during pci remove/rescan.
Could save domain/bus/device/function aside in the list and according that
info restore new dev to drhd_units later. Then
dmar_find_matched_drdh_unit and device_to_iommu could return right drhd
and iommu.
Add remove_dev_from_drhd/restore_dev_to_drhd functions to do the real
work. call them in device ADD_DEVICE and UNBOUND_DRIVER
Need to do the samething to atsr. (expose dmar_atsr_units and add
atsru->segment)
After patch, will right iommu for the new dev and will not get DMAR error
any more.
Signed-off-by: Yinghai Lu <yinghai@kernel.org> Cc: Jesse Barnes <jbarnes@virtuousgeek.org> Cc: David Woodhouse <dwmw2@infradead.org> Cc: Vinod Koul <vinod.koul@intel.com> Cc: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Akinobu Mita [Wed, 24 Aug 2011 23:46:27 +0000 (09:46 +1000)]
The dqc_bitmap field of struct ocfs2_local_disk_chunk is 32-bit aligned,
but not 64-bit aligned. The dqc_bitmap is accessed by ocfs2_set_bit(),
ocfs2_clear_bit(), ocfs2_test_bit(), or ocfs2_find_next_zero_bit(). These
are wrapper macros for ext2_*_bit() which need to take an unsigned long
aligned address (though some architectures are able to handle unaligned
address correctly)
So some 64bit architectures may not be able to access the dqc_bitmap
correctly.
This avoids such unaligned access by using another wrapper functions for
ext2_*_bit(). The code is taken from fs/ext4/mballoc.c which also need to
handle unaligned bitmap access.
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Acked-by: Joel Becker <jlbec@evilplan.org> Cc: Mark Fasheh <mfasheh@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Akinobu Mita [Wed, 24 Aug 2011 23:46:26 +0000 (09:46 +1000)]
ext4_{set,clear}_bit() is defined as __test_and_{set,clear}_bit_le() for
ext4. Only two ext4_{set,clear}_bit() calls check the return value. The
rest of calls ignore the return value and they can be replaced with
__{set,clear}_bit_le().
This changes ext4_{set,clear}_bit() from __test_and_{set,clear}_bit_le()
to __{set,clear}_bit_le() and introduces ext4_test_and_{set,clear}_bit()
for the two places where old bit needs to be returned.
This ext4_{set,clear}_bit() change is considered safe, because if someone
uses these macros without noticing the change, new ext4_{set,clear}_bit
don't have return value and causes compiler errors where the return value
is used.
This also removes unused ext4_find_first_zero_bit().
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Cc: "Theodore Ts'o" <tytso@mit.edu> Cc: Andreas Dilger <adilger.kernel@dilger.ca> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Christine Chan [Wed, 24 Aug 2011 23:46:25 +0000 (09:46 +1000)]
del_timer_sync() calls debug_object_assert_init() to assert that a timer
has been initialized before calling lock_timer_base(). lock_timer_base()
would spin forever on a NULL(uninit-ed) base. The check is added to
del_timer() to prevent silent failure, even though it would not get stuck
in an infinite loop.
Signed-off-by: Christine Chan <cschan@codeaurora.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: John Stultz <john.stultz@linaro.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Jonathan Cameron [Wed, 24 Aug 2011 23:46:22 +0000 (09:46 +1000)]
A straightforward looking use of idr for a device id.
Signed-off-by: Jonathan Cameron <jic23@cam.ac.uk> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Tejun Heo <tj@kernel.org> Cc: Guenter Roeck <guenter.roeck@ericsson.com> Cc: James Bottomley <James.Bottomley@HansenPartnership.com> Acked-by: Darrick J. Wong <djwong@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Andrea Righi [Wed, 24 Aug 2011 23:46:21 +0000 (09:46 +1000)]
fb_set_suspend() must be called with the console semaphore held, which
means the code path coming in here will first take the console_lock() and
then call lock_fb_info().
However several framebuffer ioctl commands acquire these locks in reverse
order (lock_fb_info() and then console_lock()). This gives rise to
potential AB-BA deadlock.
Fix this by changing the order of acquisition in the ioctl commands that
make use of console_lock().
Signed-off-by: Andrea Righi <arighi@develer.com> Reported-by: Peter Nordström (Palm GBU) <peter.nordstrom@palm.com> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Jesper Juhl [Wed, 24 Aug 2011 23:46:19 +0000 (09:46 +1000)]
A call to va_copy() should always be followed by a call to va_end() in the
same function. In kernel/autit.c::audit_log_vformat() this is not always
done. This patch makes sure va_end() is always called.
Signed-off-by: Jesper Juhl <jj@chaosbits.net> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Eric Paris <eparis@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Mathias Krause [Wed, 24 Aug 2011 23:46:19 +0000 (09:46 +1000)]
The address limit is already set in flush_old_exec() so this
set_fs(USER_DS) is redundant.
Signed-off-by: Mathias Krause <minipli@googlemail.com> Cc: Russell King <rmk@arm.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The current interrupt traces from irq_handler_entry and irq_handler_exit
provide when an interrupt is handled. They provide good data about when
the system has switched to kernel space and how it affects the currently
running processes.
There are some IRQ vectors which trigger the system into kernel space,
which are not handled in generic IRQ handlers. Tracing such events gives
us the information about IRQ interaction with other system events.
The trace also tells where the system is spending its time. We want to
know which cores are handling interrupts and how they are affecting other
processes in the system. Also, the trace provides information about when
the cores are idle and which interrupts are changing that state.
The following patch adds the event definition and trace instrumentation
for interrupt vectors. For x86, a lookup table is provided to print out
readable IRQ vector names. The template can be used to provide interrupt
vector lookup tables on other architectures.
Signed-off-by: Vaibhav Nagarnaik <vnagarnaik@google.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Michael Rubin <mrubin@google.com> Cc: David Sharp <dhsharp@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Ed Wildgoose [Wed, 24 Aug 2011 23:46:16 +0000 (09:46 +1000)]
This new driver replaces the old PCEngines Alix 2/3 LED driver with a new
driver that controls the LEDs through the leds-gpio driver. The old
driver accessed GPIOs directly, which created a conflict and prevented
also loading the cs5535-gpio driver to read other GPIOs on the Alix board.
With this new driver, we hook into leds-gpio which in turn uses GPIO to
control the LEDs and therefore it's possible to control both the LEDs and
access onboard GPIOs
Driver is moved to platform/geode and any other geode initialisation
modules should move here also.
This driver is inspired by leds-net5501.c by Alessandro Zummo.
Ideally, leds-net5501.c should also be moved to platform/geode.
Additionally the driver relies on parts of the patch: 7f131cf3ed ("leds:
leds-alix2c - take port address from MSR) by Daniel Mack to perform
detection of the Alix board.
Signed-off-by: Ed Wildgoose <kernel@wildgooses.com> Cc: Alessandro Zummo <a.zummo@towertech.it> Cc: Daniel Mack <daniel@caiaq.de> Reviewed-by: Grant Likely <grant.likely@secretlab.ca> Cc: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Richard Purdie <rpurdie@rpsys.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Ludwig Nussel [Wed, 24 Aug 2011 23:46:16 +0000 (09:46 +1000)]
On x86_32 casting the unsigned int result of get_random_int() to long may
result in a negative value. On x86_32 the range of mmap_rnd() therefore
was -255 to 255. The 32bit mode on x86_64 used 0 to 255 as intended.
The bug was introduced by 675a081 ("x86: unify mmap_{32|64}.c") in January
2008.
Signed-off-by: Ludwig Nussel <ludwig.nussel@suse.de> Cc: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Harvey Harrison <harvey.harrison@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Andy Whitcroft [Wed, 24 Aug 2011 23:46:15 +0000 (09:46 +1000)]
Since the commit below which added O_PATH support to the *at() calls, the
error return for readlink/readlinkat for the empty pathname has switched
from ENOENT to EINVAL:
readlinkat(), fchownat() and fstatat() with empty relative pathnames
This is both unexpected for userspace and makes readlink/readlinkat
inconsistant with all other interfaces; and inconsistant with our stated
return for these pathnames.
As the readlinkat call does not have a flags parameter we cannot use the
AT_EMPTY_PATH approach used in the other calls. Therefore expose whether
the original path is infact entry via a new user_path_at_empty() path
lookup function. Use this to determine whether to default to EINVAL or
ENOENT for failures.
BugLink: http://bugs.launchpad.net/bugs/817187 Signed-off-by: Andy Whitcroft <apw@canonical.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
hank [Wed, 24 Aug 2011 23:46:14 +0000 (09:46 +1000)]
The parameter's origin type is long. On an i386 architecture, it can
easily be larger than 0x80000000, causing this function to convert it to a
sign-extended u64 type. Change the type to unsigned long so we get the
correct result.
[akpm@linux-foundation.org: build fix] Signed-off-by: hank <pyu@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: John Stultz <john.stultz@linaro.org> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Because of x86-implement-strict-user-copy-checks-for-x86_64.patch
When compiling mm/mempolicy.c the following warning is shown.
In file included from arch/x86/include/asm/uaccess.h:572,
from include/linux/uaccess.h:5,
from include/linux/highmem.h:7,
from include/linux/pagemap.h:10,
from include/linux/mempolicy.h:70,
from mm/mempolicy.c:68:
In function `copy_from_user',
inlined from `compat_sys_get_mempolicy' at mm/mempolicy.c:1415:
arch/x86/include/asm/uaccess_64.h:64: warning: call to `copy_from_user_overflow' declared with attribute warning: copy_from_user() buffer size is not provably correct
LD mm/built-in.o
Fix this by passing correct buffer size value.
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>