Andrew Morton [Wed, 28 Sep 2011 00:51:03 +0000 (10:51 +1000)]
sysctl-add-support-for-poll-fix
s/declare/define/ for definitions
Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Greg KH <gregkh@suse.de> Cc: Kay Sievers <kay.sievers@vrfy.org> Cc: Lucas De Marchi <lucas.demarchi@profusion.mobi> Signed-off-by: Andrew Morton <>
Lucas De Marchi [Wed, 28 Sep 2011 00:51:03 +0000 (10:51 +1000)]
sysctl: add support for poll()
Adding support for poll() in sysctl fs allows userspace to receive
notifications of changes in sysctl entries. This adds a infrastructure to
allow files in sysctl fs to be pollable and implements it for hostname and
domainname.
Signed-off-by: Lucas De Marchi <lucas.demarchi@profusion.mobi> Cc: Greg KH <gregkh@suse.de> Cc: Kay Sievers <kay.sievers@vrfy.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <>
drivers/net/rionet.c: fix ethernet address macros for LE platforms
Modify Ethernet addess macros to be compatible with BE/LE platforms
Signed-off-by: Alexandre Bounine <alexandre.bounine@idt.com> Cc: Chul Kim <chul.kim@idt.com> Cc: Kumar Gala <galak@kernel.crashing.org> Cc: Matt Porter <mporter@kernel.crashing.org> Cc: Li Yang <leoli@freescale.com> Cc: <stable@kernel.org> [2.6.39+] Signed-off-by: Andrew Morton <>
RapidIO: fix potential null deref in rio_setup_device()
The "goto cleanup" path can deference "rswitch" when it is NULL.
Reported-by: Dan Carpenter <error27@gmail.com> Signed-off-by: Alexandre Bounine <alexandre.bounine@idt.com> Cc: Dan Carpenter <error27@gmail.com> Cc: Kumar Gala <galak@kernel.crashing.org> Cc: Matt Porter <mporter@kernel.crashing.org> Cc: Chul Kim <chul.kim@idt.com> Signed-off-by: Andrew Morton <>
RapidIO: Tsi721 driver - fixes for the initial release
- address comments made by Andrew Morton,
see http://marc.info/?l=linux-kernel&m=131361256714116&w=2
- add spinlock for IB_MSG handler
- rename private BDMA channel structure to avoid conflict with DMA engine
- fix endianess bug in outbound message interrupt handler
Signed-off-by: Alexandre Bounine <alexandre.bounine@idt.com> Cc: Chul Kim <chul.kim@idt.com> Cc: Kumar Gala <galak@kernel.crashing.org> Cc: Matt Porter <mporter@kernel.crashing.org> Cc: Li Yang <leoli@freescale.com> Signed-off-by: Andrew Morton <>
Add RapidIO mport driver for IDT TSI721 PCI Express-to-SRIO bridge device.
The driver provides full set of callback functions defined for mport
devices in RapidIO subsystem. It also is compatible with current version
of RIONET driver (Ethernet over RapidIO messaging services).
This patch is applicable to kernel versions starting from 2.6.39.
Signed-off-by: Alexandre Bounine <alexandre.bounine@idt.com> Signed-off-by: Chul Kim <chul.kim@idt.com> Cc: Kumar Gala <galak@kernel.crashing.org> Cc: Matt Porter <mporter@kernel.crashing.org> Cc: Li Yang <leoli@freescale.com> Signed-off-by: Andrew Morton <>
Liu Gang [Wed, 28 Sep 2011 00:51:01 +0000 (10:51 +1000)]
arch/powerpc/sysdev/fsl_rio.c: release rapidio port I/O region resource if port failed to initialize
The "struct rio_mport" contains a member of master port I/O memory
resource structure "struct resource iores". This resource will be read
from device tree and be used for rapidio R/W transaction memory space.
Rapidio requests the port I/O memory resource under the root resource
"iomem_resource".
struct rio_mport *port;
port = kzalloc(sizeof(struct rio_mport), GFP_KERNEL);
request_resource(&iomem_resource, &port->iores);
When port failed to initialize, allocated "rio_mport" structure memory
will be freed, and the port I/O memory resource structure pointer
"&port->iores" will be invalid. If other requests resource under
"iomem_resource", "&port->iores" node may be operated in the child
resources list and this will cause the system to crash.
So the requested port I/O memory resource should be released before
freeing allocated "rio_mport" structure.
Signed-off-by: Liu Gang <Gang.Liu@freescale.com> Acked-by: Alexandre Bounine <alexandre.bounine@idt.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Grant Likely <grant.likely@secretlab.ca> Signed-off-by: Andrew Morton <>
Liu Gang [Wed, 28 Sep 2011 00:51:00 +0000 (10:51 +1000)]
drivers/rapidio/rio-scan.c: use discovered bit to test if enumeration is complete
The discovered bit in PGCCSR register indicates if the device has been
discovered by system host. In Rapidio systems, some agent devices can also
be master devices. They can issue requests into the system.
Signed-off-by: Liu Gang <Gang.Liu@freescale.com> Acked-by: Alexandre Bounine <alexandre.bounine@idt.com> Signed-off-by: Andrew Morton <>
After merging the akpm tree, today's linux-next build (lost of them)
produced this warning:
WARNING: init/mounts.o(.text+0x192): Section mismatch in reference from the function devt_from_partuuid() to the variable .init.data:root_wait
The function devt_from_partuuid() references
the variable __initdata root_wait.
This is often because devt_from_partuuid lacks a __initdata
annotation or the annotation of root_wait is wrong.
Commit 185237e4cfab ("This patch makes two changes:"
init-add-root=partuuid=uuid-partnroff=%d-support-update.patch) adds the
reference to root_wait from the non-init function devt_from_partuuid().
The easiest thing to do is to remove __init_date from root_wait.
I have applied this patch as a merge fixup for today:
From: Stephen Rothwell <sfr@canb.auug.org.au>
Date: Wed, 10 Aug 2011 12:09:35 +1000
Subject: [PATCH] do_mounts: remove __init_data from root_wait
as it is now used from a non init routine.
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Will Drewry <wad@chromium.org> Signed-off-by: Andrew Morton <>
Will Drewry [Wed, 28 Sep 2011 00:51:00 +0000 (10:51 +1000)]
init: add root=PARTUUID=UUID/PARTNROFF=%d support
Expand root=PARTUUID=UUID syntax to support selecting a root partition by
integer offset from a known, unique partition. This approach provides
similar properties to specifying a device and partition number, but using
the UUID as the unique path prior to evaluating the offset.
For example,
root=PARTUUID=99DE9194-FC15-4223-9192-FC243948F88B/PARTNROFF=1
selects the partition with UUID 99DE.. then select the next
partition.
This change is motivated by a particular usecase in Chromium OS where the
bootloader can easily determine what partition it is on (by UUID) but
doesn't perform general partition table walking.
That said, support for this model provides a direct mechanism for the user
to modify the root partition to boot without specifically needing to
extract each UUID or update the bootloader explicitly when the root
partition UUID is changed (if it is recreated to be larger, for instance).
Pinning to a /boot-style partition UUID allows the arbitrary root
partition reconfiguration/modifications with slightly less ambiguity than
just [dev][partition] and less stringency than the specific root partition
UUID.
Signed-off-by: Will Drewry <wad@chromium.org> Cc: Kay Sievers <kay.sievers@vrfy.org> Cc: Randy Dunlap <rdunlap@xenotime.net> Cc: Namhyung Kim <namhyung@gmail.com> Cc: Trond Myklebust <Trond.Myklebust@netapp.com> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Andrew Morton <>
The patch "proc: fix races against execve() of /proc/PID/fd**" is still a
partial fix for a setxid problem. link(2) is a yet another way to
identify whether a specific fd is opened by a privileged process. By
calling link(2) against /proc/PID/fd/* an attacker may identify whether
the fd number is valid for PID by analysing link(2) return code.
Both getattr() and link() can be used by the attacker iff the dentry is
present in the dcache. In this case ->lookup() is not called and the only
way to check ptrace permissions is either operation handler or
->revalidate(). The easiest solution to prevent any unauthorized access
to /proc/PID/fd*/ files is to force the dentry drop on each unauthorized
access attempt.
If an attacker keeps opened fd of /proc/PID/fd/ and dcache contains a
specific dentry for some /proc/PID/fd/XXX, any future attemp to use the
dentry by the attacker would lead to the dentry drop as a result of a
failed ptrace check in ->revalidate(). Then the attacker cannot spawn a
dentry for the specific fd number because of ptrace check in ->lookup().
The dentry drop can be still observed by an attacker by analysing
information from /proc/slabinfo, which is addressed in the successive
patch.
Signed-off-by: Vasiliy Kulikov <segoon@openwall.com> Cc: Cyrill Gorcunov <gorcunov@gmail.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Pekka Enberg <penberg@kernel.org> Cc: Matt Mackall <mpm@selenic.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <>
proc: fix races against execve() of /proc/PID/fd**
fd* files are restricted to the task's owner, and other users may not get
direct access to them. But one may open any of these files and run any
setuid program, keeping opened file descriptors. As there are permission
checks on open(), but not on readdir() and read(), operations on the kept
file descriptors will not be checked. It makes it possible to violate
procfs permission model.
Reading fdinfo/* may disclosure current fds' position and flags, reading
directory contents of fdinfo/ and fd/ may disclosure the number of opened
files by the target task. This information is not sensible per se, but it
can reveal some private information (like length of a password stored in a
file) under certain conditions.
Used existing (un)lock_trace functions to check for ptrace_may_access(),
but instead of using EPERM return code from it use EACCES to be consistent
with existing proc_pid_follow_link()/proc_pid_readlink() return code. If
they differ, attacker can guess what fds exist by analyzing stat() return
code. Patched handlers: stat() for fd/*, stat() and read() for fdindo/*,
readdir() and lookup() for fd/ and fdinfo/.
David Rientjes [Wed, 28 Sep 2011 00:50:57 +0000 (10:50 +1000)]
cpusets: avoid looping when storing to mems_allowed if one node remains set
{get,put}_mems_allowed() exist so that general kernel code may locklessly
access a task's set of allowable nodes without having the chance that a
concurrent write will cause the nodemask to be empty on configurations
where MAX_NUMNODES > BITS_PER_LONG.
This could incur a significant delay, however, especially in low memory
conditions because the page allocator is blocking and reclaim requires
get_mems_allowed() itself. It is not atypical to see writes to
cpuset.mems take over 2 seconds to complete, for example. In low memory
conditions, this is problematic because it's one of the most imporant
times to change cpuset.mems in the first place!
The only way a task's set of allowable nodes may change is through cpusets
by writing to cpuset.mems and when attaching a task to a generic code is
not reading the nodemask with get_mems_allowed() at the same time, and
then clearing all the old nodes. This prevents the possibility that a
reader will see an empty nodemask at the same time the writer is storing a
new nodemask.
If at least one node remains unchanged, though, it's possible to simply
set all new nodes and then clear all the old nodes. Changing a task's
nodemask is protected by cgroup_mutex so it's guaranteed that two threads
are not changing the same task's nodemask at the same time, so the
nodemask is guaranteed to be stored before another thread changes it and
determines whether a node remains set or not.
Signed-off-by: David Rientjes <rientjes@google.com> Cc: Miao Xie <miaox@cn.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Nick Piggin <npiggin@kernel.dk> Cc: Paul Menage <paul@paulmenage.org> Signed-off-by: Andrew Morton <>
Steven Rostedt [Wed, 28 Sep 2011 00:50:57 +0000 (10:50 +1000)]
memcg: Fix race condition in memcg_check_events() with this_cpu usage
Various code in memcontrol.c () calls this_cpu_read() on the calculations
to be done from two different percpu variables, or does an open-coded
read-modify-write on a single percpu variable.
Disable preemption throughout these operations so that the writes go to
the correct palces.
[ Added this_cpu to __this_cpu conversion by Johannes ]
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Cc: Greg Thelen <gthelen@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Christoph Lameter <cl@linux.com> Signed-off-by: Andrew Morton <>
Johannes Weiner [Wed, 28 Sep 2011 00:50:56 +0000 (10:50 +1000)]
memcg: close race between charge and putback
There is a potential race between a thread charging a page and another
thread putting it back to the LRU list:
charge: putback:
SetPageCgroupUsed SetPageLRU
PageLRU && add to memcg LRU PageCgroupUsed && add to memcg LRU
The order of setting one flag and checking the other is crucial, otherwise
the charge may observe !PageLRU while the putback observes !PageCgroupUsed
and the page is not linked to the memcg LRU at all.
Global memory pressure may fix this by trying to isolate and putback the
page for reclaim, where that putback would link it to the memcg LRU again.
Without that, the memory cgroup is undeletable due to a charge whose
physical page can not be found and moved out.
Signed-off-by: Johannes Weiner <jweiner@redhat.com> Cc: Ying Han <yinghan@google.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Balbir Singh <bsingharora@gmail.com> Cc: Michal Hocko <mhocko@suse.cz> Signed-off-by: Andrew Morton <>
Johannes Weiner [Wed, 28 Sep 2011 00:50:56 +0000 (10:50 +1000)]
memcg: skip scanning active lists based on individual size
Reclaim decides to skip scanning an active list when the corresponding
inactive list is above a certain size in comparison to leave the assumed
working set alone while there are still enough reclaim candidates around.
The memcg implementation of comparing those lists instead reports whether
the whole memcg is low on the requested type of inactive pages,
considering all nodes and zones.
This can lead to an oversized active list not being scanned because of the
state of the other lists in the memcg, as well as an active list being
scanned while its corresponding inactive list has enough pages.
Not only is this wrong, it's also a scalability hazard, because the global
memory state over all nodes and zones has to be gathered for each memcg
and zone scanned.
Make these calculations purely based on the size of the two LRU lists
that are actually affected by the outcome of the decision.
Signed-off-by: Johannes Weiner <jweiner@redhat.com> Reviewed-by: Rik van Riel <riel@redhat.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Balbir Singh <bsingharora@gmail.com> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Reviewed-by: Ying Han <yinghan@google.com> Signed-off-by: Andrew Morton <>
Igor Mammedov [Wed, 28 Sep 2011 00:50:55 +0000 (10:50 +1000)]
memcg: do not expose uninitialized mem_cgroup_per_node to world
If somebody is touching data too early, it might be easier to diagnose a
problem when dereferencing NULL at mem->info.nodeinfo[node] than trying to
understand why mem_cgroup_per_zone is [un|partly]initialized.
Signed-off-by: Igor Mammedov <imammedo@redhat.com> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <>
While back-porting Johannes Weiner's patch "mm: memcg-aware global
reclaim" for an internal effort, we noticed a significant performance
regression during page-reclaim heavy workloads due to high contention of
the ss->id_lock. This lock protects idr map, and serializes calls to
idr_get_next() in css_get_next() (which is used during the memcg hierarchy
walk). Since idr_get_next() is just doing a look up, we need only
serialize it with respect to idr_remove()/idr_get_new(). By making the
ss->id_lock a rwlock, contention is greatly reduced and performance
improves.
Tested: cat a 256m file from a ramdisk in a 128m container 50 times on
each core (one file + container per core) in parallel on a NUMA machine.
Result is the time for the test to complete in 1 of the containers. Both
kernels included Johannes' memcg-aware global reclaim patches.
Before rwlock patch: 1710.778s
After rwlock patch: 152.227s
Signed-off-by: Andrew Bresticker <abrestic@google.com> Cc: Paul Menage <menage@gmail.com> Cc: Li Zefan <lizf@cn.fujitsu.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Ying Han <yinghan@google.com> Signed-off-by: Andrew Morton <>
Raghavendra K T [Wed, 28 Sep 2011 00:50:54 +0000 (10:50 +1000)]
memcg: rename mem variable to memcg
The memcg code sometimes uses "struct mem_cgroup *mem" and sometimes uses
"struct mem_cgroup *memcg". Rename all mem variables to memcg in source
file.
Signed-off-by: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Michal Hocko <mhocko@suse.cz> Signed-off-by: Andrew Morton <>
Ben Blum [Wed, 28 Sep 2011 00:50:54 +0000 (10:50 +1000)]
cgroups: don't attach task to subsystem if migration failed
If a task has exited to the point it has called cgroup_exit() already,
then we can't migrate it to another cgroup anymore.
This can happen when we are attaching a task to a new cgroup between the
call to ->can_attach_task() on subsystems and the migration that is
eventually tried in cgroup_task_migrate().
In this case cgroup_task_migrate() returns -ESRCH and we don't want to
attach the task to the subsystems because the attachment to the new cgroup
itself failed.
Fix this by only calling ->attach_task() on the subsystems if the cgroup
migration succeeded.
Reported-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Ben Blum <bblum@andrew.cmu.edu> Acked-by: Paul Menage <paul@paulmenage.org> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Signed-off-by: Andrew Morton <>
Ben Blum [Wed, 28 Sep 2011 00:50:54 +0000 (10:50 +1000)]
cgroups: more safe tasklist locking in cgroup_attach_proc
Fix unstable tasklist locking in cgroup_attach_proc.
According to this thread - https://lkml.org/lkml/2011/7/27/243 - RCU is
not sufficient to guarantee the tasklist is stable w.r.t. de_thread and
exit. Taking tasklist_lock for reading, instead of rcu_read_lock, ensures
proper exclusion.
Signed-off-by: Ben Blum <bblum@andrew.cmu.edu> Acked-by: Paul Menage <paul@paulmenage.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Cc: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <>
David Anders [Wed, 28 Sep 2011 00:50:52 +0000 (10:50 +1000)]
rtc: add initial support for mcp7941x parts
Add initial support for the microchip mcp7941x series of real time clocks.
The mcp7941x series is generally compatible with the ds1307 and ds1337 rtc
devices from dallas semiconductor. minor differences include a backup
battery enable bit, and the polarity of the oscillator enable bit.
Signed-off-by: David Anders <danders.dev@gmail.com> Cc: Alessandro Zummo <a.zummo@towertech.it> Signed-off-by: Andrew Morton <>
Mike Waychison [Wed, 28 Sep 2011 00:50:52 +0000 (10:50 +1000)]
oprofilefs: handle zero-length writes
Currently in oprofilefs, files that use ulong_fops mis-handle writes of
zero length. A count of 0 causes oprofilefs_ulong_from_user to return 0
(success), which then leads to oprofile_set_ulong being called to stuff
"value" into file->private_data without it being initialized.
Fix this by moving the check for a zero-length write up into
ulong_write_file.
Signed-off-by: Mike Waychison <mikew@google.com> Cc: Robert Richter <robert.richter@amd.com> Signed-off-by: Andrew Morton <>
Neil Armstrong [Wed, 28 Sep 2011 00:50:51 +0000 (10:50 +1000)]
init/do_mounts_rd.c: fix ramdisk identification for padded cramfs
When a cramfs ramdisk padded with 512 bytes is given to the kernel, the
current identify_ramdisk_image function fails to identify it.
Tested with a padded cramfs image on an ARM based board.
Signed-off-by: Neil Armstrong <narmstrong@neotion.com> Cc: Namhyung Kim <namhyung@gmail.com> Cc: Davidlohr Bueso <dave@gnu.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <>
Jason Baron [Wed, 28 Sep 2011 00:50:51 +0000 (10:50 +1000)]
epoll: limit paths
The current epoll code can be tickled to run basically indefinitely in
both loop detection path check (on ep_insert()), and in the wakeup paths.
The programs that tickle this behavior set up deeply linked networks of
epoll file descriptors that cause the epoll algorithms to traverse them
indefinitely. A couple of these sample programs have been previously
posted in this thread: https://lkml.org/lkml/2011/2/25/297.
To fix the loop detection path check algorithms, I simply keep track of
the epoll nodes that have been already visited. Thus, the loop detection
becomes proportional to the number of epoll file descriptor and links.
This dramatically decreases the run-time of the loop check algorithm. In
one diabolical case I tried it reduced the run-time from 15 mintues (all
in kernel time) to .3 seconds.
Fixing the wakeup paths could be done at wakeup time in a similar manner
by keeping track of nodes that have already been visited, but the
complexity is harder, since there can be multiple wakeups on different
cpus...Thus, I've opted to limit the number of possible wakeup paths when
the paths are created.
This is accomplished, by noting that the end file descriptor points that
are found during the loop detection pass (from the newly added link), are
actually the sources for wakeup events. I keep a list of these file
descriptors and limit the number and length of these paths that emanate
from these 'source file descriptors'. In the current implemetation I
allow 1000 paths of length 1, 500 of length 2, 100 of length 3, 50 of
length 4 and 10 of length 5. Note that it is sufficient to check the
'source file descriptors' reachable from the newly added link, since no
other 'source file descriptors' will have newly added links. This allows
us to check only the wakeup paths that may have gotten too long, and not
re-check all possible wakeup paths on the system.
In terms of the path limit selection, I think its first worth noting that
the most common case for epoll, is probably the model where you have 1
epoll file descriptor that is monitoring n number of 'source file
descriptors'. In this case, each 'source file descriptor' has a 1 path of
length 1. Thus, I believe that the limits I'm proposing are quite
reasonable and in fact may be too generous. Thus, I'm hoping that the
proposed limits will not prevent any workloads that currently work to
fail.
In terms of locking, I have extended the use of the 'epmutex' to all
epoll_ctl add and remove operations. Currently its only used in a subset
of the add paths. I need to hold the epmutex, so that we can correctly
traverse a coherent graph, to check the number of paths. I believe that
this additional locking is probably ok, since its in the setup/teardown
paths, and doesn't affect the running paths, but it certainly is going to
add some extra overhead. Also, worth noting is that the epmuex was
recently added to the ep_ctl add operations in the initial path loop
detection code using the argument that it was not on a critical path.
Another thing to note here, is the length of epoll chains that is allowed.
Currently, eventpoll.c defines:
/* Maximum number of nesting allowed inside epoll sets */
#define EP_MAX_NESTS 4
This basically means that I am limited to a graph depth of 5 (EP_MAX_NESTS
+ 1). However, this limit is currently only enforced during the loop
check detection code, and only when the epoll file descriptors are added
in a certain order. Thus, this limit is currently easily bypassed. The
newly added check for wakeup paths, stricly limits the wakeup paths to a
length of 5, regardless of the order in which ep's are linked together.
Thus, a side-effect of the new code is a more consistent enforcement of
the graph depth.
Thus far, I've tested this, using the sample programs previously
mentioned, which now either return quickly or return -EINVAL. I've also
testing using the piptest.c epoll tester, which showed no difference in
performance. I've also created a number of different epoll networks and
tested that they behave as expectded.
I believe this solves the original diabolical test cases, while still
preserving the sane epoll nesting.
Signed-off-by: Jason Baron <jbaron@redhat.com> Cc: Nelson Elhage <nelhage@ksplice.com> Cc: Davide Libenzi <davidel@xmailserver.org> Signed-off-by: Andrew Morton <>
Nelson Elhage [Wed, 28 Sep 2011 00:50:51 +0000 (10:50 +1000)]
epoll: fix spurious lockdep warnings
epoll can acquire recursively acquire ep->mtx on multiple "struct
eventpoll"s at once in the case where one epoll fd is monitoring another
epoll fd. This is perfectly OK, since we're careful about the lock
ordering, but it causes spurious lockdep warnings. Annotate the recursion
using mutex_lock_nested, and add a comment explaining the nesting rules
for good measure.
Recent versions of systemd are triggering this, and it can also be
demonstrated with the following trivial test program:
--------------------8<--------------------
int main(void) {
int e1, e2;
struct epoll_event evt = {
.events = EPOLLIN
};
Reported-by: Paul Bolle <pebolle@tiscali.nl> Tested-by: Paul Bolle <pebolle@tiscali.nl> Signed-off-by: Nelson Elhage <nelhage@nelhage.com> Acked-by: Jason Baron <jbaron@redhat.com> Cc: Dave Jones <davej@redhat.com> Cc: Davide Libenzi <davidel@xmailserver.org> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <>
Andrew Morton [Wed, 28 Sep 2011 00:50:50 +0000 (10:50 +1000)]
lib-crc-add-slice-by-8-algorithm-to-crc32c-fix
don't include asm/msr.h
Cc: Bob Pearson <rpearson@systemfabricworks.com> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: Roland Dreier <roland@kernel.org> Cc: frank zago <fzago@systemfabricworks.com> Signed-off-by: Andrew Morton <>
frank zago [Wed, 28 Sep 2011 00:50:50 +0000 (10:50 +1000)]
lib/crc: add slice by 8 algorithm to crc32.c
Add support for slice by 8 to existing crc32 algorithm. Also modify
gen_crc32table.c to only produce table entries that are actually used.
The parameters CRC_LE_BITS and CRC_BE_BITS determine the number of bits in
the input array that are processed during each step. Generally the more
bits the faster the algorithm is but the more table data required.
Using an x86_64 Opteron machine running at 2100MHz the following table was
collected with a pre-warmed cache by computing the crc 1000 times on a
buffer of 4096 bytes.
BITS is the value of CRC_LE_BITS or CRC_BE_BITS. The old
default was 8 which actually selected the 32 bit algorithm. In
this version the value 8 is used to select the standard
8 bit algorithm and two new values: 32 and 64 are introduced
to select the slice by 4 and slice by 8 algorithms respectively.
Where Size is the size of crc32.o's text segment which includes
code and table data when both LE and BE versions are set to BITS.
The current version of crc32.c by default uses the slice by 4 algorithm
which requires about 2.8 cycles per byte. The slice by 8 algorithm is
roughly 2X faster and enables packet processing at over 1GB/sec on a
typical 2-3GHz system.
Signed-off-by: Bob Pearson <rpearson@systemfabricworks.com> Cc: Roland Dreier <roland@kernel.org> Cc: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: Andrew Morton <>
If the first cmpxchg call succeeds, it is not necessary to use cpu_relax
before cmpxchg. So cpu_relax in a busy loop involving cmpxchg should go
after cmpxchg instead of before that. This patch fixes this in llist.
Signed-off-by: Huang Ying <ying.huang@intel.com> Acked-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Andi Kleen <andi@firstfloor.org> Signed-off-by: Andrew Morton <>
Glauber Costa [Wed, 28 Sep 2011 00:50:46 +0000 (10:50 +1000)]
lib/percpu_counter.c: enclose hotplug only variables in hotplug ifdef
These variables are only used when CONFIG_HOTPLUG_CPU is enabled, they are
ifdef'ed everywhere else. So don't define them when CONFIG_HOTPLUG_CPU is
not enabled.
Signed-off-by: Glauber Costa <glommer@parallels.com> Signed-off-by: Andrew Morton <>
Cc: Andi Kleen <ak@linux.intel.com> Cc: Andy Shevchenko <andy.shevchenko@gmail.com> Cc: H Hartley Sweeten <hartleys@visionengravers.com> Cc: H Hartley Sweeten <hsweeten@visionengravers.com> Cc: Huang Ying <ying.huang@intel.com> Cc: Len Brown <len.brown@intel.com> Signed-off-by: Andrew Morton <>
lib/bitmap.c: quiet sparse noise about address space
__bitmap_parse() and __bitmap_parselist() both take a pointer to a kernel
buffer as a parameter and then cast it to a pointer to user buffer for use
in cases when the parameter is_user indicates that the buffer is actually
located in user space. This casting, and the casts in the callers,
results in sparse noise like the following:
warning: incorrect type in initializer (different address spaces)
expected char const [noderef] <asn:1>*ubuf
got char const *buf
warning: cast removes address space of expression
Since these casts are intentional, use __force to quiet the noise.
Signed-off-by: H Hartley Sweeten <hsweeten@visionengravers.com> Cc: Len Brown <len.brown@intel.com> Cc: Huang Ying <ying.huang@intel.com> Cc: Andy Shevchenko <andy.shevchenko@gmail.com> Cc: Andi Kleen <ak@linux.intel.com> Signed-off-by: Andrew Morton <>
Akinobu Mita [Wed, 28 Sep 2011 00:50:45 +0000 (10:50 +1000)]
lib/spinlock_debug.c: print owner on spinlock lockup
When SPIN_BUG_ON is triggered, the lock owner information is reported.
But it is omitted when spinlock lockup is detected.
This information is useful especially on the architectures which don't
implement trigger_all_cpu_backtrace() that is called just after detecting
lockup. So report it and also avoid message format duplication.
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Cc: Ingo Molnar <mingo@redhat.com> Signed-off-by: Andrew Morton <>
lib/kstrtox: common code between kstrto*() and simple_strto*() functions
Currently termination logic (\0 or \n\0) is hardcoded in _kstrtoull(),
avoid that for code reuse between kstrto*() and simple_strtoull().
Essentially, make them different only in termination logic.
simple_strtoull() (and scanf(), BTW) ignores integer overflow, that's a
bug we currently don't have guts to fix, making KSTRTOX_OVERFLOW hack
necessary.
Almost forgot: patch shrinks code size by about ~80 bytes on x86_64.
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <>
My GPIOs are on an I2C port expander, so we must use the *_cansleep()
variant of the GPIO functions. This is was not being done in
create_gpio_led().
We can change gpio_get_value() to gpio_get_value_cansleep() because it is
only called from the platform_driver probe function, which is a context
where we can sleep.
Only tested on my gpio_cansleep() system, but it seems safe for all
systems.
Signed-off-by: David Daney <david.daney@cavium.com> Cc: Richard Purdie <rpurdie@rpsys.net> Acked-by: Trent Piepho <tpiepho@gmail.com> Cc: Grant Likely <grant.likely@secretlab.ca> Signed-off-by: Andrew Morton <>
Magnus Damm [Wed, 28 Sep 2011 00:50:44 +0000 (10:50 +1000)]
drivers/leds/leds-renesas-tpu.c: move Renesas TPU LED driver platform data
Use the platform_data include directory for the TPU LED driver, as
suggested by Paul Mundt.
Signed-off-by: Magnus Damm <damm@opensource.se> Cc: Paul Mundt <lethal@linux-sh.org> Cc: Richard Purdie <rpurdie@rpsys.net> Cc: Grant Likely <grant.likely@secretlab.ca> Signed-off-by: Andrew Morton <>
Magnus Damm [Wed, 28 Sep 2011 00:50:44 +0000 (10:50 +1000)]
drivers/leds/leds-renesas-tpu.c: update driver to use workqueue
Use a workqueue in the Renesas TPU LED driver to allow the Runtime PM code
to sleep.
Signed-off-by: Magnus Damm <damm@opensource.se> Cc: Paul Mundt <lethal@linux-sh.org> Cc: Richard Purdie <rpurdie@rpsys.net> Cc: Grant Likely <grant.likely@secretlab.ca> Signed-off-by: Andrew Morton <>
Wolfram Sang [Wed, 28 Sep 2011 00:50:43 +0000 (10:50 +1000)]
drivers/leds/leds-lm3530.c: remove obsolete cleanup for clientdata
A few new i2c-drivers came into the kernel which clear the
clientdata-pointer on exit or error. This is obsolete meanwhile, the core
will do it.
Signed-off-by: Wolfram Sang <w.sang@pengutronix.de> Cc: Richard Purdie <rpurdie@rpsys.net> Acked-by: Jean Delvare <khali@linux-fr.org> Signed-off-by: Andrew Morton <>
Axel Lin [Wed, 28 Sep 2011 00:50:43 +0000 (10:50 +1000)]
leds-renesas-tpu-led-driver-v2-fix
include linux/module.h
Signed-off-by: Axel Lin <axel.lin@gmail.com> Cc: Magnus Damm <damm@opensource.se> Cc: Paul Mundt <lethal@linux-sh.org> Cc: Richard Purdie <rpurdie@rpsys.net> Cc: Grant Likely <grant.likely@secretlab.ca> Signed-off-by: Andrew Morton <>
Magnus Damm [Wed, 28 Sep 2011 00:50:42 +0000 (10:50 +1000)]
leds: Renesas TPU LED driver
Add V2 of the LED driver for a single timer channel for the TPU hardware
block commonly found in Renesas SoCs.
The driver has been written with optimal Power Management in mind, so to
save power the LED is driven as a regular GPIO pin in case of maximum
brightness and power off which allows the TPU hardware to be idle and
which in turn allows the clocks to be stopped and the power domain to be
turned off transparently.
Any other brightness level requires use of the TPU hardware in PWM mode.
TPU hardware device clocks and power are managed through Runtime PM.
System suspend and resume is known to be working - during suspend the LED
is set to off by the generic LED code.
The TPU hardware timer is equipeed with a 16-bit counter together with an
up-to-divide-by-64 prescaler which makes the hardware suitable for
brightness control. Hardware blink is unsupported.
The LED PWM waveform has been verified with a Fluke 123 Scope meter on a
sh7372 Mackerel board. Tested with experimental sh7372 A3SP power domain
patches. Platform device bind/unbind tested ok.
V2 has been tested on the DS2 LED of the sh73a0-based AG5EVM.
Signed-off-by: Magnus Damm <damm@opensource.se> Cc: Paul Mundt <lethal@linux-sh.org> Cc: Richard Purdie <rpurdie@rpsys.net> Cc: Grant Likely <grant.likely@secretlab.ca> Signed-off-by: Andrew Morton <>
Mark Brown [Wed, 28 Sep 2011 00:50:42 +0000 (10:50 +1000)]
backlight: fix broken regulator API usage in l4f00242t03
The regulator support in the l4f00242t03 is very non-idiomatic. Rather
than requesting the regulators based on the device name and the supply
names used by the device the driver requires boards to pass system
specific supply names around through platform data. The driver also
conditionally requests the regulators based on this platform data, adding
unneeded conditional code to the driver.
Fix this by removing the platform data and converting to the standard
idiom, also updating all in tree users of the driver. As no datasheet
appears to be available for the LCD I'm guessing the names for the
supplies based on the existing users and I've no ability to do anything
more than compile test.
The use of regulator_set_voltage() in the driver is also problematic,
since fixed voltages are required the expectation would be that the
voltages would be fixed in the constraints set by the machines rather than
manually configured by the driver, but is less problematic.
Signed-off-by: Mark Brown <broonie@opensource.wolfsonmicro.com> Tested-by: Fabio Estevam <fabio.estevam@freescale.com> Cc: Richard Purdie <rpurdie@rpsys.net> Signed-off-by: Andrew Morton <>
Wolfram Sang [Wed, 28 Sep 2011 00:50:41 +0000 (10:50 +1000)]
video/backlight: remove obsolete cleanup for clientdata
A few new i2c-drivers came into the kernel which clear the
clientdata-pointer on exit or error. This is obsolete meanwhile, the core
will do it.
Signed-off-by: Wolfram Sang <w.sang@pengutronix.de> Cc: Richard Purdie <rpurdie@rpsys.net> Cc: Paul Mundt <lethal@linux-sh.org> Acked-by: Jean Delvare <khali@linux-fr.org> Signed-off-by: Andrew Morton <>
Hans Verkuil [Wed, 28 Sep 2011 00:50:40 +0000 (10:50 +1000)]
poll: add poll_requested_events() function
In some cases the poll() implementation in a driver has to do different
things depending on the events the caller wants to poll for. An example
is when a driver needs to start a DMA engine if the caller polls for
POLLIN, but doesn't want to do that if POLLIN is not requested but instead
only POLLOUT or POLLPRI is requested. This is something that can happen
in the video4linux subsystem.
Unfortunately, the current epoll/poll/select implementation doesn't
provide that information reliably. The poll_table_struct does have it: it
has a key field with the event mask. But once a poll() call matches one
or more bits of that mask any following poll() calls are passed a NULL
poll_table_struct pointer.
The solution is to set the qproc field to NULL in poll_table_struct once
poll() matches the events, not the poll_table_struct pointer itself. That
way drivers can obtain the mask through a new poll_requested_events
inline.
The poll_table_struct can still be NULL since some kernel code calls it
internally (netfs_state_poll() in ./drivers/staging/pohmelfs/netfs.h). In
that case poll_requested_events() returns ~0 (i.e. all events).
Since eventpoll always leaves the key field at ~0 instead of using the
requested events mask, that source was changed as well to properly fill in
the key field.
Signed-off-by: Hans Verkuil <hans.verkuil@cisco.com> Reviewed-by: Jonathan Corbet <corbet@lwn.net> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Davide Libenzi <davidel@xmailserver.org> Signed-off-by: Andrew Morton <>
WARNING: externs should be avoided in .c files
#99: FILE: arch/alpha/boot/misc.c:28:
+extern __printf(1, 2) long srm_printk(const char *, ...);
ERROR: space required after that ';' (ctx:VxV)
#178: FILE: arch/powerpc/boot/ps3.c:39:
+static inline __printf(1, 2) int DBG(const char *fmt, ...) {return 0;}
^
ERROR: "foo* bar" should be "foo *bar"
#225: FILE: arch/s390/include/asm/debug.h:175:
+debug_sprintf_event(debug_info_t* id, int level, char *string, ...);
ERROR: space required after that ',' (ctx:VxV)
#237: FILE: arch/s390/include/asm/debug.h:216:
+debug_sprintf_exception(debug_info_t *id, int level, char *string,...);
^
WARNING: space prohibited between function name and open parenthesis '('
#494: FILE: fs/ext2/ext2.h:139:
+void ext2_error (struct super_block *, const char *, const char *, ...);
WARNING: printk() should include KERN_ facility level
#719: FILE: fs/partitions/ldm.c:63:
+ printk("%s%s(): %pV\n", level, function, &vaf);
WARNING: space prohibited between function name and open parenthesis '('
#721: FILE: fs/partitions/ldm.c:65:
+ va_end (args);
WARNING: space prohibited between function name and open parenthesis '('
#750: FILE: fs/ufs/ufs.h:121:
+void ufs_warning (struct super_block *, const char *, const char *, ...);
WARNING: space prohibited between function name and open parenthesis '('
#752: FILE: fs/ufs/ufs.h:123:
+void ufs_error (struct super_block *, const char *, const char *, ...);
WARNING: space prohibited between function name and open parenthesis '('
#754: FILE: fs/ufs/ufs.h:125:
+void ufs_panic (struct super_block *, const char *, const char *, ...);
WARNING: space prohibited between function name and open parenthesis '('
#1074: FILE: include/linux/ext3_fs.h:941:
+void ext3_error (struct super_block *, const char *, const char *, ...);
WARNING: space prohibited between function name and open parenthesis '('
#1083: FILE: include/linux/ext3_fs.h:944:
+void ext3_abort (struct super_block *, const char *, const char *, ...);
WARNING: space prohibited between function name and open parenthesis '('
#1085: FILE: include/linux/ext3_fs.h:946:
+void ext3_warning (struct super_block *, const char *, const char *, ...);
WARNING: do not add new typedefs
#1178: FILE: include/linux/kdb.h:119:
+typedef __printf(1, 2) int (*kdb_printf_t)(const char *, ...);
ERROR: "foo * bar" should be "foo *bar"
#1203: FILE: include/linux/kernel.h:299:
+extern __printf(2, 3) int sprintf(char * buf, const char * fmt, ...);
William Douglas [Wed, 28 Sep 2011 00:50:39 +0000 (10:50 +1000)]
printk: remove bounds checking for log_prefix
Currently log_prefix is testing that the first character of the log level
and facility is less than '0' and greater than '9' (which is always
false).
Since the code being updated works because strtoul bombs out (endp isn't
updated) and 0 is returned anyway just remove the check and don't change
the behavior of the function.
Signed-off-by: William Douglas <william.douglas@intel.com> Signed-off-by: Andrew Morton <>
William Douglas [Wed, 28 Sep 2011 00:50:39 +0000 (10:50 +1000)]
printk: fix bounds checking for log_prefix
Currently log_prefix is testing that the first character of the log level
and facility is less than '0' and greater than '9' (which is always
false). It should be testing to see if the character less than '0' or
greater than '9' instead. This patch makes that change.
The code being changed worked because strtoul bombs out (endp isn't
updated) and 0 is returned anyway.
Signed-off-by: William Douglas <william.douglas@intel.com> Signed-off-by: Andrew Morton <>
Jason Baron [Wed, 28 Sep 2011 00:50:37 +0000 (10:50 +1000)]
dynamic_debug: fix undefined reference to `__netdev_printk'
Dynamic debug recently added support for netdev_printk. It uses
__netdev_printk() to support this functionality. However, when CONFIG_NET
is not set, we get the following error:
lib/built-in.o: In function `__dynamic_netdev_dbg':
(.text+0x9fda): undefined reference to `__netdev_printk'
Fix this by making the call to netdev_printk() contingent upon CONFIG_NET.
We could have fixed this by defining netdev_printk() to a 'no-op' in the
!CONFIG_NET case. However, this is not consistent with how the networking
layer uses netdev_printk. For example, CONFIG_NET is not set,
netdev_printk() does not have a 'no-op' definition defined.
Signed-off-by: Jason Baron <jbaron@redhat.com> Acked-by: Randy Dunlap <rdunlap@xenotime.net> Cc: Greg KH <greg@kroah.com> Acked-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Andrew Morton <>
Jason Baron [Wed, 28 Sep 2011 00:50:37 +0000 (10:50 +1000)]
dynamic_debug: use a single printk() to emit messages
We were using KERN_CONT to combine messages with their prefix. However,
KERN_CONT is not smp safe, in the sense that it can interleave messages.
This interleaving can result in printks coming out at the wrong loglevel.
With the high frequency of printks that dynamic debug can produce this is
not desirable.
So make dynamic_emit_prefix() fill a char buf[64] instead of doing a
printk directly. If we enable printing out of function, module, line, or
pid info, they are placed in this 64 byte buffer. In my testing 64 bytes
was enough size to fulfill all requests. Even if it's not, we can match
up the printk itself to see where it's from, so to me this is no big deal.
[akpm@linux-foundation.org: convert dangerous macro to C] Signed-off-by: Jason Baron <jbaron@redhat.com> Cc: Greg KH <greg@kroah.com> Cc: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Andrew Morton <>
Mark Brown [Wed, 28 Sep 2011 00:50:35 +0000 (10:50 +1000)]
lis3lv02d: make regulator API usage unconditional
The regulator API contains a range of features for stubbing itself out
when not in use and for transparently restricting the actual effect of
regulator API calls where they can't be supported on a particular system
so that drivers don't need to individually implement this. Simplify the
driver slightly by making use of this idiom.
The only in tree user is ecovec24 which does not use the regulator API.
Signed-off-by: Mark Brown <broonie@opensource.wolfsonmicro.com> Cc: Éric Piel <eric.piel@tremplin-utc.net> Cc: Ilkka Koskinen <ilkka.koskinen@nokia.com> Signed-off-by: Andrew Morton <>
>From my POV, it looks like the hardware is not working as expected
and returns a bogus data rate. The driver doesn't check the result
and directly uses it as some sort of divisor in some places:
msleep(lis3->pwron_delay / lis3lv02d_get_odr());
Under this circumstances, this could very well cause the
"divide by zero" exception from above.
For now, I fixed it the easiest and most obvious way:
Check if the result is sane and if it isn't use a sane default
instead. I went for "100" in the latter case, simply because
/sys/devices/platform/lis3lv02d/rate returns it on a successful
boot.
Signed-off-by: Christian Lamparter <chunkeey@googlemail.com> Signed-off-by: Éric Piel <eric.piel@tremplin-utc.net> Cc: Matthew Garrett <mjg@redhat.com> Cc: Witold Pilat <witold.pilat@gmail.com> Cc: Lyall Pearce <lyall.pearce@hp.com> Cc: Malte Starostik <m-starostik@versanet.de> Cc: Ilkka Koskinen <ilkka.koskinen@nokia.com> Cc: Thadeu Lima de Souza Cascardo <cascardo@holoscopio.com> Cc: Christian Lamparter <chunkeey@googlemail.com> Signed-off-by: Andrew Morton <>
Jonathan Cameron [Wed, 28 Sep 2011 00:50:32 +0000 (10:50 +1000)]
drivers/hwmon/hwmon.c: convert idr to ida and use ida_simple_get()
A straightforward looking use of idr for a device id.
Signed-off-by: Jonathan Cameron <jic23@cam.ac.uk> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Tejun Heo <tj@kernel.org> Cc: Guenter Roeck <guenter.roeck@ericsson.com> Cc: James Bottomley <James.Bottomley@HansenPartnership.com> Acked-by: Darrick J. Wong <djwong@us.ibm.com> Signed-off-by: Andrew Morton <>
Pavel Emelyanov [Wed, 28 Sep 2011 00:50:30 +0000 (10:50 +1000)]
fs/pipe.c: add ->statfs callback for pipefs
Currently a statfs on a pipe's /proc/<pid>/fd/ link returns -ENOSYS. Wire
pipfs up so that the statfs succeeds.
This is required by checkpoint-restart in the userspace to make it
possible to distinguish pipes from fifos.
When we dump information about task's open files we use the /proc/pid/fd
directoy's symlinks and the fact that opening any of them gives us exactly
the same dentry->inode pair as the original process has. Now if a task
we're dumping has opened pipe and fifo we need to detect this and act
accordingly. Knowing that an fd with type S_ISFIFO resides on a pipefs is
the most precise way.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com> Reviewed-by: Tejun Heo <tj@kernel.org> Acked-by: Serge Hallyn <serge.hallyn@canonical.com> Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org> Cc: Al Viro <viro@ZenIV.linux.org.uk> Signed-off-by: Andrew Morton <>
Stephen Boyd [Wed, 28 Sep 2011 00:50:30 +0000 (10:50 +1000)]
Consolidate CONFIG_DEBUG_STRICT_USER_COPY_CHECKS
The help text for this config is duplicated across the x86, parisc, and
s390 Kconfig.debug files. Arnd Bergman noted that the help text was
slightly misleading and should be fixed to state that enabling this option
isn't a problem when using pre 4.4 gcc.
To simplify the rewording, consolidate the text into lib/Kconfig.debug and
modify it there to be more explicit about when you should say N to this
config.
Also, make the text a bit more generic by stating that this option enables
compile time checks so we can cover architectures which emit warnings vs.
ones which emit errors. The details of how an architecture decided to
implement the checks isn't as important as the concept of compile time
checking of copy_from_user() calls.
While we're doing this, remove all the copy_from_user_overflow() code
that's duplicated many times and place it into lib/ so that any
architecture supporting this option can get the function for free.
Signed-off-by: Stephen Boyd <sboyd@codeaurora.org> Reviewed-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Ingo Molnar <mingo@elte.hu> Acked-by: H. Peter Anvin <hpa@zytor.com> Cc: Arjan van de Ven <arjan@linux.intel.com> Cc: Helge Deller <deller@gmx.de> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Acked-by: Chris Metcalf <cmetcalf@tilera.com> Signed-off-by: Andrew Morton <>
Stephen Boyd [Wed, 28 Sep 2011 00:50:29 +0000 (10:50 +1000)]
x86: implement strict user copy checks for x86_64
Strict user copy checks are only really supported on x86_32 even though
the config option is selectable on x86_64. Add the necessary support to
the 64 bit code to trigger copy_from_user() warnings at compile time.
Signed-off-by: Stephen Boyd <sboyd@codeaurora.org> Acked-by: Ingo Molnar <mingo@elte.hu> Acked-by: H. Peter Anvin <hpa@zytor.com> Signed-off-by: Andrew Morton <>
Enabling DEBUG_STRICT_USER_COPY_CHECKS causes the following warning:
In file included from arch/x86/include/asm/uaccess.h:573,
from kernel/kprobes.c:55:
In function 'copy_from_user',
inlined from 'write_enabled_file_bool' at
kernel/kprobes.c:2191:
arch/x86/include/asm/uaccess_64.h:65:
warning: call to 'copy_from_user_overflow' declared with
attribute warning: copy_from_user() buffer size is not provably
correct
presumably due to buf_size being signed causing GCC to fail to see that
buf_size can't become negative.
Signed-off-by: Stephen Boyd <sboyd@codeaurora.org> Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com> Cc: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com> Cc: David S. Miller <davem@davemloft.net> Acked-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Signed-off-by: Andrew Morton <>
Shaohua Li [Wed, 28 Sep 2011 00:50:28 +0000 (10:50 +1000)]
intel_idle: fix API misuse
smp_call_function() only lets all other CPUs execute a specific function,
while we expect all CPUs do in intel_idle. Without the fix, we could have
one cpu which has auto_demotion enabled or has no boradcast timer setup.
Usually we don't see impact because auto demotion just harms power and the
intel_idle init is called in CPU 0, where boradcast timer delivers
interrupt, but this still could be a problem.
Signed-off-by: Shaohua Li <shaohua.li@intel.com> Cc: Len Brown <lenb@kernel.org> Signed-off-by: Andrew Morton <>
Michael Cree [Wed, 28 Sep 2011 00:50:28 +0000 (10:50 +1000)]
alpha: wire up sendmmsg syscall
Signed-off-by: Michael Cree <mcree@orcon.net.nz> Reviewed-by: Matt Turner <mattst88@gmail.com> Cc: Richard Henderson <rth@twiddle.net> Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru> Signed-off-by: Andrew Morton <>
Michael Cree [Wed, 28 Sep 2011 00:50:27 +0000 (10:50 +1000)]
alpha: wire up accept4 syscall
Somehow wiring up the accept4 syscall on Alpha was missed long ago.
This commit rectifies that oversight.
Signed-off-by: Michael Cree <mcree@orcon.net.nz> Reviewed-by: Matt Turner <mattst88@gmail.com> Cc: Richard Henderson <rth@twiddle.net> Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru> Signed-off-by: Andrew Morton <>
Magnus Lynch [Wed, 28 Sep 2011 00:50:27 +0000 (10:50 +1000)]
hpet: factor timer allocate from open
The current implementation of the /dev/hpet driver couples opening the
device with allocating one of the (scarce) timers (aka comparators). This
is a limitation in that the main counter may be valuable to applications
seeking a high-resolution timer who have no use for the interrupt
generating functionality of the comparators.
This patch alters the open semantics so that when the device is opened, no
timer is allocated. Operations that depend on a timer being in context
implicitly attempt allocating a timer, to maintain backward compatibility.
There is also an IOCTL (HPET_ALLOC_TIMER _IO) added so that the
allocation may be done explicitly. (I prefer the explicit open then
allocate pattern but don't know how practical it would be to require all
existing code to be changed.)
/dev/hpet is accessed via mmap(). This is the only interface of /dev/hpet
that is actually used in practice.
[akpm@linux-foundation.org: coding-style tweaks]
[arnd@arndb.de: fix build] Signed-off-by: Magnus Lynch <maglyx@gmail.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: john stultz <johnstul@us.ibm.com> Acked-by: Clemens Ladisch <clemens@ladisch.de> Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Andrew Morton <>