include/linux/bio.h: use a static inline function for bio_integrity_clone()
When CONFIG_BLK_DEV_INTEGRITY is not set, we get these warnings:
drivers/md/dm.c: In function 'split_bvec':
drivers/md/dm.c:1061:3: warning: statement with no effect
drivers/md/dm.c: In function 'clone_bio':
drivers/md/dm.c:1088:3: warning: statement with no effect
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Andrew Morton <akpm@google.com>
Andi Kleen [Wed, 5 Oct 2011 00:44:07 +0000 (11:44 +1100)]
dio: optimize cache misses in the submission path
Some investigation of a transaction processing workload showed that a
major consumer of cycles in __blockdev_direct_IO is the cache miss while
accessing the block size. This is because it has to walk the chain from
block_dev to gendisk to queue.
The block size is needed early on to check alignment and sizes. It's only
done if the check for the inode block size fails. But the costly block
device state is unconditionally fetched.
- Reorganize the code to only fetch block dev state when actually
needed.
Then do a prefetch on the block dev early on in the direct IO path. This
is worth it, because there is substantial code run before we actually
touch the block dev now.
- I also added some unlikelies to make it clear the compiler that block
device fetch code is not normally executed.
This gave a small, but measurable improvement on a large database
benchmark (about 0.3%)
Signed-off-by: Andi Kleen <ak@linux.intel.com> Cc: Jeff Moyer <jmoyer@redhat.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Andi Kleen [Wed, 5 Oct 2011 00:44:05 +0000 (11:44 +1100)]
dio: inline the complete submission path
Add inlines to all the submission path functions. While this increases
code size it also gives gcc a lot of optimization opportunities in this
critical hotpath.
In particular -- together with some other changes -- this allows gcc to
get rid of the unnecessary clearing of sdio at the beginning and optimize
the messy parameter passing. Any non inlining of a function which takes a
sdio parameter would break this optimization because they cannot be done
if the address of a structure is taken.
Note that benefits are only seen with CONFIG_OPTIMIZE_INLINING and
CONFIG_CC_OPTIMIZE_FOR_SIZE both set to off.
This gives about 2.2% improvement on a large database benchmark with a
high IOPS rate.
Signed-off-by: Andi Kleen <ak@linux.intel.com> Acked-by: Jeff Moyer <jmoyer@redhat.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Andi Kleen [Wed, 5 Oct 2011 00:44:04 +0000 (11:44 +1100)]
dio: separate map_bh from dio
Only a single b_private field in the map_bh buffer head is needed after
the submission path. Move map_bh separately to avoid storing this
information in the long term slab.
This avoids the weird 104 byte hole in struct dio_submit which also needed
to be memseted early.
Signed-off-by: Andi Kleen <ak@linux.intel.com> Acked-by: Jeff Moyer <jmoyer@redhat.com> Acked-by: Jeff Moyer <jmoyer@redhat.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Andi Kleen [Wed, 5 Oct 2011 00:44:02 +0000 (11:44 +1100)]
dio: separate fields only used in the submission path from struct dio
This large, but largely mechanic, patch moves all fields in struct dio
that are only used in the submission path into a separate on stack data
structure. This has the advantage that the memory is very likely cache
hot, which is not guaranteed for memory fresh out of kmalloc.
This also gives gcc more optimization potential because it can easier
determine that there are no external aliases for these variables.
The sdio initialization is a initialization now instead of memset. This
allows gcc to break sdio into individual fields and optimize away
unnecessary zeroing (after all the functions are inlined)
Signed-off-by: Andi Kleen <ak@linux.intel.com> Acked-by: Jeff Moyer <jmoyer@redhat.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tao Ma [Wed, 5 Oct 2011 00:44:02 +0000 (11:44 +1100)]
fs/direct-io.c: salcuate fs_count correctly in get_more_blocks()
In get_more_blocks(), we use dio_count to calcuate fs_count and do some
tricky things to increase fs_count if dio_count isn't aligned. But
actually it still has some corner cases that can't be coverd. See the
following example:
dio_write foo -s 1024 -w 4096
(direct write 4096 bytes at offset 1024). The same goes if the offset
isn't aligned to fs_blocksize.
In this case, the old calculation counts fs_count to be 1, but actually we
will write into 2 different blocks (if fs_blocksize=4096). The old code
just works, since it will call get_block twice (and may have to allocate
and create extents twice for filesystems like ext4). So we'd better call
get_block just once with the proper fs_count.
Signed-off-by: Tao Ma <boyu.mt@taobao.com> Cc: "Theodore Ts'o" <tytso@mit.edu> Cc: Christoph Hellwig <hch@lst.de> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Jeff Moyer [Wed, 5 Oct 2011 00:44:02 +0000 (11:44 +1100)]
aio: allocate kiocbs in batches
In testing aio on a fast storage device, I found that the context lock
takes up a fair amount of cpu time in the I/O submission path. The reason
is that we take it for every I/O submitted (see __aio_get_req). Since we
know how many I/Os are passed to io_submit, we can preallocate the kiocbs
in batches, reducing the number of times we take and release the lock.
In my testing, I was able to reduce the amount of time spent in
_raw_spin_lock_irq by .56% (average of 3 runs). The command I used to
test this was:
I also tested the patch with various numbers of events passed to
io_submit, and I ran the xfstests aio group of tests to ensure I didn't
break anything.
Signed-off-by: Jeff Moyer <jmoyer@redhat.com> Cc: Daniel Ehrenberg <dehrenberg@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Florian Faber [Wed, 5 Oct 2011 00:44:01 +0000 (11:44 +1100)]
drivers/w1/w1_int.c: multiple masters used same init_name
When using multiple masters, w1_int.c would use the .init_name from w1.c
for all entities, which will fail when creating a corresponding sysfs
entry. This patch uses the unique name previously generated.
James Nuss [Wed, 5 Oct 2011 00:43:59 +0000 (11:43 +1100)]
pps: new client driver using GPIO
This client driver allows you to use a GPIO pin as a source for PPS
signals. Platform data [1] are used to specify the GPIO pin number,
label, assert event edge type, and whether clear events are captured.
This driver is based on the work by Ricardo Martins who submitted an
initial implementation [2] of a PPS IRQ client driver to the linuxpps
mailing-list on Dec 3 2010.
James Nuss [Wed, 5 Oct 2011 00:43:58 +0000 (11:43 +1100)]
pps: default echo function
A default echo function has been provided so it is no longer an error when
you specify PPS_ECHOASSERT or PPS_ECHOCLEAR without an explicit echo
function. This allows some code re-use and also makes it easier to write
client drivers since the default echo function does not normally need to
change.
Signed-off-by: James Nuss <jamesnuss@nanometrics.ca> Reviewed-by: Ben Gardiner <bengardiner@nanometrics.ca> Acked-by: Rodolfo Giometti <giometti@linux.it> Cc: Ricardo Martins <rasm@fe.up.pt> Cc: Alexander Gordeev <lasaine@lvk.cs.msu.su> Cc: Igor Plyatov <plyatov@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
WANG Cong [Wed, 5 Oct 2011 00:43:58 +0000 (11:43 +1100)]
sysctl: make CONFIG_SYSCTL_SYSCALL default to n
When I tried to send a patch to remove it, Andi told me we still need to
keep compabitlies for old libc, so we can't remove this completely. Then
just make it default to n and remove the doc from
feature-removal-schedule.txt.
Signed-off-by: WANG Cong <amwang@redhat.com> Cc: Eric Biederman <ebiederm@xmission.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Andrew Morton [Wed, 5 Oct 2011 00:43:57 +0000 (11:43 +1100)]
sysctl-add-support-for-poll-fix
s/declare/define/ for definitions
Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Greg KH <gregkh@suse.de> Cc: Kay Sievers <kay.sievers@vrfy.org> Cc: Lucas De Marchi <lucas.demarchi@profusion.mobi> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Lucas De Marchi [Wed, 5 Oct 2011 00:43:57 +0000 (11:43 +1100)]
sysctl: add support for poll()
Adding support for poll() in sysctl fs allows userspace to receive
notifications of changes in sysctl entries. This adds a infrastructure to
allow files in sysctl fs to be pollable and implements it for hostname and
domainname.
Signed-off-by: Lucas De Marchi <lucas.demarchi@profusion.mobi> Cc: Greg KH <gregkh@suse.de> Cc: Kay Sievers <kay.sievers@vrfy.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
drivers/net/rionet.c: fix ethernet address macros for LE platforms
Modify Ethernet addess macros to be compatible with BE/LE platforms
Signed-off-by: Alexandre Bounine <alexandre.bounine@idt.com> Cc: Chul Kim <chul.kim@idt.com> Cc: Kumar Gala <galak@kernel.crashing.org> Cc: Matt Porter <mporter@kernel.crashing.org> Cc: Li Yang <leoli@freescale.com> Cc: <stable@kernel.org> [2.6.39+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
RapidIO: fix potential null deref in rio_setup_device()
The "goto cleanup" path can deference "rswitch" when it is NULL.
Reported-by: Dan Carpenter <error27@gmail.com> Signed-off-by: Alexandre Bounine <alexandre.bounine@idt.com> Cc: Dan Carpenter <error27@gmail.com> Cc: Kumar Gala <galak@kernel.crashing.org> Cc: Matt Porter <mporter@kernel.crashing.org> Cc: Chul Kim <chul.kim@idt.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
RapidIO: Tsi721 driver - fixes for the initial release
- address comments made by Andrew Morton,
see http://marc.info/?l=linux-kernel&m=131361256714116&w=2
- add spinlock for IB_MSG handler
- rename private BDMA channel structure to avoid conflict with DMA engine
- fix endianess bug in outbound message interrupt handler
Signed-off-by: Alexandre Bounine <alexandre.bounine@idt.com> Cc: Chul Kim <chul.kim@idt.com> Cc: Kumar Gala <galak@kernel.crashing.org> Cc: Matt Porter <mporter@kernel.crashing.org> Cc: Li Yang <leoli@freescale.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Add RapidIO mport driver for IDT TSI721 PCI Express-to-SRIO bridge device.
The driver provides full set of callback functions defined for mport
devices in RapidIO subsystem. It also is compatible with current version
of RIONET driver (Ethernet over RapidIO messaging services).
This patch is applicable to kernel versions starting from 2.6.39.
Signed-off-by: Alexandre Bounine <alexandre.bounine@idt.com> Signed-off-by: Chul Kim <chul.kim@idt.com> Cc: Kumar Gala <galak@kernel.crashing.org> Cc: Matt Porter <mporter@kernel.crashing.org> Cc: Li Yang <leoli@freescale.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Liu Gang [Wed, 5 Oct 2011 00:43:55 +0000 (11:43 +1100)]
arch/powerpc/sysdev/fsl_rio.c: release rapidio port I/O region resource if port failed to initialize
The "struct rio_mport" contains a member of master port I/O memory
resource structure "struct resource iores". This resource will be read
from device tree and be used for rapidio R/W transaction memory space.
Rapidio requests the port I/O memory resource under the root resource
"iomem_resource".
struct rio_mport *port;
port = kzalloc(sizeof(struct rio_mport), GFP_KERNEL);
request_resource(&iomem_resource, &port->iores);
When port failed to initialize, allocated "rio_mport" structure memory
will be freed, and the port I/O memory resource structure pointer
"&port->iores" will be invalid. If other requests resource under
"iomem_resource", "&port->iores" node may be operated in the child
resources list and this will cause the system to crash.
So the requested port I/O memory resource should be released before
freeing allocated "rio_mport" structure.
Signed-off-by: Liu Gang <Gang.Liu@freescale.com> Acked-by: Alexandre Bounine <alexandre.bounine@idt.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Grant Likely <grant.likely@secretlab.ca> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Liu Gang [Wed, 5 Oct 2011 00:43:55 +0000 (11:43 +1100)]
drivers/rapidio/rio-scan.c: use discovered bit to test if enumeration is complete
The discovered bit in PGCCSR register indicates if the device has been
discovered by system host. In Rapidio systems, some agent devices can also
be master devices. They can issue requests into the system.
Signed-off-by: Liu Gang <Gang.Liu@freescale.com> Acked-by: Alexandre Bounine <alexandre.bounine@idt.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
After merging the akpm tree, today's linux-next build (lost of them)
produced this warning:
WARNING: init/mounts.o(.text+0x192): Section mismatch in reference from the function devt_from_partuuid() to the variable .init.data:root_wait
The function devt_from_partuuid() references
the variable __initdata root_wait.
This is often because devt_from_partuuid lacks a __initdata
annotation or the annotation of root_wait is wrong.
Commit 185237e4cfab ("This patch makes two changes:"
init-add-root=partuuid=uuid-partnroff=%d-support-update.patch) adds the
reference to root_wait from the non-init function devt_from_partuuid().
The easiest thing to do is to remove __init_date from root_wait.
I have applied this patch as a merge fixup for today:
From: Stephen Rothwell <sfr@canb.auug.org.au>
Date: Wed, 10 Aug 2011 12:09:35 +1000
Subject: [PATCH] do_mounts: remove __init_data from root_wait
as it is now used from a non init routine.
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Will Drewry <wad@chromium.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Will Drewry [Wed, 5 Oct 2011 00:43:54 +0000 (11:43 +1100)]
init: add root=PARTUUID=UUID/PARTNROFF=%d support
Expand root=PARTUUID=UUID syntax to support selecting a root partition by
integer offset from a known, unique partition. This approach provides
similar properties to specifying a device and partition number, but using
the UUID as the unique path prior to evaluating the offset.
For example,
root=PARTUUID=99DE9194-FC15-4223-9192-FC243948F88B/PARTNROFF=1
selects the partition with UUID 99DE.. then select the next
partition.
This change is motivated by a particular usecase in Chromium OS where the
bootloader can easily determine what partition it is on (by UUID) but
doesn't perform general partition table walking.
That said, support for this model provides a direct mechanism for the user
to modify the root partition to boot without specifically needing to
extract each UUID or update the bootloader explicitly when the root
partition UUID is changed (if it is recreated to be larger, for instance).
Pinning to a /boot-style partition UUID allows the arbitrary root
partition reconfiguration/modifications with slightly less ambiguity than
just [dev][partition] and less stringency than the specific root partition
UUID.
Signed-off-by: Will Drewry <wad@chromium.org> Cc: Kay Sievers <kay.sievers@vrfy.org> Cc: Randy Dunlap <rdunlap@xenotime.net> Cc: Namhyung Kim <namhyung@gmail.com> Cc: Trond Myklebust <Trond.Myklebust@netapp.com> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Vasiliy Kulikov [Wed, 5 Oct 2011 00:43:53 +0000 (11:43 +1100)]
proc: force dcache drop on unauthorized access
The patch "proc: fix races against execve() of /proc/PID/fd**" is still a
partial fix for a setxid problem. link(2) is a yet another way to
identify whether a specific fd is opened by a privileged process. By
calling link(2) against /proc/PID/fd/* an attacker may identify whether
the fd number is valid for PID by analysing link(2) return code.
Both getattr() and link() can be used by the attacker iff the dentry is
present in the dcache. In this case ->lookup() is not called and the only
way to check ptrace permissions is either operation handler or
->revalidate(). The easiest solution to prevent any unauthorized access
to /proc/PID/fd*/ files is to force the dentry drop on each unauthorized
access attempt.
If an attacker keeps opened fd of /proc/PID/fd/ and dcache contains a
specific dentry for some /proc/PID/fd/XXX, any future attemp to use the
dentry by the attacker would lead to the dentry drop as a result of a
failed ptrace check in ->revalidate(). Then the attacker cannot spawn a
dentry for the specific fd number because of ptrace check in ->lookup().
The dentry drop can be still observed by an attacker by analysing
information from /proc/slabinfo, which is addressed in the successive
patch.
Signed-off-by: Vasiliy Kulikov <segoon@openwall.com> Cc: Cyrill Gorcunov <gorcunov@gmail.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Pekka Enberg <penberg@kernel.org> Cc: Matt Mackall <mpm@selenic.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Vasiliy Kulikov [Wed, 5 Oct 2011 00:43:52 +0000 (11:43 +1100)]
proc: fix races against execve() of /proc/PID/fd**
fd* files are restricted to the task's owner, and other users may not get
direct access to them. But one may open any of these files and run any
setuid program, keeping opened file descriptors. As there are permission
checks on open(), but not on readdir() and read(), operations on the kept
file descriptors will not be checked. It makes it possible to violate
procfs permission model.
Reading fdinfo/* may disclosure current fds' position and flags, reading
directory contents of fdinfo/ and fd/ may disclosure the number of opened
files by the target task. This information is not sensible per se, but it
can reveal some private information (like length of a password stored in a
file) under certain conditions.
Used existing (un)lock_trace functions to check for ptrace_may_access(),
but instead of using EPERM return code from it use EACCES to be consistent
with existing proc_pid_follow_link()/proc_pid_readlink() return code. If
they differ, attacker can guess what fds exist by analyzing stat() return
code. Patched handlers: stat() for fd/*, stat() and read() for fdindo/*,
readdir() and lookup() for fd/ and fdinfo/.
David Rientjes [Wed, 5 Oct 2011 00:43:52 +0000 (11:43 +1100)]
cpusets: avoid looping when storing to mems_allowed if one node remains set
{get,put}_mems_allowed() exist so that general kernel code may locklessly
access a task's set of allowable nodes without having the chance that a
concurrent write will cause the nodemask to be empty on configurations
where MAX_NUMNODES > BITS_PER_LONG.
This could incur a significant delay, however, especially in low memory
conditions because the page allocator is blocking and reclaim requires
get_mems_allowed() itself. It is not atypical to see writes to
cpuset.mems take over 2 seconds to complete, for example. In low memory
conditions, this is problematic because it's one of the most imporant
times to change cpuset.mems in the first place!
The only way a task's set of allowable nodes may change is through cpusets
by writing to cpuset.mems and when attaching a task to a generic code is
not reading the nodemask with get_mems_allowed() at the same time, and
then clearing all the old nodes. This prevents the possibility that a
reader will see an empty nodemask at the same time the writer is storing a
new nodemask.
If at least one node remains unchanged, though, it's possible to simply
set all new nodes and then clear all the old nodes. Changing a task's
nodemask is protected by cgroup_mutex so it's guaranteed that two threads
are not changing the same task's nodemask at the same time, so the
nodemask is guaranteed to be stored before another thread changes it and
determines whether a node remains set or not.
Signed-off-by: David Rientjes <rientjes@google.com> Cc: Miao Xie <miaox@cn.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Nick Piggin <npiggin@kernel.dk> Cc: Paul Menage <paul@paulmenage.org> Signed-off-by: Andrew Morton <akpm@google.com>
Steven Rostedt [Wed, 5 Oct 2011 00:43:51 +0000 (11:43 +1100)]
memcg: Fix race condition in memcg_check_events() with this_cpu usage
Various code in memcontrol.c () calls this_cpu_read() on the calculations
to be done from two different percpu variables, or does an open-coded
read-modify-write on a single percpu variable.
Disable preemption throughout these operations so that the writes go to
the correct palces.
[ Added this_cpu to __this_cpu conversion by Johannes ]
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Cc: Greg Thelen <gthelen@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Christoph Lameter <cl@linux.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Johannes Weiner [Wed, 5 Oct 2011 00:43:51 +0000 (11:43 +1100)]
memcg: close race between charge and putback
There is a potential race between a thread charging a page and another
thread putting it back to the LRU list:
charge: putback:
SetPageCgroupUsed SetPageLRU
PageLRU && add to memcg LRU PageCgroupUsed && add to memcg LRU
The order of setting one flag and checking the other is crucial, otherwise
the charge may observe !PageLRU while the putback observes !PageCgroupUsed
and the page is not linked to the memcg LRU at all.
Global memory pressure may fix this by trying to isolate and putback the
page for reclaim, where that putback would link it to the memcg LRU again.
Without that, the memory cgroup is undeletable due to a charge whose
physical page can not be found and moved out.
Signed-off-by: Johannes Weiner <jweiner@redhat.com> Cc: Ying Han <yinghan@google.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Balbir Singh <bsingharora@gmail.com> Cc: Michal Hocko <mhocko@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Johannes Weiner [Wed, 5 Oct 2011 00:43:50 +0000 (11:43 +1100)]
memcg: skip scanning active lists based on individual size
Reclaim decides to skip scanning an active list when the corresponding
inactive list is above a certain size in comparison to leave the assumed
working set alone while there are still enough reclaim candidates around.
The memcg implementation of comparing those lists instead reports whether
the whole memcg is low on the requested type of inactive pages,
considering all nodes and zones.
This can lead to an oversized active list not being scanned because of the
state of the other lists in the memcg, as well as an active list being
scanned while its corresponding inactive list has enough pages.
Not only is this wrong, it's also a scalability hazard, because the global
memory state over all nodes and zones has to be gathered for each memcg
and zone scanned.
Make these calculations purely based on the size of the two LRU lists
that are actually affected by the outcome of the decision.
Signed-off-by: Johannes Weiner <jweiner@redhat.com> Reviewed-by: Rik van Riel <riel@redhat.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Balbir Singh <bsingharora@gmail.com> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Reviewed-by: Ying Han <yinghan@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Igor Mammedov [Wed, 5 Oct 2011 00:43:50 +0000 (11:43 +1100)]
memcg: do not expose uninitialized mem_cgroup_per_node to world
If somebody is touching data too early, it might be easier to diagnose a
problem when dereferencing NULL at mem->info.nodeinfo[node] than trying to
understand why mem_cgroup_per_zone is [un|partly]initialized.
Signed-off-by: Igor Mammedov <imammedo@redhat.com> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
While back-porting Johannes Weiner's patch "mm: memcg-aware global
reclaim" for an internal effort, we noticed a significant performance
regression during page-reclaim heavy workloads due to high contention of
the ss->id_lock. This lock protects idr map, and serializes calls to
idr_get_next() in css_get_next() (which is used during the memcg hierarchy
walk). Since idr_get_next() is just doing a look up, we need only
serialize it with respect to idr_remove()/idr_get_new(). By making the
ss->id_lock a rwlock, contention is greatly reduced and performance
improves.
Tested: cat a 256m file from a ramdisk in a 128m container 50 times on
each core (one file + container per core) in parallel on a NUMA machine.
Result is the time for the test to complete in 1 of the containers. Both
kernels included Johannes' memcg-aware global reclaim patches.
Before rwlock patch: 1710.778s
After rwlock patch: 152.227s
Signed-off-by: Andrew Bresticker <abrestic@google.com> Cc: Paul Menage <menage@gmail.com> Cc: Li Zefan <lizf@cn.fujitsu.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Ying Han <yinghan@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Raghavendra K T [Wed, 5 Oct 2011 00:43:49 +0000 (11:43 +1100)]
memcg: rename mem variable to memcg
The memcg code sometimes uses "struct mem_cgroup *mem" and sometimes uses
"struct mem_cgroup *memcg". Rename all mem variables to memcg in source
file.
Signed-off-by: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Michal Hocko <mhocko@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Add a new subsystem to limit the number of running tasks, similar to the
NR_PROC rlimit but in the scope of a cgroup.
The user can set an upper bound limit that is checked every time a task
forks in a cgroup or is moved into a cgroup with that subsystem binded.
The primary goal is to protect against forkbombs that explode inside a
container. The traditional NR_PROC rlimit is not efficient in that case
because if we run containers in parallel under the same user, one of these
could starve all the others by spawning a high number of tasks close to
the user wide limit.
This is a prevention against forkbombs, so it's not deemed to cure the
effects of a forkbomb when the system is in a state where it's not
responsive. It's aimed at preventing from ever reaching that state and
stop the spreading of tasks early. While defining the limit on the
allowed number of tasks, it's up to the user to find the right balance
between the resource its containers may need and what it can afford to
provide.
As it's totally dissociated from the rlimit NR_PROC, both can be
complementary: the cgroup task counter can set an upper bound per
container and the rlmit can be an upper bound on the overall set of
containers.
Also this subsystem can be used to kill all the tasks in a cgroup without
races against concurrent forks, by setting the limit of tasks to 0, any
further forks can be rejected. This is a good way to kill a forkbomb in a
container, or simply kill any container without the need to retry an
unbound number of times.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Paul Menage <paul@paulmenage.org> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Aditya Kali <adityakali@google.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Kay Sievers <kay.sievers@vrfy.org> Cc: Tim Hockin <thockin@hockin.org> Cc: Tejun Heo <htejun@gmail.com> Cc: Kirill A. Shutemov <kirill@shutemov.name> Signed-off-by: Andrew Morton <akpm@google.com>
Let the subsystem's fork callback return an error value so that they can
cancel a fork. This is going to be used by the task counter subsystem to
implement the limit.
Suggested-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Paul Menage <paul@paulmenage.org> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Aditya Kali <adityakali@google.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Kay Sievers <kay.sievers@vrfy.org> Cc: Tim Hockin <thockin@hockin.org> Cc: Tejun Heo <htejun@gmail.com> Acked-by: Kirill A. Shutemov <kirill@shutemov.name> Signed-off-by: Andrew Morton <akpm@google.com>
cgroups: pull up res counter charge failure interpretation to caller
res_counter_charge() always returns -ENOMEM when the limit is reached and
the charge thus can't happen.
However it's up to the caller to interpret this failure and return the
appropriate error value. The task counter subsystem will need to report
the user that a fork() has been cancelled because of some limit reached,
not because we are too short on memory.
Fix this by returning -1 when res_counter_charge() fails.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Paul Menage <paul@paulmenage.org> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Aditya Kali <adityakali@google.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Kay Sievers <kay.sievers@vrfy.org> Cc: Tim Hockin <thockin@hockin.org> Cc: Tejun Heo <htejun@gmail.com> Acked-by: Kirill A. Shutemov <kirill@shutemov.name> Signed-off-by: Andrew Morton <akpm@google.com>
cgroups: ability to stop res charge propagation on bounded ancestor
Moving a task from a cgroup to another may require to substract its
resource charge from the old cgroup and add it to the new one.
For this to happen, the uncharge/charge propagation can just stop when we
reach the common ancestor for the two cgroups. Further the performance
reasons, we also want to avoid to temporarily overload the common
ancestors with a non-accurate resource counter usage if we charge first
the new cgroup and uncharge the old one thereafter. This is going to be a
requirement for the coming max number of task subsystem.
To solve this, provide a pair of new API that can charge/uncharge a
resource counter until we reach a given ancestor.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Acked-by: Paul Menage <paul@paulmenage.org> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Aditya Kali <adityakali@google.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Kay Sievers <kay.sievers@vrfy.org> Cc: Tim Hockin <thockin@hockin.org> Cc: Tejun Heo <htejun@gmail.com> Acked-by: Kirill A. Shutemov <kirill@shutemov.name> Signed-off-by: Andrew Morton <akpm@google.com>
cgroups: new cancel_attach_task() subsystem callback
To cancel a process attachment on a subsystem, we only call the
cancel_attach() callback once on the leader but we have no way to cancel
the attachment individually for each member of the process group.
This is going to be needed for the max number of tasks susbystem that is
coming.
To prepare for this integration, call a new cancel_attach_task() callback
on each task of the group until we reach the member that failed to attach.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Acked-by: Paul Menage <paul@paulmenage.org> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Aditya Kali <adityakali@google.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Kay Sievers <kay.sievers@vrfy.org> Cc: Tim Hockin <thockin@hockin.org> Cc: Tejun Heo <htejun@gmail.com> Acked-by: Kirill A. Shutemov <kirill@shutemov.name> Signed-off-by: Andrew Morton <akpm@google.com>
Provide an API to inherit a counter value from a parent. This can be
useful to implement cgroup.clone_children on a resource counter.
Still the resources of the children are limited by those of the parent, so
this is only to provide a default setting behaviour when clone_children is
set.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Paul Menage <paul@paulmenage.org> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Aditya Kali <adityakali@google.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Kay Sievers <kay.sievers@vrfy.org> Cc: Tim Hockin <thockin@hockin.org> Cc: Tejun Heo <htejun@gmail.com> Acked-by: Kirill A. Shutemov <kirill@shutemov.name> Signed-off-by: Andrew Morton <akpm@google.com>
Extend the resource counter API with a mirror of res_counter_read_u64() to
make it handy to update a resource counter value from a cgroup subsystem
u64 value file.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Acked-by: Paul Menage <paul@paulmenage.org> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Aditya Kali <adityakali@google.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Kay Sievers <kay.sievers@vrfy.org> Cc: Tim Hockin <thockin@hockin.org> Cc: Tejun Heo <htejun@gmail.com> Cc: Kirill A. Shutemov <kirill@shutemov.name> Signed-off-by: Andrew Morton <akpm@google.com>
Steven Rostedt [Wed, 5 Oct 2011 00:43:45 +0000 (11:43 +1100)]
cgroup/kmemleak: Annotate alloc_page() for cgroup allocations
When the cgroup base was allocated with kmalloc, it was necessary to
annotate the variable with kmemleak_not_leak(). But because it has
recently been changed to be allocated with alloc_page() (which skips
kmemleak checks) causes a warning on boot up.
I was triggering this output:
allocated 8388608 bytes of page_cgroup
please try 'cgroup_disable=memory' option if you don't want memory cgroups
kmemleak: Trying to color unknown object at 0xf5840000 as Grey
Pid: 0, comm: swapper Not tainted 3.0.0-test #12
Call Trace:
[<c17e34e6>] ? printk+0x1d/0x1f^M
[<c10e2941>] paint_ptr+0x4f/0x78
[<c178ab57>] kmemleak_not_leak+0x58/0x7d
[<c108ae9f>] ? __rcu_read_unlock+0x9/0x7d
[<c1cdb462>] kmemleak_init+0x19d/0x1e9
[<c1cbf771>] start_kernel+0x346/0x3ec
[<c1cbf1b4>] ? loglevel+0x18/0x18
[<c1cbf0aa>] i386_start_kernel+0xaa/0xb0
After a bit of debugging I tracked the object 0xf840000 (and others) down
to the cgroup code. The change from allocating base with kmalloc to
alloc_page() has the base not calling kmemleak_alloc() which adds the
pointer to the object_tree_root, but kmemleak_not_leak() adds it to the
crt_early_log[] table. On kmemleak_init(), the entry is found in the
early_log[] but not the object_tree_root, and this error message is
displayed.
If alloc_page() fails then it defaults back to vmalloc() which still uses
the kmemleak_alloc() which makes us still need the kmemleak_not_leak()
call. The solution is to call the kmemleak_alloc() directly if the
alloc_page() succeeds.
Reviewed-by: Michal Hocko <mhocko@suse.cz> Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Acked-by: Catalin Marinas <catalin.marinas@arm.com> Signed-off-by: Jonathan Nieder <jrnieder@gmail.com> Signed-off-by: Andrew Morton <akpm@google.com>
Ben Blum [Wed, 5 Oct 2011 00:43:44 +0000 (11:43 +1100)]
cgroups: don't attach task to subsystem if migration failed
If a task has exited to the point it has called cgroup_exit() already,
then we can't migrate it to another cgroup anymore.
This can happen when we are attaching a task to a new cgroup between the
call to ->can_attach_task() on subsystems and the migration that is
eventually tried in cgroup_task_migrate().
In this case cgroup_task_migrate() returns -ESRCH and we don't want to
attach the task to the subsystems because the attachment to the new cgroup
itself failed.
Fix this by only calling ->attach_task() on the subsystems if the cgroup
migration succeeded.
Reported-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Ben Blum <bblum@andrew.cmu.edu> Acked-by: Paul Menage <paul@paulmenage.org> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Ben Blum [Wed, 5 Oct 2011 00:43:44 +0000 (11:43 +1100)]
cgroups: more safe tasklist locking in cgroup_attach_proc
Fix unstable tasklist locking in cgroup_attach_proc.
According to this thread - https://lkml.org/lkml/2011/7/27/243 - RCU is
not sufficient to guarantee the tasklist is stable w.r.t. de_thread and
exit. Taking tasklist_lock for reading, instead of rcu_read_lock, ensures
proper exclusion.
Signed-off-by: Ben Blum <bblum@andrew.cmu.edu> Acked-by: Paul Menage <paul@paulmenage.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Cc: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
David Anders [Wed, 5 Oct 2011 00:43:43 +0000 (11:43 +1100)]
rtc: add initial support for mcp7941x parts
Add initial support for the microchip mcp7941x series of real time clocks.
The mcp7941x series is generally compatible with the ds1307 and ds1337 rtc
devices from dallas semiconductor. minor differences include a backup
battery enable bit, and the polarity of the oscillator enable bit.
Signed-off-by: David Anders <danders.dev@gmail.com> Cc: Alessandro Zummo <a.zummo@towertech.it> Reviewed-by: Wolfram Sang <w.sang@pengutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Mike Waychison [Wed, 5 Oct 2011 00:43:42 +0000 (11:43 +1100)]
oprofilefs: handle zero-length writes
Currently in oprofilefs, files that use ulong_fops mis-handle writes of
zero length. A count of 0 causes oprofilefs_ulong_from_user to return 0
(success), which then leads to oprofile_set_ulong being called to stuff
"value" into file->private_data without it being initialized.
Fix this by moving the check for a zero-length write up into
ulong_write_file.
Signed-off-by: Mike Waychison <mikew@google.com> Cc: Robert Richter <robert.richter@amd.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Neil Armstrong [Wed, 5 Oct 2011 00:43:42 +0000 (11:43 +1100)]
init/do_mounts_rd.c: fix ramdisk identification for padded cramfs
When a cramfs ramdisk padded with 512 bytes is given to the kernel, the
current identify_ramdisk_image function fails to identify it.
Tested with a padded cramfs image on an ARM based board.
Signed-off-by: Neil Armstrong <narmstrong@neotion.com> Cc: Namhyung Kim <namhyung@gmail.com> Cc: Davidlohr Bueso <dave@gnu.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Jiri Kosina [Wed, 5 Oct 2011 00:43:41 +0000 (11:43 +1100)]
binfmt_elf: fix PIE execution with randomization disabled
The case of address space randomization being disabled in runtime through
randomize_va_space sysctl is not treated properly in load_elf_binary(),
resulting in SIGKILL coming at exec() time for certain PIE-linked binaries
in case the randomization has been disabled at runtime prior to calling
exec().
Handle the randomize_va_space == 0 case the same way as if we were not
supporting .text randomization at all.
Based on original patch by H.J. Lu and Josh Boyer.
Signed-off-by: Jiri Kosina <jkosina@suse.cz> Cc: Ingo Molnar <mingo@elte.hu> Cc: Russell King <rmk@arm.linux.org.uk> Cc: H.J. Lu <hongjiu.lu@intel.com> Cc: <stable@kernel.org> Tested-by: Josh Boyer <jwboyer@redhat.com> Acked-by: Nicolas Pitre <nicolas.pitre@linaro.org> Signed-off-by: Andrew Morton <akpm@google.com>
Jason Baron [Wed, 5 Oct 2011 00:43:41 +0000 (11:43 +1100)]
epoll: limit paths
The current epoll code can be tickled to run basically indefinitely in
both loop detection path check (on ep_insert()), and in the wakeup paths.
The programs that tickle this behavior set up deeply linked networks of
epoll file descriptors that cause the epoll algorithms to traverse them
indefinitely. A couple of these sample programs have been previously
posted in this thread: https://lkml.org/lkml/2011/2/25/297.
To fix the loop detection path check algorithms, I simply keep track of
the epoll nodes that have been already visited. Thus, the loop detection
becomes proportional to the number of epoll file descriptor and links.
This dramatically decreases the run-time of the loop check algorithm. In
one diabolical case I tried it reduced the run-time from 15 mintues (all
in kernel time) to .3 seconds.
Fixing the wakeup paths could be done at wakeup time in a similar manner
by keeping track of nodes that have already been visited, but the
complexity is harder, since there can be multiple wakeups on different
cpus...Thus, I've opted to limit the number of possible wakeup paths when
the paths are created.
This is accomplished, by noting that the end file descriptor points that
are found during the loop detection pass (from the newly added link), are
actually the sources for wakeup events. I keep a list of these file
descriptors and limit the number and length of these paths that emanate
from these 'source file descriptors'. In the current implemetation I
allow 1000 paths of length 1, 500 of length 2, 100 of length 3, 50 of
length 4 and 10 of length 5. Note that it is sufficient to check the
'source file descriptors' reachable from the newly added link, since no
other 'source file descriptors' will have newly added links. This allows
us to check only the wakeup paths that may have gotten too long, and not
re-check all possible wakeup paths on the system.
In terms of the path limit selection, I think its first worth noting that
the most common case for epoll, is probably the model where you have 1
epoll file descriptor that is monitoring n number of 'source file
descriptors'. In this case, each 'source file descriptor' has a 1 path of
length 1. Thus, I believe that the limits I'm proposing are quite
reasonable and in fact may be too generous. Thus, I'm hoping that the
proposed limits will not prevent any workloads that currently work to
fail.
In terms of locking, I have extended the use of the 'epmutex' to all
epoll_ctl add and remove operations. Currently its only used in a subset
of the add paths. I need to hold the epmutex, so that we can correctly
traverse a coherent graph, to check the number of paths. I believe that
this additional locking is probably ok, since its in the setup/teardown
paths, and doesn't affect the running paths, but it certainly is going to
add some extra overhead. Also, worth noting is that the epmuex was
recently added to the ep_ctl add operations in the initial path loop
detection code using the argument that it was not on a critical path.
Another thing to note here, is the length of epoll chains that is allowed.
Currently, eventpoll.c defines:
/* Maximum number of nesting allowed inside epoll sets */
#define EP_MAX_NESTS 4
This basically means that I am limited to a graph depth of 5 (EP_MAX_NESTS
+ 1). However, this limit is currently only enforced during the loop
check detection code, and only when the epoll file descriptors are added
in a certain order. Thus, this limit is currently easily bypassed. The
newly added check for wakeup paths, stricly limits the wakeup paths to a
length of 5, regardless of the order in which ep's are linked together.
Thus, a side-effect of the new code is a more consistent enforcement of
the graph depth.
Thus far, I've tested this, using the sample programs previously
mentioned, which now either return quickly or return -EINVAL. I've also
testing using the piptest.c epoll tester, which showed no difference in
performance. I've also created a number of different epoll networks and
tested that they behave as expectded.
I believe this solves the original diabolical test cases, while still
preserving the sane epoll nesting.
Signed-off-by: Jason Baron <jbaron@redhat.com> Cc: Nelson Elhage <nelhage@ksplice.com> Cc: Davide Libenzi <davidel@xmailserver.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Nelson Elhage [Wed, 5 Oct 2011 00:43:41 +0000 (11:43 +1100)]
epoll: fix spurious lockdep warnings
epoll can acquire recursively acquire ep->mtx on multiple "struct
eventpoll"s at once in the case where one epoll fd is monitoring another
epoll fd. This is perfectly OK, since we're careful about the lock
ordering, but it causes spurious lockdep warnings. Annotate the recursion
using mutex_lock_nested, and add a comment explaining the nesting rules
for good measure.
Recent versions of systemd are triggering this, and it can also be
demonstrated with the following trivial test program:
--------------------8<--------------------
int main(void) {
int e1, e2;
struct epoll_event evt = {
.events = EPOLLIN
};
Reported-by: Paul Bolle <pebolle@tiscali.nl> Tested-by: Paul Bolle <pebolle@tiscali.nl> Signed-off-by: Nelson Elhage <nelhage@nelhage.com> Acked-by: Jason Baron <jbaron@redhat.com> Cc: Dave Jones <davej@redhat.com> Cc: Davide Libenzi <davidel@xmailserver.org> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Andrew Morton [Wed, 5 Oct 2011 00:43:40 +0000 (11:43 +1100)]
lib-crc-add-slice-by-8-algorithm-to-crc32c-fix
don't include asm/msr.h
Cc: Bob Pearson <rpearson@systemfabricworks.com> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: Roland Dreier <roland@kernel.org> Cc: frank zago <fzago@systemfabricworks.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
frank zago [Wed, 5 Oct 2011 00:43:40 +0000 (11:43 +1100)]
lib/crc: add slice by 8 algorithm to crc32.c
Add support for slice by 8 to existing crc32 algorithm. Also modify
gen_crc32table.c to only produce table entries that are actually used.
The parameters CRC_LE_BITS and CRC_BE_BITS determine the number of bits in
the input array that are processed during each step. Generally the more
bits the faster the algorithm is but the more table data required.
Using an x86_64 Opteron machine running at 2100MHz the following table was
collected with a pre-warmed cache by computing the crc 1000 times on a
buffer of 4096 bytes.
BITS is the value of CRC_LE_BITS or CRC_BE_BITS. The old
default was 8 which actually selected the 32 bit algorithm. In
this version the value 8 is used to select the standard
8 bit algorithm and two new values: 32 and 64 are introduced
to select the slice by 4 and slice by 8 algorithms respectively.
Where Size is the size of crc32.o's text segment which includes
code and table data when both LE and BE versions are set to BITS.
The current version of crc32.c by default uses the slice by 4 algorithm
which requires about 2.8 cycles per byte. The slice by 8 algorithm is
roughly 2X faster and enables packet processing at over 1GB/sec on a
typical 2-3GHz system.
Signed-off-by: Bob Pearson <rpearson@systemfabricworks.com> Cc: Roland Dreier <roland@kernel.org> Cc: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Glauber Costa [Wed, 5 Oct 2011 00:43:36 +0000 (11:43 +1100)]
lib/percpu_counter.c: enclose hotplug only variables in hotplug ifdef
These variables are only used when CONFIG_HOTPLUG_CPU is enabled, they are
ifdef'ed everywhere else. So don't define them when CONFIG_HOTPLUG_CPU is
not enabled.
Signed-off-by: Glauber Costa <glommer@parallels.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Andi Kleen <ak@linux.intel.com> Cc: Andy Shevchenko <andy.shevchenko@gmail.com> Cc: H Hartley Sweeten <hartleys@visionengravers.com> Cc: H Hartley Sweeten <hsweeten@visionengravers.com> Cc: Huang Ying <ying.huang@intel.com> Cc: Len Brown <len.brown@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
lib/bitmap.c: quiet sparse noise about address space
__bitmap_parse() and __bitmap_parselist() both take a pointer to a kernel
buffer as a parameter and then cast it to a pointer to user buffer for use
in cases when the parameter is_user indicates that the buffer is actually
located in user space. This casting, and the casts in the callers,
results in sparse noise like the following:
warning: incorrect type in initializer (different address spaces)
expected char const [noderef] <asn:1>*ubuf
got char const *buf
warning: cast removes address space of expression
Since these casts are intentional, use __force to quiet the noise.
Signed-off-by: H Hartley Sweeten <hsweeten@visionengravers.com> Cc: Len Brown <len.brown@intel.com> Cc: Huang Ying <ying.huang@intel.com> Cc: Andy Shevchenko <andy.shevchenko@gmail.com> Cc: Andi Kleen <ak@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Akinobu Mita [Wed, 5 Oct 2011 00:43:35 +0000 (11:43 +1100)]
lib/spinlock_debug.c: print owner on spinlock lockup
When SPIN_BUG_ON is triggered, the lock owner information is reported.
But it is omitted when spinlock lockup is detected.
This information is useful especially on the architectures which don't
implement trigger_all_cpu_backtrace() that is called just after detecting
lockup. So report it and also avoid message format duplication.
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Cc: Ingo Molnar <mingo@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Alexey Dobriyan [Wed, 5 Oct 2011 00:43:35 +0000 (11:43 +1100)]
lib/kstrtox: common code between kstrto*() and simple_strto*() functions
Currently termination logic (\0 or \n\0) is hardcoded in _kstrtoull(),
avoid that for code reuse between kstrto*() and simple_strtoull().
Essentially, make them different only in termination logic.
simple_strtoull() (and scanf(), BTW) ignores integer overflow, that's a
bug we currently don't have guts to fix, making KSTRTOX_OVERFLOW hack
necessary.
Almost forgot: patch shrinks code size by about ~80 bytes on x86_64.
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
My GPIOs are on an I2C port expander, so we must use the *_cansleep()
variant of the GPIO functions. This is was not being done in
create_gpio_led().
We can change gpio_get_value() to gpio_get_value_cansleep() because it is
only called from the platform_driver probe function, which is a context
where we can sleep.
Only tested on my gpio_cansleep() system, but it seems safe for all
systems.
Signed-off-by: David Daney <david.daney@cavium.com> Cc: Richard Purdie <rpurdie@rpsys.net> Acked-by: Trent Piepho <tpiepho@gmail.com> Cc: Grant Likely <grant.likely@secretlab.ca> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Magnus Damm [Wed, 5 Oct 2011 00:43:34 +0000 (11:43 +1100)]
drivers/leds/leds-renesas-tpu.c: move Renesas TPU LED driver platform data
Use the platform_data include directory for the TPU LED driver, as
suggested by Paul Mundt.
Signed-off-by: Magnus Damm <damm@opensource.se> Cc: Paul Mundt <lethal@linux-sh.org> Cc: Richard Purdie <rpurdie@rpsys.net> Cc: Grant Likely <grant.likely@secretlab.ca> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Magnus Damm [Wed, 5 Oct 2011 00:43:34 +0000 (11:43 +1100)]
drivers/leds/leds-renesas-tpu.c: update driver to use workqueue
Use a workqueue in the Renesas TPU LED driver to allow the Runtime PM code
to sleep.
Signed-off-by: Magnus Damm <damm@opensource.se> Cc: Paul Mundt <lethal@linux-sh.org> Cc: Richard Purdie <rpurdie@rpsys.net> Cc: Grant Likely <grant.likely@secretlab.ca> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Wolfram Sang [Wed, 5 Oct 2011 00:43:33 +0000 (11:43 +1100)]
drivers/leds/leds-lm3530.c: remove obsolete cleanup for clientdata
A few new i2c-drivers came into the kernel which clear the
clientdata-pointer on exit or error. This is obsolete meanwhile, the core
will do it.
Signed-off-by: Wolfram Sang <w.sang@pengutronix.de> Cc: Richard Purdie <rpurdie@rpsys.net> Acked-by: Jean Delvare <khali@linux-fr.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Axel Lin [Wed, 5 Oct 2011 00:43:33 +0000 (11:43 +1100)]
leds-renesas-tpu-led-driver-v2-fix
include linux/module.h
Signed-off-by: Axel Lin <axel.lin@gmail.com> Cc: Magnus Damm <damm@opensource.se> Cc: Paul Mundt <lethal@linux-sh.org> Cc: Richard Purdie <rpurdie@rpsys.net> Cc: Grant Likely <grant.likely@secretlab.ca> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Magnus Damm [Wed, 5 Oct 2011 00:43:32 +0000 (11:43 +1100)]
leds: Renesas TPU LED driver
Add V2 of the LED driver for a single timer channel for the TPU hardware
block commonly found in Renesas SoCs.
The driver has been written with optimal Power Management in mind, so to
save power the LED is driven as a regular GPIO pin in case of maximum
brightness and power off which allows the TPU hardware to be idle and
which in turn allows the clocks to be stopped and the power domain to be
turned off transparently.
Any other brightness level requires use of the TPU hardware in PWM mode.
TPU hardware device clocks and power are managed through Runtime PM.
System suspend and resume is known to be working - during suspend the LED
is set to off by the generic LED code.
The TPU hardware timer is equipeed with a 16-bit counter together with an
up-to-divide-by-64 prescaler which makes the hardware suitable for
brightness control. Hardware blink is unsupported.
The LED PWM waveform has been verified with a Fluke 123 Scope meter on a
sh7372 Mackerel board. Tested with experimental sh7372 A3SP power domain
patches. Platform device bind/unbind tested ok.
V2 has been tested on the DS2 LED of the sh73a0-based AG5EVM.
Signed-off-by: Magnus Damm <damm@opensource.se> Cc: Paul Mundt <lethal@linux-sh.org> Cc: Richard Purdie <rpurdie@rpsys.net> Cc: Grant Likely <grant.likely@secretlab.ca> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Mark Brown [Wed, 5 Oct 2011 00:43:31 +0000 (11:43 +1100)]
backlight: fix broken regulator API usage in l4f00242t03
The regulator support in the l4f00242t03 is very non-idiomatic. Rather
than requesting the regulators based on the device name and the supply
names used by the device the driver requires boards to pass system
specific supply names around through platform data. The driver also
conditionally requests the regulators based on this platform data, adding
unneeded conditional code to the driver.
Fix this by removing the platform data and converting to the standard
idiom, also updating all in tree users of the driver. As no datasheet
appears to be available for the LCD I'm guessing the names for the
supplies based on the existing users and I've no ability to do anything
more than compile test.
The use of regulator_set_voltage() in the driver is also problematic,
since fixed voltages are required the expectation would be that the
voltages would be fixed in the constraints set by the machines rather than
manually configured by the driver, but is less problematic.
Signed-off-by: Mark Brown <broonie@opensource.wolfsonmicro.com> Tested-by: Fabio Estevam <fabio.estevam@freescale.com> Cc: Richard Purdie <rpurdie@rpsys.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Wolfram Sang [Wed, 5 Oct 2011 00:43:31 +0000 (11:43 +1100)]
video/backlight: remove obsolete cleanup for clientdata
A few new i2c-drivers came into the kernel which clear the
clientdata-pointer on exit or error. This is obsolete meanwhile, the core
will do it.
Signed-off-by: Wolfram Sang <w.sang@pengutronix.de> Cc: Richard Purdie <rpurdie@rpsys.net> Cc: Paul Mundt <lethal@linux-sh.org> Acked-by: Jean Delvare <khali@linux-fr.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>